Cleaning Babynames

In this post I download, clean and tidy open data from the City of Munich regarding baby names. The available datasets have different filename patterns, column names, delimiters, typos and implicitly missing data. I’ll use the cleaned data for future posts.

data cleaning

EDA

Author

Christian Gebhard

Published

June 21, 2023

Introduction and Credits

The Open Data Portal of the City of Munich (Germany) provides all kinds of datasets regarding infrastructure, traffic, economy and demographics. The names given to babies can be found here (credits: Datenquelle: dl-de/by-2-0: Landeshauptstadt München – opendata.muenchen.de).

The data has some issues which I’ll address in this post below as preparation for future posts using this dataset.

Short dictionary for non-German readers

German	English
Vorname	first name
Anzahl	count
Geschlecht	sex¹
Jahr	year

Reading and Cleaning the Data

Libraries used

library(tidyverse)
library(readr)
library(janitor)
library(stringr)
library(stringi)
library(summarytools)

Unfortunately the files have different naming patterns and use different delim-characters. In addition the year 2015 has a different column name for the first names. Since the file list is rather short, it’s quicker to hard-code this than to work out a fancy algorithm…

I really appreciate the open data portal of our city, but there’s still some room for improvement @OpenData München

Show the code

vornamen_2022 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2022.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2022)

vornamen_2021 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2021.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2021)

vornamen_2020 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2020.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2020)

vornamen_2019 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2019.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2019)

vornamen_2018 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2018.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2018)

vornamen_2017 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2017.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2017)

vornamen_2016 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2016.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2016)

vornamen_2015 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2015.csv")), delim = ";") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2015) |> 
  rename(vorname = vornamen)

vornamen_2014 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2014.csv")), delim = ";") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2014)

vornamen_2013 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2013.csv")), delim = ";") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2013)

vornamen_raw <- rbind(
  vornamen_2013, 
  vornamen_2014, 
  vornamen_2015, 
  vornamen_2016, 
  vornamen_2017, 
  vornamen_2018, 
  vornamen_2019, 
  vornamen_2020, 
  vornamen_2021, 
  vornamen_2022 
  )

In addition, Unicode characters are escaped in the form of e.g. <U+0219> for the letter ș. So we’ll clean that with the help of the {stringi} package.². After that we’ll drop 4 lines without name and sex, and 4 others where the name is “+”.

Code

vornamen <- vornamen_raw |> 
  mutate(
    vorname = stringi::stri_unescape_unicode(   # this function doesn't recognize the <U+1234> escape, so we
      gsub("<U\\+(....)>", "\\\\u\\1", vorname) # need to replace these with \u1234 as intermediate step
      ),
    anzahl = ifelse(anzahl == "4 oder wenniger", "4 oder weniger", anzahl) # there's a typo in ~14k entries...
  ) |> 
  drop_na() |> 
  filter(vorname != "+")

Now that this is done, let’s have a first look at the head of the cleaned data…

Code

knitr::kable(head(vornamen))

vorname	anzahl	geschlecht	jahr
Maximilian	166	m	2013
Felix	124	m	2013
Anna	109	w	2013
David	109	m	2013
Sophia	108	w	2013
Emilia	103	w	2013

…and a dataframe summary:

Code

summarytools::dfSummary(vornamen)

Data Frame Summary  
vornamen  
Dimensions: 42077 x 4  
Duplicates: 10  

--------------------------------------------------------------------------------------------------------------
No   Variable      Stats / Values             Freqs (% of Valid)    Graph                 Valid      Missing  
---- ------------- -------------------------- --------------------- --------------------- ---------- ---------
1    vorname       1. Jona                       19 ( 0.0%)                               42077      0        
     [character]   2. Mika                       19 ( 0.0%)                               (100.0%)   (0.0%)   
                   3. Toni                       19 ( 0.0%)                                                   
                   4. Charlie                    18 ( 0.0%)                                                   
                   5. Luca                       18 ( 0.0%)                                                   
                   6. Noa                        18 ( 0.0%)                                                   
                   7. Iman                       17 ( 0.0%)                                                   
                   8. Lou                        17 ( 0.0%)                                                   
                   9. Minh                       17 ( 0.0%)                                                   
                   10. Deniz                     16 ( 0.0%)                                                   
                   [ 18219 others ]           41899 (99.6%)         IIIIIIIIIIIIIIIIIII                       

2    anzahl        1. 4 oder weniger          36499 (86.7%)         IIIIIIIIIIIIIIIII     42077      0        
     [character]   2. 5                         833 ( 2.0%)                               (100.0%)   (0.0%)   
                   3. 6                         646 ( 1.5%)                                                   
                   4. 7                         486 ( 1.2%)                                                   
                   5. 8                         386 ( 0.9%)                                                   
                   6. 9                         306 ( 0.7%)                                                   
                   7. 10                        273 ( 0.6%)                                                   
                   8. 11                        206 ( 0.5%)                                                   
                   9. 12                        149 ( 0.4%)                                                   
                   10. 14                       146 ( 0.3%)                                                   
                   [ 124 others ]              2147 ( 5.1%)         I                                         

3    geschlecht    1. m                       20857 (49.6%)         IIIIIIIII             42077      0        
     [character]   2. w                       21220 (50.4%)         IIIIIIIIII            (100.0%)   (0.0%)   

4    jahr          Mean (sd) : 2017.6 (2.9)   2013 : 4012 ( 9.5%)   I                     42077      0        
     [numeric]     min < med < max:           2014 : 4030 ( 9.6%)   I                     (100.0%)   (0.0%)   
                   2013 < 2018 < 2022         2015 : 3931 ( 9.3%)   I                                         
                   IQR (CV) : 5 (0)           2016 : 4308 (10.2%)   II                                        
                                              2017 : 4241 (10.1%)   II                                        
                                              2018 : 4182 ( 9.9%)   I                                         
                                              2019 : 4263 (10.1%)   II                                        
                                              2020 : 4342 (10.3%)   II                                        
                                              2021 : 4378 (10.4%)   II                                        
                                              2022 : 4390 (10.4%)   II                                        
--------------------------------------------------------------------------------------------------------------

Imputing Missing Values

For a large part of names (36499), there’s no exact count given. It only says that there are 4 or less babies born with this name (“4 oder weniger” in German). Since this affects >80% of distinct name/year combinations, I cannot just drop those. To fill these implicit missing values, I’ll impute them.³

To estimate the frequencies of the rarer names, I assume a roughly exponential distribution with a tendency to more unique names. The following probabilities are used:

n = 1: 0.52
n = 2: 0.27
n = 3: 0.14
n = 4: 0.07

Imputation

# count missing values 
n <- nrow(filter(vornamen, anzahl == "4 oder weniger")) 


{
  # set seed for reproducibility for this step only
  set.seed(753)
  
  vornamen_imputed <- vornamen |> 
    # impute using the sample() function using the above probabilities
    mutate(
      anzahl = as.numeric(ifelse(anzahl == "4 oder weniger", sample(c(1, 2, 3, 4), size = n, replace = TRUE, prob = c(0.52, 0.27, 0.14, 0.07)), anzahl))
    )
}

If the estimation/imputation is suitable, then we should come to a total of births similar to what the official birth statisics say. The official reports state these numbers for the years 2017-2022 respectively: 17629, 17587, 17509, 17593, 18330 and 16540.

Accuracy check

vornamen_diff <- vornamen_imputed |> 
  uncount(anzahl) |> 
  group_by(jahr) |> 
  summarise(estimated = n()) |> 
  filter(jahr >= 2017) |> 
  mutate(
    reference = c(17629, 17587, 17509, 17593, 18330, 16540),
    difference = reference - estimated,
    diff_pct = difference / reference *100
  )
  
knitr::kable(vornamen_diff)

jahr	estimated	reference	difference	diff_pct
2017	17647	17629	-18	-0.1021045
2018	17472	17587	115	0.6538921
2019	17295	17509	214	1.2222286
2020	17797	17593	-204	-1.1595521
2021	18196	18330	134	0.7310420
2022	16715	16540	-175	-1.0580411

The imputation is quite okay, with an RMSE of around 157.87. In percent, the (root mean squared) difference is about 0.91%, with a maximum of 1.22%, which is acceptable for the purpose of this blog post (and future ones).

(Short) Exploratory Analysis

There are 9501 distinct female names and 9249 distinct male names.
The longest name (that is not a composed name with a “-”) is 16 characters long: “Sakthi……avani”
The shortest names are 2 characters long and there are 44 of those.

The most common names by year and sex are:

Code

vornamen_imputed |> 
  group_by(jahr, geschlecht) |> 
  summarise(anzahl = max(anzahl)) |> 
  left_join(vornamen_imputed) |> 
  knitr::kable()

jahr	geschlecht	anzahl	vorname
2013	m	166	Maximilian
2013	w	109	Anna
2014	m	178	Maximilian
2014	w	119	Anna
2015	m	165	Maximilian
2015	w	117	Anna
2016	m	154	Maximilian
2016	w	129	Emilia
2017	m	149	Maximilian
2017	w	136	Emilia
2018	m	155	Maximilian
2018	w	126	Anna
2019	m	136	Maximilian
2019	w	120	Emilia
2020	m	144	Maximilian
2020	w	144	Emilia
2021	m	136	Maximilian
2021	w	131	Emilia
2022	m	104	Maximilian
2022	w	103	Emilia

Wordcloud of the last 10 years

Code

library(wordcloud2)
vornamen_wc <- vornamen_imputed |> 
  group_by(vorname) |> 
  summarise(median_anzahl = median(anzahl)) |> 
  filter(median_anzahl >= 5) |> 
  transmute(
    word = vorname,
    freq = median_anzahl
  )

wordcloud2::wordcloud2(
  data = vornamen_wc,
  size = 0.4,
  color = "random-light",
  minSize = 7,
  widgetsize = c("99%", 600)
)

Footnotes

as stated by the parents or the clinic when registering a birth↩︎
credits to the answer of this stack overflow question ↩︎
Imputations on such a large part of data will obviously introduce some blur compared to the actual data. As more precise data is not publicly available and I’ll mostly work on the more frequent names, where exact numbers are available, this is sufficient for me.↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@misc{gebhard2023,
  author = {Gebhard, Christian},
  title = {Cleaning {Babynames}},
  date = {2023-06-21},
  url = {https://christiangebhard.com/posts/2023-06-babynames/2023-06-babynames.html},
  langid = {en}
}

For attribution, please cite this work as:

Gebhard, Christian. 2023. “Cleaning Babynames.” June 21, 2023. https://christiangebhard.com/posts/2023-06-babynames/2023-06-babynames.html.