Cleaning Babynames

In this post I download, clean and tidy open data from the City of Munich regarding baby names. The available datasets have different filename patterns, column names, delimiters, typos and implicitly missing data. I’ll use the cleaned data for future posts.

data cleaning
EDA
R
Author
Published

June 21, 2023

Foto by Brett Jordan on Unsplash

Introduction and Credits

The Open Data Portal of the City of Munich (Germany) provides all kinds of datasets regarding infrastructure, traffic, economy and demographics. The names given to babies can be found here (credits: Datenquelle: dl-de/by-2-0: Landeshauptstadt München – opendata.muenchen.de).

The data has some issues which I’ll address in this post below as preparation for future posts using this dataset.

German English
Vorname first name
Anzahl count
Geschlecht sex1
Jahr year

Reading and Cleaning the Data

Unfortunately the files have different naming patterns and use different delim-characters. In addition the year 2015 has a different column name for the first names. Since the file list is rather short, it’s quicker to hard-code this than to work out a fancy algorithm…

I really appreciate the open data portal of our city, but there’s still some room for improvement @OpenData München
Show the code
vornamen_2022 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2022.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2022)

vornamen_2021 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2021.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2021)

vornamen_2020 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2020.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2020)

vornamen_2019 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2019.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2019)

vornamen_2018 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2018.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2018)

vornamen_2017 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2017.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2017)

vornamen_2016 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2016.csv")), delim = ",") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2016)

vornamen_2015 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2015.csv")), delim = ";") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2015) |> 
  rename(vorname = vornamen)

vornamen_2014 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2014.csv")), delim = ";") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2014)

vornamen_2013 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2013.csv")), delim = ";") |> 
  janitor::clean_names() |> 
  mutate(jahr = 2013)

vornamen_raw <- rbind(
  vornamen_2013, 
  vornamen_2014, 
  vornamen_2015, 
  vornamen_2016, 
  vornamen_2017, 
  vornamen_2018, 
  vornamen_2019, 
  vornamen_2020, 
  vornamen_2021, 
  vornamen_2022 
  )

In addition, Unicode characters are escaped in the form of e.g. <U+0219> for the letter ș. So we’ll clean that with the help of the stringi package.2. After that we’ll drop 4 lines without name and sex, and 4 others where the name is “+”.

Code
vornamen <- vornamen_raw |> 
  mutate(
    vorname = stringi::stri_unescape_unicode(   # this function doesn't recognize the <U+1234> escape, so we
      gsub("<U\\+(....)>", "\\\\u\\1", vorname) # need to replace these with \u1234 as intermediate step
      ),
    anzahl = ifelse(anzahl == "4 oder wenniger", "4 oder weniger", anzahl) # there's a typo in ~14k entries...
  ) |> 
  drop_na() |> 
  filter(vorname != "+")

Now that this is done, let’s have a first look at the head of the cleaned data…

Code
knitr::kable(head(vornamen))
vorname anzahl geschlecht jahr
Maximilian 166 m 2013
Felix 124 m 2013
Anna 109 w 2013
David 109 m 2013
Sophia 108 w 2013
Emilia 103 w 2013

…and a dataframe summary:

Code
summarytools::dfSummary(vornamen)
Data Frame Summary  
vornamen  
Dimensions: 42077 x 4  
Duplicates: 10  

--------------------------------------------------------------------------------------------------------------
No   Variable      Stats / Values             Freqs (% of Valid)    Graph                 Valid      Missing  
---- ------------- -------------------------- --------------------- --------------------- ---------- ---------
1    vorname       1. Jona                       19 ( 0.0%)                               42077      0        
     [character]   2. Mika                       19 ( 0.0%)                               (100.0%)   (0.0%)   
                   3. Toni                       19 ( 0.0%)                                                   
                   4. Charlie                    18 ( 0.0%)                                                   
                   5. Luca                       18 ( 0.0%)                                                   
                   6. Noa                        18 ( 0.0%)                                                   
                   7. Iman                       17 ( 0.0%)                                                   
                   8. Lou                        17 ( 0.0%)                                                   
                   9. Minh                       17 ( 0.0%)                                                   
                   10. Deniz                     16 ( 0.0%)                                                   
                   [ 18219 others ]           41899 (99.6%)         IIIIIIIIIIIIIIIIIII                       

2    anzahl        1. 4 oder weniger          36499 (86.7%)         IIIIIIIIIIIIIIIII     42077      0        
     [character]   2. 5                         833 ( 2.0%)                               (100.0%)   (0.0%)   
                   3. 6                         646 ( 1.5%)                                                   
                   4. 7                         486 ( 1.2%)                                                   
                   5. 8                         386 ( 0.9%)                                                   
                   6. 9                         306 ( 0.7%)                                                   
                   7. 10                        273 ( 0.6%)                                                   
                   8. 11                        206 ( 0.5%)                                                   
                   9. 12                        149 ( 0.4%)                                                   
                   10. 14                       146 ( 0.3%)                                                   
                   [ 124 others ]              2147 ( 5.1%)         I                                         

3    geschlecht    1. m                       20857 (49.6%)         IIIIIIIII             42077      0        
     [character]   2. w                       21220 (50.4%)         IIIIIIIIII            (100.0%)   (0.0%)   

4    jahr          Mean (sd) : 2017.6 (2.9)   2013 : 4012 ( 9.5%)   I                     42077      0        
     [numeric]     min < med < max:           2014 : 4030 ( 9.6%)   I                     (100.0%)   (0.0%)   
                   2013 < 2018 < 2022         2015 : 3931 ( 9.3%)   I                                         
                   IQR (CV) : 5 (0)           2016 : 4308 (10.2%)   II                                        
                                              2017 : 4241 (10.1%)   II                                        
                                              2018 : 4182 ( 9.9%)   I                                         
                                              2019 : 4263 (10.1%)   II                                        
                                              2020 : 4342 (10.3%)   II                                        
                                              2021 : 4378 (10.4%)   II                                        
                                              2022 : 4390 (10.4%)   II                                        
--------------------------------------------------------------------------------------------------------------

Imputing Missing Values

For a large part of names (36499), there’s no exact count given. It only says that there are 4 or less babies born with this name (“4 oder weniger” in German). Since this affects >80% of distinct name/year combinations, I cannot just drop those. To fill these implicit missing values, I’ll impute them.3

To estimate the frequencies of the rarer names, I assume a roughly exponential distribution with a tendency to more unique names. The following probabilities are used:

n = 1: 0.52
n = 2: 0.27
n = 3: 0.14
n = 4: 0.07
Imputation
# count missing values 
n <- nrow(filter(vornamen, anzahl == "4 oder weniger")) 


{
  # set seed for reproducibility for this step only
  set.seed(753)
  
  vornamen_imputed <- vornamen |> 
    # impute using the sample() function using the above probabilities
    mutate(
      anzahl = as.numeric(ifelse(anzahl == "4 oder weniger", sample(c(1, 2, 3, 4), size = n, replace = TRUE, prob = c(0.52, 0.27, 0.14, 0.07)), anzahl))
    )
}

If the estimation/imputation is suitable, then we should come to a total of births similar to what the official birth statisics say. The official reports state these numbers for the years 2017-2022 respectively: 17629, 17587, 17509, 17593, 18330 and 16540.

Accuracy check
vornamen_diff <- vornamen_imputed |> 
  uncount(anzahl) |> 
  group_by(jahr) |> 
  summarise(estimated = n()) |> 
  filter(jahr >= 2017) |> 
  mutate(
    reference = c(17629, 17587, 17509, 17593, 18330, 16540),
    difference = reference - estimated,
    diff_pct = difference / reference *100
  )
  
knitr::kable(vornamen_diff)
jahr estimated reference difference diff_pct
2017 17647 17629 -18 -0.1021045
2018 17472 17587 115 0.6538921
2019 17295 17509 214 1.2222286
2020 17797 17593 -204 -1.1595521
2021 18196 18330 134 0.7310420
2022 16715 16540 -175 -1.0580411

The imputation is quite okay, with an RMSE of around 157.87. In percent, the (root mean squared) difference is about 0.91%, with a maximum of 1.22%, which is acceptable for the purpose of this blog post (and future ones).

(Short) Exploratory Analysis

  • There are 9501 distinct female names and 9249 distinct male names.

  • The longest name (that is not a composed name with a “-”) is 16 characters long: “Sakthi……avani”

  • The shortest names are 2 characters long and there are 44 of those.

  • The most common names by year and sex are:

    Code
    vornamen_imputed |> 
      group_by(jahr, geschlecht) |> 
      summarise(anzahl = max(anzahl)) |> 
      left_join(vornamen_imputed) |> 
      knitr::kable()
    jahr geschlecht anzahl vorname
    2013 m 166 Maximilian
    2013 w 109 Anna
    2014 m 178 Maximilian
    2014 w 119 Anna
    2015 m 165 Maximilian
    2015 w 117 Anna
    2016 m 154 Maximilian
    2016 w 129 Emilia
    2017 m 149 Maximilian
    2017 w 136 Emilia
    2018 m 155 Maximilian
    2018 w 126 Anna
    2019 m 136 Maximilian
    2019 w 120 Emilia
    2020 m 144 Maximilian
    2020 w 144 Emilia
    2021 m 136 Maximilian
    2021 w 131 Emilia
    2022 m 104 Maximilian
    2022 w 103 Emilia

Wordcloud of the last 10 years

Code
library(wordcloud2)
vornamen_wc <- vornamen_imputed |> 
  group_by(vorname) |> 
  summarise(median_anzahl = median(anzahl)) |> 
  filter(median_anzahl >= 5) |> 
  transmute(
    word = vorname,
    freq = median_anzahl
  )

wordcloud2::wordcloud2(
  data = vornamen_wc,
  size = 0.4,
  color = "random-light",
  minSize = 7,
  widgetsize = c("99%", 600)
)

Footnotes

  1. as stated by the parents or the clinic when registering a birth↩︎

  2. credits to the answer of this stack overflow question↩︎

  3. Imputations on such a large part of data will obviously introduce some blur compared to the actual data. As more precise data is not publicly available and I’ll mostly work on the more frequent names, where exact numbers are available, this is sufficient for me.↩︎

Reuse

Citation

BibTeX citation:
@misc{gebhard2023,
  author = {Gebhard, Christian},
  title = {Cleaning {Babynames}},
  date = {2023-06-21},
  url = {https://christiangebhard.com/posts/2023-06-babynames/2023-06-babynames.html},
  langid = {en}
}
For attribution, please cite this work as:
Gebhard, Christian. 2023. “Cleaning Babynames.” June 21, 2023. https://christiangebhard.com/posts/2023-06-babynames/2023-06-babynames.html.