Cleaning Babynames
In this post I download, clean and tidy open data from the City of Munich regarding baby names. The available datasets have different filename patterns, column names, delimiters, typos and implicitly missing data. I’ll use the cleaned data for future posts.
Introduction and Credits
The Open Data Portal of the City of Munich (Germany) provides all kinds of datasets regarding infrastructure, traffic, economy and demographics. The names given to babies can be found here (credits: Datenquelle: dl-de/by-2-0: Landeshauptstadt München – opendata.muenchen.de).
The data has some issues which I’ll address in this post below as preparation for future posts using this dataset.
German | English |
---|---|
Vorname | first name |
Anzahl | count |
Geschlecht | sex1 |
Jahr | year |
Reading and Cleaning the Data
Unfortunately the files have different naming patterns and use different delim-characters. In addition the year 2015 has a different column name for the first names. Since the file list is rather short, it’s quicker to hard-code this than to work out a fancy algorithm…
Show the code
vornamen_2022 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2022.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2022)
vornamen_2021 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2021.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2021)
vornamen_2020 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2020.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2020)
vornamen_2019 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2019.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2019)
vornamen_2018 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2018.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2018)
vornamen_2017 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2017.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2017)
vornamen_2016 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2016.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2016)
vornamen_2015 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2015.csv")), delim = ";") |>
janitor::clean_names() |>
mutate(jahr = 2015) |>
rename(vorname = vornamen)
vornamen_2014 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2014.csv")), delim = ";") |>
janitor::clean_names() |>
mutate(jahr = 2014)
vornamen_2013 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2013.csv")), delim = ";") |>
janitor::clean_names() |>
mutate(jahr = 2013)
vornamen_raw <- rbind(
vornamen_2013,
vornamen_2014,
vornamen_2015,
vornamen_2016,
vornamen_2017,
vornamen_2018,
vornamen_2019,
vornamen_2020,
vornamen_2021,
vornamen_2022
)
In addition, Unicode characters are escaped in the form of e.g. <U+0219> for the letter ș. So we’ll clean that with the help of the stringi package.2. After that we’ll drop 4 lines without name and sex, and 4 others where the name is “+”.
Code
vornamen <- vornamen_raw |>
mutate(
vorname = stringi::stri_unescape_unicode( # this function doesn't recognize the <U+1234> escape, so we
gsub("<U\\+(....)>", "\\\\u\\1", vorname) # need to replace these with \u1234 as intermediate step
),
anzahl = ifelse(anzahl == "4 oder wenniger", "4 oder weniger", anzahl) # there's a typo in ~14k entries...
) |>
drop_na() |>
filter(vorname != "+")
Now that this is done, let’s have a first look at the head of the cleaned data…
vorname | anzahl | geschlecht | jahr |
---|---|---|---|
Maximilian | 166 | m | 2013 |
Felix | 124 | m | 2013 |
Anna | 109 | w | 2013 |
David | 109 | m | 2013 |
Sophia | 108 | w | 2013 |
Emilia | 103 | w | 2013 |
…and a dataframe summary:
Code
summarytools::dfSummary(vornamen)
Data Frame Summary
vornamen
Dimensions: 42077 x 4
Duplicates: 10
--------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ------------- -------------------------- --------------------- --------------------- ---------- ---------
1 vorname 1. Jona 19 ( 0.0%) 42077 0
[character] 2. Mika 19 ( 0.0%) (100.0%) (0.0%)
3. Toni 19 ( 0.0%)
4. Charlie 18 ( 0.0%)
5. Luca 18 ( 0.0%)
6. Noa 18 ( 0.0%)
7. Iman 17 ( 0.0%)
8. Lou 17 ( 0.0%)
9. Minh 17 ( 0.0%)
10. Deniz 16 ( 0.0%)
[ 18219 others ] 41899 (99.6%) IIIIIIIIIIIIIIIIIII
2 anzahl 1. 4 oder weniger 36499 (86.7%) IIIIIIIIIIIIIIIII 42077 0
[character] 2. 5 833 ( 2.0%) (100.0%) (0.0%)
3. 6 646 ( 1.5%)
4. 7 486 ( 1.2%)
5. 8 386 ( 0.9%)
6. 9 306 ( 0.7%)
7. 10 273 ( 0.6%)
8. 11 206 ( 0.5%)
9. 12 149 ( 0.4%)
10. 14 146 ( 0.3%)
[ 124 others ] 2147 ( 5.1%) I
3 geschlecht 1. m 20857 (49.6%) IIIIIIIII 42077 0
[character] 2. w 21220 (50.4%) IIIIIIIIII (100.0%) (0.0%)
4 jahr Mean (sd) : 2017.6 (2.9) 2013 : 4012 ( 9.5%) I 42077 0
[numeric] min < med < max: 2014 : 4030 ( 9.6%) I (100.0%) (0.0%)
2013 < 2018 < 2022 2015 : 3931 ( 9.3%) I
IQR (CV) : 5 (0) 2016 : 4308 (10.2%) II
2017 : 4241 (10.1%) II
2018 : 4182 ( 9.9%) I
2019 : 4263 (10.1%) II
2020 : 4342 (10.3%) II
2021 : 4378 (10.4%) II
2022 : 4390 (10.4%) II
--------------------------------------------------------------------------------------------------------------
Imputing Missing Values
For a large part of names (36499), there’s no exact count given. It only says that there are 4 or less babies born with this name (“4 oder weniger” in German). Since this affects >80% of distinct name/year combinations, I cannot just drop those. To fill these implicit missing values, I’ll impute them.3
To estimate the frequencies of the rarer names, I assume a roughly exponential distribution with a tendency to more unique names. The following probabilities are used:
n = 1: 0.52
n = 2: 0.27
n = 3: 0.14
n = 4: 0.07
Imputation
# count missing values
n <- nrow(filter(vornamen, anzahl == "4 oder weniger"))
{
# set seed for reproducibility for this step only
set.seed(753)
vornamen_imputed <- vornamen |>
# impute using the sample() function using the above probabilities
mutate(
anzahl = as.numeric(ifelse(anzahl == "4 oder weniger", sample(c(1, 2, 3, 4), size = n, replace = TRUE, prob = c(0.52, 0.27, 0.14, 0.07)), anzahl))
)
}
If the estimation/imputation is suitable, then we should come to a total of births similar to what the official birth statisics say. The official reports state these numbers for the years 2017-2022 respectively: 17629, 17587, 17509, 17593, 18330 and 16540.
Accuracy check
jahr | estimated | reference | difference | diff_pct |
---|---|---|---|---|
2017 | 17647 | 17629 | -18 | -0.1021045 |
2018 | 17472 | 17587 | 115 | 0.6538921 |
2019 | 17295 | 17509 | 214 | 1.2222286 |
2020 | 17797 | 17593 | -204 | -1.1595521 |
2021 | 18196 | 18330 | 134 | 0.7310420 |
2022 | 16715 | 16540 | -175 | -1.0580411 |
The imputation is quite okay, with an RMSE of around 157.87. In percent, the (root mean squared) difference is about 0.91%, with a maximum of 1.22%, which is acceptable for the purpose of this blog post (and future ones).
(Short) Exploratory Analysis
There are 9501 distinct female names and 9249 distinct male names.
The longest name (that is not a composed name with a “-”) is 16 characters long: “Sakthi……avani”
The shortest names are 2 characters long and there are 44 of those.
-
The most common names by year and sex are:
Code
jahr geschlecht anzahl vorname 2013 m 166 Maximilian 2013 w 109 Anna 2014 m 178 Maximilian 2014 w 119 Anna 2015 m 165 Maximilian 2015 w 117 Anna 2016 m 154 Maximilian 2016 w 129 Emilia 2017 m 149 Maximilian 2017 w 136 Emilia 2018 m 155 Maximilian 2018 w 126 Anna 2019 m 136 Maximilian 2019 w 120 Emilia 2020 m 144 Maximilian 2020 w 144 Emilia 2021 m 136 Maximilian 2021 w 131 Emilia 2022 m 104 Maximilian 2022 w 103 Emilia
Wordcloud of the last 10 years
Code
library(wordcloud2)
vornamen_wc <- vornamen_imputed |>
group_by(vorname) |>
summarise(median_anzahl = median(anzahl)) |>
filter(median_anzahl >= 5) |>
transmute(
word = vorname,
freq = median_anzahl
)
wordcloud2::wordcloud2(
data = vornamen_wc,
size = 0.4,
color = "random-light",
minSize = 7,
widgetsize = c("99%", 600)
)
Footnotes
as stated by the parents or the clinic when registering a birth↩︎
credits to the answer of this stack overflow question↩︎
Imputations on such a large part of data will obviously introduce some blur compared to the actual data. As more precise data is not publicly available and I’ll mostly work on the more frequent names, where exact numbers are available, this is sufficient for me.↩︎
Reuse
Citation
@misc{gebhard2023,
author = {Gebhard, Christian},
title = {Cleaning {Babynames}},
date = {2023-06-21},
url = {https://christiangebhard.com/posts/2023-06-babynames/2023-06-babynames.html},
langid = {en}
}