Libraries used
library(tidyverse)
library(readr)
library(janitor)
library(stringr)
library(stringi)
library(summarytools)
In this post I download, clean and tidy open data from the City of Munich regarding baby names. The available datasets have different filename patterns, column names, delimiters, typos and implicitly missing data. I’ll use the cleaned data for future posts.
June 21, 2023
The Open Data Portal of the City of Munich (Germany) provides all kinds of datasets regarding infrastructure, traffic, economy and demographics. The names given to babies can be found here (credits: Datenquelle: dl-de/by-2-0: Landeshauptstadt München – opendata.muenchen.de).
The data has some issues which I’ll address in this post below as preparation for future posts using this dataset.
German | English |
---|---|
Vorname | first name |
Anzahl | count |
Geschlecht | sex1 |
Jahr | year |
Unfortunately the files have different naming patterns and use different delim-characters. In addition the year 2015 has a different column name for the first names. Since the file list is rather short, it’s quicker to hard-code this than to work out a fancy algorithm…
vornamen_2022 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2022.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2022)
vornamen_2021 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2021.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2021)
vornamen_2020 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2020.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2020)
vornamen_2019 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2019.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2019)
vornamen_2018 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2018.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2018)
vornamen_2017 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2017.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2017)
vornamen_2016 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2016.csv")), delim = ",") |>
janitor::clean_names() |>
mutate(jahr = 2016)
vornamen_2015 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2015.csv")), delim = ";") |>
janitor::clean_names() |>
mutate(jahr = 2015) |>
rename(vorname = vornamen)
vornamen_2014 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2014.csv")), delim = ";") |>
janitor::clean_names() |>
mutate(jahr = 2014)
vornamen_2013 <- read_delim(Sys.glob(here::here("posts", "2023-06-babynames", "original_data", "*2013.csv")), delim = ";") |>
janitor::clean_names() |>
mutate(jahr = 2013)
vornamen_raw <- rbind(
vornamen_2013,
vornamen_2014,
vornamen_2015,
vornamen_2016,
vornamen_2017,
vornamen_2018,
vornamen_2019,
vornamen_2020,
vornamen_2021,
vornamen_2022
)
In addition, Unicode characters are escaped in the form of e.g. <U+0219> for the letter ș. So we’ll clean that with the help of the {stringi}
package.2. After that we’ll drop 4 lines without name and sex, and 4 others where the name is “+”.
vornamen <- vornamen_raw |>
mutate(
vorname = stringi::stri_unescape_unicode( # this function doesn't recognize the <U+1234> escape, so we
gsub("<U\\+(....)>", "\\\\u\\1", vorname) # need to replace these with \u1234 as intermediate step
),
anzahl = ifelse(anzahl == "4 oder wenniger", "4 oder weniger", anzahl) # there's a typo in ~14k entries...
) |>
drop_na() |>
filter(vorname != "+")
Now that this is done, let’s have a first look at the head of the cleaned data…
vorname | anzahl | geschlecht | jahr |
---|---|---|---|
Maximilian | 166 | m | 2013 |
Felix | 124 | m | 2013 |
Anna | 109 | w | 2013 |
David | 109 | m | 2013 |
Sophia | 108 | w | 2013 |
Emilia | 103 | w | 2013 |
…and a dataframe summary:
Data Frame Summary
vornamen
Dimensions: 42077 x 4
Duplicates: 10
--------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ------------- -------------------------- --------------------- --------------------- ---------- ---------
1 vorname 1. Jona 19 ( 0.0%) 42077 0
[character] 2. Mika 19 ( 0.0%) (100.0%) (0.0%)
3. Toni 19 ( 0.0%)
4. Charlie 18 ( 0.0%)
5. Luca 18 ( 0.0%)
6. Noa 18 ( 0.0%)
7. Iman 17 ( 0.0%)
8. Lou 17 ( 0.0%)
9. Minh 17 ( 0.0%)
10. Deniz 16 ( 0.0%)
[ 18219 others ] 41899 (99.6%) IIIIIIIIIIIIIIIIIII
2 anzahl 1. 4 oder weniger 36499 (86.7%) IIIIIIIIIIIIIIIII 42077 0
[character] 2. 5 833 ( 2.0%) (100.0%) (0.0%)
3. 6 646 ( 1.5%)
4. 7 486 ( 1.2%)
5. 8 386 ( 0.9%)
6. 9 306 ( 0.7%)
7. 10 273 ( 0.6%)
8. 11 206 ( 0.5%)
9. 12 149 ( 0.4%)
10. 14 146 ( 0.3%)
[ 124 others ] 2147 ( 5.1%) I
3 geschlecht 1. m 20857 (49.6%) IIIIIIIII 42077 0
[character] 2. w 21220 (50.4%) IIIIIIIIII (100.0%) (0.0%)
4 jahr Mean (sd) : 2017.6 (2.9) 2013 : 4012 ( 9.5%) I 42077 0
[numeric] min < med < max: 2014 : 4030 ( 9.6%) I (100.0%) (0.0%)
2013 < 2018 < 2022 2015 : 3931 ( 9.3%) I
IQR (CV) : 5 (0) 2016 : 4308 (10.2%) II
2017 : 4241 (10.1%) II
2018 : 4182 ( 9.9%) I
2019 : 4263 (10.1%) II
2020 : 4342 (10.3%) II
2021 : 4378 (10.4%) II
2022 : 4390 (10.4%) II
--------------------------------------------------------------------------------------------------------------
For a large part of names (36499), there’s no exact count given. It only says that there are 4 or less babies born with this name (“4 oder weniger” in German). Since this affects >80% of distinct name/year combinations, I cannot just drop those. To fill these implicit missing values, I’ll impute them.3
To estimate the frequencies of the rarer names, I assume a roughly exponential distribution with a tendency to more unique names. The following probabilities are used:
n = 1: 0.52
n = 2: 0.27
n = 3: 0.14
n = 4: 0.07
# count missing values
n <- nrow(filter(vornamen, anzahl == "4 oder weniger"))
{
# set seed for reproducibility for this step only
set.seed(753)
vornamen_imputed <- vornamen |>
# impute using the sample() function using the above probabilities
mutate(
anzahl = as.numeric(ifelse(anzahl == "4 oder weniger", sample(c(1, 2, 3, 4), size = n, replace = TRUE, prob = c(0.52, 0.27, 0.14, 0.07)), anzahl))
)
}
If the estimation/imputation is suitable, then we should come to a total of births similar to what the official birth statisics say. The official reports state these numbers for the years 2017-2022 respectively: 17629, 17587, 17509, 17593, 18330 and 16540.
jahr | estimated | reference | difference | diff_pct |
---|---|---|---|---|
2017 | 17647 | 17629 | -18 | -0.1021045 |
2018 | 17472 | 17587 | 115 | 0.6538921 |
2019 | 17295 | 17509 | 214 | 1.2222286 |
2020 | 17797 | 17593 | -204 | -1.1595521 |
2021 | 18196 | 18330 | 134 | 0.7310420 |
2022 | 16715 | 16540 | -175 | -1.0580411 |
The imputation is quite okay, with an RMSE of around 157.87. In percent, the (root mean squared) difference is about 0.91%, with a maximum of 1.22%, which is acceptable for the purpose of this blog post (and future ones).
There are 9501 distinct female names and 9249 distinct male names.
The longest name (that is not a composed name with a “-”) is 16 characters long: “Sakthi……avani”
The shortest names are 2 characters long and there are 44 of those.
The most common names by year and sex are:
jahr | geschlecht | anzahl | vorname |
---|---|---|---|
2013 | m | 166 | Maximilian |
2013 | w | 109 | Anna |
2014 | m | 178 | Maximilian |
2014 | w | 119 | Anna |
2015 | m | 165 | Maximilian |
2015 | w | 117 | Anna |
2016 | m | 154 | Maximilian |
2016 | w | 129 | Emilia |
2017 | m | 149 | Maximilian |
2017 | w | 136 | Emilia |
2018 | m | 155 | Maximilian |
2018 | w | 126 | Anna |
2019 | m | 136 | Maximilian |
2019 | w | 120 | Emilia |
2020 | m | 144 | Maximilian |
2020 | w | 144 | Emilia |
2021 | m | 136 | Maximilian |
2021 | w | 131 | Emilia |
2022 | m | 104 | Maximilian |
2022 | w | 103 | Emilia |
library(wordcloud2)
vornamen_wc <- vornamen_imputed |>
group_by(vorname) |>
summarise(median_anzahl = median(anzahl)) |>
filter(median_anzahl >= 5) |>
transmute(
word = vorname,
freq = median_anzahl
)
wordcloud2::wordcloud2(
data = vornamen_wc,
size = 0.4,
color = "random-light",
minSize = 7,
widgetsize = c("99%", 600)
)
as stated by the parents or the clinic when registering a birth↩︎
credits to the answer of this stack overflow question↩︎
Imputations on such a large part of data will obviously introduce some blur compared to the actual data. As more precise data is not publicly available and I’ll mostly work on the more frequent names, where exact numbers are available, this is sufficient for me.↩︎
@misc{gebhard2023,
author = {Gebhard, Christian},
title = {Cleaning {Babynames}},
date = {2023-06-21},
url = {https://christiangebhard.com/posts/2023-06-babynames/2023-06-babynames.html},
langid = {en}
}