Bundestag Part II.1 - Go back to start…

R germany politics RegEx record linkage

After further research I found a way to get more data and with better quality.

Christian A. Gebhard true
03-22-2021

…do not pass Go.

Introduction

I should have known better. When I couldn’t find a proper API to the access the Bundestag-Data, I immediately went for what looked like the second best option: scraping the xls files that are hosted on the website.1 I started scraping the website right away, setup selenium and wrote a script to automatically click through the dynamic website and download all available xls files.

The result wasn’t too bad, actually: I obtained around 480 datasets, including exact dates of the polls, all individual votes, names of representatives and party/parliamentary groups. Yet, I couldn’t get the script to finish correctly. I had hoped to scrape all ~700 available polls, not only ~480.

Had I stepped back for a second and done some further research first, I would have found a comfortable way to access all polling data via an easily iterable URL pattern to obtain the same data for all ~700 polls. And better yet, someone has been doing the work of web scraping the bundestag-data in a quite professional way: The project behind https://www.democracy-deutschland.de/ provide several tools around scraping several parliamentary websites in Germany (e.g. national and federal parliaments). Some of the data scraped with these tools are regularly updated and hosted at the repository https://github.com/bundestag.

As I still learned a lot during my own scraping process, I decided to keep the now obsolete post and make this update a separate post.

Preview Image: © 2021 Christian A. Gebhard

The data

The Bundestag provides polling data for quite a long time going back as PDF reports, but only since around 2012 in machine readable tables. Roll call data since 2009 is available as lists that can be scraped from the website. Unfortunately there is no specific license mentioned for the open data, but the data is offered to “interested users” for “further machine processing”. Sounds good enough for my use.

I’ll be using the data hosted in a github repo, which was generated scraping the www.bundestag.de website. Among the different datasets, there are json files with the individual votes linked to the ID number of the ballot and in others there is the meta-data on the ballots including title, date, etc. Both are unofficial data, but I probably couldn’t scrape it in a better quality, so I will continue to use this data.

Both datasources (NamedPolls and NamedPollDeputies) are published under “The Unlicense” with no conditions whatsoever.
I still want to give credit to the contributors of this repo and thank them for their great and helpful work!

Processing the data

I want to get a single dataset containing the names and dates of the ballots, the names and votes of the representatives and their party membership. Unfortunately the data is separated in two “folders”: one for the poll meta data and one with the actual votes. Additionally in each folder each poll has its own file.

In this post I will

  1. read the json files,
  2. reshape them into a rectangularized form,
  3. concatenate all separate poll files to a single dataframe,
  4. do additional cleaning / completion steps,
  5. and join the voting and the meta data.

So let’s get started with loading the required packages:

Reading the meta data of the polls

I cloned the above mentioned repositories “NamedPolls” and “NamedPollDeputies” to my data_sources/2021_bundestag folder. Using Sys.glob I obtain the file_paths for all data files, loop over these and rbind them to create the desired metadata tibble.

# obtain a vector with all relevant file paths
poll_files <- Sys.glob(file.path("../../../data_sources/2021_bundestag/NamedPolls/data/*.json"))

# instantiate an empty tibble
poll_meta <- tibble()

# loop over the file paths, read each file, extract information of interest and append to the tibble
for (file in poll_files) {
  json_temp <- read_json(file)
  p_id <- as.numeric(json_temp$data$id)
  p_title <- json_temp$data$title
  p_date <- as.Date(json_temp$data$date)
  poll_meta <- rbind(poll_meta, tibble(p_id, p_date, p_title))
}

Let’s get the dimensions and a summary of the metadata.

#> [1] "Dimensions:"
#> [1] 656   3
#> [1] "Summary of Metadata:"
#>       p_id           p_date             p_title         
#>  Min.   :  1.0   Min.   :2009-12-03   Length:656        
#>  1st Qu.:164.8   1st Qu.:2012-06-29   Class :character  
#>  Median :347.5   Median :2015-07-02   Mode  :character  
#>  Mean   :355.3   Mean   :2015-07-15                     
#>  3rd Qu.:541.2   3rd Qu.:2018-10-18                     
#>  Max.   :720.0   Max.   :2021-03-26

Reading the voting data

In this snippet I use jsonlite::fromJSON(), as it can simplify the data to a dataframe automatically. The dataframe with the votes of each deputy is nested within the resulting list, as there is other data “surrounding” the actual voting data:

# obtain a vector with all relevant file paths
vote_files <- Sys.glob(file.path("../../../data_sources/2021_bundestag/NamedPollDeputies//data/*.json"))

# instantiate an empty tibble
voting_data_raw <- tibble()

# loop over the files, extract the poll id and the list item with the 
# nested data frame. Then append the dataframe to the rest of the data.
for (file in vote_files) {
  votes_temp <- fromJSON(file, simplifyVector = TRUE, simplifyDataFrame = TRUE)
  temp_id <- votes_temp$data$id
  df_temp <- as_tibble(votes_temp$data$votes$deputies) %>% 
    mutate(p_id = as.numeric(temp_id))
  voting_data_raw <- rbind(voting_data_raw, df_temp)
}

print("Dimensions of the dataset:")
#> [1] "Dimensions of the dataset:"
dim(voting_data_raw)
#> [1] 427039      8

Cleaning and Quality Checks

For the further analysis in this series I want to make sure, that each deputy is identified by a unique identifier, and not listed under e.g. slightly different names. The scraped data isn’t perfect in this regard:

…so I cannot use either the ID nor the URL.

My first idea was to use the official basic deputy data provided by the Bundestag2, because there are all name variants, name changes, academic titles or titles of nobility listed and linked to a constant deputy ID number. However in the votes dataset above has 427039 entries and the deputy dataset also of that magnitude. As the order of first/last name etc. is still different in the current voting data and the “reference” dataset, I wanted to use record linkage / fuzzy string matching. Unfortunately my midrange 2013 notebook ran out of memory comparing all possible pairings and I had to take a different approach.

Removing titles

In the end I decided to use the name as identifier. To avoid interference of titles causing different names for the same person, I simply removed all variants of academic titles, that I could find or think of from the names:

voting_data_no_titles <- voting_data_raw %>% 
  mutate(name_clean = str_replace_all(name, "Prof\\.", ""),
         name_clean = str_replace_all(name_clean, "Dr\\.", ""),
         name_clean = str_replace_all(name_clean, "h\\.\\s?c\\.", ""),
         name_clean = str_replace_all(name_clean, "rer\\.\\s?nat\\.", ""),
         name_clean = str_replace_all(name_clean, "rer\\.\\s?pol\\.", ""),
         name_clean = str_replace_all(name_clean, "iur\\.", ""),
         name_clean = str_replace_all(name_clean, ",\\s*", ", ") # replace excessive spaces with a single space as this is missing in some cases causing name-duplicates 
         )

This resulted in quite a reduction of unique names from 1269 to 1207.

As quality checks I answered the following questions (written in the code comments):

# Q1: Are there rare name variations, e.g. caused by typos (occurring 
# once or twice), missed by the above cleaning?

voting_data_no_titles %>% 
  count(name_clean) %>% 
  arrange(n) %>% 
  head(10) %>% 
  paged_table()
# Q2: Did the above process cause "name collision" for two deputies,
# resulting in >1 vote in a poll for the same name?

voting_data_no_titles %>% 
  group_by(p_id) %>% 
  count(name_clean, sort = TRUE) %>% 
  head(10) %>% 
  paged_table()
# Q3: Are there different names for the same biography-URL?

voting_data_no_titles %>% 
  distinct(URL, name_clean) %>% 
  count(URL, sort = TRUE) %>% 
  head(10) %>% 
  paged_table()
# Q4: Are there different names for the same id number?

voting_data_no_titles %>% 
  distinct(id, name_clean) %>% 
  count(id, sort = TRUE) %>% 
  head(10) %>% 
  paged_table()

I can answer all four questions with “no”:

Record linkage

To check for systematic spelling differences that do not appear as “single exceptions” above, I went back to record linkage and in fact identified a few names, mostly due to accents or non-standard characters.3


unique_names <- voting_data_no_titles %>% 
  distinct(name_clean)

pair_blocking(unique_names, unique_names) %>% 
  compare_pairs(by = "name_clean", default_comparator = lcs()) %>% 
  score_problink() %>%
  select_threshold(threshold = 0.6) %>% 
  link() %>% 
  filter(name_clean.x != name_clean.y) %>% 
  arrange(name_clean.x)
#>                 name_clean.x              name_clean.y
#> 1             Barthel, Klaus              Breil, Klaus
#> 2             Baumann, Bernd            Neumann, Bernd
#> 3         Bergner, Christoph      Bernstiel, Christoph
#> 4       Bernstiel, Christoph        Bergner, Christoph
#> 5               Beyer, Peter             Bleser, Peter
#> 6              Bleser, Peter              Beyer, Peter
#> 7              Bleser, Peter             Felser, Peter
#> 8             Brand, Michael            Brandt, Michel
#> 9           Brandl, Reinhard         Grindel, Reinhard
#> 10            Brandt, Michel            Brand, Michael
#> 11              Breil, Klaus            Barthel, Klaus
#> 12            Brugger, Agnes        Brugger, Agnieszka
#> 13        Brugger, Agnieszka            Brugger, Agnes
#> 14             Canel, Sylvia            Pantel, Sylvia
#> 15           Dagdelen, Sevim           Dağdelen, Sevim
#> 16           Dağdelen, Sevim           Dagdelen, Sevim
#> 17            Donth, Michael             Roth, Michael
#> 18             Feist, Thomas             Seitz, Thomas
#> 19             Felser, Peter             Bleser, Peter
#> 20     Friedrich, Hans-Peter          Friedrich, Peter
#> 21          Friedrich, Peter     Friedrich, Hans-Peter
#> 22            Fuchs, Michael            Kauch, Michael
#> 23             Glos, Michael             Groß, Michael
#> 24         Grindel, Reinhard          Brandl, Reinhard
#> 25             Groß, Michael             Glos, Michael
#> 26           Hauer, Matthias          Heider, Matthias
#> 27          Heider, Matthias           Hauer, Matthias
#> 28          Hirte, Christian          Wirth, Christian
#> 29           Hohmann, Martin          Rosemann, Martin
#> 30           Jung, Christian          Lange, Christian
#> 31            Kauch, Michael            Fuchs, Michael
#> 32       Korkmaz-Emre, Elvan            Korkmaz, Elvan
#> 33            Korkmaz, Elvan       Korkmaz-Emre, Elvan
#> 34           Krestel, Holger             Ortel, Holger
#> 35          Lange, Christian           Jung, Christian
#> 36             Lehmann, Jens             Lehmann, Sven
#> 37             Lehmann, Sven             Lehmann, Jens
#> 38        Lengsfeld, Philipp      Lerchenfeld, Philipp
#> 39      Lerchenfeld, Philipp        Lengsfeld, Philipp
#> 40      Lerchenfeld, Philipp Lerchenfeld, Philipp Graf
#> 41 Lerchenfeld, Philipp Graf      Lerchenfeld, Philipp
#> 42          Leutert, Michael           Luther, Michael
#> 43           Luther, Michael          Leutert, Michael
#> 44             Moll, Claudia           Müller, Claudia
#> 45           Müller, Claudia             Moll, Claudia
#> 46        Neskovic, Wolfgang        Nešković, Wolfgang
#> 47        Nešković, Wolfgang        Neskovic, Wolfgang
#> 48            Neumann, Bernd            Baumann, Bernd
#> 49           Neumann, Martin          Rosemann, Martin
#> 50             Ortel, Holger           Krestel, Holger
#> 51             Özoguz, Aydan             Özoğuz, Aydan
#> 52             Özoğuz, Aydan             Özoguz, Aydan
#> 53            Pantel, Sylvia             Canel, Sylvia
#> 54          Rosemann, Martin           Neumann, Martin
#> 55          Rosemann, Martin           Hohmann, Martin
#> 56             Roth, Michael            Donth, Michael
#> 57          Scheuer, Andreas           Steier, Andreas
#> 58              Schulz, Swen               Schulz, Uwe
#> 59               Schulz, Uwe              Schulz, Swen
#> 60           Schuster, Armin          Schuster, Marina
#> 61          Schuster, Marina           Schuster, Armin
#> 62             Seitz, Thomas             Feist, Thomas
#> 63             Seitz, Thomas           Stritzl, Thomas
#> 64           Steier, Andreas          Scheuer, Andreas
#> 65           Stritzl, Thomas             Seitz, Thomas
#> 66          Wadephul, Johann    Wadephul, Johann David
#> 67    Wadephul, Johann David          Wadephul, Johann
#> 68          Wiehle, Wolfgang         Wieland, Wolfgang
#> 69         Wieland, Wolfgang          Wiehle, Wolfgang
#> 70          Wirth, Christian          Hirte, Christian

Above I chose a quite permissive threshold of 0.6. I wouldn’t normally do this, but as some names are quite short, even a few changed characters might have a strong impact on the string distance. Many of the pairings above are in fact different deputies, but some names have to be corrected:

voting_data_cleaned <- voting_data_no_titles %>% 
  mutate(name_clean = case_when(
            name_clean == "Brugger, Agnieszka" ~ "Brugger, Agnes", # http://webarchiv.bundestag.de/archive/2013/1212/bundestag/abgeordnete17/biografien/B/brugger_agnes.html
            name_clean == "Dagdelen, Sevim" ~ "Dağdelen, Sevim",
            name_clean == "Özoguz, Aydan" ~ "Özoğuz, Aydan",
            name_clean == "Jantz, Christina" ~ "Jantz-Herrmann, Christina", # name changed due to marriage?
            name_clean == "Aken, Jan" ~ "Aken, Jan van", # van forgotten in some records
            name_clean == "Lerchenfeld, Philipp" ~ "Lerchenfeld, Philipp Graf", # title forgotten in some records
            name_clean == "Neskovic, Wolfgang" ~ "Nešković, Wolfgang", # wrong spelling in some polls
            name_clean == "Korkmaz, Elvan" ~ "Korkmaz-Emre, Elvan", # name changed due to marriage?
            name_clean == "Wadephul, Johann" ~ "Wadephul, Johann David", # David forgotten in some records
            TRUE ~ name_clean
          ))

voting_data_cleaned %>% 
  distinct(name_clean) %>% 
  nrow()
#> [1] 1198

As expected this reduces the unique names by another 9 to 1198 distinct deputy names.

Where does this stop?

One factor I’ll be missing is, if due to marriage or divorce deputies completely changed their last name. I would have to check against the reference basic deputy data mentioned above.

But I decided to keep the rest of the data as it is. Since I’m not planning on doing any world-shaking inferences, the effort of manually checking every record against the official deputy data with all official name changes, earned or lost titles etc. seemed disproportionate. I’m going for an exploratory analysis.

Joining the datasets and Conclusion

After reading and cleaning the data, it’s time to save it and head on to the next questions to be answerd!

poll_meta %>% 
  write_csv("../../../data_sources/2021_bundestag/parsed/poll_meta.csv")

voting_data_cleaned %>% 
  write_csv("../../../data_sources/2021_bundestag/parsed/voting_data_cleaned.csv")

voting_data_complete <- voting_data_cleaned %>% 
  left_join(poll_meta, by = "p_id") %>% 
  select(-c(imgURL, URL, id))

voting_data_complete %>% 
  write_csv("../../../data_sources/2021_bundestag/parsed/polls_voting_complete.csv")

This concludes this blog post. I didn’t do any “analysis” on the data or visualization. I still hope it was an interesting read and am looking forward to the next post, as there - finally - I can do some actual analyis.


  1. Actually I hadn’t tried web scraping before, so it also intrigued me as a possibility to learn this.↩︎

  2. I used this dataset in the first part of this series↩︎

  3. The differences might be caused by different systems/encodings when collecting/publishing or scraping the data.↩︎

Citation

For attribution, please cite this work as

Gebhard (2021, March 22). jolly data: Bundestag Part II.1 - Go back to start.... Retrieved from https://jollydata.blog/posts/2021-03-22-bundestag-part-ii1/

BibTeX citation

@misc{gebhard2021bundestag,
  author = {Gebhard, Christian A.},
  title = {jolly data: Bundestag Part II.1 - Go back to start...},
  url = {https://jollydata.blog/posts/2021-03-22-bundestag-part-ii1/},
  year = {2021}
}