ParlLawSpeech
Tutorials
# Tutorial author: Christian Rauh (@ChRauh) - 2025-01-20
# Packages used for creation of these tutorials ####
library(here) # 1.0.1; directory and file management
library(knitr) # 1.39, markdown functions
library(tidyverse) # 1.3.1, data wrangling and visualization
# KnitR options
knit_hooks$set(inline = function(x) { # Neater output of numbers from inline code (NB: assuming all is numeric!)
prettyNum(x, big.mark=",")
})
The structure provided by the ParlLawSpeech data - and in
particular the ability to link full texts of bills to MP speeches and
adopted laws - allows fine grained comparative analyses of the
decision-making in the parliamentary process. Of course, answering
substantial research questions with these data will often require good
theoretical and contextual information about the politics of legislative
debate in the respective parliament (for good introductions see Bäck
et al. 2022, for example) and retrieving concepts of interest from
the dense text data we provide may at times require advanced approaches
of natural language processing (for an overview of what is possible see
Jurafski & Martin
2021, for example).
But the potential of linked parliamentary text data can already be
demonstrated with comparatively low-key
examples. In the following we provide three such exemplary
applications that should help interested users in getting up to speed
with analysing the encompassing data that ParlLawSpeech
offers.
To this end, our tutorials directly include the code to reproduce (or to
adapt) the exemplary analyses. We work in the free and open-source R environment (specifically in R
version 4.4.2, ideally to be used with a dedicated IDE such as RStudio) and resort to a few add-on
packages that are loaded into the running R session here:
# Load packages used in the tutorials
library(here) # 1.0.1; directory and file management
library(tidyverse) # 1.3.1, data wrangling and visualization
library(lubridate) # 1.8.0; make dealing with dates a little easier
library(patchwork) # 1.1.1, compile multiple graphs into one
library(quanteda) # 3.2.1, encompassing, powerful suite for bags-of-words text analysis, see https://quanteda.io/
library(quanteda.textstats) # 0.95, textual statistics for characterizing and comparing textual data
If these packages are not yet available in your local setup, please
install them first with the following code:
install.packages(c("here", "tidyverse", "lubridate", "patchwork", "quanteda", "quanteda.textstats"), dependencies = TRUE)
Now you are ready to get started with analysing ParlLawSpeech!
Tutorial 1:
How much are legislatives bills debated on the
plenary floor?
Members of parliament (MPs) fulfill important representative and
communicative functions in linking societal preferences to legislative
decisions. They need to demonstrate that they represent their
constituencies in decision-making and they are crucial for communicating
and assessing the laws that are discussed and decided in parliament.
Plenary debates offer a useful and usually publicly very visible forum
to live up to these ideals.
But given the myriad of legislation that modern parliaments have to
process, one may reasonably ask: How many bills actually
reach the open debates on the plenary floor? How much plenary attention
do bills actually receive?
One basic indicator to systematically approach these questions would be
the average number of speeches per bill tabled in parliament. This
indicator, however, requires data linkage: one first needs to
know which speech covers which bill.
During our data collection we have carefully inspected, coded, and
ultimately linked the agenda information on plenary debates in
parliamentary archives with the meta information in document databases
of bills and adopted laws. The ParlLawSpeech files thus can provide
exactly this link through the unique
procedure_ID variable. In this example, we
exploit and illustrate this in the example of the Hrvatski
sabor, the unicameral legislature of the Republic of Croatia (HR),
which our data covers throughout the 2003-2020 period.
So let’s first load the respective speech data set (we are assuming that
you have PLS country folders and files in you current working directory
- if not you will have to adapt the path to the files
here).
speeches <- read_rds(here("Croatia", "Corpus_speeches_croatia.RDS"))
We then aggregate these data along the procedure_ID variable
and count how many speeches we have for each value on this variable.
speechesPerBill <- speeches %>% # Copy of the speeches data set ...
group_by(procedure_ID) %>% # ... group by bill-specific ID ...
summarize(nspeeches = n()) %>% # ... sumarise by counting the number of obs (=speeches) per ID ...
filter(procedure_ID != "") %>% # ... and ignore speeches that did not cover specific bills.
ungroup() %>%
arrange(desc(nspeeches)) # Sort bills in descending order by number of corresponding speeches
kable(head(speechesPerBill))
procedure_ID | nspeeches |
---|---|
P.Z.519_09 | 2253 |
P.Z.791_09 | 1935 |
P.Z.498_09 | 1299 |
P.Z.251_07 | 1270 |
P.Z.6_09 | 1107 |
P.Z.220_09 | 1069 |
Nice. But as we have started from the speech data set, we now only ‘see’
those bills that were actually covered by at least one speech on the
plenary floor. We thus might miss bills that have not have been debated
at all. With ParlLawSpeech we can check this with the help of the bills
data set provided for each country.
Let’s load the respective Croatian data and compare it to the aggregate
speeches-per-bill data we have just created above.
bills <- read_rds(here("Croatia", "Corpus_bills_croatia.RDS")) %>%
select(procedure_ID, title_bill) %>%
unique() # remove duplicates
nrow(bills)
## [1] 3676
nrow(speechesPerBill)
## [1] 2879
We see that the bills data set contains 797 rows more than we have
retrieved from the speeches data set, suggesting that many bills
actually never reached the plenary floor.
Before digging into this further, there is a second second issue to
note: parliaments often discuss several bills together in one
debate. In such instances our data contain multiple entries in the
procedure_ID variable. Look at one example:
speechesPerBill$procedure_ID[13]
## [1] "P.Z.452_09, P.Z.453_09, P.Z.454_09"
This debate apparently covered three bills (with consecutive numbering
along the conventions of the Croatian parliament). From the meta data
alone, unfortunately, it is impossible to say whether an individual
speech in this particular debate referred only to one, to two, or to all
of these bills (which are probably closely related anyway). For the
purposes here, we attribute each speech to each of the respective bills
in this debate (one alternative assumption may be that only one third of
the speeches can be attributed to each bill, but let’s keep it simple
for now…).
On this basis, we can now traverse through our full bill data for
Croatia and look up the number of speeches that each bill has received
in our earlier aggregation of the speech data set.
bills$nspeeches <- 0 # Set the number of speeches per bill initially to 0
for (i in 1:nrow(bills)) { # Loop through all bills ...
# ... look up in which row of the speechesPerBill aggregation the bill id occurs ...
row <- which(str_detect(fixed(speechesPerBill$procedure_ID), bills$procedure_ID[i]))
if (length(row)>0) { # ... and if the bill was debated on the plenary floor ...
# ... store the number of respective speeches in the bills data
bills$nspeeches[i] <- speechesPerBill$nspeeches[row[1]]
}
}
bills <- bills %>%
arrange(desc(nspeeches)) # Sort the data in descending number of speeches per bill
This provides the final number of plenary speeches per each
individual bill that was tabled in the Croatian parliament between 2003
and 2020. What does this information say about our initial
questions? Have a look at the basic summary statistics first.
summary(bills$nspeeches)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 14.00 39.00 88.76 106.25 2253.00
We get a first answer: In the Croatian parliament, an average
legislative bill is debated in 88.76 speeches. Quite a bit of
debate, indeed.
But wait, the minimum value is 0! As suspected there are also bills that
are never debated on the plenary floor. Let’s have a look at their
overall share:
mean(bills$nspeeches == 0)*100
## [1] 7.725789
This provides a second answer: about 8 per cent of bills are
actually never debated in the plenary of the Croation parliament
…
Moreover, the summary statistics above suggest that the distribution of
speeches per bill is rather skewed. The median value is ‘only’ 39
speeches per bill but the maximum goes up to 2,253 speeches. Let’s
visualize this distribution.
ggplot(bills, aes(x=nspeeches))+
geom_histogram(bins = 60, fill = "#0063a6", alpha = .8)+
labs(title = "Plenary speeches per legislative bill in the Croatian Hrvatski sabor, 2003-2020",
x = "Number of speeches per bill (binned)",
y = "Frequency")+
theme_bw()+
theme(
plot.title = element_text(face = "bold.italic"),
strip.background = element_rect(fill= NA),
panel.grid.minor = element_blank())
Plenary attention to individual bills is actually very
skewed. In other words, speeches in the plenary debates
concentrate on a few bills. Of the 3,676 bills, only 1,590 receive more
than 50 speeches (in a parliament that consists of 151 seats at the time
of writing).
Moreover, a few bills receive extraordinary plenary
attention. As we have sorted the data set in descending order
above, we can readily look at the bill that received the most MP
speeches in the Croatian Hrvatski sabor between 2003 and 2020.
kable(bills[1,])
procedure_ID | title_bill | nspeeches |
---|---|---|
P.Z.519_09 | Prijedlog zakona o financiranju političkih aktivnosti, izborne promidžbe i referenduma | 2253 |
On two days in early 2019, the “Proposal of the law on the financing of
political activities, election campaigns and referendums” (with a little
help by Google Translate …) was debated in 2,253 MP speeches. Party
finances seem to be a hot topic for MPs …
Of course, much more has to be done to turn this initial
demonstration into a substantially meaningful analysis. But already this
simple example shows that the linkages between different types of
parliamentary documents generate systematic insights into the
parliamentary process that were not visible with such precision
before.
From here, more in-depth analysis could, for example, classify the text
in the bill titles to analyse whether plenary attention is
systematically biased for or against certain topics. Or you might be
interested whether and to what extent the concentration of plenary
debates on few specific bills observed for the Croatian parliament is
different from other countries. Go ahead, ParlLawSpeech offers
data to pursue such questions …
Tutorial 2:
How much do legislative bills change during the
parliamentary process?
One often-heard criticism of modern parliaments is that they are
dominated by the executive: parliamentary majorities are accused to just
rubber-stamp the legislative bills that governments serve them rather
than actually shaping the content of binding law. To what
extent does this hold true? How much do legislative bills actually
change during the parliamentary process?
Again such questions require data linkage, here specifically links of
tabled bills and the finally adopted laws. Such questions also require
data along which these pairs of bills and adopted laws can be
systematically compared. The full-text vectors of legislative documents
in ParlLawSpeech encapsulate lots of information useful to that
end.
To illustrate how to exploit these ParlLawSpeech features, this tutorial
resorts to the data for the Spanish Congreso de los Diputados
in the 1996-2023 period.
Let’s load these and join the relevant information for bills and adopted
laws along the procedure_ID variable.
# Load bills data for Spain,
# Select relevant variables and mark government-sponsored bills
bills <- read_rds(here("Spain", "Corpus_bills_spain.RDS")) %>%
select(procedure_ID, title_bill, bill_text, initiator) %>%
mutate(sponsor = ifelse(initiator == "Gobierno",
"Government", "Other") %>%
factor(levels = c("Government", "Other"))) %>%
select(-initiator) %>%
filter(!is.na(sponsor))
# Law data for Spain
laws <- read_rds(here("Spain", "Corpus_laws_spain.RDS")) %>%
select(procedure_ID, title_law, law_text)
# Combine bill and law information
comp <- bills %>%
left_join(laws %>%
select(procedure_ID, title_law, law_text),
by = "procedure_ID")
This linked data initially provides us information on how many bills
actually turned into binding law - as indicated by the presence or
absence of a respective law text. The ParlLawSpeech data furthermore
provide information on who tabled the bill in the first place.
Pulling this together allows us to visualize the success rates of
government-sponsored bills vs. those of bills tabled by other actors
(typically individual partisan factions in the Congreso or autonomous
Spanish regions).
# Calculate bill adoption rates
adoptionrates <- comp %>% # Calculate adoption rates ...
group_by(sponsor) %>% # ... by bill sponsor
summarise(nbills = n(), # Number of bills
adoption.rate = 1 - mean(is.na(law_text))) %>% # Absence of law text as non-adopted
ungroup() %>%
mutate(sponsor2 = paste0(sponsor, "\n(", nbills, " bills)"))
# Plot these data
ggplot(adoptionrates, aes(y = adoption.rate, x = sponsor2, fill = sponsor2))+
geom_col(alpha = .8)+
scale_fill_manual(values = c("#0063a6", "#e41a1c"))+
scale_y_continuous(labels = scales::percent)+
labs(title = "Bill adoption rates in the Spanish Congreso de los Diputados",
subtitle = paste0("Based on ", sum(adoptionrates$nbills), " in the 1996-2023 period"),
x = " \nBill sponsor",
y = "Share of bills\nthat resulted in a binding law\n ")+
theme_bw()+
theme(legend.position = "none")
We clearly find that the likelihood that a bill is adopted by
the Spanish lower house is significantly higher when it comes from the
government (89.6% as opposed to 6.95%). This is initially
consistent with the ‘executive dominance’ criticism.
But, of course, that does not have to mean that parliament adopts these
laws exactly as the government has proposed them. To answer how much
bills are changed during the parliamentary process we have to
systematically compare their content.
For such a comparison, the full-text vectors that ParlLawSpeech offers
come in handy. We first reduce the data to procedures for which both a
bill and an adopted law text are available.
pairs <- comp %>%
filter(!is.na(bill_text) & !is.na(law_text))
Before comparing the text pairs, one disclaimer is in order. As
noted above, the full-text vectors in ParlLawSpeech equal the way in
which the respective source archive provides them so as to give
researchers maximum freedom in selecting, cleaning, and pre-processing
the text data in ways that fit the text analysis of interest. Inversely,
of course, this means that researchers will often need to clean the
texts in ways that suits their planned analysis. We thus recommend
that ParlLawSpeech users carefully inspect and acquaint themselves in
great detail with the structure of the provided texts before plugging
them into automated text analysis algorithms (recall: garbage in
-> garbage out).
With the code below, for example, users can export random examples to
local files, which can then be inspected with standard text editors (we
recommend Notepad++, for
example).
# Pick random row
i <- sample(1:nrow(pairs), 1)
i
# Meta-info on picked example
pairs$title_bill[i]
pairs$title_law[i]
pairs$procedure_ID[i]
# Export bill and law text to local files for inspection
writeLines(pairs$bill_text[i], "BillExample.txt")
writeLines(pairs$law_text[i], "LawExample.txt")
For our interest in bill-to-law change for this tutorial, we want to
compare only the legal substance of the documents, that is the
legal articles that actually stipulate the binding rules in a law. In
other words, we want to remove all recitals, preambles, justifications,
appendices and other boilerplate (which might be relevant for other
analyses, of course).
To achieve this we inspected and cross-checked numerous examples and
noted recurring patterns that would help us to isolate the text bits of
interest. We then generalized these patterns to regular
expressions which we then match across the texts with the functions
of the stringR package
(contained in the tidyverse package we have loaded above).
This is a time-consuming, but often necessary step. For learning how to
use regular expressions in R we can recommend this
resource. For developing specific regular expressions and for
testing them with exemplary texts, sandbox tools such as regexr.com are usually also very
helpful.
Such pre-processing steps should always be well documented and usually
also very extensively validated. For our purposes here we use the
following (admittedly somewhat crude) text cleaning steps to cut the raw
texts before and after their legal substance:
# Cleaning bill texts
pairs$bill_text_redux <-
# Copy of raw bill text
pairs$bill_text %>%
# Missing white spaces after punctuation
str_replace_all("(\\.|,|;)([A-Z])", "\\1 \\2") %>%
# Reduce multiple consecutive whitespaces to one regular whitespace
str_replace_all("\\s+", " ") %>%
# Remove everything before the legal text begins
# "Article 1" heading (only valid if followed by a capitalized word)
str_remove(regex("^.*? PREÁMBULO",
ignore_case = F)) %>% # Remove everything up to the preamble, needed in case law text has an article index
str_replace(regex("^.*? (art(í|i)culo)\\s+(1|(u|ú)nico|pr(í|i)mero)(\\.|:| )\\s*([A-ZÁÉÍÓÚ])",
ignore_case = T), "Artículo 1. X") %>%
# Remove Appendices
# str_remove("\\.\\s+ANEXO\\s+([0-9]|I).*") %>%
str_remove(regex("\\s+ANEXO\\s+([0-9]|I)\\s.*?$", ignore_case = F)) %>%
# Remove HTML and other gibberish
str_remove_all("<.*?>") %>%
str_remove_all("\\[\\*.*?\\*\\]") %>% # Some strange table / page formatting
# Final, non-substantial edits
str_replace_all("[[:punct:]]", " ") %>%
str_replace_all("\\s+", " ") %>%
str_trim() %>%
tolower()
# Cleaning law texts
pairs$law_text_redux <-
# Copy of raw law text
pairs$law_text %>%
# Missing whitespaces after punctuation
str_replace_all("(\\.|,|;)([A-Z])", "\\1 \\2") %>%
# Reduce multiple consecutive whitespaces to one regular whitespace
str_replace_all("\\s+", " ") %>%
# Remove everything before the legal text begins
# "Article 1" heading (only valid if followed by a capitalized word)
str_remove(regex("^.*? PREÁMBULO",
ignore_case = F)) %>% # Remove everything up to the preamble, needed in case law text has an article index
str_replace(regex("^.*? (art(í|i)culo)\\s+(1|(u|ú)nico|pr(í|i)mero)(\\.|:| )\\s*([A-ZÁÉÍÓÚ])",
ignore_case = T), "Artículo 1. X") %>%
# Remove Appendices
# str_remove("\\.\\s+ANEXO\\s+([0-9]|I).*") %>%
str_remove(regex("\\s+ANEXO\\s+([0-9]|I)\\s.*?$", ignore_case = F)) %>%
# Remove HTML and other gibberish
str_remove_all("<.*?>") %>%
str_remove_all("\\[\\*.*?\\*\\]") %>% # Some strange table / page formatting
# Final, non-substantial edits
str_replace_all("[[:punct:]]", " ") %>%
str_replace_all("\\s+", " ") %>%
str_trim() %>%
tolower()
# Re-check examples
# i <- sample(1:nrow(pairs), 1)
# i
# writeLines(pairs$bill_text_redux[i], "BillExample.txt")
# writeLines(pairs$law_text_redux[i], "LawExample.txt")
# Keep only bills that actually draft legal text
# (in a few instances Spnish bills just describe rather than actually propose legal text)
pairs <- pairs %>%
filter(str_detect(bill_text_redux, "artículo 1 x"))
# Remove everything after the final 'entry into force' article in each law text/proposal
pairs$bill_text_redux <-
pairs$bill_text_redux %>%
str_remove("entrará en vigor (a|e)l día siguiente (al ){0,1}de su publicación en el boletín oficial del estado.*?$")
pairs$law_text_redux <-
pairs$law_text_redux %>%
str_remove("entrará en vigor (a|e)l día siguiente (al ){0,1}de su publicación en el boletín oficial del estado.*?$")
# Filter case with apparent archive errors
pairs <- pairs %>%
filter(procedure_ID != "121/000065 Leg.6") %>% # Bill and law title don't match
filter(procedure_ID != "121/000012 Leg.7") %>% # Bill and law title don't match
filter(procedure_ID != "121/000033 Leg.8") # Bill text doesn't match title
# Store intermediate results (so to avoid having to repeat the above - this takes some time)
write_rds(pairs, here("TutorialOutputs", "ES_CleanedBillsAndLaws.RDS"))
Having reduced the bill and law text to their legal core, we can now
compare them pair by pair.
An initial, very simple indicator for legal change during the
parliamentary process is the amount of words that are added or
removed from the bill before it is adopted as law.
In the following steps we thus count the number of words in each text
(again using a regular expression) and then plot the length of bills
against the length of the respective laws.
# Reloaded the cleaned texts
pairs <- read_rds(here("TutorialOutputs", "ES_CleanedBillsAndLaws.RDS"))
# Word counts
pairs$billlength <- str_count(pairs$bill_text_redux, "\\w+")
pairs$lawlength <- str_count(pairs$law_text_redux, "\\w+")
pairs$lengthdiff <- pairs$lawlength - pairs$billlength
# Plot length differences
ggplot(pairs ,aes(x = billlength, y = lawlength, color = sponsor))+
geom_abline(intercept = 0, slope = 1)+ # 45° line indicating identical word counts of bills and laws
geom_point(alpha = .6)+
# geom_smooth(method = "loess")+
scale_color_manual(values = c("#0063a6", "#e41a1c"))+
labs(title = "Length of legislative bills vs. length of adopted laws",
subtitle = "Spanish Congreso de los Diputados, 1996-2023",
x = " \nNumber of words in bill text",
y = "Number of words in law text\n ",
color = "Bill sponsor: ")+
theme_bw()+
theme(legend.position = "bottom")
Apparently the majority of legislative procedures clusters around the
45° degree line which indicates that the length of the finally adopted
law largely equals the length of the initial bill. But we also note
quite a number of deviating cases mostly significantly above the 45°
line.
In other words, the word count of bills changes often only very
little during the parliamentary process in the Spanish Congreso, but
when it does the resulting laws tend to become longer than the initial
bill.
Of course, an apparently by-and-large stable word count may still
mask significant and political change depending on whether and which
words are added, removed or replaced. Such within-text change can be
aggregated by comparing word frequency matrices and normalizing their
overlap, for example by the Cosine
similarity measure.
But not only the differing frequency of words but also their order can
matter a great deal in politics. Consider the following hypothetical
examples:
- We prioritize the environment over the economy!
- We prioritize the economy over the environment!
Both sentences have the same length and they also contain exactly the
same words (the Cosine similarity of their word frequency matrix equals
1). But in terms of political meaning, they are arguably very
different.
One approach to retrieve a frequency-based similarity measure that is at
least somewhat sensitive to word order is to tokenize (= split) the
texts not into individual words but into sequences of consecutive words
(so-called ngrams). Tokenizing the above examples into overlapping
bigrams (sequences of two consecutive words), for example, would look
like this:
- [We prioritize] [prioritize the] [the environment] [environment over] [over the] [the economy]
- [We prioritize] [prioritize the] [the economy] [economy over] [over the] [the environment]
Of the six bigrams in each example, only five occur in both texts -
resulting in a lower cosine similarity of .83. The longer the ngrams
that we split the text into, the more sensitive our similarity measure
becomes to changes in word order.
Let’s apply these ideas to our text data from the Spanish Congreso. The
following code traverses through all 1,534 bill/law pairs and uses
functions from the quanteda R package
to split each text into overlapping 5-grams, to store their frequency in
a matrix, and to then calculate the Cosine similarity for each each
bill/law pair.
# Empty traget variable
pairs$cosine5 <- NA
# Loop through text pairs,
# get frequency matrix of 5-grams for respective bill and law
# calculate cosine similarity of both matrixes
for (i in 1:nrow(pairs)) {
a.mat <- corpus(pairs$bill_text_redux[i]) %>%
tokens() %>% tokens_ngrams(n = 5) %>%
dfm() %>% dfm_weight("prop")
b.mat <- corpus(pairs$law_text_redux[i]) %>%
tokens() %>% tokens_ngrams(n = 5) %>%
dfm() %>% dfm_weight("prop")
pairs$cosine5[i] <- textstat_simil(a.mat, b.mat,
method = "cosine",
margin = "documents")[1]
}
# Store intermediate result
# So as not having to run this again and again (a little bit of computation is involved here...)
write_rds(pairs %>% select(procedure_ID, sponsor, cosine5), here("TutorialOutputs", "ES_BillLawCosine.RDS"))
Based on 5-word sequences, this measure gives us a relative estimate
of the similarity between the bill and the finally adopted
law.
Inverting it accordingly allows us to plot legislative change during the
parliamentary process across the 1,534 bills/law pairs for the 1996-2023
period we observe here.
# Reload the Cosine similarity data
pairs <- read_rds(here("TutorialOutputs", "ES_BillLawCosine.RDS"))
# Invert cosine similarity to express change rather than similarity
pairs$cosine5inv <- 1 - pairs$cosine5
# Calculate average text changes
avchange <- pairs %>%
group_by(sponsor) %>%
summarise(avchange = mean(cosine5inv))
# Plot full distribution
ggplot(pairs,aes(x = cosine5inv, color = sponsor, fill = sponsor, ..scaled..))+
geom_density(alpha = .6, color = NA)+
facet_wrap(~sponsor, nrow = 2,
strip.position = "right")+
geom_vline(data=filter(pairs, sponsor=="Government"), aes(xintercept= mean(cosine5inv)), colour="#0063a6", linetype = "dashed")+
geom_vline(data=filter(pairs, sponsor=="Other"), aes(xintercept= mean(cosine5inv)), colour="#e41a1c", linetype = "dashed")+
# geom_density_ridges(alpha = .6, scale = 1.2, rel_min_height = 0.001)+
# stat_density_ridges(quantile_lines = TRUE, alpha = .6, scale = 1.2, rel_min_height = 0.001)+
labs(title = "Change in wording from bill to law, Spanish Congreso 1996-2023",
subtitle = paste0("Only around ",
round(avchange$avchange[avchange$sponsor == "Government"]*100,2) ,
"% of the text is changed in government bills. \nFor the few bills from other sponsors this average increases to ",
round(avchange$avchange[avchange$sponsor == "Other"]*100,2),
"%."),
x = " \nEstimated change of text before parliament adopts a bill into binding law\n(Inverted cosine similarity of bill and law texts based on moving 5-gram text windows)",
y = " ")+
scale_x_continuous(expand = c(0, 0))+
scale_y_continuous(expand = expansion(mult = c(0, .1)))+
scale_color_manual(values = c("#0063a6", "#e41a1c"))+
scale_fill_manual(values = c("#0063a6", "#e41a1c"))+
coord_cartesian(xlim = c(0, 1))+
theme_bw()+
theme(legend.position = "none",
axis.text = element_text(color = "black"),
strip.text = element_text(face = "bold"),
strip.background = element_rect(fill= NA),
plot.title = element_text(face = "bold.italic"))
This highlights that government bills actually change very
little during the parliamentary process. On average (and based
on comparing 5-gram sequences of words), only 9.32% of the text is
changed in a government bill before parliament adopts it as law. Looking
at the full distribution, moreover, suggests that this average is driven
by a few outliers - the median government bill ‘experiences’ only 4.11%
of textual change during processing in parliament.
In contrast, the few adopted bills tabled by other sponsors are
changed much more significantly during the parliamentary
process. On average, more than a third of the text of such
bills is altered and the overall distribution across all such bills is
much flatter than that for government-sponsored bills.
Again, this is a relatively quick analysis. Users wishing to push
this further may invest in more careful pre-processing and especially
look into minimum edit distance algorithms that have been recently
successfully applied to study legislative change (e.g Cross
and Hermanson, 2017, Rauh
2020). These tools can then be used to test theories about the
legislative influence of parliaments in a comparative fashion,
e.g. across specific bills, across different governments, or even across
countries.
For now, however, also this comparatively simple example showcases the
analytical potential in the availability of linked full-text data on the
parliamentary process. What can you do with
it?
Tutorial 3:
How does the debate on a specific bill differ from
other debates?
In many instances analysts will be interested in whether and
how the debate on one specific bill differs from typical behavior in the
respective parliament. To illustrate such an application
with the ParlLawSpeech data, this tutorial focuses on the legal
framework for containing COVID-19 in Germany.
Specifically, we look at the third installment of the so-called ‘Law for
the Protection of the Population in an Epidemic Situation of National
Concern’ (‘Drittes Gesetz zum Schutz der Bevölkerung bei einer
epidemischen Lage von nationaler Tragweite’) which was introduced
to the German Bundestag on November 3 2020.
In a context of still growing infection rates, legal and political
disputes on prior pandemic measures, and partially massive street
protests, this bill aimed to regulate and to define the conditions under
which the executive could enact partially far-reaching interventions
such as lockdowns or prohibitions of rallies, amongst others.
In this context, the grand-coalition government (Merkel IV) aimed for a
broad parliamentary debate that would ideally generate support of the
proposed measures (which were formally tabled not by the government for
that reason but by the faction leaders of the governing parties CDU/CSU
and SPD). So how much plenary attention did the different parties in
the German Bundestag devote to debating this law in comparison to
others? And did the resulting debate actually transcend the typical
government-opposition dynamics?
To approach these exemplary questions in a systematic manner, we first
load the speech data set for the German Bundestag and filter it for
meaningful comparison. Specifically, we include speeches that …
- were given during the Merkel IV government,
- were not given by the chair of the debate and were longer than five
words (excluding merely organisational statements),
- addressed any legislative bill,
- and were given by a member of a political faction.
speeches <- read_rds(here("Germany", "Corpus_speeches_germany.RDS")) %>%
select(date, chair, speaker, party, text, procedure_ID) %>%
filter(date >= "2018-03-14" & date < "2021-12-08") %>% # Debates during the Merkel IV government
filter(!chair) %>% # Drop organisational speeches from the chair
filter(str_count(text, "\\w+") >= 5) %>% # Keep only speeches longer than 5 words
filter(procedure_ID != "") %>% # Keep only bill-specific debates | TO DO: encode as NA
filter(!is.na(party) & party != "fraktionslos") # Drop non-partisan speakers
To identify which of these 7,148 speeches address the bill we are
interested in here, we initially load the bill data for Germany and also
reduce it to all bills tabled during the Merkel IV government.
# Load the German bill data
bills <- read_rds(here("Germany", "Corpus_bills_germany.RDS")) %>%
select(procedure_ID, initiator, initiation_date, title_bill) %>%
filter(initiation_date >= ymd("2018-03-14") & initiation_date < ymd("2021-12-08"))
Knowing the German title of the proposed law and the initiation date, we
can then retrieve the respective procedure_ID which provides
the link to the relevant speeches.
billsOfInterest <- bills %>%
filter(str_detect(title_bill, "Schutz der Bevölkerung bei einer epidemischen Lage von nationaler Tragweite")) %>% filter(initiation_date == ymd("2020-11-03"))
kable(billsOfInterest)
procedure_ID | initiator | initiation_date | title_bill |
---|---|---|---|
19/23944/19 | Fraktion der CDU/CSU, Fraktion der SPD | 2020-11-03 | Drittes Gesetz zum Schutz der Bevölkerung bei einer epidemischen Lage von nationaler Tragweite |
To assess the plenary attention to this bill (in comparison to all
others under the Merkel IV government), we proxy bill-specific
speaking time by the number of words that speakers from
each faction uttered on each bill.
NumberOfWords <-
# Start from the speech data ...
speeches %>%
# ... count number of words with a regular expression matching groups of so-called word characters
# including upper and lower case charcters as well as numbers ...
mutate(wordcount = str_count(text, "\\w+")) %>%
# ... group by bill and partisan faction ...
group_by(procedure_ID, party) %>%
# ... and retrieve the sum of words spoken for each group ... #
summarise(wordcount = sum(wordcount)) %>%
# ... mark the law of interest, using it's ID retrieved above ...
mutate(covid = str_detect(procedure_ID, "19/23944/19")) %>%
# ... and summarise across this law and all others ...
group_by(covid, party) %>%
# .. along the mean number of words spoken and a bootstrapped confidence interval.
summarise(ci = list(mean_cl_boot(wordcount) %>%
rename(mean=y, lwr=ymin, upr=ymax))) %>%
unnest(cols = c(ci))
Then we provide more telling labels for the bill group variable. We also
order the parties by their share of seats in the 19th German Bundestag
as this should be be roughly proportional to their speaking time
according to the Bundestag’s
internal rules.
# Label debate
NumberOfWords$covid2 <-
ifelse(NumberOfWords$covid == T,
"on the third installment of the Infection Protection Act",
"on all other bills debated during the Merkel IV government (average)") %>%
factor(levels = c("on all other bills debated during the Merkel IV government (average)", "on the third installment of the Infection Protection Act"))
# Order parties (by size of faction in 19th Bundestag)
NumberOfWords$party2 <- factor(NumberOfWords$party,
levels = c("BÜNDNIS 90/DIE GRÜNEN", "DIE LINKE", "FDP", "AfD", "SPD", "CDU/CSU")) %>%
fct_rev()
This data then allows us to comparatively visualize the plenary
attention that each party devoted to the third installment of the
Infection Protection Act - in comparison to all other bills during the
Merkel IV government.
# Plot number of speeches
ggplot(NumberOfWords, aes(x = party2, y = mean, ymin = lwr, ymax = upr, color = party2, fill = party2, group = covid2))+
geom_linerange(position = position_dodge(width = .5))+
geom_col(position = position_dodge(width = .5), width = .5, aes(alpha = covid2))+
scale_color_manual(values = c("black", "#E3000F", "#3c9dde", "#ffd600", "#b61c3e", "#46962b"), guide = "none")+
scale_fill_manual(values = c("black", "#E3000F", "#3c9dde", "#ffd600", "#b61c3e", "#46962b"), guide = "none")+
labs(title = "Length of partisan speech on\nthe third installment of the Infection Protection Act",
subtitle = "Compared to all other bill-specific debates in the Bundestag during the Merkel IV government (2018-2021)",
x = " \nParty\n(sorted by size of faction in the 19th German Bundestag)\n",
y = "Sum of words\n in plenary speeches on individual bills\n ",
alpha = "Debate:")+
theme_bw()+
theme(legend.position = "bottom",
axis.text = element_text(color = "black"))
This figure initially shows that all partisan factions in the German
Bundestag spoke significantly more on the third installment of the
Infection Protection Act than they spoke on the average bill during the
Merkel IV government. This act indeed garnered plenary attention
way above average levels.
The data also suggest that partisan speaking time is indeed roughly
proportional to the seat share a party holds in the Bundestag (recall
that the x-axis is sorted along this share). But interestingly, the
increase of speaking time on the Third Infection Protection Act does not
look really proportional. Let’s inspect the increases by party
numerically.
NumberOfWordsWide <- NumberOfWords %>%
select(party, covid, mean) %>%
mutate(covid = ifelse(covid, "InfectionAct", "OtherBills")) %>%
pivot_wider(id_cols = party, names_from = covid, values_from = mean) %>%
mutate(IncreaseFactor = round(InfectionAct/OtherBills, 1)) %>%
arrange(desc(IncreaseFactor))
names(NumberOfWordsWide) <- c("Party", "Average words on other bills", "Words on Infection Act", "Increase Factor")
kable(NumberOfWordsWide)
Party | Average words on other bills | Words on Infection Act | Increase Factor |
---|---|---|---|
AfD | 1103.3285 | 3242 | 2.9 |
CDU/CSU | 2711.6211 | 7795 | 2.9 |
FDP | 1030.9791 | 2885 | 2.8 |
SPD | 1934.3113 | 4394 | 2.3 |
DIE LINKE | 915.9644 | 1813 | 2.0 |
BÜNDNIS 90/DIE GRÜNEN | 977.9162 | 1757 | 1.8 |
This confirms the visual impression. In contrast to what the internal
Bundestag rules would lead us to expect and as compared to the average
bill, some parties seem to have increased their speaking time on
the Infection Protection Act more than others.
The conservative CDU/CSU faction as well as two largest opposition
parties, the far-right AfD and the liberal FDP, increased their number
of words by almost a factor of three compared to the average bill. In
contrast, MPs from the social-democratic SPD and from the two smaller
opposition parties, the Left and the Greens, roughly only doubled their
average amount of bill-specific speech.
Thus, the disproportional increases in plenary attention did not pit
government against opposition parties but rather seem to divide parties
tending to the right of the German political spectrum from those tending
towards the left.
How did these partisan speakers position themselves on this Third
Infection Protection Act more substantially?
To pursue this question we exploit the full-text data that ParlLawSpeech
offers and build on Proksch
et al (2018, LSQ) who demonstrate that the sentiment expressed
in bill-specific partisan speeches reliably reveals
government-opposition dynamics in plenary speeches.
In their simplest form, sentiment analyses classify texts based on their
ratio of positive to negative words, drawn from a separate dictionary.
For our exemplary application we use a publicly available
dictionary of positive and negative terms that has been shown to map
well on human impressions of German political language (Rauh
2018, JITP).
# Sentiment dictionary presented in Rauh (2018, Journal of Information Technology and Politics)
# Publically available at https://doi.org/10.7910/DVN/BKBXWD
load(here("TutorialInputs", "Rauh_SentDictionaryGerman.Rdata")) # Creates object called 'sent.dictionary'
This dictionary contains 17,330 positively and 19,750 negatively connoted German terms. To count how often these terms occur in our parliamentary speeches, we rely on the quanteda R package which offers a powerful suite of frequency-based text analysis tools, including dictionary-based approaches.
# Transform the Rauh word lists into 'dictionary' object as used by the quanteda package
# Essentially two named lists containing terms with positive (pos) and negative (neg) sentiment scores
dict <- list(pos = sent.dictionary$feature[sent.dictionary$sentiment == 1],
neg = sent.dictionary$feature[sent.dictionary$sentiment == -1]) %>%
dictionary()
We use quanteda’s core functions to tokenize each of our speeches into individual words, construct a document-frequency-matrix, in which we then count the occurrence of positive and negative terms from the sentiment dictionary.
sentiCount <-
corpus(speeches$text) %>%
tokens() %>%
dfm() %>%
dfm_lookup(dictionary = dict) %>%
convert(to = "data.frame")
We then calculate a sentiment score for each speech as the log-odds ratio of positive to negative words (cf. Lowe et al. 2011, Proksch et al 2018) and add this information to the speech data set.
sentiCount$score <- log((sentiCount$pos + .5)/ (sentiCount$neg +.5))
speeches$sentiment <- sentiCount$score
Finally, we mark the debate on the Third Infection Protection Act in this data and clean up some of the other group labels.
# Mark law
speeches$covid <- ifelse(speeches$procedure_ID == "19/23944/19",
"on the third installment of the Infection Protection Act",
"on all other bills debated during the Merkel IV government") %>%
factor(levels = c("on the third installment of the Infection Protection Act", "on all other bills debated during the Merkel IV government"))
# Order parties
speeches$party2 <- factor(speeches$party,
levels = c("DIE LINKE", "AfD", "BÜNDNIS 90/DIE GRÜNEN", "FDP", "SPD", "CDU/CSU"))
# Mark governing parties
speeches$gov <- ifelse(speeches$party == "CDU/CSU" | speeches$party == "SPD", "Government parties", "Opposition parties")
Now we have all the information we need in one place and can
visualize the sentiment that partisan speeches expressed on the
Third Infection Protection Act in comparison to all other 820 bills
tabled in the German Bundestag during the Merkel IV grand-coalition
government.
ggplot(speeches, aes(x = party2, y = sentiment, colour = party2, shape = covid))+
geom_hline(yintercept = mean(speeches$sentiment, na.rm = T), linetype = "dashed")+
stat_summary(geom = "pointrange", fun.data = mean_cl_boot, position = position_dodge(width = .5), size = .8)+
scale_color_manual(values = c("#b61c3e", "#3c9dde", "#46962b", "#ffd600", "#E3000F", "black"), guide = "none")+
scale_shape_manual(values = c(19, 4))+
guides(shape = guide_legend(nrow=2,byrow=TRUE))+
facet_grid(gov~., scales = "free_y", space = "free_y")+
labs(title = "Expressed sentiment in partisan speeches on\nthe third installment of the Infection Protection Act",
subtitle = "Compared to all other bill-specific speeches in the Bundestag during the Merkel IV government (2018-2021)",
x = " ",
y = "Average sentiment score of individual speech act\nand 95% confidence interval",
shape = "Plenary speeches:")+
coord_flip()+
theme_bw()+
theme(legend.position = "bottom",
legend.box="vertical",
axis.text = element_text(color = "black"))
These data initially confirm the findings of Proksch
et al 2018: The sentiment expressed in bill-specific debates clearly
distinguishes government from opposition parties. The average sentiment
values and their confidence intervals on bills during the Merkel IV
government (marked by cross-hair symbols in the figure) clearly
separate the governing CDU/CSU and SPD factions with above-average
sentiment levels, from the four opposition parties with below-average
sentiment score in their plenary speeches.
However, the debate on the Third Infection Protection Act does
not fit this typical pattern of government-opposition dynamics.
In particular, speakers from the governing CDU/CSU coalition use notably
more negative language than speakers from their coalition partner SPD.
And also the sentiment expressed among the opposition parties differs
strongly. While MPs from the far-right AfD, the liberal FDP, and
partially also from the far-left Linke express much more negative
sentiment than in their usual bill-specific speeches, Green MPs express
an above-average sentiment level that comes close to that expressed in
speeches of social-democratic MPs from the governing
coalition.
In sum, this exemplary analysis suggests that the parliamentary debate
on the major German law to provide the executive with partially
far-reaching counter-measures to contain Covid-19 indeed deviated from
the typical dynamics in Germany’s lower chamber, but probably not along
the lines that the government had hoped for: plenary attention was
indeed higher but also more disproportional than on the average bills,
pushing parties that expressed more negative sentiment which itself was
not structured along the usual split between governing and opposition
parties.
Of course, this exemplary analysis should also not be
over-interpreted - the bill was ultimately accepted with votes from the
governing coalition and the Greens. But it shows that analyzing and
comparing bill-specific debates holds insights that go beyond
the mere analysis of voting results.
From here, further analysis
could dig into party-specific word-frequency patterns, or apply more
advanced NLP methods of aspect-based stance detection or semantic
scaling of the arguments MPs provide. So go ahead, the
ParlLawSpeech data is waiting for you …