ParlLawSpeech: The Data
Overview
The ParlLawSpeech dataset offers 4.02 GB of data in total, including machine-readable full texts of 43,582 bills, 28,124 laws, and 3,092,431 plenary speeches from eight European parliaments covering more than two decades each.
ParlLawSpeech is meant to push the systematic analysis of European democracies with advanced text-as-data/NLP methods. Compared to the other excellent extant political science text corpora (cf. Sebők et al. 2025), it adds two key innovations. First, the bill and law corpora are are among the most encompassing full-text vectors of legal documents handled by parliaments. Second, we provide novel data linkage possibilities by offering a common identifier across bills, corresponding plenary speeches, and finally adopted laws- opening up new analytical opportunities to study the legislative process (basic ideas in the tutorial section).
This dataset is the result of the OPTED Work Package 5 (WP5) led by Sven-Oliver Proksch, Christian Rauh, and Miklós Sebők and funded by the European Union’s Horizon 2020 program (Grant agreement 951832). Our task was to develop different prototypes for improving access to linked parliamentary text data produced in European democracies.
This page provides a quick overview of the data structure to
potential users. More detailed information is provided in the codebook
shipped with the data repository that is permanently stored at
GESIS.
Data access and usage
The dataset is licensed under the Creative Commons
Attribution 4.0 International (CC-BY 4.0 Deed) license, the
only authorized version for PLS. When using the data or any of its
components in your work, you are required to cite the source as follows:
Schwalbach, Jan; Hetzer, Lukas; Proksch, Sven-Oliver; Rauh, Christian; Sebők, Miklós (2025), “ParlLawSpeech”; GESIS, Cologne. doi:10.7802/2824.
Coverage
Two considerations drove the choice of countries/parliaments in the
initial ParlLawSpeech prototypes. First, the collected data should be
usable for diverse comparative research projects by including smaller
and larger EU member states from different regions of the Union
(including the EU itself). Second, we relied on our prior inventory of
parliamentary text sources in Europe (see also Sebők et al. 2025)
and our earlier collections of parliamentary speeches (ParlSpeech)
to identify those official archives from which we could collect the
respective text data within the time and resource constraints of the
project.
In a very labor-intensive process, all data have been collected by own
web scraping scripts customized to each parliamentary archive and/or
legal database (for which APIs are available in very few cases only).
Links between the documents were established by semi-automated analyses
of the meta-information or the full texts that the respective archive
provides which again required detailed customization to the conventions
of the respective archive.
The table below summarizes the countries and time periods eventually
included in the ParlLawSpeech collection. As an additional goodie, we
can offer large shares of the European and Hungarian speeches readily
translated to English (details on the translation here).
Country | Code | Parliament | Time period |
---|---|---|---|
Austria | AT | Nationalrat | 1996/01 - 2019/09 |
Croatia | HR | Hrvatski sabor | 2003/12 - 2020/05 |
Czech Republic | CZ | Poslanecká snemovna Parlamentu Ceské republiky | 2013/11 - 2023/10 |
Denmark | DK | Folketing | 2007/11 - 2022/10 |
EU | EU | European Parliament | 1999/07 - 2024/04 |
Germany | DE | Bundestag | 2009/10 - 2021/09 |
Hungary | HU | Országgyűlés | 1994/06 - 2022/03 |
Spain | ES | Congreso de los Diputados | 1996/03 - 2023/07 |
Data structure
For each of the above mentioned countries/parliaments (including the
EU/EP), ParlLawSpeech provides three separate data files: one for bills,
one for laws, and one for plenary speeches. These three files are
structured along the following columns (core variables):
Corpus_bills_[country].RDS:
Data on legislative bills tabled in the respective parliament- title_bill: Bill title as provided in the original parliamentary archive
- bill_ID: Identification number of the bill document, following the conventions of the respective parliament
- bill_text: Full text of the bill
- initiation_date: Day on which the bill was tabled
- status: Final decision of the legislative body on the bill
- procedure_ID: A unique ParlLawSpeech
identifier linking bills, speeches, and law
Corpus_laws_[country].RDS:
Data on laws finally adopted by the respective parliament- title_law: Title of the law as provided in the original archive
- law_ID: Identifier of the law text as used in the respective parliament or the official legislation database of the country
- law_text: Full text of the law
- adoption_date: Day on which the law was
published
- procedure_ID: A unique ParlLawSpeech
identifier linking bills, speeches, and law
Corpus_speeches_[country].RDS:
Data on plenary speeches in the respective parliament (in chronological order across and within sessions)- agenda: Agenda item under which the speech
was given (following the conventions of the respective parliament)
- date: Day on which the speech was given (YYY-MM-DD)
- party: Party and/or partisan faction of speaker
- speaker: Name of the person having given the plenary speech
- text: Full text of the respective speech
- speechnumber: Unique speech ID with session reflecting the chronological order
- procedure_ID: A unique ParlLawSpeech
identifier linking bills, speeches, and law (here NA for
speeches that were not directly linked to a specific bill/law procedure)
- agenda: Agenda item under which the speech
was given (following the conventions of the respective parliament)
If not otherwise indicated, all variables are provided as UTF-8 encoded
strings. The lists above show the minimal data structure that is
available across all countries and periods. Where readily available in
the source archives, individual ParlLawSpeech files include
additional meta information, for example on speaker
roles in the speeches data set, or on document types, committees
involved, and voting results in the bill data sets (in the case of the
EU also including Celex IDs and legal bases).
Note that the columns with meta information as well as the full texts
of bills and laws follow the conventions and document structures that
the respective parliamentary archive provides. Staying close to the
original conventions has two advantages. First, it allows users to
filter the ParlLawSpeech data along external knowledge about the
respective parliament. For example, researchers may want to isolate only
debates related to specific document numbers or document titles that
have been identified through the respective parliamentary website or
other qualitative research. Second, sticking to the original document
and data formats provides researchers with maximum freedom regarding
text cleaning and pre-processing choices. For example, researchers may
decide whether or not the recitals and justifications often provided
with a legal document should enter their text analyses.
The downside of largely sticking to the original formats is, however,
that comparative analysis especially across countries might require
additional text cleaning steps at times (for an example, see tutorial 2). We
advise users to inspect the columns of interest carefully before
processing them further. Different analytical interests and methods may
require different text preparation steps.
Files in the repository
The data have been compiled as .rds
files for programming use in the free and open-source R environment. Users working in
other environments can easily export them from R to any other format,
using either base R’s
export functions or add-on packages such as haven or feather,
for example.
The table below summarizes the individual data files provided in
ParlLawSpeech, indicating the file size (MB), the number of variables
offered, and the number of documents (bills, laws, speeches)
therein.
Country | File | Size | Variables | Observations |
---|---|---|---|---|
Austria | Corpus_bills_austria.RDS | 67.21 | 11 | 5926 |
Austria | Corpus_laws_austria.RDS | 31.16 | 7 | 3030 |
Austria | Corpus_speeches_austria.RDS | 278.97 | 13 | 204881 |
Croatia | Corpus_bills_croatia.RDS | 125.00 | 14 | 3676 |
Croatia | Corpus_laws_croatia.RDS | 59.62 | 5 | 2972 |
Croatia | Corpus_speeches_croatia.RDS | 160.84 | 10 | 405260 |
Czech Republic | Corpus_bills_czech_republic.RDS | 123.03 | 6 | 2127 |
Czech Republic | Corpus_laws_czech_republic.RDS | 8.57 | 7 | 844 |
Czech Republic | Corpus_speeches_czech_republic.RDS | 80.75 | 13 | 192979 |
Denmark | Corpus_bills_denmark.RDS | 181.12 | 10 | 3615 |
Denmark | Corpus_laws_denmark.RDS | 15.60 | 7 | 3220 |
Denmark | Corpus_speeches_denmark.RDS | 147.46 | 10 | 716807 |
EP | Corpus_bills_EP.RDS | 191.31 | 11 | 14105 |
EP | Corpus_laws_EP.RDS | 90.34 | 6 | 10554 |
EP | Corpus_speeches_EP.RDS | 448.77 | 16 | 574119 |
Germany | Corpus_bills_germany.RDS | 100.80 | 10 | 2445 |
Germany | Corpus_laws_germany.RDS | 12.57 | 6 | 1638 |
Germany | Corpus_speeches_germany.RDS | 244.13 | 12 | 191932 |
Hungary | Corpus_bills_hungary.rds | 783.81 | 10 | 7500 |
Hungary | Corpus_laws_hungary.rds | 242.82 | 5 | 4303 |
Hungary | Corpus_speeches_hungary.rds | 318.71 | 17 | 487877 |
Spain | Corpus_bills_spain.RDS | 48.57 | 10 | 4188 |
Spain | Corpus_laws_spain.RDS | 35.72 | 6 | 1563 |
Spain | Corpus_speeches_spain.RDS | 320.77 | 12 | 318576 |