Data

ParlLawSpeech: The Data

Overview
Data access and usage
Coverage
Data structure
Files in the repository

Overview

The ParlLawSpeech dataset offers 4.02 GB of data in total, including machine-readable full texts of 43,582 bills, 28,124 laws, and 3,092,431 plenary speeches from eight European parliaments covering more than two decades each.

ParlLawSpeech is meant to push the systematic analysis of European democracies with advanced text-as-data/NLP methods. Compared to the other excellent extant political science text corpora (cf. Sebők et al. 2025), it adds two key innovations. First, the bill and law corpora are are among the most encompassing full-text vectors of legal documents handled by parliaments. Second, we provide novel data linkage possibilities by offering a common identifier across bills, corresponding plenary speeches, and finally adopted laws- opening up new analytical opportunities to study the legislative process (basic ideas in the tutorial section).

This dataset is the result of the OPTED Work Package 5 (WP5) led by Sven-Oliver Proksch, Christian Rauh, and Miklós Sebők and funded by the European Union’s Horizon 2020 program (Grant agreement 951832). Our task was to develop different prototypes for improving access to linked parliamentary text data produced in European democracies.

This page provides a quick overview of the data structure to potential users. More detailed information is provided in the codebook shipped with the data repository that is permanently stored at GESIS.

Data access and usage

The dataset is licensed under the Creative Commons Attribution 4.0 International (CC-BY 4.0 Deed) license, the only authorized version for PLS. When using the data or any of its components in your work, you are required to cite the source as follows:

Schwalbach, Jan; Hetzer, Lukas; Proksch, Sven-Oliver; Rauh, Christian; Sebők, Miklós (2025), “ParlLawSpeech”; GESIS, Cologne. doi:10.7802/2824.

Download ParlLawSpeech (via GESIS)

Coverage

Two considerations drove the choice of countries/parliaments in the initial ParlLawSpeech prototypes. First, the collected data should be usable for diverse comparative research projects by including smaller and larger EU member states from different regions of the Union (including the EU itself). Second, we relied on our prior inventory of parliamentary text sources in Europe (see also Sebők et al. 2025) and our earlier collections of parliamentary speeches (ParlSpeech) to identify those official archives from which we could collect the respective text data within the time and resource constraints of the project.

In a very labor-intensive process, all data have been collected by own web scraping scripts customized to each parliamentary archive and/or legal database (for which APIs are available in very few cases only). Links between the documents were established by semi-automated analyses of the meta-information or the full texts that the respective archive provides which again required detailed customization to the conventions of the respective archive.

The table below summarizes the countries and time periods eventually included in the ParlLawSpeech collection. As an additional goodie, we can offer large shares of the European and Hungarian speeches readily translated to English (details on the translation here).

Country	Code	Parliament	Time period
Austria	AT	Nationalrat	1996/01 - 2019/09
Croatia	HR	Hrvatski sabor	2003/12 - 2020/05
Czech Republic	CZ	Poslanecká snemovna Parlamentu Ceské republiky	2013/11 - 2023/10
Denmark	DK	Folketing	2007/11 - 2022/10
EU	EU	European Parliament	1999/07 - 2024/04
Germany	DE	Bundestag	2009/10 - 2021/09
Hungary	HU	Országgyűlés	1994/06 - 2022/03
Spain	ES	Congreso de los Diputados	1996/03 - 2023/07

Data structure

For each of the above mentioned countries/parliaments (including the EU/EP), ParlLawSpeech provides three separate data files: one for bills, one for laws, and one for plenary speeches. These three files are structured along the following columns (core variables):

Corpus_bills_[country].RDS:
Data on legislative bills tabled in the respective parliament
- title_bill: Bill title as provided in the original parliamentary archive
- bill_ID: Identification number of the bill document, following the conventions of the respective parliament
- bill_text: Full text of the bill
- initiation_date: Day on which the bill was tabled
- status: Final decision of the legislative body on the bill
- procedure_ID: A unique ParlLawSpeech identifier linking bills, speeches, and law
Corpus_laws_[country].RDS:
Data on laws finally adopted by the respective parliament
- title_law: Title of the law as provided in the original archive
- law_ID: Identifier of the law text as used in the respective parliament or the official legislation database of the country
- law_text: Full text of the law
- adoption_date: Day on which the law was published
- procedure_ID: A unique ParlLawSpeech identifier linking bills, speeches, and law
Corpus_speeches_[country].RDS:
Data on plenary speeches in the respective parliament (in chronological order across and within sessions)
- agenda: Agenda item under which the speech was given (following the conventions of the respective parliament)
- date: Day on which the speech was given (YYY-MM-DD)
- party: Party and/or partisan faction of speaker
- speaker: Name of the person having given the plenary speech
- text: Full text of the respective speech
- speechnumber: Unique speech ID with session reflecting the chronological order
- procedure_ID: A unique ParlLawSpeech identifier linking bills, speeches, and law (here NA for speeches that were not directly linked to a specific bill/law procedure)

If not otherwise indicated, all variables are provided as UTF-8 encoded strings. The lists above show the minimal data structure that is available across all countries and periods. Where readily available in the source archives, individual ParlLawSpeech files include additional meta information, for example on speaker roles in the speeches data set, or on document types, committees involved, and voting results in the bill data sets (in the case of the EU also including Celex IDs and legal bases).

Note that the columns with meta information as well as the full texts of bills and laws follow the conventions and document structures that the respective parliamentary archive provides. Staying close to the original conventions has two advantages. First, it allows users to filter the ParlLawSpeech data along external knowledge about the respective parliament. For example, researchers may want to isolate only debates related to specific document numbers or document titles that have been identified through the respective parliamentary website or other qualitative research. Second, sticking to the original document and data formats provides researchers with maximum freedom regarding text cleaning and pre-processing choices. For example, researchers may decide whether or not the recitals and justifications often provided with a legal document should enter their text analyses.

The downside of largely sticking to the original formats is, however, that comparative analysis especially across countries might require additional text cleaning steps at times (for an example, see tutorial 2). We advise users to inspect the columns of interest carefully before processing them further. Different analytical interests and methods may require different text preparation steps.

Files in the repository

The data have been compiled as .rds files for programming use in the free and open-source R environment. Users working in other environments can easily export them from R to any other format, using either base R’s export functions or add-on packages such as haven or feather, for example.

The table below summarizes the individual data files provided in ParlLawSpeech, indicating the file size (MB), the number of variables offered, and the number of documents (bills, laws, speeches) therein.

Country	File	Size	Variables	Observations
Austria	Corpus_bills_austria.RDS	67.21	11	5926
Austria	Corpus_laws_austria.RDS	31.16	7	3030
Austria	Corpus_speeches_austria.RDS	278.97	13	204881
Croatia	Corpus_bills_croatia.RDS	125.00	14	3676
Croatia	Corpus_laws_croatia.RDS	59.62	5	2972
Croatia	Corpus_speeches_croatia.RDS	160.84	10	405260
Czech Republic	Corpus_bills_czech_republic.RDS	123.03	6	2127
Czech Republic	Corpus_laws_czech_republic.RDS	8.57	7	844
Czech Republic	Corpus_speeches_czech_republic.RDS	80.75	13	192979
Denmark	Corpus_bills_denmark.RDS	181.12	10	3615
Denmark	Corpus_laws_denmark.RDS	15.60	7	3220
Denmark	Corpus_speeches_denmark.RDS	147.46	10	716807
EP	Corpus_bills_EP.RDS	191.31	11	14105
EP	Corpus_laws_EP.RDS	90.34	6	10554
EP	Corpus_speeches_EP.RDS	448.77	16	574119
Germany	Corpus_bills_germany.RDS	100.80	10	2445
Germany	Corpus_laws_germany.RDS	12.57	6	1638
Germany	Corpus_speeches_germany.RDS	244.13	12	191932
Hungary	Corpus_bills_hungary.rds	783.81	10	7500
Hungary	Corpus_laws_hungary.rds	242.82	5	4303
Hungary	Corpus_speeches_hungary.rds	318.71	17	487877
Spain	Corpus_bills_spain.RDS	48.57	10	4188
Spain	Corpus_laws_spain.RDS	35.72	6	1563
Spain	Corpus_speeches_spain.RDS	320.77	12	318576