ParlLawSpeech: The Data

Overview

The ParlLawSpeech dataset offers 4.02 GB of data in total, including machine-readable full texts of 43,582 bills, 28,124 laws, and 3,092,431 plenary speeches from eight European parliaments covering more than two decades each.

ParlLawSpeech is meant to push the systematic analysis of European democracies with advanced text-as-data/NLP methods. Compared to the other excellent extant political science text corpora (cf. Sebők et al. 2025), it adds two key innovations. First, the bill and law corpora are are among the most encompassing full-text vectors of legal documents handled by parliaments. Second, we provide novel data linkage possibilities by offering a common identifier across bills, corresponding plenary speeches, and finally adopted laws- opening up new analytical opportunities to study the legislative process (basic ideas in the tutorial section).

This dataset is the result of the OPTED Work Package 5 (WP5) led by Sven-Oliver Proksch, Christian Rauh, and Miklós Sebők and funded by the European Union’s Horizon 2020 program (Grant agreement 951832). Our task was to develop different prototypes for improving access to linked parliamentary text data produced in European democracies.

This page provides a quick overview of the data structure to potential users. More detailed information is provided in the codebook shipped with the data repository that is permanently stored at GESIS.

Data access and usage

The dataset is licensed under the Creative Commons Attribution 4.0 International (CC-BY 4.0 Deed) license, the only authorized version for PLS. When using the data or any of its components in your work, you are required to cite the source as follows:

Schwalbach, Jan; Hetzer, Lukas; Proksch, Sven-Oliver; Rauh, Christian; Sebők, Miklós (2025), “ParlLawSpeech”; GESIS, Cologne. doi:10.7802/2824.



Coverage

Two considerations drove the choice of countries/parliaments in the initial ParlLawSpeech prototypes. First, the collected data should be usable for diverse comparative research projects by including smaller and larger EU member states from different regions of the Union (including the EU itself). Second, we relied on our prior inventory of parliamentary text sources in Europe (see also Sebők et al. 2025) and our earlier collections of parliamentary speeches (ParlSpeech) to identify those official archives from which we could collect the respective text data within the time and resource constraints of the project.

In a very labor-intensive process, all data have been collected by own web scraping scripts customized to each parliamentary archive and/or legal database (for which APIs are available in very few cases only). Links between the documents were established by semi-automated analyses of the meta-information or the full texts that the respective archive provides which again required detailed customization to the conventions of the respective archive.

The table below summarizes the countries and time periods eventually included in the ParlLawSpeech collection. As an additional goodie, we can offer large shares of the European and Hungarian speeches readily translated to English (details on the translation here).

Country Code Parliament Time period
Austria AT Nationalrat 1996/01 - 2019/09
Croatia HR Hrvatski sabor 2003/12 - 2020/05
Czech Republic CZ Poslanecká snemovna Parlamentu Ceské republiky 2013/11 - 2023/10
Denmark DK Folketing 2007/11 - 2022/10
EU EU European Parliament 1999/07 - 2024/04
Germany DE Bundestag 2009/10 - 2021/09
Hungary HU Országgyűlés 1994/06 - 2022/03
Spain ES Congreso de los Diputados 1996/03 - 2023/07


Data structure

For each of the above mentioned countries/parliaments (including the EU/EP), ParlLawSpeech provides three separate data files: one for bills, one for laws, and one for plenary speeches. These three files are structured along the following columns (core variables):

  • Corpus_bills_[country].RDS:
    Data on legislative bills tabled in the respective parliament

    • title_bill: Bill title as provided in the original parliamentary archive
    • bill_ID: Identification number of the bill document, following the conventions of the respective parliament
    • bill_text: Full text of the bill
    • initiation_date: Day on which the bill was tabled
    • status: Final decision of the legislative body on the bill
    • procedure_ID: A unique ParlLawSpeech identifier linking bills, speeches, and law

  • Corpus_laws_[country].RDS:
    Data on laws finally adopted by the respective parliament

    • title_law: Title of the law as provided in the original archive
    • law_ID: Identifier of the law text as used in the respective parliament or the official legislation database of the country
    • law_text: Full text of the law
    • adoption_date: Day on which the law was published
    • procedure_ID: A unique ParlLawSpeech identifier linking bills, speeches, and law

  • Corpus_speeches_[country].RDS:
    Data on plenary speeches in the respective parliament (in chronological order across and within sessions)

    • agenda: Agenda item under which the speech was given (following the conventions of the respective parliament)
    • date: Day on which the speech was given (YYY-MM-DD)
    • party: Party and/or partisan faction of speaker
    • speaker: Name of the person having given the plenary speech
    • text: Full text of the respective speech
    • speechnumber: Unique speech ID with session reflecting the chronological order
    • procedure_ID: A unique ParlLawSpeech identifier linking bills, speeches, and law (here NA for speeches that were not directly linked to a specific bill/law procedure)



If not otherwise indicated, all variables are provided as UTF-8 encoded strings. The lists above show the minimal data structure that is available across all countries and periods. Where readily available in the source archives, individual ParlLawSpeech files include additional meta information, for example on speaker roles in the speeches data set, or on document types, committees involved, and voting results in the bill data sets (in the case of the EU also including Celex IDs and legal bases).

Note that the columns with meta information as well as the full texts of bills and laws follow the conventions and document structures that the respective parliamentary archive provides. Staying close to the original conventions has two advantages. First, it allows users to filter the ParlLawSpeech data along external knowledge about the respective parliament. For example, researchers may want to isolate only debates related to specific document numbers or document titles that have been identified through the respective parliamentary website or other qualitative research. Second, sticking to the original document and data formats provides researchers with maximum freedom regarding text cleaning and pre-processing choices. For example, researchers may decide whether or not the recitals and justifications often provided with a legal document should enter their text analyses.

The downside of largely sticking to the original formats is, however, that comparative analysis especially across countries might require additional text cleaning steps at times (for an example, see tutorial 2). We advise users to inspect the columns of interest carefully before processing them further. Different analytical interests and methods may require different text preparation steps.


Files in the repository

The data have been compiled as .rds files for programming use in the free and open-source R environment. Users working in other environments can easily export them from R to any other format, using either base R’s export functions or add-on packages such as haven or feather, for example.

The table below summarizes the individual data files provided in ParlLawSpeech, indicating the file size (MB), the number of variables offered, and the number of documents (bills, laws, speeches) therein.

Country File Size Variables Observations
Austria Corpus_bills_austria.RDS 67.21 11 5926
Austria Corpus_laws_austria.RDS 31.16 7 3030
Austria Corpus_speeches_austria.RDS 278.97 13 204881
Croatia Corpus_bills_croatia.RDS 125.00 14 3676
Croatia Corpus_laws_croatia.RDS 59.62 5 2972
Croatia Corpus_speeches_croatia.RDS 160.84 10 405260
Czech Republic Corpus_bills_czech_republic.RDS 123.03 6 2127
Czech Republic Corpus_laws_czech_republic.RDS 8.57 7 844
Czech Republic Corpus_speeches_czech_republic.RDS 80.75 13 192979
Denmark Corpus_bills_denmark.RDS 181.12 10 3615
Denmark Corpus_laws_denmark.RDS 15.60 7 3220
Denmark Corpus_speeches_denmark.RDS 147.46 10 716807
EP Corpus_bills_EP.RDS 191.31 11 14105
EP Corpus_laws_EP.RDS 90.34 6 10554
EP Corpus_speeches_EP.RDS 448.77 16 574119
Germany Corpus_bills_germany.RDS 100.80 10 2445
Germany Corpus_laws_germany.RDS 12.57 6 1638
Germany Corpus_speeches_germany.RDS 244.13 12 191932
Hungary Corpus_bills_hungary.rds 783.81 10 7500
Hungary Corpus_laws_hungary.rds 242.82 5 4303
Hungary Corpus_speeches_hungary.rds 318.71 17 487877
Spain Corpus_bills_spain.RDS 48.57 10 4188
Spain Corpus_laws_spain.RDS 35.72 6 1563
Spain Corpus_speeches_spain.RDS 320.77 12 318576