Please enable JavaScript in your browser. It is required to use most of the features of Dataverse.

Instytut Filozofii i Socjologii

(Polska Akademia Nauk)

Select dataverse

Add dataset

Metrics

5 Downloads

Share Dataset

Share this dataset on your favorite social media networks.

Frontiers in Psychology Corpus (2010–2021)

Version 1.0

Miłkowski, Marcin, 2026, "Frontiers in Psychology Corpus (2010–2021)", https://doi.org/10.18150/4LJ9WD, RepOD, V1

Learn about Data Citation Standards.

Description

Frontiers in Psychology Corpus (2010–2021)

A comprehensive text corpus of 21,084 papers published in Frontiers in Psychology between 2010 and 2021, processed for computational linguistics and semantic analysis.

Dataset Overview

This corpus contains full-text articles from Frontiers in Psychology, converted from XML to plain text and preprocessed for natural language processing tasks.

Files and Structure

`fpsyg_filtered.zip`

Contains the filtered text corpus with the following preprocessing applied:

- XML to text conversion: Original XML documents converted to plain text format

- Sentence segmentation: Text segmented into individual sentences

- Boilerplate removal: Journal metadata, headers, footers, and other non-content elements removed using filter_text_corpus.py

`fpsyg_tagged.zip.*` (2 parts)

Contains the linguistically annotated corpus. This archive is split into multiple parts due to size constraints:

- fpsyg_tagged.zip.001

- fpsyg_tagged.zip.002

To extract: First combine the parts using 7-Zip:

bash

7z x fpsyg_tagged.zip.001

7-Zip will automatically detect and combine all parts, then extract the contents.

After extraction:

- File: fpsyg_filtered_tagged.conllu

- POS tags: Penn Treebank part-of-speech tags

- Dependency parsing: Universal Dependency (UD) tags

- Tagger: Processed using Stanza

- Format: CoNLL-U format

`fpsyg_index.zip.*` (6 parts)

Contains a compiled binary index for semantic mining. This archive is split into multiple parts:

- fpsyg_index.zip.001 through fpsyg_index.zip.006

To extract: First combine the parts using 7-Zip:

bash

7z x fpsyg_index.zip.001

7-Zip will automatically detect and combine all parts, then extract the contents.

After extraction:

- Compatible with ConceptSketch semantic mining software

- Usage: Point ConceptSketch to the extraction directory

Processing Pipeline

Original XML → Text Conversion → Sentence Segmentation → Boilerplate Removal → POS/UD Tagging (Stanza) → Final Corpus

License

This dataset is distributed under the Creative Commons Attribution License (CC BY), derived from the original papers' licensing terms.

Citation

If you use this dataset in your research, please cite:

Dataset compiled by: Marcin Miłkowski

Affiliation: Cognitive Metascience Lab & Center for AI in Society, Institute of Philosophy and Sociology, Polish Academy of Sciences

Requirements

Stanza (for re-tagging or custom processing): https://github.com/stanfordnlp/stanza
ConceptSketch (for semantic mining): https://github.com/cognitive-metascience/concept-sketch

Contact

For questions regarding this dataset, please contact the Cognitive Metascience Lab at the Institute of Philosophy and Sociology, Polish Academy of Sciences: https://cognitive-metascience.github.io

Subject

Arts and Humanities; Social Sciences

Keyword

corpus, metascience, journal articles

License

Different licenses for individual files

Find

1 to 10 of 11 Files

Select all 11 files in this dataset.

			Download
		fpsyg_filtered.zip ZIP Archive - 372.8 MB - Mar 20, 2026 - 1 Download MD5: 1eed7998f029e707dd089c02e73d4b01 License: CC BY - Creative Commons Attribution 4.0 The filtered text corpus with preprocessing applied	Download
		filter_text_corpus.py Unknown - 6.8 KB - Mar 20, 2026 - 2 Downloads MD5: a08f4705b10940c4bd68a7cdc9072341 License: BSD 3 – Clause “New” or “Revised” License Filtering script	Download
		fpsyg_index.zip.001 ZIP Archive - 2.0 GB - Mar 20, 2026 - 0 Downloads MD5: c0c9d96db72986206a510e7183014d09 License: CC BY - Creative Commons Attribution 4.0 Compiled binary index, part 1	Download
		fpsyg_index.zip.002 Unknown - 2.0 GB - Mar 20, 2026 - 0 Downloads MD5: 828e528e2ad3fa756c9ab99cfecb8f13 License: CC BY - Creative Commons Attribution 4.0 Compiled binary index, part 2	Preview Download
		fpsyg_index.zip.003 Unknown - 2.0 GB - Mar 20, 2026 - 0 Downloads MD5: b5c3532b638942f0096f1cd4898f9466 License: CC BY - Creative Commons Attribution 4.0 Compiled binary index, part 3	Preview Download
		fpsyg_index.zip.004 Unknown - 2.0 GB - Mar 20, 2026 - 0 Downloads MD5: 9371b4cc53a307785efc3c34132a7433 License: CC BY - Creative Commons Attribution 4.0 Compilex index, part 4	Preview Download
		fpsyg_index.zip.005 Unknown - 2.0 GB - Mar 20, 2026 - 0 Downloads MD5: e33b77000272e4284c8bbc8b5a95f1fc License: CC BY - Creative Commons Attribution 4.0 Compiled index, part 5	Preview Download
		fpsyg_index.zip.006 Unknown - 914.7 MB - Mar 20, 2026 - 0 Downloads MD5: 2f0ed5e7be2d4c367c34e9bc8fd2b46e License: CC BY - Creative Commons Attribution 4.0 Compiled index, part 6	Preview Download
		fpsyg_tagged.zip.001 ZIP Archive - 2.0 GB - Mar 20, 2026 - 0 Downloads MD5: 56305d298aba8a3236efe92c73ac0870 License: CC BY - Creative Commons Attribution 4.0 Tagged corpus (conllu format), part 1	Download
		fpsyg_tagged.zip.002 Unknown - 402.0 MB - Mar 20, 2026 - 0 Downloads MD5: 377132d8e130dff00e54bbec32f429e3 License: CC BY - Creative Commons Attribution 4.0	Preview Download

Select File(s)

Please select a file or files to be deleted.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select a file or files to be edited.

Select File(s)

Please select a file or files to be edited.

Select File(s)

Please select a file or files to be edited.

Edit license

License

For selected file(s) set a license to

Select File(s)

Please select a file or files to be downloaded.

Select File(s)

Please select a file or files for access request.

Select File(s)

Please select restricted file(s) to be unrestricted.

Request Access

You need to Log In/Sign Up to request access to this file.

Continue

Dataset Terms

Please confirm and/or complete the information needed below in order to continue.

Asterisks indicate required fields

Request Access

Access to file(s) subject to additional consent under following conditions:

Package File Download

Restricted Files Selected

The restricted file(s) selected may not be downloaded because you have not been granted access.

Restricted Files Selected

The restricted file(s) selected may not be downloaded because you have not been granted access.

Click Continue to download the files you have access to download.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Embargo

Are you sure you want to lift the embargo?

Once you lift the embargo, you will not be able to set it again.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Dataset Anonymized Private URL

Use a Anonymized Private URL to allow those without Dataverse accounts to access your dataset. For more information about the Private URL feature, please refer to the User Guide.

Private URL has not been created.

"WARNING. This dataset has at least one published version. Those who have access to the Anonymized Private URL for this dataset may be able to use its accessible metadata to look up the full, not anonymized version of this dataset.

Dataset Anonymized Private URL

Are you sure you want to disable the Private URL? If you have shared the Private URL with others they will no longer be able to use it to access your dataset.

Dataset Private URL

Use a Private URL to allow those without Dataverse accounts to access your dataset. For more information about the Private URL feature, please refer to the User Guide.

Private URL has not been created.

Dataset Private URL

Are you sure you want to disable the Private URL? If you have shared the Private URL with others they will no longer be able to use it to access your dataset.

Submit for Review

You will not be able to make changes to this dataset while it is in review.

Additional information for the dataset reviewer:

Submit

Publish Dataset

This dataset cannot be published until Instytut Filozofii i Socjologii is published. Would you like to publish both right now?

Once you publish this dataset it must remain published.

Publish Dataset

Are you sure you want to republish this dataset?

Select if this is a minor or major version update.

Minor Release (1.1)

Major Release (2.0)

Publish Dataset

This dataset cannot be published until Instytut Filozofii i Socjologii is published by its administrator.

Publish Dataset

This dataset cannot be published until Instytut Filozofii i Socjologii and RepOD are published.

Return for modification

Additional information

Return e-mail address

Send a copy of this message to the return e-mail address.

Contact the Repository's Support

Instytut Filozofii i Socjologii

Frontiers in Psychology Corpus (2010–2021)

Frontiers in Psychology Corpus (2010–2021)

Dataset Overview

Files and Structure

fpsyg_filtered.zip

fpsyg_tagged.zip.* (2 parts)

fpsyg_index.zip.* (6 parts)

Processing Pipeline

License

Citation

Requirements

Contact

`fpsyg_filtered.zip`

`fpsyg_tagged.zip.*` (2 parts)

`fpsyg_index.zip.*` (6 parts)