Frontiers in Psychology Corpus (2010–2021)
A comprehensive text corpus of 21,084 papers published in Frontiers in Psychology between 2010 and 2021, processed for computational linguistics and semantic analysis.
Dataset Overview
This corpus contains full-text articles from Frontiers in Psychology, converted from XML to plain text and preprocessed for natural language processing tasks.
Files and Structure
fpsyg_filtered.zip
Contains the filtered text corpus with the following preprocessing applied:
- XML to text conversion: Original XML documents converted to plain text format
- Sentence segmentation: Text segmented into individual sentences
- Boilerplate removal: Journal metadata, headers, footers, and other non-content elements removed using filter_text_corpus.py
fpsyg_tagged.zip.* (2 parts)
Contains the linguistically annotated corpus. This archive is split into multiple parts due to size constraints:
- fpsyg_tagged.zip.001
- fpsyg_tagged.zip.002
To extract: First combine the parts using 7-Zip:
bash
7z x fpsyg_tagged.zip.001
7-Zip will automatically detect and combine all parts, then extract the contents.
After extraction:
- File: fpsyg_filtered_tagged.conllu
- POS tags: Penn Treebank part-of-speech tags
- Dependency parsing: Universal Dependency (UD) tags
- Tagger: Processed using Stanza
- Format: CoNLL-U format
fpsyg_index.zip.* (6 parts)
Contains a compiled binary index for semantic mining. This archive is split into multiple parts:
- fpsyg_index.zip.001 through fpsyg_index.zip.006
To extract: First combine the parts using 7-Zip:
bash
7z x fpsyg_index.zip.001
7-Zip will automatically detect and combine all parts, then extract the contents.
After extraction:
- Compatible with ConceptSketch semantic mining software
- Usage: Point ConceptSketch to the extraction directory
Processing Pipeline
Original XML → Text Conversion → Sentence Segmentation → Boilerplate Removal → POS/UD Tagging (Stanza) → Final Corpus
License
This dataset is distributed under the Creative Commons Attribution License (CC BY), derived from the original papers' licensing terms.
Citation
If you use this dataset in your research, please cite:
Dataset compiled by: Marcin Miłkowski
Affiliation: Cognitive Metascience Lab & Center for AI in Society, Institute of Philosophy and Sociology, Polish Academy of Sciences
Requirements
- Stanza (for re-tagging or custom processing): https://github.com/stanfordnlp/stanza
- ConceptSketch (for semantic mining): https://github.com/cognitive-metascience/concept-sketch
Contact
For questions regarding this dataset, please contact the Cognitive Metascience Lab at the Institute of Philosophy and Sociology, Polish Academy of Sciences: https://cognitive-metascience.github.io