MDPI Open Peer Review Corpus 2
Section for Logic & Cognitive Science, Institute of Philosophy and Sociology, Polish Academy of Science
Cognitive Metascience Lab
Generated by Ksawery Jasieński, with some input from Remigiusz Depta, under supervision of Marcin Miłkowski (2022-2023)
---
MDPI is committed to open peer review idea, but these are voluntary. They are not available for download in a single package, so they must be crawled from their website.
This dataset contains all peer reviews available on mdpi.com as of January 2023, covering over 135 thousand papers. These are in plaintext format (look for TXT files). In addition, the corpus contains metadata in JSON format for particular reviews, author responses, as well as original paper metadata. For reference see the JSON `schema` files available in the GitHub repository associated with this project.
Additionally, this dataset contains the source HTML for each website from which the text of reviews was extracted, as well as any supplementary materials attached with the reviews. The original files were not enriched with any linguistic annotation or converted to any format (these are predominantly PDF and DOCX files, as uploaded through the MDPI editorial system by authors and reviewers).
We are making source code available for the dedicated crawler that was built to scrape the MDPI database. See the GitHub link below:
- https://github.com/cognitive-metascience/review_crawler/tree/main/crawling
See this corpus on PubPeer:
- https://pubpeer.com/publications/25353AAFD4FC52E2BEC8C7AD08B259#
---
This dataset is split into parts because of the upload limits of this repository. The archives are available in ZIP (33 parts, look for `.z[01-33]` files) and 7z (23 parts) formats. In addition, we provide the set of excluded articles with incomplete information (e.g., missing some reviews in the first round etc.) in the `mdpi-dump-dir.zip` file.
The dataset after unpacking is a little over 170 GB in size.
The files are being made available under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Changelog
current version - August 2023
- no new reviews added
- re-scraped (June 2023) the reviews that had some sections of text missing
- dataset now also includes the source HTML from mdpi.com for each reviewed article. The webpage content was cleaned before storing to an HTML file: specifically, all comments are removed from the document, as well as the following tags: 'script', 'style', 'noscript', 'link', 'rect'.
14th March 2023
- first submission of this dataset
- reviews was scraped in early January 2023
- dataset contains metadata for 135652 peer-reviewed articles from the MDPI database, along with full plain text for each review and any supplementary materials that were attached (PDF or DOC files containing e.g. author responses to comments)
(2023)