Bulk downloading from Dataverse
Today I wanted to download a dataset called DeepPatent2
Wu, Jian, 2023, “Replication Data for DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding”, https://doi.org/10.7910/DVN/UG4SBD, Harvard Dataverse, V2, UNF:6:v+kPnjPdsW7S36aUW0I7bg== [fileUNF]
It’s hosted on Harvard’s Dataverse instance: Replication Data for DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding – Harvard Dataverse. The dataset is sharded into 2GB chunks. This instance is not configured to allow such large downloads (~ 350GB) in bulk via the browser’s UI.
However, it does provide a Schema.org JSON-LD API for each deposit (which gets a DOI) and this makes it straightforward to retrieve URIs for each of the individual parts. Here is what that looks like:
https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/UG4SBD
Using curl
, jq
and aria2
, it is straightforward to download the required files:
aria2c --input-file <(curl {URI_TO_JSON_SCHEMA.ORG_FILE} | jq -r .distribution[].contentUrl)