Skip to main content

Bulk downloading from Dataverse

Today I wanted to download a dataset called DeepPatent2

Wu, Jian, 2023, “Replication Data for DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding”, https://doi.org/10.7910/DVN/UG4SBD, Harvard Dataverse, V2, UNF:6:v+kPnjPdsW7S36aUW0I7bg== [fileUNF]

It’s hosted on Harvard’s Dataverse instance: Replication Data for DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding – Harvard Dataverse. The dataset is sharded into 2GB chunks. This instance is not configured to allow such large downloads (~ 350GB) in bulk via the browser’s UI.

However, it does provide a Schema.org JSON-LD API for each deposit (which gets a DOI) and this makes it straightforward to retrieve URIs for each of the individual parts. Here is what that looks like:

https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/UG4SBD

Using curl, jq and aria2, it is straightforward to download the required files:

aria2c --input-file <(curl {URI_TO_JSON_SCHEMA.ORG_FILE} | jq -r .distribution[].contentUrl)