Working with the Datastore

The input data that PUDL processes comes from a variety of US government agencies. However, these agencies typically make the data available on their websites or via FTP without planning for programmatic access. To ensure reproducible, programmatic access, we periodically archive the input files on the Zenodo research archiving service maintained by CERN. (See our pudl-archiver repository on GitHub for more information.)

When PUDL needs a data resource, it will attempt to automatically retrieve it from Zenodo and store it locally in a file hierarchy organized by dataset and the versioned DOI of the corresponding Zenodo deposition.

The pudl_datastore script can also be used to pre-download the raw input data in bulk. It uses the routines defined in the pudl.workspace.datastore module. For details on what data is available, for what time periods, and how much of it there is, see the PUDL Data Sources. At present the pudl_datastore script downloads the entire collection of data available for each dataset. For the FERC Form 1 and EPA CEMS datasets, this is several gigabytes.

For example, to download the full EIA Form 860 – Annual Electric Generator Report dataset (covering 2001-present) you would use:

$ pudl_datastore --dataset eia860

For more detailed usage information, see:

$ pudl_datastore --help

The downloaded data will be used by the script to populate a datastore under your $PUDL_INPUT directory, organized by data source, form, and DOI:

data/censusdp1tract/
data/eia860/
data/eia860m/
data/eia861/
data/eia923/
data/epacems/
data/ferc1/
data/ferc2/
data/ferc60/
data/ferc714/
data/phmsagas/

If the download fails to complete successfully, the script can be run repeatedly until all the files are downloaded. It will not try and re-download data which is already present locally.

Adding a new Dataset to the Datastore

We maintain a tool at pudl-archiver that manages the archival and versioning of datasets. See the documentation for information on adding datasets to the datastore.

Tell PUDL about the archive

Once you have used pudl-archiver to prepare a Zenodo archive as above, you can make the PUDL Datastore aware of it by updating the appropriate DOI in pudl.workspace.datastore.ZenodoDoiSettings. DOIs can refer to resources from the Zenodo sandbox server for archives that are still in testing or development (sandbox DOIs have a prefix of 10.5072), or the Zenodo production server server if the archive is ready for public use (production DOIs have a prefix of 10.5281).