Creating a Datastore

The input data that PUDL processes comes from a variety of US government agencies. These agencies typically make the data available on their websites or via FTP without really planning for programmatic access.

The pudl_data script helps you obtain and organize this data locally, for use by the rest of the PUDL system. It uses the routines defined in the pudl.workspace.datastore module. For details on what data is available, for what time periods, and how much of it there is, see the Data Catalog.

For example, if you wanted to download the 2018 EPA CEMS Hourly data for Colorado:

$ pudl_data --sources epacems --states CO --years 2018

If you do not specify years, the script will retrieve all available data. So to get everything for EIA Form 860 and EIA Form 923 you would run:

$ pudl_data --sources eia860 eia923

The script will download from all sources in parallel, so if you have a fast internet connection and need a lot of data, doing it all in one go makes sense. To pull down all the available data for all the sources (10+ GB) you would run:

$ pudl_data --sources eia860 eia923 epacems ferc1 epaipm

For more detailed usage information, see:

$ pudl_data --help

The downloaded data will be used by the script to populate a datastore under the data directory in your workspace, organized by data source, form, and date:

data/eia/form860/
data/eia/form923/
data/epa/cems/
data/epa/ipm/
data/ferc/form1/

If the download fails (e.g. the FTP server times out), this command can be run repeatedly until all the files are downloaded. It will not try and re-download data which is already present locally, unless you use the --clobber option. Depending on which data sources, how many years or states you have requested data for, and the speed of your internet connection, this may take minutes to hours to complete, and can consume 20+ GB of disk space even when the data is compressed.

Occasionally, the federal agencies will re-organize their websites or FTP servers, changing the names or locations of the files, causing the download script to fail. We try and update the version of the script in the Github repository as quickly as possible when this happens, but it may take a while for those changes to show up in the released software. We are working on creating an automatically updated versioned archive of the raw source files on Zenodo so we don’t need to refer directly to these unstable files that. See our scrapers and zen_storage Github repositories for more information.