Creating a Datastore

The input data that PUDL processes comes from a variety of US government agencies. These agencies typically make the data available on their websites or via FTP without really planning for programmatic access.

The pudl_data script helps you obtain and organize this data locally, for use by the rest of the PUDL system. It uses the routines defined in the pudl.datastore.datastore module. For details on what data is available, for what time periods, and how much of it there is, see the Data Catalog.

Note

You may not need to use pudl_data. If you attempt to use pudl_etl to process data that’s not already in your datastore but that is available for download, it will try and download it for you automatically.

Todo

Should we allow / require pudl_data to read its options from a settings file for the sake of consistency? And also to be able to put all these settings explicitly in the pudl_etl_example.yml input file? Or do we want the obtaining of data to be only implicit / automatic, based on what data the user attempts to process? Zane is inclined to just make it something that the ETL script does automatically

For example, if you wanted to download the 2018 EPA CEMS Hourly data for Colorado:

$ pudl_data --sources epacems --states CO --years 2018

If you do not specify years, the script will retrieve all available data. So to get everything for EIA Form 860 and EIA Form 923 you would run:

$ pudl_data --sources eia860 eia923

The script will download from all sources in parallel, so if you have a fast internet connection and need a lot of data, doing it all in one go makes sense. To pull down all the available data for all the sources (10+ GB) you would run:

$ pudl_data --sources eia860 eia923 epacems ferc1 epaipm

For more detailed usage information, see:

$ pudl_data --help

The downloaded data will be used by the script to populate a datastore under the data directory in your workspace, organized by data source, form, and date:

data/eia/form860/
data/eia/form923/
data/epa/cems/
data/epa/ipm/
data/ferc/form1/

If the download fails (e.g. the FTP server times out), this command can be run repeatedly until all the files are downloaded. It will not try and re-download data which is already present locally, unless you use the --clobber option. Depending on which data sources, how many years or states you have requested data for, and the speed of your internet connection, this may take minutes to hours to complete, and can consume 20+ GB of disk space even when the data is compressed.