pudl.workspace.datastore module

Download the original public data sources used by PUDL.

This module provides programmatic, platform-independent access to the original data sources which are used to populate the PUDL database. Those sources currently include: FERC Form 1, EIA Form 860, and EIA Form 923. The module can be used to download the data, and populate a local data store which is organized such that the rest of the PUDL package knows where to find all the raw data it needs.

Support for selectively downloading portions of the EPA’s large Continuous Emissions Monitoring System dataset will be added in the future.

pudl.workspace.datastore.assert_valid_param(source, year, month=None, state=None, check_month=None)[source]

Check whether parameters used in various datastore functions are valid.

Parameters
  • source (str) – A string indicating which data source we are going to be downloading. Currently it must be one of the following: eia860, eia861, eia923, ferc1, epacems.

  • year (int or None) – the year for which data should be downloaded. Must be within the range of valid data years, which is specified for each data source in the pudl.constants module. Use None for data sources that do not have multiple years.

  • month (int) – the month for which data should be downloaded. Only used for EPA CEMS.

  • state (str) – the state for which data should be downloaded. Only used for EPA CEMS.

  • check_month (bool) – Check whether the input month is valid? This is automaticlaly set to True for EPA CEMS.

Raises
pudl.workspace.datastore.check_if_need_update(source, year, states, data_dir, clobber=False)[source]

Check to see if the file is already downloaded and clobber is False.

Do we really need to download the requested data? Only case in which we don’t have to do anything is when the downloaded file already exists and clobber is False.

Parameters
  • source (str) – the data source to retrieve. Must be one of: eia860, eia923, ferc1, or epacems.

  • year (int or None) – the year of data that the returned path should pertain to. Must be within the range of valid data years, which is specified for each data source in pudl.constants.data_years. Note that for data (like EPA CEMS) that have multiple datasets per year, this function will download all the files for the specified year. Use None for data sources that do not have multiple years.

  • states (iterable) – List of two letter US state abbreviations indicating which states data should be downloaded for.

  • data_dir (path-like) – Path to the top level datastore directory.

  • clobber (bool) – If True, clobber the existing file and note that the file will need to be replaced with an updated file.

Returns

Whether an update is needed (True) or not (False)

Return type

bool

pudl.workspace.datastore.download(source, year, states, data_dir)[source]

Download the original data for the specified data source and year.

Given a data source and the desired year of data, download the original data files from the appropriate federal website, and place them in a temporary directory within the data store. This function does not do any checking to see whether the file already exists, or needs to be updated, and does not do any of the organization of the datastore after download, it simply gets the requested file.

Parameters
  • source (str) – the data source to retrieve. Must be one of: ‘eia860’, ‘eia923’, ‘ferc1’, or ‘epacems’.

  • year (int or None) – the year of data that the returned path should pertain to. Must be within the range of valid data years, which is specified for each data source in pudl.constants.data_years. Note that for data (like EPA CEMS) that have multiple datasets per year, this function will download all the files for the specified year. Use None for data sources that do not have multiple years.

  • states (iterable) – List of two letter US state abbreviations indicating which states data should be downloaded for.

  • data_dir (path-like) – Path to the top level datastore directory.

Returns

The path to the local downloaded file.

Return type

path-like

pudl.workspace.datastore.organize(source, year, states, data_dir, unzip=True, dl=True)[source]

Put downloaded original data file where it belongs in the datastore.

Once we’ve downloaded an original file from the public website it lives on we need to put it where it belongs in the datastore. Optionally, we also unzip it and clean up the directory hierarchy that results from unzipping.

Parameters
  • source (str) – the data source to retrieve. Must be one of: ‘eia860’, ‘eia923’, ‘ferc1’, or ‘epacems’.

  • year (int or None) – the year of data that the returned path should pertain to. Must be within the range of valid data years, which is specified for each data source in pudl.constants.data_years. Use None for data sources that do not have multiple years.

  • data_dir (path-like) – Path to the top level datastore directory.

  • unzip (bool) – If True, unzip the file once downloaded, and place the resulting data files where they ought to be in the datastore.

  • dl (bool) – If False, the files were not downloaded in this run.

Returns

None

Todo

Replace 4 assert statements

pudl.workspace.datastore.parallel_update(sources, years_by_source, states, data_dir, clobber=False, unzip=True, dl=True)[source]

Download many original source data files in parallel using threads.

pudl.workspace.datastore.path(source, data_dir, year=None, month=None, state=None, file=True)[source]

Construct a variety of local datastore paths for a given data source.

PUDL expects the original data it ingests to be organized in a particular way. This function allows you to easily construct useful paths that refer to various parts of the data store, by specifying the data source you are interested in, and optionally the year of data you’re seeking, as well as whether you want the originally downloaded files for that year, or the directory in which a given year’s worth of data for a particular data source can be found.

Note: if you change the default arguments here, you should also change them for paths_for_year()

Parameters
  • source (str) – A string indicating which data source we are going to be downloading. Currently it must be one of the following: ferc1, eia923, eia860, epacems.

  • data_dir (path-like) – Path to the top level datastore directory.

  • year (int or None) – the year of data that the returned path should pertain to. Must be within the range of valid data years, which is specified for each data source in pudl.constants.data_years, unless year is set to zero, in which case only the top level directory for the data source specified in source is returned. If None, no subdirectory is used for the data source.

  • month (int) – Month of year (1-12). Only applies to epacems.

  • state (str) – Two letter US state abbreviation. Only applies to epacems.

  • file (bool) – If True, return the full path to the originally downloaded file specified by the data source and year. If file is true, year must not be set to zero, as a year is required to specify a particular downloaded file.

Returns

the path to requested resource within the local PUDL datastore.

Return type

str

pudl.workspace.datastore.paths_for_year(source, data_dir, year=None, states=None, file=True)[source]

Derive all paths for a given source and year. See path() for details.

Parameters
  • source (str) – A string indicating which data source we are going to be downloading. Currently it must be one of the following: ferc1, eia923, eia860, epacems.

  • data_dir (path-like) – Path to the top level datastore directory.

  • year (int or None) – the year of data that the returned path should pertain to. Must be within the range of valid data years, which is specified for each data source in pudl.constants.data_years, unless year is set to zero, in which case only the top level directory for the data source specified in source is returned. If None, no subdirectory is used for the data source.

  • month (int) – Month of year (1-12). Only applies to epacems.

  • state (str) – Two letter US state abbreviation. Only applies to epacems.

  • file (bool) – If True, return the full path to the originally downloaded file specified by the data source and year. If file is true, year must not be set to zero, as a year is required to specify a particular downloaded file.

Returns

the path to requested resource within the local PUDL datastore.

Return type

str

pudl.workspace.datastore.source_url(source, year, month=None, state=None, table=None)[source]

Construct a download URL for the specified federal data source and year.

Parameters
  • source (str) – A string indicating which data source we are going to be downloading. Currently it must be one of the following: - ‘eia860’ - ‘eia861’ - ‘eia923’ - ‘ferc1’ - ‘epacems’

  • year (int or None) – the year for which data should be downloaded. Must be within the range of valid data years, which is specified for each data source in the pudl.constants module. Use None for data sources that do not have multiple years.

  • month (int) – the month for which data should be downloaded. Only used for EPA CEMS.

  • state (str) – the state for which data should be downloaded. Only used for EPA CEMS.

  • table (str) – the table for which data should be downloaded. Only used for EPA IPM.

Returns

a full URL from which the requested data may be obtained

Return type

download_url (str)

pudl.workspace.datastore.update(source, year, states, data_dir, clobber=False, unzip=True, dl=True)[source]

Update the local datastore for the given source and year.

If necessary, pull down a new copy of the data for the specified data source and year. If we already have the requested data, do nothing, unless clobber is True – in which case remove the existing data and replace it with a freshly downloaded copy.

Note that update_datastore.py runs this function in parallel, so files multiple sources and years may be in progress simultaneously.

Parameters
  • source (str) – the data source to retrieve. Must be one of: ‘eia860’, ‘eia923’, ‘ferc1’, or ‘epacems’.

  • year (int) – the year of data that the returned path should pertain to. Must be within the range of valid data years, which is specified for each data source in pudl.constants.data_years.

  • states (iterable) – List of two letter US state abbreviations indicating which states data should be downloaded for. Currently only affects the epacems dataset.

  • clobber (bool) – If true, replace existing copy of the requested data if we have it, with freshly downloaded data.

  • unzip (bool) – If true, unzip the file once downloaded, and place the resulting data files where they ought to be in the datastore. EPA CEMS files will never be unzipped.

  • data_dir (str) – The data directory which holds the PUDL datastore.

  • dl (bool) – If False, don’t download the files, only unzip ones that are already present. If True, do download the files. Either way, still obey the unzip and clobber settings. (unzip=False and dl=False will do nothing.)

Returns

None