pudl.etl module

Run the PUDL ETL Pipeline.

The PUDL project integrates several different public data sets into well normalized data packages allowing easier access and interaction between all each dataset. This module coordinates the extract/transfrom/load process for data from:

  • US Energy Information Agency (EIA): - Form 860 (eia860) - Form 923 (eia923)

  • US Federal Energy Regulatory Commission (FERC): - Form 1 (ferc1)

  • US Environmental Protection Agency (EPA): - Continuous Emissions Monitory System (epacems) - Integrated Planning Model (epaipm)

pudl.etl.etl(datapkg_settings, output_dir, pudl_settings)[source]

Run ETL process for data package specified by datapkg_settings dictionary.

This is the coordinating function for generating all of the CSV’s for a data package. For each of the datasets enumerated in the datapkg_settings, this function runs the dataset specific ETL function. Along the way, we are accumulating which tables have been loaded. This is useful for generating the metadata associated with the package.

Parameters
  • datapkg_settings (dict) – Validated ETL parameters for a single datapackage, originally read in from the PUDL ETL input file.

  • output_dir (path-like) – The individual datapackage directory, which will contain the datapackage.json file and the data directory.

  • pudl_settings (dict) – a dictionary describing paths to various resources and outputs.

Returns

The names of the tables included in the output datapackage.

Return type

list

pudl.etl.generate_datapkg_bundle(datapkg_bundle_settings, pudl_settings, datapkg_bundle_name, datapkg_bundle_doi=None, clobber=False)[source]

Coordinate the generation of data packages.

For each bundle of packages laid out in the package_settings, this function generates data packages. First, the settings are validated (which runs through each of the settings listed in the package_settings). Then for each of the packages, run through the etl (extract, transform, load) functions, which generates CSVs. Then the metadata for the packages is generated by pulling from the metadata (which is a json file containing the schema for all of the possible pudl tables).

Parameters
  • datapkg_bundle_settings (iterable) – a list of dictionaries. Each item in the list corresponds to a data package. Each data package’s dictionary contains the arguements for its ETL function.

  • pudl_settings (dict) – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • datapkg_bundle_name (str) – name of directory you want the bundle of data packages to live.

  • clobber (bool) – If True and there is already a directory with data packages with the datapkg_bundle_name, the existing data packages will be deleted and new data packages will be generated in their place.

Returns

A dictionary with datapackage names as the keys, and Python dictionaries representing tabular datapackage resource descriptors as the values, one per datapackage that was generated as part of the bundle.

Return type

dict

pudl.etl.get_flattened_etl_parameters(datapkg_bundle_settings)[source]

Compile flattened etl parameters.

The datapkg_bundle_settings is a list of dictionaries with the specific etl parameters for each dataset nested inside the dictionary. This function extracts the years, states, tables, etc. from the list datapackage settings and compiles them into one dictionary.

Parameters

datapkg_bundle_settings (iterable) – a list of data package parameters, with each element of the list being a dictionary specifying the data to be packaged.

Returns

dictionary of etl parameters with etl parameter names (keys) (i.e. ferc1_years, eia923_years) and etl parameters (values) (i.e. a list of years for ferc1_years)

Return type

dict

pudl.etl.validate_params(datapkg_bundle_settings, pudl_settings)[source]

Enforce validity of ETL parameters found in datapackage bundle settings.

For each enumerated data package in the datapkg_bundle_settings, this function checks to ensure the input parameters for each of the datasets are consistent with the known input options. Most of those options are enumerated in pudl.constants. For each dataset, the years, states, tables, etc. are checked to ensure that they are valid and present. If parameters are not valid, assertions will be raised.

There is some options that have default options or are hard coded during validation. Tables will typically be defaulted to all of the tables if they aren’t set. CEMS is always going to be partitioned by year and state. This means we have functinoally removed the option to not partition or partition another way.

Parameters
  • datapkg_bundle_settings (iterable) – a list of data package parameters, with each element of the list being a dictionary specifying the data to be packaged.

  • pudl_settings (dict) – a dictionary describing paths to various resources and outputs.

Returns

validated list of data package parameters, with each element

of the list being a dictionary specitying the data to be packaged.

Return type

iterable