pudl.etl module

Module coordinating the PUDL ETL pipeline, generating data packages.

The PUDL project integrates several different public data sets into well normalized data packages allowing easier access and interaction between all each dataset. This module coordinates the extract/transfrom/load process of data from:

  • US Energy Information Agency (EIA): - Form 860 (eia860) - Form 923 (eia923)

  • US Federal Energy Regulatory Commission (FERC): - Form 1 (ferc1)

  • US Environmental Protection Agency (EPA): - Continuous Emissions Monitory System (epacems) - Integrated Planning Model (epaipm)

pudl.etl.etl_pkg(pkg_settings, pudl_settings, pkg_bundle_dir)[source]

Extracts, transforms and loads CSVs.

This is the coordinating function for generating all of the CSV’s for a data package. For each of the datasets enumerated in the pkg_settings, this function runs the dataset specific ETL function. Along the way, we are accumulating which tables have been loaded. This is useful for generating the metadata associated with the package.

Parameters
  • pkg_settings (dict) – a dictionary of etl_params for a datapackage.

  • pudl_settings (dict) – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • uuid_pkgs (uuid) –

Returns

dictionary with datapackpackages (keys) and lists of tables (values)

Return type

dict

pudl.etl.generate_data_packages(pkg_bundle_settings, pudl_settings, pkg_bundle_name, debug=False, clobber=False)[source]

Coordinate the generation of data packages.

For each bundle of packages laid out in the package_settings, this function generates data packages. First, the settings are validated (which runs through each of the settings listed in the package_settings). Then for each of the packages, run through the etl (extract, transform, load) functions, which generates CSVs. Then the metadata for the packages is generated by pulling from the metadata (which is a json file containing the schema for all of the possible pudl tables).

Parameters
  • pkg_bundle_settings (iterable) – a list of dictionaries. Each item in the list corresponds to a data package. Each data package’s dictionary contains the arguements for its ETL function.

  • pudl_settings (dict) – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • pkg_bundle_name (string) – name of directory you want the bundle of data packages to live.

  • debug (bool) – If True, return a dictionary with package names (keys) and a list with the data package metadata and report (values).

  • clobber (bool) –

Returns

A tuple containing generated metadata for the packages laid out in the package_settings.

Return type

tuple

pudl.etl.get_flattened_etl_parameters(pkg_bundle_settings)[source]

Compile flattened etl parameters.

Parameters

pkg_bundle_settings (iterable) – a list of data package parameters, with each element of the list being a dictionary specifying the data to be packaged.

Returns

dictionary of etl parameters (i.e. ferc1_years, eia923_years)

Return type

dict

pudl.etl.validate_params(pkg_bundle_settings, pudl_settings)[source]

Read and validate the etl inputs from a settings file.

Parameters

pkg_bundle_settings (iterable) – a list of data package parameters, with each element of the list being a dictionary specifying the data to be packaged.

Returns

validated list of inputs

Return type

iterable