pudl.etl

Run the PUDL ETL Pipeline.

The PUDL project integrates several different public datasets into a well normalized relational database allowing easier access and interaction between all datasets. This module coordinates the extract/transfrom/load process for data from:

  • US Energy Information Agency (EIA): - Form 860 (eia860) - Form 923 (eia923)

  • US Federal Energy Regulatory Commission (FERC): - Form 1 (ferc1)

  • US Environmental Protection Agency (EPA): - Continuous Emissions Monitory System (epacems)

Module Contents

Functions

_read_static_tables_eia() → Dict[str, pandas.DataFrame]

Build dataframes of static EIA tables for use as foreign key constraints.

_etl_eia(etl_settings: pudl.settings.EiaSettings, ds_kwargs: Dict[str, Any]) → Dict[str, pandas.DataFrame]

Extract, transform and load CSVs for the EIA datasets.

_read_static_tables_ferc1() → Dict[str, pandas.DataFrame]

Populate static PUDL tables with constants for use as foreign keys.

_etl_ferc1(etl_settings: pudl.settings.Ferc1Settings, pudl_settings: Dict[str, Any]) → Dict[str, pandas.DataFrame]

Extract, transform and load CSVs for FERC Form 1.

etl_epacems(etl_settings: pudl.settings.EpaCemsSettings, pudl_settings: Dict[str, Any], ds_kwargs: Dict[str, Any]) → None

Extract, transform and load CSVs for EPA CEMS.

_etl_glue(etl_settings: pudl.settings.GlueSettings) → Dict[str, pandas.DataFrame]

Extract, transform and load CSVs for the Glue tables.

etl(etl_settings: pudl.settings.EtlSettings, pudl_settings: Dict, clobber: bool = False, use_local_cache: bool = True, gcs_cache_path: str = None, check_foreign_keys: bool = True, check_types: bool = True, check_values: bool = True)

Run the PUDL Extract, Transform, and Load data pipeline.

Attributes

logger

PUDL_META

pudl.etl.logger[source]
pudl.etl.PUDL_META[source]
pudl.etl._read_static_tables_eia() Dict[str, pandas.DataFrame][source]

Build dataframes of static EIA tables for use as foreign key constraints.

There are many values specified within the data that are essentially constant, but which we need to store for data validation purposes, for use as foreign keys. E.g. the list of valid EIA fuel type codes, or the possible state and country codes indicating a coal delivery’s location of origin. For now these values are primarily stored in a large collection of lists, dictionaries, and dataframes which are specified in the pudl.constants module.

pudl.etl._etl_eia(etl_settings: pudl.settings.EiaSettings, ds_kwargs: Dict[str, Any]) Dict[str, pandas.DataFrame][source]

Extract, transform and load CSVs for the EIA datasets.

Parameters
  • etl_settings – Validated ETL parameters required by this data source.

  • ds_kwargs – Keyword arguments for instantiating a PUDL datastore, so that the ETL can access the raw input data.

Returns

A dictionary of EIA dataframes ready for loading into the PUDL DB.

pudl.etl._read_static_tables_ferc1() Dict[str, pandas.DataFrame][source]

Populate static PUDL tables with constants for use as foreign keys.

There are many values specified within the data that are essentially constant, but which we need to store for data validation purposes, for use as foreign keys. E.g. the list of valid EIA fuel type codes, or the possible state and country codes indicating a coal delivery’s location of origin. For now these values are primarily stored in a large collection of lists, dictionaries, and dataframes which are specified in the pudl.constants module. This function uses those data structures to populate a bunch of small infrastructural tables within the PUDL DB.

pudl.etl._etl_ferc1(etl_settings: pudl.settings.Ferc1Settings, pudl_settings: Dict[str, Any]) Dict[str, pandas.DataFrame][source]

Extract, transform and load CSVs for FERC Form 1.

Parameters
  • etl_settings – Validated ETL parameters required by this data source.

  • pudl_settings – a dictionary filled with settings that mostly describe paths to various resources and outputs.

Returns

Dataframes containing PUDL database tables pertaining to the FERC Form 1 data, keyed by table name.

pudl.etl.etl_epacems(etl_settings: pudl.settings.EpaCemsSettings, pudl_settings: Dict[str, Any], ds_kwargs: Dict[str, Any]) None[source]

Extract, transform and load CSVs for EPA CEMS.

Parameters
  • etl_settings – Validated ETL parameters required by this data source.

  • pudl_settings – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • ds_kwargs – Keyword arguments for instantiating a PUDL datastore, so that the ETL can access the raw input data.

Returns

Unlike the other ETL functions, the EPACEMS writes its output to Parquet as it goes, since the dataset is too large to hold in memory. So it doesn’t return a dictionary of dataframes.

pudl.etl._etl_glue(etl_settings: pudl.settings.GlueSettings) Dict[str, pandas.DataFrame][source]

Extract, transform and load CSVs for the Glue tables.

Parameters

etl_settings (GlueSettings) – Validated ETL parameters required by this data source.

Returns

A dictionary of pandas.Dataframe whose keys are the names of the corresponding database table.

Return type

dict

pudl.etl.etl(etl_settings: pudl.settings.EtlSettings, pudl_settings: Dict, clobber: bool = False, use_local_cache: bool = True, gcs_cache_path: str = None, check_foreign_keys: bool = True, check_types: bool = True, check_values: bool = True)[source]

Run the PUDL Extract, Transform, and Load data pipeline.

First we validate the settings, and then process data destined for loading into SQLite, which includes The FERC Form 1 and the EIA Forms 860 and 923. Once those data have been output to SQLite we mvoe on to processing the long tables, which will be loaded into Apache Parquet files. Some of this processing depends on data that’s already been loaded into the SQLite DB.

Parameters
  • etl_settings – settings that describe datasets to be loaded.

  • pudl_settings – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • clobber – If True and there is already a pudl.sqlite database it will be deleted and a new one will be created.

  • use_local_cache – controls whether datastore should be using local file cache.

  • gcs_cache_path – controls whether datastore should be using Google Cloud Storage based cache.

Returns

None