pudl.etl#

Run the PUDL ETL Pipeline.

The PUDL project integrates several different public datasets into a well normalized relational database allowing easier access and interaction between all datasets. This module coordinates the extract/transfrom/load process for data from:

  • US Energy Information Agency (EIA): - Form 860 (eia860) - Form 923 (eia923)

  • US Federal Energy Regulatory Commission (FERC): - Form 1 (ferc1)

  • US Environmental Protection Agency (EPA): - Continuous Emissions Monitory System (epacems)

Module Contents#

Functions#

_etl_eia(→ dict[str, pandas.DataFrame])

Extract, transform and load CSVs for the EIA datasets.

_read_static_tables_ferc1()

Compile static tables for FERC1 for foriegn key constaints.

_etl_ferc1(→ dict[str, pandas.DataFrame])

Extract, transform and load CSVs for FERC Form 1.

_etl_one_year_epacems(→ None)

Process one year of EPA CEMS and output year-state paritioned Parquet files.

etl_epacems(→ None)

Extract, transform and load CSVs for EPA CEMS.

_etl_eia_bulk_elec(→ dict[str, pandas.DataFrame])

Extract and transform EIA bulk electricity aggregates.

_etl_glue(→ dict[str, pandas.DataFrame])

Extract, transform and load CSVs for the Glue tables.

_read_static_encoding_tables(→ dict[str, pandas.DataFrame])

Build dataframes of static tables from a data source for use as foreign keys.

_read_static_pudl_tables(→ dict[str, pandas.DataFrame])

Read static tables compiled as part of PUDL and not from any agency dataset.

etl(etl_settings, pudl_settings[, clobber, ...])

Run the PUDL Extract, Transform, and Load data pipeline.

Attributes#

pudl.etl.logger[source]#
pudl.etl._etl_eia(eia_settings: pudl.settings.EiaSettings, ds_kwargs: dict[str, Any]) dict[str, pandas.DataFrame][source]#

Extract, transform and load CSVs for the EIA datasets.

Parameters:
  • eia_settings – Validated ETL parameters required by this data source.

  • ds_kwargs – Keyword arguments for instantiating a PUDL datastore, so that the ETL can access the raw input data.

Returns:

A dictionary of EIA dataframes ready for loading into the PUDL DB.

pudl.etl._read_static_tables_ferc1()[source]#

Compile static tables for FERC1 for foriegn key constaints.

This function grabs static encoded tables via _read_static_encoding_tables() as well as two static tables that are non-encoded tables (ferc_accounts and ferc_depreciation_lines).

pudl.etl._etl_ferc1(ferc1_settings: pudl.settings.Ferc1Settings, pudl_settings: dict[str, Any]) dict[str, pandas.DataFrame][source]#

Extract, transform and load CSVs for FERC Form 1.

Parameters:
  • ferc1_settings – Validated ETL parameters required by this data source.

  • pudl_settings – a dictionary filled with settings that mostly describe paths to various resources and outputs.

Returns:

Dataframes containing PUDL database tables pertaining to the FERC Form 1 data, keyed by table name.

pudl.etl._etl_one_year_epacems(year: int, states: list[str], pudl_db: str, out_dir: str, ds_kwargs: dict[str, Any]) None[source]#

Process one year of EPA CEMS and output year-state paritioned Parquet files.

pudl.etl.etl_epacems(epacems_settings: pudl.settings.EpaCemsSettings, pudl_settings: dict[str, Any], ds_kwargs: dict[str, Any], clobber: str = False) None[source]#

Extract, transform and load CSVs for EPA CEMS.

Parameters:
  • epacems_settings – Validated ETL parameters required by this data source.

  • pudl_settings – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • ds_kwargs – Keyword arguments for instantiating a PUDL datastore, so that the ETL can access the raw input data.

  • clobber – If True and there is already a hourly_emissions_epacems parquer file or directory it will be deleted and a new one will be created.

Returns:

Unlike the other ETL functions, the EPACEMS writes its output to Parquet as it goes, since the dataset is too large to hold in memory. So it doesn’t return a dictionary of dataframes.

pudl.etl._etl_eia_bulk_elec(ds_kwargs: dict[str, Any]) dict[str, pandas.DataFrame][source]#

Extract and transform EIA bulk electricity aggregates.

Parameters:

ds_kwargs – Keyword arguments for instantiating a PUDL datastore, so that the ETL can access the raw input data.

Returns:

A dictionary of DataFrames whose keys are the names of the corresponding database table.

pudl.etl._etl_glue(glue_settings: pudl.settings.GlueSettings, ds_kwargs: dict[str, Any], sqlite_dfs: dict[str, pandas.DataFrame], eia_settings: pudl.settings.EiaSettings) dict[str, pandas.DataFrame][source]#

Extract, transform and load CSVs for the Glue tables.

Parameters:
  • glue_settings – Validated ETL parameters required by this data source.

  • ds_kwargs – Keyword arguments for instantiating a PUDL datastore, so that the ETL can access the raw input data.

  • sqlite_dfs – The dictionary of dataframes to be loaded into the pudl database. We pass the dictionary though because the EPACAMD-EIA crosswalk needs to know which EIA plants and generators are being loaded into the database (based on whether we run the full or fast etl). The tests will break if we pass the generators_entity_eia table as an argument because of the ferc1_solo test (where no eia tables are in the sqlite_dfs dict). Passing the whole dict avoids this because the crosswalk will only load if there are eia tables in the dict, but the dict will always be there.

  • eia_settings – Validated ETL parameters required by this data source.

Returns:

A dictionary of DataFrames whose keys are the names of the corresponding database table.

pudl.etl._read_static_encoding_tables(etl_group: Literal[static_eia, static_ferc1]) dict[str, pandas.DataFrame][source]#

Build dataframes of static tables from a data source for use as foreign keys.

There are many values specified within the data that are essentially constant, but which we need to store for data validation purposes, for use as foreign keys. E.g. the list of valid EIA fuel type codes, or the possible state and country codes indicating a coal delivery’s location of origin. For now these values are primarily stored in a large collection of lists, dictionaries, and dataframes which are specified in the pudl.metadata subpackage. This function uses those data structures to populate a bunch of small infrastructural tables within the PUDL DB.

Parameters:

etl_group – name of static table etl group.

Returns:

a dictionary with table names as keys and dataframes as values for all tables labeled as static tables in their resource etl_group

pudl.etl._read_static_pudl_tables() dict[str, pandas.DataFrame][source]#

Read static tables compiled as part of PUDL and not from any agency dataset.

pudl.etl.etl(etl_settings: pudl.settings.EtlSettings, pudl_settings: dict, clobber: bool = False, use_local_cache: bool = True, gcs_cache_path: str = None, check_foreign_keys: bool = True, check_types: bool = True, check_values: bool = True)[source]#

Run the PUDL Extract, Transform, and Load data pipeline.

First we validate the settings, and then process data destined for loading into SQLite, which includes The FERC Form 1 and the EIA Forms 860 and 923. Once those data have been output to SQLite we mvoe on to processing the long tables, which will be loaded into Apache Parquet files. Some of this processing depends on data that’s already been loaded into the SQLite DB.

Parameters:
  • etl_settings – settings that describe datasets to be loaded.

  • pudl_settings – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • clobber – If True and there is already a pudl.sqlite database it will be deleted and a new one will be created.

  • use_local_cache – controls whether datastore should be using local file cache.

  • gcs_cache_path – controls whether datastore should be using Google Cloud Storage based cache.

Returns:

None