pudl.transform.epacems#

Module to perform data cleaning functions on EPA CEMS data tables.

Module Contents#

Functions#

harmonize_eia_epa_orispl(→ pandas.DataFrame)

Harmonize the ORISPL code to match the EIA data.

convert_to_utc(→ pandas.DataFrame)

Convert CEMS datetime data to UTC timezones.

_load_plant_utc_offset(→ pandas.DataFrame)

Load the UTC offset each EIA plant.

correct_gross_load_mw(→ pandas.DataFrame)

Fix values of gross load that are wrong by orders of magnitude.

transform(→ pandas.DataFrame)

Transform EPA CEMS hourly data and ready it for export to Parquet.

Attributes#

pudl.transform.epacems.logger[source]#
pudl.transform.epacems.harmonize_eia_epa_orispl(df: pandas.DataFrame, crosswalk_df: pandas.DataFrame) pandas.DataFrame[source]#

Harmonize the ORISPL code to match the EIA data.

The EIA plant IDs and CEMS ORISPL codes almost match, but not quite. EPA has compiled a crosswalk that maps one set of IDs to the other. The crosswalk is integrated into the PUDL db.

This function merges the crosswalk with the cems data thus adding the official plant_id_eia column to CEMS. In cases where there is no plant_id_eia value for a given plant_id_epa (i.e., this plant isn’t in the crosswalk yet), we use fillna() to add the plant_id_epa value to the plant_id_eia column. Because the plant_id_epa is almost always correct this is reasonable.

EIA IDs are more correct so use the crosswalk to fix any erronious EPA IDs and get rid of that column to avoid confusion.

https://github.com/USEPA/camd-eia-crosswalk

Note that this transformation needs to be run before convert_to_utc, because convert_to_utc uses the plant ID to look up timezones.

Parameters:
  • df – A CEMS hourly dataframe for one year-month-state.

  • crosswalk_df – The core_epa__assn_eia_epacamd dataframe from the database.

Returns:

The same data, with the ORISPL plant codes corrected to match the EIA plant IDs.

pudl.transform.epacems.convert_to_utc(df: pandas.DataFrame, plant_utc_offset: pandas.DataFrame) pandas.DataFrame[source]#

Convert CEMS datetime data to UTC timezones.

Transformations include:

  • Account for timezone differences with offset from UTC.

Parameters:
  • df – A CEMS hourly dataframe for one year-state.

  • plant_utc_offset – A dataframe association with timezones.

Returns:

The same data, with an op_datetime_utc column added and the op_date and op_hour columns removed.

pudl.transform.epacems._load_plant_utc_offset(core_eia__entity_plants: pandas.DataFrame) pandas.DataFrame[source]#

Load the UTC offset each EIA plant.

CEMS times don’t change for DST, so we get the UTC offset by using the offset for the plants’ timezones in January.

Parameters:

pudl_engine – A database connection engine for an existing PUDL DB.

Returns:

Dataframe of applicable timezones taken from the core_eia__entity_plants table.

pudl.transform.epacems.correct_gross_load_mw(df: pandas.DataFrame) pandas.DataFrame[source]#

Fix values of gross load that are wrong by orders of magnitude.

Parameters:

df – A CEMS dataframe

Returns:

The same DataFrame with corrected gross load values.

pudl.transform.epacems.transform(raw_df: pandas.DataFrame, core_epa__assn_eia_epacamd: pandas.DataFrame, core_eia__entity_plants: pandas.DataFrame) pandas.DataFrame[source]#

Transform EPA CEMS hourly data and ready it for export to Parquet.

Parameters:
  • raw_df – An extracted by not yet transformed year_quarter of EPA CEMS data.

  • pudl_engine – SQLAlchemy connection engine for connecting to an existing PUDL DB.

Returns:

A single year_quarter of EPA CEMS data