pudl.transform.epacems

Module to perform data cleaning functions on EPA CEMS data tables.

Module Contents

Functions

fix_up_dates(df, plant_utc_offset)

Fix the dates for the CEMS data.

_load_plant_utc_offset(pudl_engine)

Load the UTC offset each EIA plant.

harmonize_eia_epa_orispl(df)

Harmonize the ORISPL code to match the EIA data -- NOT YET IMPLEMENTED.

add_facility_id_unit_id_epa(df)

Harmonize columns that are added later.

_all_na_or_values(series, values)

Test whether every element in the series is either missing or in values.

correct_gross_load_mw(df)

Fix values of gross load that are wrong by orders of magnitude.

transform(epacems_raw_dfs, pudl_engine)

Transform EPA CEMS hourly data and ready it for export to Parquet.

Attributes

logger

pudl.transform.epacems.logger[source]
pudl.transform.epacems.fix_up_dates(df, plant_utc_offset)[source]

Fix the dates for the CEMS data.

Transformations include:

  • Account for timezone differences with offset from UTC.

Parameters

df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state plant_utc_offset (pandas.DataFrame): A dataframe of plants’ timezones.

Returns

The same data, with an op_datetime_utc column added and the op_date and op_hour columns removed.

Return type

pandas.DataFrame

pudl.transform.epacems._load_plant_utc_offset(pudl_engine)[source]

Load the UTC offset each EIA plant.

CEMS times don’t change for DST, so we get get the UTC offset by using the offset for the plants’ timezones in January.

Parameters

pudl_engine (sqlalchemy.engine.Engine) – A database connection engine for an existing PUDL DB.

Returns

With columns plant_id_eia and utc_offset.

Return type

pandas.DataFrame

pudl.transform.epacems.harmonize_eia_epa_orispl(df)[source]

Harmonize the ORISPL code to match the EIA data – NOT YET IMPLEMENTED.

The EIA plant IDs and CEMS ORISPL codes almost match, but not quite. EPA has compiled a crosswalk that maps one set of IDs to the other, but we haven’t integrated it yet. It can be found at:

https://github.com/USEPA/camd-eia-crosswalk

Note that this transformation needs to be run before fix_up_dates, because fix_up_dates uses the plant ID to look up timezones.

Parameters

df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state.

Returns

The same data, with the ORISPL plant codes corrected to match the EIA plant IDs.

Return type

pandas.DataFrame

Todo

Actually implement the function…

pudl.transform.epacems.add_facility_id_unit_id_epa(df)[source]

Harmonize columns that are added later.

The Parquet schema requires consistent column names across all partitions and facility_id and unit_id_epa aren’t present before August 2008, so this function adds them in.

Parameters

df (pandas.DataFrame) – A CEMS dataframe

Returns

The same DataFrame guaranteed to have int facility_id and unit_id_epa cols.

Return type

pandas.Dataframe

pudl.transform.epacems._all_na_or_values(series, values)[source]

Test whether every element in the series is either missing or in values.

This is fiddly because isin() changes behavior if the series is totally NaN (because of type issues).

Example: x = pd.DataFrame({‘a’: [‘x’, np.NaN], ‘b’: [np.NaN, np.NaN]})

x.isin({‘x’, np.NaN})

Parameters

series (pd.Series) – A data column values (set): A set of values

Returns

True or False, whether the elements are missing or in values

Return type

bool

pudl.transform.epacems.correct_gross_load_mw(df)[source]

Fix values of gross load that are wrong by orders of magnitude.

Parameters

df (pandas.DataFrame) – A CEMS dataframe

Returns

The same DataFrame with corrected gross load values.

Return type

pandas.DataFrame

pudl.transform.epacems.transform(epacems_raw_dfs, pudl_engine)[source]

Transform EPA CEMS hourly data and ready it for export to Parquet.

Parameters
  • epacems_raw_dfs – a pandas.Dataframe generator that yields raw epacems data, one state-year at a time.

  • pudl_engine – a sqlalchemy.engine.Engine for connecting to an existing PUDL DB.

Yields

pandas.Dataframe – A single year-state of EPA CEMS data,