pudl.transform.epacems

Module to perform data cleaning functions on EPA CEMS data tables.

Module Contents

Functions

fix_up_dates(df: pandas.DataFrame, plant_utc_offset: pandas.DataFrame) → pandas.DataFrame

Fix the dates for the CEMS data.

_load_plant_utc_offset(pudl_engine)

Load the UTC offset each EIA plant.

harmonize_eia_epa_orispl(df)

Harmonize the ORISPL code to match the EIA data -- NOT YET IMPLEMENTED.

add_facility_id_unit_id_epa(df)

Harmonize columns that are added later.

_all_na_or_values(series, values)

Test whether every element in the series is either missing or in values.

correct_gross_load_mw(df: pandas.DataFrame) → pandas.DataFrame

Fix values of gross load that are wrong by orders of magnitude.

transform(raw_df: pandas.DataFrame, pudl_engine: sqlalchemy.engine.Engine) → pandas.DataFrame

Transform EPA CEMS hourly data and ready it for export to Parquet.

Attributes

logger

pudl.transform.epacems.logger[source]
pudl.transform.epacems.fix_up_dates(df: pandas.DataFrame, plant_utc_offset: pandas.DataFrame) pandas.DataFrame[source]

Fix the dates for the CEMS data.

Transformations include:

  • Account for timezone differences with offset from UTC.

Parameters
  • df – A CEMS hourly dataframe for one year-state.

  • plant_utc_offset – A dataframe association plant_id_eia with timezones.

Returns

The same data, with an op_datetime_utc column added and the op_date and op_hour columns removed.

pudl.transform.epacems._load_plant_utc_offset(pudl_engine)[source]

Load the UTC offset each EIA plant.

CEMS times don’t change for DST, so we get get the UTC offset by using the offset for the plants’ timezones in January.

Parameters

pudl_engine (sqlalchemy.engine.Engine) – A database connection engine for an existing PUDL DB.

Returns

With columns plant_id_eia and utc_offset.

Return type

pandas.DataFrame

pudl.transform.epacems.harmonize_eia_epa_orispl(df)[source]

Harmonize the ORISPL code to match the EIA data – NOT YET IMPLEMENTED.

The EIA plant IDs and CEMS ORISPL codes almost match, but not quite. EPA has compiled a crosswalk that maps one set of IDs to the other, but we haven’t integrated it yet. It can be found at:

https://github.com/USEPA/camd-eia-crosswalk

Note that this transformation needs to be run before fix_up_dates, because fix_up_dates uses the plant ID to look up timezones.

Parameters

df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state.

Returns

The same data, with the ORISPL plant codes corrected to match the EIA plant IDs.

Return type

pandas.DataFrame

Todo

Actually implement the function…

pudl.transform.epacems.add_facility_id_unit_id_epa(df)[source]

Harmonize columns that are added later.

The Parquet schema requires consistent column names across all partitions and facility_id and unit_id_epa aren’t present before August 2008, so this function adds them in.

Parameters

df (pandas.DataFrame) – A CEMS dataframe

Returns

The same DataFrame guaranteed to have int facility_id and unit_id_epa cols.

Return type

pandas.Dataframe

pudl.transform.epacems._all_na_or_values(series, values)[source]

Test whether every element in the series is either missing or in values.

This is fiddly because isin() changes behavior if the series is totally NaN (because of type issues).

Example: x = pd.DataFrame({‘a’: [‘x’, np.NaN], ‘b’: [np.NaN, np.NaN]})

x.isin({‘x’, np.NaN})

Parameters

series (pd.Series) – A data column values (set): A set of values

Returns

True or False, whether the elements are missing or in values

Return type

bool

pudl.transform.epacems.correct_gross_load_mw(df: pandas.DataFrame) pandas.DataFrame[source]

Fix values of gross load that are wrong by orders of magnitude.

Parameters

df – A CEMS dataframe

Returns

The same DataFrame with corrected gross load values.

pudl.transform.epacems.transform(raw_df: pandas.DataFrame, pudl_engine: sqlalchemy.engine.Engine) pandas.DataFrame[source]

Transform EPA CEMS hourly data and ready it for export to Parquet.

Parameters
  • raw_df – An extracted by not yet transformed state-year of EPA CEMS data.

  • pudl_engine – SQLAlchemy connection engine for connecting to an existing PUDL DB.

Returns

A single year-state of EPA CEMS data