`pudl.transform.epacems`

Module to perform data cleaning functions on EPA CEMS data tables.

Module Contents

Functions

`fix_up_dates`(df, plant_utc_offset)	Fix the dates for the CEMS data.
`_load_plant_utc_offset`(pudl_engine)	Load the UTC offset each EIA plant.
`harmonize_eia_epa_orispl`(df)	Harmonize the ORISPL code to match the EIA data -- NOT YET IMPLEMENTED.
`add_facility_id_unit_id_epa`(df)	Harmonize columns that are added later.
`_all_na_or_values`(series, values)	Test whether every element in the series is either missing or in values.
`correct_gross_load_mw`(df)	Fix values of gross load that are wrong by orders of magnitude.
`transform`(epacems_raw_dfs, pudl_engine)	Transform EPA CEMS hourly data and ready it for export to Parquet.

Attributes

logger

pudl.transform.epacems.logger[source]

pudl.transform.epacems.fix_up_dates(df, plant_utc_offset)[source]

Fix the dates for the CEMS data.

Transformations include:

Account for timezone differences with offset from UTC.

Parameters: df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state plant_utc_offset (pandas.DataFrame): A dataframe of plants’ timezones.
Returns: The same data, with an op_datetime_utc column added and the op_date and op_hour columns removed.
Return type: pandas.DataFrame

pudl.transform.epacems._load_plant_utc_offset(pudl_engine)[source]

Load the UTC offset each EIA plant.

CEMS times don’t change for DST, so we get get the UTC offset by using the offset for the plants’ timezones in January.

Parameters: pudl_engine (sqlalchemy.engine.Engine) – A database connection engine for an existing PUDL DB.
Returns: With columns plant_id_eia and utc_offset.
Return type: pandas.DataFrame

pudl.transform.epacems.harmonize_eia_epa_orispl(df)[source]

Harmonize the ORISPL code to match the EIA data – NOT YET IMPLEMENTED.

The EIA plant IDs and CEMS ORISPL codes almost match, but not quite. EPA has compiled a crosswalk that maps one set of IDs to the other, but we haven’t integrated it yet. It can be found at:

https://github.com/USEPA/camd-eia-crosswalk

Note that this transformation needs to be run before fix_up_dates, because fix_up_dates uses the plant ID to look up timezones.

Parameters: df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state.
Returns: The same data, with the ORISPL plant codes corrected to match the EIA plant IDs.
Return type: pandas.DataFrame

Todo

Actually implement the function…

pudl.transform.epacems.add_facility_id_unit_id_epa(df)[source]

Harmonize columns that are added later.

The Parquet schema requires consistent column names across all partitions and facility_id and unit_id_epa aren’t present before August 2008, so this function adds them in.

Parameters: df (pandas.DataFrame) – A CEMS dataframe
Returns: The same DataFrame guaranteed to have int facility_id and unit_id_epa cols.
Return type: pandas.Dataframe

pudl.transform.epacems._all_na_or_values(series, values)[source]

Test whether every element in the series is either missing or in values.

This is fiddly because isin() changes behavior if the series is totally NaN (because of type issues).

Example: x = pd.DataFrame({‘a’: [‘x’, np.NaN], ‘b’: [np.NaN, np.NaN]}): x.isin({‘x’, np.NaN})

Parameters: series (pd.Series) – A data column values (set): A set of values
Returns: True or False, whether the elements are missing or in values
Return type: bool

pudl.transform.epacems.correct_gross_load_mw(df)[source]

Fix values of gross load that are wrong by orders of magnitude.

Parameters: df (pandas.DataFrame) – A CEMS dataframe
Returns: The same DataFrame with corrected gross load values.
Return type: pandas.DataFrame

pudl.transform.epacems.transform(epacems_raw_dfs, pudl_engine)[source]

Transform EPA CEMS hourly data and ready it for export to Parquet.

Parameters

epacems_raw_dfs – a pandas.Dataframe generator that yields raw epacems data, one state-year at a time.
pudl_engine – a sqlalchemy.engine.Engine for connecting to an existing PUDL DB.

Yields

pandas.Dataframe – A single year-state of EPA CEMS data,

pudl.transform.epacems

Module Contents

Functions

Attributes

`pudl.transform.epacems`