pudl.transform.epacems module¶
Module to perform data cleaning functions on EPA CEMS data tables.
-
pudl.transform.epacems.
add_facility_id_unit_id_epa
(df)[source]¶ Harmonize columns that are added later.
The datapackage validation checks for consistent column names, and these two columns aren’t present before August 2008, so this adds them in.
- Parameters
df (pandas.DataFrame) – A CEMS dataframe
- Returns
The same DataFrame guaranteed to have int facility_id and unit_id_epa cols.
- Return type
pandas.Dataframe
-
pudl.transform.epacems.
correct_gross_load_mw
(df)[source]¶ Fix values of gross load that are wrong by orders of magnitude.
- Parameters
df (pandas.DataFrame) – A CEMS dataframe
- Returns
The same DataFrame with corrected gross load values.
- Return type
-
pudl.transform.epacems.
fix_up_dates
(df, plant_utc_offset)[source]¶ Fix the dates for the CEMS data.
Transformations include:
Account for timezone differences with offset from UTC.
- Parameters
df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state plant_utc_offset (pandas.DataFrame): A dataframe of plants’ timezones.
- Returns
The same data, with an op_datetime_utc column added and the op_date and op_hour columns removed.
- Return type
-
pudl.transform.epacems.
harmonize_eia_epa_orispl
(df)[source]¶ Harmonize the ORISPL code to match the EIA data – NOT YET IMPLEMENTED.
The EIA plant IDs and CEMS ORISPL codes almost match, but not quite. EPA has compiled a crosswalk that maps one set of IDs to the other, but we haven’t integrated it yet. It can be found at:
https://github.com/USEPA/camd-eia-crosswalk
Note that this transformation needs to be run before fix_up_dates, because fix_up_dates uses the plant ID to look up timezones.
- Parameters
df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state.
- Returns
The same data, with the ORISPL plant codes corrected to match the EIA plant IDs.
- Return type
Todo
Actually implement the function…