pudl.transform.epacems
Module to perform data cleaning functions on EPA CEMS data tables.
Module Contents
Functions
|
Fix the dates for the CEMS data. |
|
Load the UTC offset each EIA plant. |
Harmonize the ORISPL code to match the EIA data -- NOT YET IMPLEMENTED. |
|
Harmonize columns that are added later. |
|
|
Test whether every element in the series is either missing or in values. |
Fix values of gross load that are wrong by orders of magnitude. |
|
|
Transform EPA CEMS hourly data and ready it for export to Parquet. |
Attributes
- pudl.transform.epacems.fix_up_dates(df, plant_utc_offset)[source]
Fix the dates for the CEMS data.
Transformations include:
Account for timezone differences with offset from UTC.
- Parameters
df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state plant_utc_offset (pandas.DataFrame): A dataframe of plants’ timezones.
- Returns
The same data, with an op_datetime_utc column added and the op_date and op_hour columns removed.
- Return type
- pudl.transform.epacems._load_plant_utc_offset(pudl_engine)[source]
Load the UTC offset each EIA plant.
CEMS times don’t change for DST, so we get get the UTC offset by using the offset for the plants’ timezones in January.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – A database connection engine for an existing PUDL DB.
- Returns
With columns plant_id_eia and utc_offset.
- Return type
- pudl.transform.epacems.harmonize_eia_epa_orispl(df)[source]
Harmonize the ORISPL code to match the EIA data – NOT YET IMPLEMENTED.
The EIA plant IDs and CEMS ORISPL codes almost match, but not quite. EPA has compiled a crosswalk that maps one set of IDs to the other, but we haven’t integrated it yet. It can be found at:
https://github.com/USEPA/camd-eia-crosswalk
Note that this transformation needs to be run before fix_up_dates, because fix_up_dates uses the plant ID to look up timezones.
- Parameters
df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state.
- Returns
The same data, with the ORISPL plant codes corrected to match the EIA plant IDs.
- Return type
Todo
Actually implement the function…
- pudl.transform.epacems.add_facility_id_unit_id_epa(df)[source]
Harmonize columns that are added later.
The Parquet schema requires consistent column names across all partitions and
facility_id
andunit_id_epa
aren’t present before August 2008, so this function adds them in.- Parameters
df (pandas.DataFrame) – A CEMS dataframe
- Returns
The same DataFrame guaranteed to have int facility_id and unit_id_epa cols.
- Return type
pandas.Dataframe
- pudl.transform.epacems._all_na_or_values(series, values)[source]
Test whether every element in the series is either missing or in values.
This is fiddly because isin() changes behavior if the series is totally NaN (because of type issues).
- Example: x = pd.DataFrame({‘a’: [‘x’, np.NaN], ‘b’: [np.NaN, np.NaN]})
x.isin({‘x’, np.NaN})
- Parameters
series (pd.Series) – A data column values (set): A set of values
- Returns
True or False, whether the elements are missing or in values
- Return type
- pudl.transform.epacems.correct_gross_load_mw(df)[source]
Fix values of gross load that are wrong by orders of magnitude.
- Parameters
df (pandas.DataFrame) – A CEMS dataframe
- Returns
The same DataFrame with corrected gross load values.
- Return type
- pudl.transform.epacems.transform(epacems_raw_dfs, pudl_engine)[source]
Transform EPA CEMS hourly data and ready it for export to Parquet.
- Parameters
epacems_raw_dfs – a
pandas.Dataframe
generator that yields raw epacems data, one state-year at a time.pudl_engine – a
sqlalchemy.engine.Engine
for connecting to an existing PUDL DB.
- Yields
pandas.Dataframe – A single year-state of EPA CEMS data,