pudl.extract.epacems#

Retrieve data from EPA CEMS hourly zipped CSVs.

Presently, this module is where the CEMS columns are renamed and dropped. Any columns in the IGNORE_COLS dictionary are excluded from the final output. All of these columns are calculable rates, measurement flags, or descriptors (like facility name) that can be accessed by merging this data with the EIA860 plants entity table. We also remove the FACILITY_ID field because it is internal to the EPA’s business accounting database and UNIT_ID field because it’s a unique (calculable) identifier for plant_id and emissions_unit_id (previously UNITID) groupings. It took a minute to verify the difference between the UNITID and UNIT_ID fields, but coorespondance with the EPA’s CAMD team cleared this up.

Pre-transform, the plant_id_epa field is a close but not perfect indicator for plant_id_eia. In the raw data it’s called ORISPL_CODE but that’s not entirely accurate. The epacamd_eia crosswalk will show that the mapping between ORISPL_CODE as it appears in CEMS and the plant_id_eia field used in EIA data. Hense, we’ve called it plant_id_epa until it gets transformed into plant_id_eia during the transform process with help from the crosswalk.

Module Contents#

Classes#

EpaCemsPartition

Represents EpaCems partition identifying unique resource file.

EpaCemsDatastore

Helper class to extract EpaCems resources from datastore.

Functions#

extract(year, state, ds)

Coordinate the extraction of EPA CEMS hourly DataFrames.

Attributes#

logger

RENAME_DICT

A dictionary containing EPA CEMS column names (keys) and replacement

IGNORE_COLS

The set of EPA CEMS columns to ignore when reading data.

pudl.extract.epacems.logger[source]#
pudl.extract.epacems.RENAME_DICT[source]#

A dictionary containing EPA CEMS column names (keys) and replacement names to use when reading those columns into PUDL (values). There are some duplicate rename values because the column names change year to year.

Type:

dict

pudl.extract.epacems.IGNORE_COLS[source]#

The set of EPA CEMS columns to ignore when reading data.

Type:

set

class pudl.extract.epacems.EpaCemsPartition[source]#

Bases: NamedTuple

Represents EpaCems partition identifying unique resource file.

year :str[source]#
state :str[source]#
get_key()[source]#

Returns hashable key for use with EpaCemsDatastore.

get_filters()[source]#

Returns filters for retrieving given partition resource from Datastore.

get_monthly_file(month: int) pathlib.Path[source]#

Returns the filename (without suffix) that contains the monthly data.

class pudl.extract.epacems.EpaCemsDatastore(datastore: pudl.workspace.datastore.Datastore)[source]#

Helper class to extract EpaCems resources from datastore.

EpaCems resources are identified by a year and a state. Each of these zip files contain monthly zip files that in turn contain csv files. This class implements get_data_frame method that will concatenate tables for a given state and month across all months.

get_data_frame(partition: EpaCemsPartition) pandas.DataFrame[source]#

Constructs dataframe holding data for a given (year, state) partition.

_csv_to_dataframe(csv_file) pandas.DataFrame[source]#

Convert a CEMS csv file into a pandas.DataFrame.

Parameters:

csv (file-like object) – data to be read

Returns:

A DataFrame containing the contents of the CSV file.

pudl.extract.epacems.extract(year: int, state: str, ds: pudl.workspace.datastore.Datastore)[source]#

Coordinate the extraction of EPA CEMS hourly DataFrames.

Parameters:
  • year – report year of the data to extract

  • ds – Initialized datastore

Yields:

pandas.DataFrame – A single state-year of EPA CEMS hourly emissions data.