pudl.extract.epacems#

Retrieve data from EPA CEMS hourly zipped CSVs.

Prior to August 2023, this data was retrieved from an FTP server. After August 2023, this data is now retrieved from the CEMS API. The format of the files has changed from monthly CSVs for each state to one CSV per state per year. The names of the columns have also changed. Column name compatibility was determined by reading the CEMS API documentation on column names.

Presently, this module is where the CEMS columns are renamed and dropped. Any columns in the IGNORE_COLS dictionary are excluded from the final output. All of these columns are calculable rates, measurement flags, or descriptors (like facility name) that can be accessed by merging this data with the EIA860 plants entity table. We also remove the FACILITY_ID field because it is internal to the EPA’s business accounting database.

Pre-transform, the plant_id_epa field is a close but not perfect indicator for plant_id_eia. In the raw data it’s called Facility ID (ORISPL code) but that’s not entirely accurate. The core_epa__assn_eia_epacamd crosswalk will show that the mapping between Facility ID as it appears in CEMS and the plant_id_eia field used in EIA data. Hence, we’ve called it plant_id_epa until it gets transformed into plant_id_eia during the transform process with help from the crosswalk.

Module Contents#

Classes#

EpaCemsPartition

Represents EpaCems partition identifying unique resource file.

EpaCemsDatastore

Helper class to extract EpaCems resources from datastore.

Functions#

extract(→ pandas.DataFrame)

Coordinate the extraction of EPA CEMS hourly DataFrames.

Attributes#

logger

API_RENAME_DICT

A dictionary containing EPA CEMS column names (keys) and replacement names to

API_IGNORE_COLS

The set of EPA CEMS columns to ignore when reading data.

API_DTYPE_DICT

pudl.extract.epacems.logger[source]#
pudl.extract.epacems.API_RENAME_DICT[source]#

A dictionary containing EPA CEMS column names (keys) and replacement names to use when reading those columns into PUDL (values).

There are some duplicate rename values because the column names change year to year.

Type:

Dict

pudl.extract.epacems.API_IGNORE_COLS[source]#

The set of EPA CEMS columns to ignore when reading data.

Type:

Set

pudl.extract.epacems.API_DTYPE_DICT[source]#
class pudl.extract.epacems.EpaCemsPartition(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Represents EpaCems partition identifying unique resource file.

property year[source]#

Return the year associated with the year_quarter.

year_quarter: Annotated[str, StringConstraints(strict=True, pattern='^(19|20)\\d{2}[q][1-4]$')][source]#
get_filters()[source]#

Returns filters for retrieving given partition resource from Datastore.

get_quarterly_file() pathlib.Path[source]#

Return the name of the CSV file that holds annual hourly data.

class pudl.extract.epacems.EpaCemsDatastore(datastore: pudl.workspace.datastore.Datastore)[source]#

Helper class to extract EpaCems resources from datastore.

EpaCems resources are identified by a year and a quarter. Each of these zip files contains one csv file. This class implements get_data_frame method that will rename columns for a quarterly CSV file.

get_data_frame(partition: EpaCemsPartition) pandas.DataFrame[source]#

Constructs dataframe from a zipfile for a given (year_quarter) partition.

_csv_to_dataframe(csv_path: pathlib.Path, ignore_cols: dict[str, str], rename_dict: dict[str, str], dtype_dict: dict[str, type], chunksize: int = 100000) pandas.DataFrame[source]#

Convert a CEMS csv file into a pandas.DataFrame.

Parameters:

csv_path – Path to CSV file containing data to read.

Returns:

A DataFrame containing the filtered and dtyped contents of the CSV file.

pudl.extract.epacems.extract(year_quarter: str, ds: pudl.workspace.datastore.Datastore) pandas.DataFrame[source]#

Coordinate the extraction of EPA CEMS hourly DataFrames.

Parameters:
  • year_quarter – report year and quarter of the data to extract

  • ds – Initialized datastore

Yields:

A single quarter of EPA CEMS hourly emissions data.