pudl.extract.epacems module

Retrieve data from EPA CEMS hourly zipped CSVs.

This modules pulls data from EPA’s published CSV files.

pudl.extract.epacems.CSV_DTYPES = {'CO2_MASS': <class 'float'>, 'CO2_MASS (tons)': <class 'float'>, 'CO2_MASS_MEASURE_FLG': StringDtype, 'FAC_ID': Int64Dtype(), 'GLOAD': <class 'float'>, 'GLOAD (MW)': <class 'float'>, 'HEAT_INPUT': <class 'float'>, 'HEAT_INPUT (mmBtu)': <class 'float'>, 'NOX_MASS': <class 'float'>, 'NOX_MASS (lbs)': <class 'float'>, 'NOX_MASS_MEASURE_FLG': StringDtype, 'NOX_RATE': <class 'float'>, 'NOX_RATE (lbs/mmBtu)': <class 'float'>, 'NOX_RATE_MEASURE_FLG': StringDtype, 'OP_DATE': StringDtype, 'OP_HOUR': Int64Dtype(), 'OP_TIME': <class 'float'>, 'ORISPL_CODE': Int64Dtype(), 'SLOAD': <class 'float'>, 'SLOAD (1000 lbs)': <class 'float'>, 'SLOAD (1000lb/hr)': <class 'float'>, 'SO2_MASS': <class 'float'>, 'SO2_MASS (lbs)': <class 'float'>, 'SO2_MASS_MEASURE_FLG': StringDtype, 'STATE': StringDtype, 'UNITID': StringDtype, 'UNIT_ID': Int64Dtype()}

A dictionary containing column names (keys) and data types (values) for EPA CEMS.

Type

dict

class pudl.extract.epacems.EpaCemsDatastore(datastore: pudl.workspace.datastore.Datastore)[source]

Bases: object

Helper class to extract EpaCems resources from datastore.

EpaCems resources are identified by a year and a state. Each of these zip files contain monthly zip files that in turn contain csv files. This class implements get_data_frame method that will concatenate tables for a given state and month across all months.

get_data_frame(partition: pudl.extract.epacems.EpaCemsPartition)pandas.core.frame.DataFrame[source]

Constructs dataframe holding data for a given (year, state) partition.

class pudl.extract.epacems.EpaCemsPartition(year: str, state: str)[source]

Bases: tuple

Represents EpaCems partition identifying unique resource file.

get_filters()[source]

Returns filters for retrieving given partition resource from Datastore.

get_key()[source]

Returns hashable key for use with EpaCemsDatastore.

get_monthly_file(month: int)pathlib.Path[source]

Returns the filename (without suffix) that contains the monthly data.

state: str

Alias for field number 1

year: str

Alias for field number 0

pudl.extract.epacems.IGNORE_COLS = {'CO2_RATE', 'CO2_RATE (tons/mmBtu)', 'CO2_RATE_MEASURE_FLG', 'FACILITY_NAME', 'SO2_RATE', 'SO2_RATE (lbs/mmBtu)', 'SO2_RATE_MEASURE_FLG'}

The set of EPA CEMS columns to ignore when reading data.

Type

set

pudl.extract.epacems.RENAME_DICT = {'CO2_MASS': 'co2_mass_tons', 'CO2_MASS (tons)': 'co2_mass_tons', 'CO2_MASS_MEASURE_FLG': 'co2_mass_measurement_code', 'FAC_ID': 'facility_id', 'GLOAD': 'gross_load_mw', 'GLOAD (MW)': 'gross_load_mw', 'HEAT_INPUT': 'heat_content_mmbtu', 'HEAT_INPUT (mmBtu)': 'heat_content_mmbtu', 'NOX_MASS': 'nox_mass_lbs', 'NOX_MASS (lbs)': 'nox_mass_lbs', 'NOX_MASS_MEASURE_FLG': 'nox_mass_measurement_code', 'NOX_RATE': 'nox_rate_lbs_mmbtu', 'NOX_RATE (lbs/mmBtu)': 'nox_rate_lbs_mmbtu', 'NOX_RATE_MEASURE_FLG': 'nox_rate_measurement_code', 'OP_DATE': 'op_date', 'OP_HOUR': 'op_hour', 'OP_TIME': 'operating_time_hours', 'ORISPL_CODE': 'plant_id_eia', 'SLOAD': 'steam_load_1000_lbs', 'SLOAD (1000 lbs)': 'steam_load_1000_lbs', 'SLOAD (1000lb/hr)': 'steam_load_1000_lbs', 'SO2_MASS': 'so2_mass_lbs', 'SO2_MASS (lbs)': 'so2_mass_lbs', 'SO2_MASS_MEASURE_FLG': 'so2_mass_measurement_code', 'STATE': 'state', 'UNITID': 'unitid', 'UNIT_ID': 'unit_id_epa'}

A dictionary containing EPA CEMS column names (keys) and replacement names to use when reading those columns into PUDL (values).

Type

dict

pudl.extract.epacems.extract(epacems_years, states, ds: pudl.workspace.datastore.Datastore)[source]

Coordinate the extraction of EPA CEMS hourly DataFrames.

Parameters
  • epacems_years (list) – The years of CEMS data to extract, as 4-digit integers.

  • states (list) – The states whose CEMS data we want to extract, indicated by 2-letter US state codes.

  • ds (Datastore) – Initialized datastore

Yields

dict – a dictionary with a single EPA CEMS tabular data resource name as the key, having the form “hourly_emissions_epacems_YEAR_STATE” where YEAR is a 4 digit number and STATE is a lower case 2-letter code for a US state. The value is a pandas.DataFrame containing all the raw EPA CEMS hourly emissions data for the indicated state and year.