pudl.etl.epacems_assets

EPA CEMS Hourly Emissions assets.

The core_epacems__hourly_emissions() asset defined in this module uses a dagster pattern that is unique from other PUDL assets. The underlying architecture uses ops to create a dynamic graph which is wrapped by a special asset called a graph backed asset that creates an asset from a graph of ops. The dynamic graph will allow dagster to dynamically generate an op for processing each year of EPA CEMS data and execute these ops in parallel. For more information see: https://docs.dagster.io/concepts/ops-jobs-graphs/dynamic-graphs and https://docs.dagster.io/concepts/assets/graph-backed-assets.

Module Contents

Functions

_partitioned_path(→ pathlib.Path)

get_years_from_settings(context)

Return set of years in settings.

process_single_year(→ YearPartitions)

Process a single year of EPA CEMS data.

consolidate_partitions(→ None)

Read partitions into memory and write to a single monolithic output.

core_epacems__hourly_emissions(→ None)

Extract, transform and load CSVs for EPA CEMS.

_core_epacems__emissions_unit_ids(→ pandas.DataFrame)

Make unique annual plant_id_eia and emissions_unit_id_epa.

Attributes

pudl.etl.epacems_assets.logger[source]
pudl.etl.epacems_assets.YearPartitions[source]
pudl.etl.epacems_assets._partitioned_path() pathlib.Path[source]
pudl.etl.epacems_assets.get_years_from_settings(context)[source]

Return set of years in settings.

These will be used to kick off worker processes to process each year of data in parallel.

pudl.etl.epacems_assets.process_single_year(context, year, core_epa__assn_eia_epacamd: pandas.DataFrame, core_eia__entity_plants: pandas.DataFrame) YearPartitions[source]

Process a single year of EPA CEMS data.

Parameters:
  • context – dagster keyword that provides access to resources and config.

  • year – Year of data to process.

  • core_epa__assn_eia_epacamd – The EPA EIA crosswalk table used for harmonizing the ORISPL code with EIA.

  • core_eia__entity_plants – The EIA Plant entities used for aligning timezones.

pudl.etl.epacems_assets.consolidate_partitions(context, partitions: list[YearPartitions]) None[source]

Read partitions into memory and write to a single monolithic output.

Parameters:
  • context – dagster keyword that provides access to resources and config.

  • partitions – Year and state combinations in the output database.

pudl.etl.epacems_assets.core_epacems__hourly_emissions(_core_epa__assn_eia_epacamd_unique: pandas.DataFrame, core_eia__entity_plants: pandas.DataFrame) None[source]

Extract, transform and load CSVs for EPA CEMS.

This asset creates a dynamic graph of ops to process EPA CEMS data in parallel. It will create both a partitioned and single monolithic parquet output. For more information see: https://docs.dagster.io/concepts/ops-jobs-graphs/dynamic-graphs.

pudl.etl.epacems_assets._core_epacems__emissions_unit_ids(core_epacems__hourly_emissions: dask.dataframe.DataFrame) pandas.DataFrame[source]

Make unique annual plant_id_eia and emissions_unit_id_epa.

Returns:

“plant_id_eia”, “year” and “emissions_unit_id_epa”

Return type:

dataframe with unique set of