pudl.analysis.epacamd_eia#

Use the epacamd_eia crosswalk to connect EPA units to EIA generators and other data.

A major use case for this dataset is to identify subplants within plant_ids, which are the smallest coherent units for aggregation. Despite the name, plant_id refers to a legal entity that often contains multiple distinct power plants, even of different technology or fuel types.

EPA CEMS data combines information from several parts of a power plant: * emissions from smokestacks * fuel use from combustors * electricty production from generators But smokestacks, combustors, and generators can be connected in complex, many-to-many relationships. This complexity makes attribution difficult for, as an example, allocating pollution to energy producers. Furthermore, heterogeneity within plant_ids make aggregation to the parent entity difficult or inappropriate.

But by analyzing the relationships between combustors and generators, as provided in the epacamd_eia crosswalk, we can identify distinct power plants. These are the smallest coherent units of aggregation.

In graph analysis terminology, the crosswalk is a list of edges between nodes (combustors and generators) in a bipartite graph. The networkx python package provides functions to analyze this edge list and extract disjoint subgraphs (groups of combustors and generators that are connected to each other). These are the distinct power plants. To avoid a name collision with plant_id, we term these collections ‘subplants’, and identify them with a subplant_id that is unique within each plant_id. Subplants are thus identified with the composite key (plant_id, subplant_id).

Through this analysis, we found that 56% of plant_ids contain multiple distinct subplants, and 11% contain subplants with different technology types, such as a gas boiler and gas turbine (not in a combined cycle).

Usage Example:

epacems = pudl.output.epacems.epacems(states=[‘ID’], years=[2020]) # subset for test epacamd_eia = pudl_out.epacamd_eia() filtered_crosswalk = filter_crosswalk(epacamd_eia, epacems) crosswalk_with_subplant_ids = make_subplant_ids(filtered_crosswalk)

Module Contents#

Functions#

_get_unique_keys(→ pandas.DataFrame)

Get unique unit IDs from CEMS data.

filter_crosswalk_by_epacems(→ pandas.DataFrame)

Inner join unique CEMS units with the epacamd_eia crosswalk.

filter_out_boiler_rows(→ pandas.DataFrame)

Remove rows that represent graph edges between generators and boilers.

_prep_for_networkx(→ pandas.DataFrame)

Make surrogate keys for combustors and generators.

_subplant_ids_from_prepped_crosswalk(→ pandas.DataFrame)

Use networkx graph analysis to create subplant IDs from crosswalk edge list.

_convert_global_id_to_composite_id(→ pandas.DataFrame)

Convert global_subplant_id to a composite key (plant_id_eia, subplant_id).

filter_crosswalk(→ pandas.DataFrame)

Remove unmapped crosswalk rows or duplicates due to m2m boiler relationships.

make_subplant_ids(→ pandas.DataFrame)

Identify sub-plants in the EPA/EIA crosswalk graph.

pudl.analysis.epacamd_eia._get_unique_keys(epacems: pd.DataFrame | dd.DataFrame) pandas.DataFrame[source]#

Get unique unit IDs from CEMS data.

Parameters:

epacems (Union[pd.DataFrame, dd.DataFrame]) – epacems dataset from pudl.output.epacems.epacems

Returns:

unique keys from the epacems dataset

Return type:

pd.DataFrame

pudl.analysis.epacamd_eia.filter_crosswalk_by_epacems(crosswalk: pandas.DataFrame, epacems: pd.DataFrame | dd.DataFrame) pandas.DataFrame[source]#

Inner join unique CEMS units with the epacamd_eia crosswalk.

This is essentially an empirical filter on EPA units. Instead of filtering by construction/retirement dates in the crosswalk (thus assuming they are accurate), use the presence/absence of CEMS data to filter the units.

Parameters:
  • crosswalk – epacamd_eia crosswalk

  • unique_epacems_ids (pd.DataFrame) – unique ids from _get_unique_keys

Returns:

The inner join of the epacamd_eia crosswalk and unique epacems units. Adds the global ID column unit_id_epa.

pudl.analysis.epacamd_eia.filter_out_boiler_rows(crosswalk: pandas.DataFrame) pandas.DataFrame[source]#

Remove rows that represent graph edges between generators and boilers.

Parameters:

crosswalk (pd.DataFrame) – epacamd_eia crosswalk

Returns:

the epacamd_eia crosswalk with boiler rows (many/one-to-many)

removed

Return type:

pd.DataFrame

pudl.analysis.epacamd_eia._prep_for_networkx(crosswalk: pandas.DataFrame) pandas.DataFrame[source]#

Make surrogate keys for combustors and generators.

Parameters:

crosswalk (pd.DataFrame) – epacamd_eia crosswalk

Returns:

copy of epacamd_eia crosswalk with new surrogate ID columns

’combustor_id’ and ‘generator_id’

Return type:

pd.DataFrame

pudl.analysis.epacamd_eia._subplant_ids_from_prepped_crosswalk(prepped: pandas.DataFrame) pandas.DataFrame[source]#

Use networkx graph analysis to create subplant IDs from crosswalk edge list.

Parameters:

prepped (pd.DataFrame) – epacamd_eia crosswalked passed through _prep_for_networkx()

Returns:

copy of epacamd_eia crosswalk plus new column ‘global_subplant_id’

Return type:

pd.DataFrame

pudl.analysis.epacamd_eia._convert_global_id_to_composite_id(crosswalk_with_ids: pandas.DataFrame) pandas.DataFrame[source]#

Convert global_subplant_id to a composite key (plant_id_eia, subplant_id).

The composite key will be much more stable (though not fully stable!) in time. The global ID changes if ANY unit or generator changes, whereas the compound key only changes if units/generators change within that specific plant.

A global ID could also tempt users into using it as a crutch, even though it isn’t stable. A compound key should discourage that behavior.

Parameters:

crosswalk_with_ids (pd.DataFrame) – crosswalk with global_subplant_id, as from _subplant_ids_from_prepped_crosswalk()

Raises:

ValueError – if crosswalk_with_ids has a MultiIndex

Returns:

copy of crosswalk_with_ids with an added column: ‘subplant_id’

Return type:

pd.DataFrame

pudl.analysis.epacamd_eia.filter_crosswalk(crosswalk: pandas.DataFrame, epacems: pd.DataFrame | dd.DataFrame) pandas.DataFrame[source]#

Remove unmapped crosswalk rows or duplicates due to m2m boiler relationships.

Parameters:
  • crosswalk (pd.DataFrame) – The epacamd_eia crosswalk.

  • epacems (Union[pd.DataFrame, dd.DataFrame]) – Emissions data. Must contain columns named [“plant_id_eia”, “emissions_unit_id_epa”]

Returns:

A filtered copy of epacamd_eia crosswalk

Return type:

pd.DataFrame

pudl.analysis.epacamd_eia.make_subplant_ids(crosswalk: pandas.DataFrame) pandas.DataFrame[source]#

Identify sub-plants in the EPA/EIA crosswalk graph.

Any row filtering should be done before this step.

Usage Example:

epacems = pudl.output.epacems.epacems(states=[‘ID’]) # small subset for quick test epacamd_eia = pudl_out.epacamd_eia() filtered_crosswalk = filter_crosswalk(epacamd_eia, epacems) crosswalk_with_subplant_ids = make_subplant_ids(filtered_crosswalk)

Note that sub-plant ids should be used in conjunction with plant_id_eia vs. plant_id_epa because the former is more granular and integrated into CEMS during the transform process.

Parameters:

crosswalk (pd.DataFrame) – The epacamd_eia crosswalk

Returns:

An edge list connecting EPA units to EIA generators, with

connected pieces issued a subplant_id

Return type:

pd.DataFrame