pudl.analysis.epa_crosswalk#

Use the EPA crosswalk to connect EPA units to EIA generators and other data.

A major use case for this dataset is to identify subplants within plant_ids, which are the smallest coherent units for aggregation. Despite the name, plant_id refers to a legal entity that often contains multiple distinct power plants, even of different technology or fuel types.

EPA CEMS data combines information from several parts of a power plant:

  • emissions from smokestacks

  • fuel use from combustors

  • electricty production from generators

But smokestacks, combustors, and generators can be connected in complex, many-to-many relationships. This complexity makes attribution difficult for, as an example, allocating pollution to energy producers. Furthermore, heterogeneity within plant_ids make aggregation to the parent entity difficult or inappropriate.

But by analyzing the relationships between combustors and generators, as provided in the EPA/EIA crosswalk, we can identify distinct power plants. These are the smallest coherent units of aggregation.

In graph analysis terminology, the crosswalk is a list of edges between nodes (combustors and generators) in a bipartite graph. The networkx python package provides functions to analyze this edge list and extract disjoint subgraphs (groups of combustors and generators that are connected to each other). These are the distinct power plants. To avoid a name collision with plant_id, we term these collections ‘subplants’, and identify them with a subplant_id that is unique within each plant_id. Subplants are thus identified with the composite key (plant_id, subplant_id).

Through this analysis, we found that 56% of plant_ids contain multiple distinct subplants, and 11% contain subplants with different technology types, such as a gas boiler and gas turbine (not in a combined cycle).

Usage Example:

epacems = pudl.output.epacems.epacems(states=[‘ID’]) # small subset for quick test epa_crosswalk_df = pudl.output.epacems.epa_crosswalk() filtered_crosswalk = filter_crosswalk(epa_crosswalk_df, epacems) crosswalk_with_subplant_ids = make_subplant_ids(filtered_crosswalk)

Module Contents#

Functions#

_get_unique_keys(→ pandas.DataFrame)

Get unique unit IDs from CEMS data.

filter_crosswalk_by_epacems(→ pandas.DataFrame)

Inner join unique CEMS units with the EPA crosswalk.

filter_out_unmatched(→ pandas.DataFrame)

Remove unmatched or excluded (non-exporting) units.

filter_out_boiler_rows(→ pandas.DataFrame)

Remove rows that represent graph edges between generators and boilers.

_prep_for_networkx(→ pandas.DataFrame)

Make surrogate keys for combustors and generators.

_subplant_ids_from_prepped_crosswalk(→ pandas.DataFrame)

Use graph analysis to create global subplant IDs from a crosswalk edge list.

_convert_global_id_to_composite_id(→ pandas.DataFrame)

Convert global_subplant_id to a composite key (CAMD_PLANT_ID, subplant_id).

filter_crosswalk(→ pandas.DataFrame)

Remove irrelevant or duplicated rows from the crosswalk.

make_subplant_ids(→ pandas.DataFrame)

Identify sub-plants in the EPA/EIA crosswalk graph.

pudl.analysis.epa_crosswalk._get_unique_keys(epacems: pd.DataFrame | dd.DataFrame) pandas.DataFrame[source]#

Get unique unit IDs from CEMS data.

Parameters:

epacems – dataset from pudl.output.epacems.epacems()

Returns:

Unique keys from the epacems dataset.

pudl.analysis.epa_crosswalk.filter_crosswalk_by_epacems(crosswalk: pandas.DataFrame, epacems: pd.DataFrame | dd.DataFrame) pandas.DataFrame[source]#

Inner join unique CEMS units with the EPA crosswalk.

This is essentially an empirical filter on EPA units. Instead of filtering by construction/retirement dates in the crosswalk (thus assuming they are accurate), use the presence/absence of CEMS data to filter the units.

Parameters:
  • crosswalk – the EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()

  • unique_epacems_ids – unique ids from _get_unique_keys

Returns:

The inner join of the EPA crosswalk and unique epacems units. Adds the global ID column unit_id_epa.

pudl.analysis.epa_crosswalk.filter_out_unmatched(crosswalk: pandas.DataFrame) pandas.DataFrame[source]#

Remove unmatched or excluded (non-exporting) units.

Unmatched rows are limitations of the completeness of the EPA crosswalk itself, not of PUDL.

Parameters:

crosswalk – the EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()

Returns:

The EPA crosswalk with unmatched units removed.

pudl.analysis.epa_crosswalk.filter_out_boiler_rows(crosswalk: pandas.DataFrame) pandas.DataFrame[source]#

Remove rows that represent graph edges between generators and boilers.

Parameters:

crosswalk – the EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()

Returns:

The EPA crosswalk with boiler rows (many/one-to-many) removed

pudl.analysis.epa_crosswalk._prep_for_networkx(crosswalk: pandas.DataFrame) pandas.DataFrame[source]#

Make surrogate keys for combustors and generators.

Parameters:

crosswalk – EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()

Returns:

A copy of EPA crosswalk with new surrogate ID columns ‘combustor_id’ and ‘generator_id’

pudl.analysis.epa_crosswalk._subplant_ids_from_prepped_crosswalk(prepped: pandas.DataFrame) pandas.DataFrame[source]#

Use graph analysis to create global subplant IDs from a crosswalk edge list.

Parameters:

prepped – an EPA crosswalk that has passed through _prep_for_networkx()

Returns:

A copy of EPA crosswalk plus new column ‘global_subplant_id’

pudl.analysis.epa_crosswalk._convert_global_id_to_composite_id(crosswalk_with_ids: pandas.DataFrame) pandas.DataFrame[source]#

Convert global_subplant_id to a composite key (CAMD_PLANT_ID, subplant_id).

The composite key will be much more stable (though not fully stable!) in time. The global ID changes if ANY unit or generator changes, whereas the compound key only changes if units/generators change within that specific plant.

A global ID could also tempt users into using it as a crutch, even though it isn’t stable. A compound key should discourage that behavior.

Parameters:

crosswalk_with_ids – crosswalk with global_subplant_id, as from _subplant_ids_from_prepped_crosswalk()

Raises:

ValueError – if crosswalk_with_ids has a MultiIndex

Returns:

‘subplant_id’

Return type:

A copy of crosswalk_with_ids with an added column

pudl.analysis.epa_crosswalk.filter_crosswalk(crosswalk: pandas.DataFrame, epacems: pd.DataFrame | dd.DataFrame) pandas.DataFrame[source]#

Remove irrelevant or duplicated rows from the crosswalk.

Remove crosswalk rows that do not correspond to an EIA facility or are duplicated due to many-to-many boiler relationships.

Parameters:
  • crosswalk – The EPA/EIA crosswalk from pudl.output.epacems.epa_crosswalk()

  • epacems – Emissions data. Must contain columns named [“plant_id_eia”, “unitid”, “unit_id_epa”]

Returns:

A filtered copy of EPA crosswalk.

pudl.analysis.epa_crosswalk.make_subplant_ids(crosswalk: pandas.DataFrame) pandas.DataFrame[source]#

Identify sub-plants in the EPA/EIA crosswalk graph.

Any row filtering should be done before this step.

Usage Example:

epacems = pudl.output.epacems.epacems(states=[‘ID’]) # small subset for quick test epa_crosswalk_df = pudl.output.epacems.epa_crosswalk() filtered_crosswalk = filter_crosswalk(epa_crosswalk_df, epacems) crosswalk_with_subplant_ids = make_subplant_ids(filtered_crosswalk)

Parameters:

crosswalk – The EPA/EIA crosswalk, from pudl.output.epacems.epa_crosswalk()

Returns:

An edge list connecting EPA units to EIA generators, with connected pieces issued a subplant_id