pudl.analysis.epa_crosswalk

Use the EPA crosswalk to connect EPA units to EIA generators and other data.

A major use case for this dataset is to identify subplants within plant_ids, which are the smallest coherent units for aggregation. Despite the name, plant_id refers to a legal entity that often contains multiple distinct power plants, even of different technology or fuel types.

EPA CEMS data combines information from several parts of a power plant: * emissions from smokestacks * fuel use from combustors * electricty production from generators But smokestacks, combustors, and generators can be connected in complex, many-to-many relationships. This complexity makes attribution difficult for, as an example, allocating pollution to energy producers. Furthermore, heterogeneity within plant_ids make aggregation to the parent entity difficult or inappropriate.

But by analyzing the relationships between combustors and generators, as provided in the EPA/EIA crosswalk, we can identify distinct power plants. These are the smallest coherent units of aggregation.

In graph analysis terminology, the crosswalk is a list of edges between nodes (combustors and generators) in a bipartite graph. The networkx python package provides functions to analyze this edge list and extract disjoint subgraphs (groups of combustors and generators that are connected to each other). These are the distinct power plants. To avoid a name collision with plant_id, we term these collections ‘subplants’, and identify them with a subplant_id that is unique within each plant_id. Subplants are thus identified with the composite key (plant_id, subplant_id).

Through this analysis, we found that 56% of plant_ids contain multiple distinct subplants, and 11% contain subplants with different technology types, such as a gas boiler and gas turbine (not in a combined cycle).

Usage Example:

epacems = pudl.output.epacems.epacems(states=[‘ID’]) # small subset for quick test epa_crosswalk_df = pudl.output.epacems.epa_crosswalk() filtered_crosswalk = filter_crosswalk(epa_crosswalk_df, epacems) crosswalk_with_subplant_ids = make_subplant_ids(filtered_crosswalk)

Module Contents

Functions

_get_unique_keys(epacems: Union[pandas.DataFrame, dask.dataframe.DataFrame]) → pandas.DataFrame

Get unique unit IDs from CEMS data.

filter_crosswalk_by_epacems(crosswalk: pandas.DataFrame, epacems: Union[pandas.DataFrame, dask.dataframe.DataFrame]) → pandas.DataFrame

Inner join unique CEMS units with the EPA crosswalk.

filter_out_unmatched(crosswalk: pandas.DataFrame) → pandas.DataFrame

Remove unmatched or excluded (non-exporting) units.

filter_out_boiler_rows(crosswalk: pandas.DataFrame) → pandas.DataFrame

Remove rows that represent graph edges between generators and boilers.

_prep_for_networkx(crosswalk: pandas.DataFrame) → pandas.DataFrame

Make surrogate keys for combustors and generators.

_subplant_ids_from_prepped_crosswalk(prepped: pandas.DataFrame) → pandas.DataFrame

Use networkx graph analysis to create global subplant IDs from a preprocessed crosswalk edge list.

_convert_global_id_to_composite_id(crosswalk_with_ids: pandas.DataFrame) → pandas.DataFrame

Convert global_subplant_id to an equivalent composite key (CAMD_PLANT_ID, subplant_id).

filter_crosswalk(crosswalk: pandas.DataFrame, epacems: Union[pandas.DataFrame, dask.dataframe.DataFrame]) → pandas.DataFrame

Remove crosswalk rows that do not correspond to an EIA facility or are duplicated due to many-to-many boiler relationships.

make_subplant_ids(crosswalk: pandas.DataFrame) → pandas.DataFrame

Identify sub-plants in the EPA/EIA crosswalk graph. Any row filtering should be done before this step.

pudl.analysis.epa_crosswalk._get_unique_keys(epacems: Union[pandas.DataFrame, dask.dataframe.DataFrame]) pandas.DataFrame[source]

Get unique unit IDs from CEMS data.

Parameters

epacems (Union[pd.DataFrame, dd.DataFrame]) – epacems dataset from pudl.output.epacems.epacems

Returns

unique keys from the epacems dataset

Return type

pd.DataFrame

pudl.analysis.epa_crosswalk.filter_crosswalk_by_epacems(crosswalk: pandas.DataFrame, epacems: Union[pandas.DataFrame, dask.dataframe.DataFrame]) pandas.DataFrame[source]

Inner join unique CEMS units with the EPA crosswalk.

This is essentially an empirical filter on EPA units. Instead of filtering by construction/retirement dates in the crosswalk (thus assuming they are accurate), use the presence/absence of CEMS data to filter the units.

Parameters
  • crosswalk (pd.DataFrame) – the EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()

  • unique_epacems_ids (pd.DataFrame) – unique ids from _get_unique_keys

Returns

the inner join of the EPA crosswalk and unique epacems units. Adds the global ID column unit_id_epa.

Return type

pd.DataFrame

pudl.analysis.epa_crosswalk.filter_out_unmatched(crosswalk: pandas.DataFrame) pandas.DataFrame[source]

Remove unmatched or excluded (non-exporting) units.

Unmatched rows are limitations of the completeness of the EPA crosswalk itself, not of PUDL.

Parameters

crosswalk (pd.DataFrame) – the EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()

Returns

the EPA crosswalk with unmatched units removed

Return type

pd.DataFrame

pudl.analysis.epa_crosswalk.filter_out_boiler_rows(crosswalk: pandas.DataFrame) pandas.DataFrame[source]

Remove rows that represent graph edges between generators and boilers.

Parameters

crosswalk (pd.DataFrame) – the EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()

Returns

the EPA crosswalk with boiler rows (many/one-to-many) removed

Return type

pd.DataFrame

pudl.analysis.epa_crosswalk._prep_for_networkx(crosswalk: pandas.DataFrame) pandas.DataFrame[source]

Make surrogate keys for combustors and generators.

Parameters

crosswalk (pd.DataFrame) – EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()

Returns

copy of EPA crosswalk with new surrogate ID columns ‘combustor_id’ and ‘generator_id’

Return type

pd.DataFrame

pudl.analysis.epa_crosswalk._subplant_ids_from_prepped_crosswalk(prepped: pandas.DataFrame) pandas.DataFrame[source]

Use networkx graph analysis to create global subplant IDs from a preprocessed crosswalk edge list.

Parameters

prepped (pd.DataFrame) – an EPA crosswalk that has passed through _prep_for_networkx()

Returns

copy of EPA crosswalk plus new column ‘global_subplant_id’

Return type

pd.DataFrame

pudl.analysis.epa_crosswalk._convert_global_id_to_composite_id(crosswalk_with_ids: pandas.DataFrame) pandas.DataFrame[source]

Convert global_subplant_id to an equivalent composite key (CAMD_PLANT_ID, subplant_id).

The composite key will be much more stable (though not fully stable!) in time. The global ID changes if ANY unit or generator changes, whereas the compound key only changes if units/generators change within that specific plant.

A global ID could also tempt users into using it as a crutch, even though it isn’t stable. A compound key should discourage that behavior.

Parameters

crosswalk_with_ids (pd.DataFrame) – crosswalk with global_subplant_id, as from _subplant_ids_from_prepped_crosswalk()

Raises

ValueError – if crosswalk_with_ids has a MultiIndex

Returns

copy of crosswalk_with_ids with an added column: ‘subplant_id’

Return type

pd.DataFrame

pudl.analysis.epa_crosswalk.filter_crosswalk(crosswalk: pandas.DataFrame, epacems: Union[pandas.DataFrame, dask.dataframe.DataFrame]) pandas.DataFrame[source]

Remove crosswalk rows that do not correspond to an EIA facility or are duplicated due to many-to-many boiler relationships.

Parameters
  • crosswalk (pd.DataFrame) – The EPA/EIA crosswalk, as from pudl.output.epacems.epa_crosswalk()

  • epacems (Union[pd.DataFrame, dd.DataFrame]) – Emissions data. Must contain columns named [“plant_id_eia”, “unitid”, “unit_id_epa”]

Returns

A filtered copy of EPA crosswalk

Return type

pd.DataFrame

pudl.analysis.epa_crosswalk.make_subplant_ids(crosswalk: pandas.DataFrame) pandas.DataFrame[source]

Identify sub-plants in the EPA/EIA crosswalk graph. Any row filtering should be done before this step.

Usage Example:

epacems = pudl.output.epacems.epacems(states=[‘ID’]) # small subset for quick test epa_crosswalk_df = pudl.output.epacems.epa_crosswalk() filtered_crosswalk = filter_crosswalk(epa_crosswalk_df, epacems) crosswalk_with_subplant_ids = make_subplant_ids(filtered_crosswalk)

Parameters

crosswalk (pd.DataFrame) – The EPA/EIA crosswalk, as from pudl.output.epacems.epa_crosswalk()

Returns

An edge list connecting EPA units to EIA generators, with connected pieces issued a subplant_id

Return type

pd.DataFrame