pudl.analysis.epa_crosswalk
Use the EPA crosswalk to connect EPA units to EIA generators and other data.
A major use case for this dataset is to identify subplants within plant_ids, which are the smallest coherent units for aggregation. Despite the name, plant_id refers to a legal entity that often contains multiple distinct power plants, even of different technology or fuel types.
EPA CEMS data combines information from several parts of a power plant: * emissions from smokestacks * fuel use from combustors * electricty production from generators But smokestacks, combustors, and generators can be connected in complex, many-to-many relationships. This complexity makes attribution difficult for, as an example, allocating pollution to energy producers. Furthermore, heterogeneity within plant_ids make aggregation to the parent entity difficult or inappropriate.
But by analyzing the relationships between combustors and generators, as provided in the EPA/EIA crosswalk, we can identify distinct power plants. These are the smallest coherent units of aggregation.
In graph analysis terminology, the crosswalk is a list of edges between nodes (combustors and generators) in a bipartite graph. The networkx python package provides functions to analyze this edge list and extract disjoint subgraphs (groups of combustors and generators that are connected to each other). These are the distinct power plants. To avoid a name collision with plant_id, we term these collections ‘subplants’, and identify them with a subplant_id that is unique within each plant_id. Subplants are thus identified with the composite key (plant_id, subplant_id).
Through this analysis, we found that 56% of plant_ids contain multiple distinct subplants, and 11% contain subplants with different technology types, such as a gas boiler and gas turbine (not in a combined cycle).
Usage Example:
epacems = pudl.output.epacems.epacems(states=[‘ID’]) # small subset for quick test epa_crosswalk_df = pudl.output.epacems.epa_crosswalk() filtered_crosswalk = filter_crosswalk(epa_crosswalk_df, epacems) crosswalk_with_subplant_ids = make_subplant_ids(filtered_crosswalk)
Module Contents
Functions
|
Get unique unit IDs from CEMS data. |
|
Inner join unique CEMS units with the EPA crosswalk. |
|
Remove unmatched or excluded (non-exporting) units. |
|
Remove rows that represent graph edges between generators and boilers. |
|
Make surrogate keys for combustors and generators. |
|
Use networkx graph analysis to create global subplant IDs from a preprocessed crosswalk edge list. |
|
Convert global_subplant_id to an equivalent composite key (CAMD_PLANT_ID, subplant_id). |
|
Remove crosswalk rows that do not correspond to an EIA facility or are duplicated due to many-to-many boiler relationships. |
|
Identify sub-plants in the EPA/EIA crosswalk graph. Any row filtering should be done before this step. |
- pudl.analysis.epa_crosswalk._get_unique_keys(epacems: Union[pandas.DataFrame, dask.dataframe.DataFrame]) pandas.DataFrame [source]
Get unique unit IDs from CEMS data.
- Parameters
epacems (Union[pd.DataFrame, dd.DataFrame]) – epacems dataset from pudl.output.epacems.epacems
- Returns
unique keys from the epacems dataset
- Return type
pd.DataFrame
- pudl.analysis.epa_crosswalk.filter_crosswalk_by_epacems(crosswalk: pandas.DataFrame, epacems: Union[pandas.DataFrame, dask.dataframe.DataFrame]) pandas.DataFrame [source]
Inner join unique CEMS units with the EPA crosswalk.
This is essentially an empirical filter on EPA units. Instead of filtering by construction/retirement dates in the crosswalk (thus assuming they are accurate), use the presence/absence of CEMS data to filter the units.
- Parameters
crosswalk (pd.DataFrame) – the EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()
unique_epacems_ids (pd.DataFrame) – unique ids from _get_unique_keys
- Returns
the inner join of the EPA crosswalk and unique epacems units. Adds the global ID column unit_id_epa.
- Return type
pd.DataFrame
- pudl.analysis.epa_crosswalk.filter_out_unmatched(crosswalk: pandas.DataFrame) pandas.DataFrame [source]
Remove unmatched or excluded (non-exporting) units.
Unmatched rows are limitations of the completeness of the EPA crosswalk itself, not of PUDL.
- Parameters
crosswalk (pd.DataFrame) – the EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()
- Returns
the EPA crosswalk with unmatched units removed
- Return type
pd.DataFrame
- pudl.analysis.epa_crosswalk.filter_out_boiler_rows(crosswalk: pandas.DataFrame) pandas.DataFrame [source]
Remove rows that represent graph edges between generators and boilers.
- Parameters
crosswalk (pd.DataFrame) – the EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()
- Returns
the EPA crosswalk with boiler rows (many/one-to-many) removed
- Return type
pd.DataFrame
- pudl.analysis.epa_crosswalk._prep_for_networkx(crosswalk: pandas.DataFrame) pandas.DataFrame [source]
Make surrogate keys for combustors and generators.
- Parameters
crosswalk (pd.DataFrame) – EPA crosswalk, as from pudl.output.epacems.epa_crosswalk()
- Returns
copy of EPA crosswalk with new surrogate ID columns ‘combustor_id’ and ‘generator_id’
- Return type
pd.DataFrame
- pudl.analysis.epa_crosswalk._subplant_ids_from_prepped_crosswalk(prepped: pandas.DataFrame) pandas.DataFrame [source]
Use networkx graph analysis to create global subplant IDs from a preprocessed crosswalk edge list.
- Parameters
prepped (pd.DataFrame) – an EPA crosswalk that has passed through _prep_for_networkx()
- Returns
copy of EPA crosswalk plus new column ‘global_subplant_id’
- Return type
pd.DataFrame
- pudl.analysis.epa_crosswalk._convert_global_id_to_composite_id(crosswalk_with_ids: pandas.DataFrame) pandas.DataFrame [source]
Convert global_subplant_id to an equivalent composite key (CAMD_PLANT_ID, subplant_id).
The composite key will be much more stable (though not fully stable!) in time. The global ID changes if ANY unit or generator changes, whereas the compound key only changes if units/generators change within that specific plant.
A global ID could also tempt users into using it as a crutch, even though it isn’t stable. A compound key should discourage that behavior.
- Parameters
crosswalk_with_ids (pd.DataFrame) – crosswalk with global_subplant_id, as from _subplant_ids_from_prepped_crosswalk()
- Raises
ValueError – if crosswalk_with_ids has a MultiIndex
- Returns
copy of crosswalk_with_ids with an added column: ‘subplant_id’
- Return type
pd.DataFrame
- pudl.analysis.epa_crosswalk.filter_crosswalk(crosswalk: pandas.DataFrame, epacems: Union[pandas.DataFrame, dask.dataframe.DataFrame]) pandas.DataFrame [source]
Remove crosswalk rows that do not correspond to an EIA facility or are duplicated due to many-to-many boiler relationships.
- Parameters
crosswalk (pd.DataFrame) – The EPA/EIA crosswalk, as from pudl.output.epacems.epa_crosswalk()
epacems (Union[pd.DataFrame, dd.DataFrame]) – Emissions data. Must contain columns named [“plant_id_eia”, “unitid”, “unit_id_epa”]
- Returns
A filtered copy of EPA crosswalk
- Return type
pd.DataFrame
- pudl.analysis.epa_crosswalk.make_subplant_ids(crosswalk: pandas.DataFrame) pandas.DataFrame [source]
Identify sub-plants in the EPA/EIA crosswalk graph. Any row filtering should be done before this step.
Usage Example:
epacems = pudl.output.epacems.epacems(states=[‘ID’]) # small subset for quick test epa_crosswalk_df = pudl.output.epacems.epa_crosswalk() filtered_crosswalk = filter_crosswalk(epa_crosswalk_df, epacems) crosswalk_with_subplant_ids = make_subplant_ids(filtered_crosswalk)
- Parameters
crosswalk (pd.DataFrame) – The EPA/EIA crosswalk, as from pudl.output.epacems.epa_crosswalk()
- Returns
An edge list connecting EPA units to EIA generators, with connected pieces issued a subplant_id
- Return type
pd.DataFrame