pudl.glue.epacamd_eia
#
Extract, clean, and normalize the EPACAMD-EIA crosswalk.
This module defines functions that read the raw EPACAMD-EIA crosswalk file and clean up the column names.
The crosswalk file was a joint effort on behalf on EPA and EIA and is published on the EPA’s github account at www.github.com/USEPA/camd-eia-crosswalk”. It’s a work in progress and worth noting that, at present, only pulls from 2018 data.
Module Contents#
Functions#
Attributes#
- pudl.glue.epacamd_eia.extract(ds: pudl.workspace.datastore.Datastore) pandas.DataFrame [source]#
Extract the EPACAMD-EIA Crosswalk from the Datastore.
- pudl.glue.epacamd_eia.transform(epacamd_eia: pandas.DataFrame, generators_entity_eia: pandas.DataFrame, boilers_entity_eia: pandas.DataFrame, processing_all_eia_years: bool) dict[str, pandas.DataFrame] [source]#
Clean up the EPACAMD-EIA Crosswalk file.
In its raw form, the crosswalk contains many fields. The transform process removes descriptive fields like state, location, facility name, capacity, operating status, and fuel type that can be found by linking this dataset to others already in the database. We’re primarily concerned with linking ids in this table, not including any other plant information.
The raw file contains several fields with the prefix
MOD
. These fields are used to create a single ID for matches with different spelling. For example,EIA_PLANT_ID
(plant_id_eia
in PUDL) has the generator IDCTG5
in EPA andGT5
in EIA. TheMOD
columns for both of these generators (MOD_CAMD_GENERATOR_ID
,MOD_EIA_GENERATOR_ID_GEN
) converts that value to 5. TheMATCH_TYPE_GEN
,MATCH_TYPE_BOILER
, andPLANT_ID_CHANGE_FLAG
fields indicate whether theseMOD
fields contain new information. Because we’re not concerned with creating a new, modified ID field for either EPA or EIA data, we don’t need theseMOD
orMATCH_TYPE
columns in our final output. We just care which EPA value maps to which EIA value.In terms of cleaning, we implement the standard column name changes: lower-case, no special characters, and underscores instead of spaces. We also rename some of the columns for clarity and to match how they appear in the tables you will merge with. Besides standardizing datatypes (again for merge compatability) the only meaningful data alteration we employ here is removing leading zeros from numeric strings on the
generator_id
andemissions_unit_id_epa
fields. This is because the same function is used to clean those same fields in all the other tables in which they appear. In order to merge properly, we need to clean the values in the crosswalk the same way. Lastly, we drop all rows withoutEIA_PLANT_ID
(plant_id_eia
) values because that means that they are unmatched and do not provide any useful information to users.It’s important to note that the crosswalk is kept intact (and not seperated into smaller reference tables) because the relationship between the ids is not 1:1. For example, you can’t isolate the plant_id fields, drop duplicates, and call it a day. The plant discrepancies depend on which generator ids it’s referring to. This goes for all fields. Be careful, and do some due diligence before eliminating columns.
We talk more about the complexities regarding EPA “units” in our Data Source documentation page for EPACEMS.
It’s also important to note that the crosswalk is a static file: there is no year field. The plant_id_eia and generator_id fields, however, are foreign keys from an annualized table. If the fast ETL is run (on one year of data) the test will break because the crosswalk tables with
plant_id_eia
andgenerator_id
contain values from various years. To keep the crosswalk in alignment with the available eia data, we’ll restrict it based on the generator entity table which hasplant_id_eia
andgenerator_id
so long as it’s not using the full suite of avilable years. If it is, we don’t want to restrict the crosswalk so we can get warnings and errors from any foreign key discrepancies. This isn’t an ideal solution, but it works for now.- Parameters:
epacamd_eia – The result of running this module’s extract() function.
generators_entity_eia – The generators_entity_eia table.
boilers_entity_eia – The boilers_entitiy_eia table.
processing_all_years – A boolean indicating whether the years from the Eia860Settings object match the EIA860 working partitions. This indicates whether or not to restrict the crosswalk data so the tests don’t fail on foreign key restraints.
- Returns:
A dictionary containing the cleaned EPACAMD-EIA crosswalk DataFrame.