Source code for pudl.glue.epacamd_eia

"""Extract, clean, and normalize the EPACAMD-EIA crosswalk.

This module defines functions that read the raw EPACAMD-EIA crosswalk file and clean
up the column names.

The crosswalk file was a joint effort on behalf on EPA and EIA and is published on the
EPA's github account at www.github.com/USEPA/camd-eia-crosswalk". It's a work in
progress and worth noting that, at present, only pulls from 2018 data.
"""
import pandas as pd

import pudl.logging_helpers
from pudl.helpers import remove_leading_zeros_from_numeric_strings, simplify_columns
from pudl.metadata.fields import apply_pudl_dtypes
from pudl.workspace.datastore import Datastore

[docs]logger = pudl.logging_helpers.get_logger(__name__)
[docs]def extract(ds: Datastore) -> pd.DataFrame: """Extract the EPACAMD-EIA Crosswalk from the Datastore.""" logger.info("Extracting the EPACAMD-EIA crosswalk from Zenodo") with ds.get_zipfile_resource("epacamd_eia", name="epacamd_eia.zip").open( "camd-eia-crosswalk-master/epa_eia_crosswalk.csv" ) as f: return pd.read_csv(f)
[docs]def transform( epacamd_eia: pd.DataFrame, generators_entity_eia: pd.DataFrame, boilers_entity_eia: pd.DataFrame, processing_all_eia_years: bool, ) -> dict[str, pd.DataFrame]: """Clean up the EPACAMD-EIA Crosswalk file. In its raw form, the crosswalk contains many fields. The transform process removes descriptive fields like state, location, facility name, capacity, operating status, and fuel type that can be found by linking this dataset to others already in the database. We're primarily concerned with linking ids in this table, not including any other plant information. The raw file contains several fields with the prefix `MOD`. These fields are used to create a single ID for matches with different spelling. For example, `EIA_PLANT_ID` (`plant_id_eia` in PUDL) has the generator ID `CTG5` in EPA and `GT5` in EIA. The `MOD` columns for both of these generators (`MOD_CAMD_GENERATOR_ID`, `MOD_EIA_GENERATOR_ID_GEN`) converts that value to 5. The `MATCH_TYPE_GEN`, `MATCH_TYPE_BOILER`, and `PLANT_ID_CHANGE_FLAG` fields indicate whether these `MOD` fields contain new information. Because we're not concerned with creating a new, modified ID field for either EPA or EIA data, we don't need these `MOD` or `MATCH_TYPE` columns in our final output. We just care which EPA value maps to which EIA value. In terms of cleaning, we implement the standard column name changes: lower-case, no special characters, and underscores instead of spaces. We also rename some of the columns for clarity and to match how they appear in the tables you will merge with. Besides standardizing datatypes (again for merge compatability) the only meaningful data alteration we employ here is removing leading zeros from numeric strings on the `generator_id` and `emissions_unit_id_epa` fields. This is because the same function is used to clean those same fields in all the other tables in which they appear. In order to merge properly, we need to clean the values in the crosswalk the same way. Lastly, we drop all rows without `EIA_PLANT_ID` (`plant_id_eia`) values because that means that they are unmatched and do not provide any useful information to users. It's important to note that the crosswalk is kept intact (and not seperated into smaller reference tables) because the relationship between the ids is not 1:1. For example, you can't isolate the plant_id fields, drop duplicates, and call it a day. The plant discrepancies depend on which generator ids it's referring to. This goes for all fields. Be careful, and do some due diligence before eliminating columns. We talk more about the complexities regarding EPA "units" in our Data Source documentation page for EPACEMS: https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/epacems.html It's also important to note that the crosswalk is a static file: there is no year field. The plant_id_eia and generator_id fields, however, are foreign keys from an annualized table. If the fast ETL is run (on one year of data) the test will break because the crosswalk tables with plant_id_eia and generator_id contain values from various years. To keep the crosswalk in alignment with the available eia data, we'll restrict it based on the generator entity table which has plant_id_eia and generator_id so long as it's not using the full suite of avilable years. If it is, we don't want to restrict the crosswalk so we can get warnings and errors from any foreign key discrepancies. This isn't an ideal solution, but it works for now. Args: epacamd_eia: The result of running this module's extract() function. generators_entity_eia: The generators_entity_eia table. boilers_entity_eia: The boilers_entitiy_eia table. processing_all_years: A boolean indicating whether the years from the Eia860Settings object match the EIA860 working partitions. This indicates whether or not to restrict the crosswalk data so the tests don't fail on foreign key restraints. Returns: A dictionary containing the cleaned EPACAMD-EIA crosswalk DataFrame. """ logger.info("Transforming the EPACAMD-EIA crosswalk") column_rename = { "camd_plant_id": "plant_id_epa", "camd_unit_id": "emissions_unit_id_epa", "camd_generator_id": "generator_id_epa", "eia_plant_id": "plant_id_eia", "eia_boiler_id": "boiler_id", # Eventually change to boiler_id_eia "eia_generator_id": "generator_id", # Eventually change to generator_id_eia } # Basic column rename, selection, and dtype alignment. crosswalk_clean = ( epacamd_eia.pipe(simplify_columns) .rename(columns=column_rename) .filter(list(column_rename.values())) .pipe(remove_leading_zeros_from_numeric_strings, col_name="generator_id") .pipe( remove_leading_zeros_from_numeric_strings, col_name="emissions_unit_id_epa" ) .pipe(apply_pudl_dtypes, "eia") .dropna(subset=["plant_id_eia"]) ) # Restrict crosswalk for tests if running fast etl if not processing_all_eia_years: logger.info( "Selected subset of avilable EIA years--restricting EPACAMD-EIA Crosswalk \ to chosen subset of EIA years" ) crosswalk_clean = pd.merge( crosswalk_clean, generators_entity_eia[["plant_id_eia", "generator_id"]], on=["plant_id_eia", "generator_id"], how="inner", ) crosswalk_clean = pd.merge( crosswalk_clean, boilers_entity_eia[["plant_id_eia", "boiler_id"]], on=["plant_id_eia", "boiler_id"], how="inner", ) return {"epacamd_eia": crosswalk_clean}