PUDL ID Mapping
The Status of FERC 1 and EIA IDs
Many of the same utilities are reporting data to FERC and EIA, but there is no official crosswalk file or universal ID to connect the records and obtain information from both sources in one table. The EIA assigns their own IDs to distinguish the utilities and plants that report to them over time. These IDs are useful for parsing EIA data, but they are not connected to utility and plant data reported by FERC. FERC Form 1 uses a respondent ID to keep track of utilities, but it does not have an ID system in place to keep track of unique plants or records over time.
Why is this significant? Without plant IDs, you can’t track plant characteristics over time. Without shared plant IDs, you can’t develop a holistic understanding of individual plants based on the data they report to different entities. Linking the financial information from FERC 1 with the operational data from EIA923 and EIA860 for instance helps us cultivate a more complete picture of the marginal cost of electricity.
To make the data from FERC 1 and EIA more usable and interchangeable, we have developed
utility_id_pudl values that are the same across
datasets. The IDs are assigned via a manual mapping process that is codified in a
spreadsheet located at
pudl/package_data/glue/pudl_id_mapping.xlsx. This spreadsheet
contains all unique plant and utility names reported to FERC and EIA from the oldest to
the newest year of data we’ve collected for each data source.
In addition to mapping universal IDs onto plant and utility records, this spreadsheet assigns universal plant and utility names to each record.
The spreadsheet has two tabs, one for mapping utilities and one for mapping plants. We assign plant and utility PUDL IDs to each record first by giving each record a unique ID and then by identifying and fixing records that should share an ID. See below for more detailed instructions.
Plants, as defined by this mapping process, are considered co-located generation assets.
Records that have the same
plant_id_eia should also have the same
PUDL IDs should never be hard-coded into any analysis or transformation functions as they may come to represent different plants or utilities as new records are added and mapped.
PUDL IDs are not static. They identify unique plants/utilities in a given iteration of the data, but they may not remain exactly the same over time. The spreadsheet assigns PUDL IDs based on a record’s position in the spreadsheet, so if you change the PUDL ID of a record in the middle, it will change the PUDL ID of all those below it (unless they are referencing a PUDL ID that already exists)
No more than one person should ever update the PUDL ID mapping spreadsheet at a time. Two people updating PUDL IDs simultaneously will lead to discrepancies in the PUDL ID values.
pudl.glue.ferc1_eia module is where most of the spreadsheet coordination
happens after the records are manually mapped. It also contains the functions that
determine whether plants/utilities have or haven’t been mapped.
Checking for Unmapped Records
With every new year of data comes the possibility of new plants and utilities. Once
you’ve integrated the new data into PUDL (see these instructions), you’ll need to check for unmapped utility and plants. To do this,
find_unmapped_plants_utils.py script. You can add the
--help flag for
more information. From the top level directory in the PUDL repository:
This script identifies plants and utilities which exist in the updated FERC 1 and EIA
datasets that do not yet appear in
pudl_id_mapping.xlsx. The script will output four
CSVs in the
devtools/ferc1-eia-glue directory that correspond to unmapped plants and
utilities from FERC 1 and EIA.
Assigning PUDL IDs to Unmapped Records
Here comes the manually intensive part of the process! Now we must ensure that 1) every record gets assigned a PUDL ID and 2) that records pertaining to the same plant have the same PUDL ID.
The ordering of the rows in the mapping spreadsheet is important. YOU MUST NOT SORT THE PUDL ID MAPPING SPREADSHEET, as it will change the values of many assigned IDs. If you need to view only a subset of the data in the sheet for ease of mapping you can filter it.
unmapped_plants_ferc1/eia.csv files should display basic plant information such
as the facility name, utility name, and capacity. We show capacity here so that we can
prioritize which plants to map. The larger the capacity, the more important it is to get
it mapped. Sort the records by capacity so the highest priority records at the top.
From the FERC and EIA unmapped plants spreadsheets, copy the
plant_id_eia (only in
columns and paste them at the bottom of the corresponding columns in the plants tab of
pudl_id_mapping.xlsx spreadsheet. Next drag the auto-incrementing formula in the
plant_id_pudl column and the naming formula in the
plant_name_pudl column so
that all new records are automatically assigned PUDL plant names and unique PUDL IDs.
You should also drag the
find_plant_id_eia_matches formula down, which we’ll use in
the next step.
In previous iterations of the spreadsheet, matching FERC and EIA records were placed in
the same row with the FERC version in the FERC columns and the EIA version in the EIA
columns. This is not necessary. As long as matching FERC and EIA records (and same-plant
records within a data source) have the same PUDL ID in the
you’re good to go!
Both FERC and EIA have utility IDs, so we’re fairly confident that they don’t require intra-dataset mapping. For this reason, we only focus on connecting utilities between datasets.
Linking FERC1-EIA Records
Copy the information output to the
unmapped_utils_eia/ferc1.csv files and paste it
in the appropriate columns at the bottom of the
pudl_id_mapping.xlsx sheet. Note
that FERC 1 utility information goes in the left-hand columns and EIA utility
information goes in the right-hand columns.
Next, you’ll have to manually assign
utility_id_pudl values to each row. There is no
formula you can drag down, so just find the largest
utility_id_pudl and create new
values incrementing from there. To double check whether a utility has already appeared,
drag down the formulas in the
columns. If there’s a match, the correct
utility_id_pudl will show up in the column,
and you can create a reference to the original
utility_id_pudl assignment above.
Make sure to save the file when you’re done!
Testing Newly Mapped Records
Before you integrate these newly mapped records into the PUDL database, you’ll want to run some basic tests in the command line to make sure you’ve covered all of the unmapped entities. This command assumes that you have all of the new EIA data loaded into your live PUDL DB, and all of the new FERC 1 data loaded into your cloned FERC 1 DB:
$ pytest --live-dbs test/integration/glue_test.py
Integrating Newly Mapped Records into PUDL
Once you’ve successfully mapped all unmapped PUDL IDs, you’ll want to rerun the ETL! This ensures that the newly mapped IDs get integrated into the PUDL database and output tables that folks are using. Make sure to tell everyone else to do so as well so that you can all use the newly mapped PUDL IDs.