pudl.glue.ferc1_eia#

Extract and transform glue tables between FERC Form 1 and EIA 860/923.

FERC1 and EIA report on many of the same plants and utilities, but have no embedded connection. We have combed through the FERC and EIA plants and utilities to generate id’s which can connect these datasets. The resulting fields in the PUDL tables are plant_id_pudl and utility_id_pudl, respectively. This was done by hand in a spreadsheet which is in the package_data/glue directory. When mapping plants, we considered a plant a co-located collection of electricity generation equipment. If a coal plant was converted to a natural gas unit, our aim was to consider this the same plant. This module simply reads in the mapping spreadsheet and converts it to a dictionary of dataframes.

Because these mappings were done by hand and for every one of FERC Form 1’s thousands of reported plants, we know there are probably some incorrect or incomplete mappings. If you see a plant_id_pudl or utility_id_pudl mapping that you think is incorrect, please open an issue on our Github!

Note that the PUDL IDs may change over time. They are not guaranteed to be stable. If you need to find a particular plant or utility reliably, you should use its plant_id_eia, utility_id_eia, or utility_id_ferc1.

Another note about these id’s: these id’s map our definition of plants, which is not the most granular level of plant unit. The generators are typically the smaller, more interesting unit. FERC does not typically report in units (although it sometimes does), but it does often break up gas units from coal units. EIA reports on the generator and boiler level. When trying to use these PUDL id’s, consider the granularity that you desire and the potential implications of using a co-located set of plant infrastructure as an id.

Module Contents#

Functions#

get_plant_map(→ pandas.DataFrame)

Read in the manual FERC to EIA plant mapping data.

get_utility_map(→ pandas.DataFrame)

Read in the manual FERC to EIA utility mapping data.

get_db_plants_ferc1(→ pandas.DataFrame)

Pull a dataframe of all plants in the FERC Form 1 DB for the given years.

get_mapped_plants_ferc1(→ pandas.DataFrame)

Generate a dataframe containing all previously mapped FERC 1 plants.

get_mapped_utils_ferc1()

Read in the list of manually mapped utilities for FERC Form 1.

get_unmapped_plants_ferc1(→ pandas.DataFrame)

Generate a DataFrame of all unmapped FERC plants in the given years.

get_unmapped_utils_ferc1(ferc1_engine)

Generate a list of as-of-yet unmapped utilities from the FERC Form 1 DB.

get_db_plants_eia(pudl_engine)

Get a list of all EIA plants appearing in the PUDL DB.

get_mapped_plants_eia()

Get a list of all EIA plants that have been assigned PUDL Plant IDs.

get_unmapped_plants_eia(pudl_engine)

Identify any as-of-yet unmapped EIA Plants.

get_lost_plants_eia(pudl_engine)

Identify any EIA plants which were mapped, but then lost from the DB.

get_db_utils_eia(pudl_engine)

Get a list of all EIA Utilities appearing in the PUDL DB.

get_utility_most_recent_capacity(→ pandas.DataFrame)

Get a list of all utilities' most recent total net capacity of their generators.

get_mapped_utils_eia(→ pandas.DataFrame)

Get a list of all the EIA Utilities that have PUDL IDs.

get_unmapped_utils_eia(→ pandas.DataFrame)

Get a list of all the EIA Utilities in the PUDL DB without PUDL IDs.

get_unmapped_utils_with_plants_eia(pudl_engine)

Get all EIA Utilities that lack PUDL IDs but have plants/ownership.

get_lost_utils_eia(pudl_engine)

Get a list of all mapped EIA Utilites not found in the PUDL DB.

glue([ferc1, eia])

Generates a dictionary of dataframes for glue tables between FERC1, EIA.

Attributes#

pudl.glue.ferc1_eia.logger[source]#
pudl.glue.ferc1_eia.DATA_TABLES_EIA923 :list[str] = ['boiler_fuel_eia923', 'fuel_receipts_costs_eia923', 'generation_eia923',...[source]#
pudl.glue.ferc1_eia.get_plant_map() pandas.DataFrame[source]#

Read in the manual FERC to EIA plant mapping data.

pudl.glue.ferc1_eia.get_utility_map() pandas.DataFrame[source]#

Read in the manual FERC to EIA utility mapping data.

pudl.glue.ferc1_eia.get_db_plants_ferc1(pudl_settings: dict[str, str], years: collections.abc.Iterable[int]) pandas.DataFrame[source]#

Pull a dataframe of all plants in the FERC Form 1 DB for the given years.

This function looks in the f1_steam, f1_gnrt_plant, f1_hydro and f1_pumped_storage tables, and generates a dataframe containing every unique combination of respondent_id (utility_id_ferc1) and plant_name is finds. Also included is the capacity of the plant in MW (as reported in the raw FERC Form 1 DB), the respondent_name (utility_name_ferc1) and a column indicating which of the plant tables the record came from. Plant and utility names are translated to lowercase, with leading and trailing whitespace stripped and repeating internal whitespace compacted to a single space.

This function is primarily meant for use generating inputs into the manual mapping of FERC to EIA plants with PUDL IDs.

Parameters:
  • pudl_settings – Dictionary containing various paths and database URLs used by PUDL.

  • years – Years for which plants should be compiled.

Returns:

A dataframe containing columns utility_id_ferc1, utility_name_ferc1, plant_name, capacity_mw, and plant_table. Each row is a unique combination of utility_id_ferc1 and plant_name.

pudl.glue.ferc1_eia.get_mapped_plants_ferc1() pandas.DataFrame[source]#

Generate a dataframe containing all previously mapped FERC 1 plants.

Many plants are reported in FERC Form 1 with different versions of the same name in different years. Because FERC provides no unique ID for plants, these names must be used as part of their identifier. We manually curate a list of all the versions of plant names which map to the same actual plant. In order to identify new plants each year, we have to compare the new plant names and respondent IDs against this raw mapping, not the contents of the PUDL data, since within PUDL we use one canonical name for the plant. This function pulls that list of various plant names and their corresponding utilities (both name and ID) for use in identifying which plants have yet to be mapped when we are integrating new data.

Parameters:

None

Returns:

plant_name, utility_id_ferc1, and utility_name_ferc1. Each row represents a unique combination of utility_id_ferc1 and plant_name.

Return type:

A DataFrame with three columns

pudl.glue.ferc1_eia.get_mapped_utils_ferc1()[source]#

Read in the list of manually mapped utilities for FERC Form 1.

Unless a new utility has appeared in the database, this should be identical to the full list of utilities available in the FERC Form 1 database.

Parameters:

None

Returns:

pandas.DataFrame

pudl.glue.ferc1_eia.get_unmapped_plants_ferc1(pudl_settings: dict[str, str], years: collections.abc.Iterable[int]) pandas.DataFrame[source]#

Generate a DataFrame of all unmapped FERC plants in the given years.

Pulls all plants from the FERC Form 1 DB for the given years, and compares that list against the already mapped plants. Any plants found in the database but not in the list of mapped plants are returned.

Parameters:
  • pudl_settings – Dictionary containing various paths and database URLs used by PUDL.

  • years – Years for which plants should be compiled from the raw FERC Form 1 DB.

Returns:

utility_id_ferc1, utility_name_ferc1, plant_name, capacity_mw, and plant_table. Each row is a unique combination of utility_id_ferc1 and plant_name, which appears in the FERC Form 1 DB, but not in the list of manually mapped plants.

Return type:

A dataframe containing five columns

pudl.glue.ferc1_eia.get_unmapped_utils_ferc1(ferc1_engine)[source]#

Generate a list of as-of-yet unmapped utilities from the FERC Form 1 DB.

Find any utilities which do exist in the cloned FERC Form 1 DB, but which do not show up in the already mapped FERC respondents.

Parameters:

ferc1_engine (sqlalchemy.engine.Engine) – A database connection engine for the cloned FERC Form 1 DB.

Returns:

with columns “utility_id_ferc1” and “utility_name_ferc1”

Return type:

pandas.DataFrame

pudl.glue.ferc1_eia.get_db_plants_eia(pudl_engine)[source]#

Get a list of all EIA plants appearing in the PUDL DB.

This list of plants is used to determine which plants need to be added to the FERC 1 / EIA plant mappings, where we assign PUDL Plant IDs. Unless a new year’s worth of data has been added to the PUDL DB, but the plants have not yet been mapped, all plants in the PUDL DB should also appear in the plant mappings. It only makes sense to run this with a connection to a PUDL DB that has all the EIA data in it.

Parameters:

pudl_engine (sqlalchemy.engine.Engine) – A database connection engine for connecting to a PUDL SQLite database.

Returns:

A DataFrame with plant_id_eia, plant_name_eia, and state columns, for addition to the FERC 1 / EIA plant mappings.

Return type:

pandas.DataFrame

pudl.glue.ferc1_eia.get_mapped_plants_eia()[source]#

Get a list of all EIA plants that have been assigned PUDL Plant IDs.

Read in the list of already mapped EIA plants from the FERC 1 / EIA plant and utility mapping spreadsheet kept in the package_data.

Parameters:

None

Returns:

A DataFrame listing the plant_id_eia and plant_name_eia values for every EIA plant which has already been assigned a PUDL Plant ID.

Return type:

pandas.DataFrame

pudl.glue.ferc1_eia.get_unmapped_plants_eia(pudl_engine)[source]#

Identify any as-of-yet unmapped EIA Plants.

pudl.glue.ferc1_eia.get_lost_plants_eia(pudl_engine)[source]#

Identify any EIA plants which were mapped, but then lost from the DB.

pudl.glue.ferc1_eia.get_db_utils_eia(pudl_engine)[source]#

Get a list of all EIA Utilities appearing in the PUDL DB.

pudl.glue.ferc1_eia.get_utility_most_recent_capacity(pudl_engine) pandas.DataFrame[source]#

Get a list of all utilities’ most recent total net capacity of their generators.

pudl.glue.ferc1_eia.get_mapped_utils_eia() pandas.DataFrame[source]#

Get a list of all the EIA Utilities that have PUDL IDs.

pudl.glue.ferc1_eia.get_unmapped_utils_eia(pudl_engine: sqlalchemy.engine.Engine, data_tables_eia923: list[str] = DATA_TABLES_EIA923) pandas.DataFrame[source]#

Get a list of all the EIA Utilities in the PUDL DB without PUDL IDs.

Identify any EIA Utility that appears in the data but does not have a utility_id_pudl associated with it in our ID mapping spreadsheet. Label some of those utilities for potential linkage to FERC 1 utilities, but only if they have plants which report data somewhere in the EIA-923 data tables. For those utilites that do have plants reporting in EIA-923, sum up the total capacity of all of their plants and include that in the output dataframe so that we can effectively prioritize mapping them.

pudl.glue.ferc1_eia.get_unmapped_utils_with_plants_eia(pudl_engine)[source]#

Get all EIA Utilities that lack PUDL IDs but have plants/ownership.

pudl.glue.ferc1_eia.get_lost_utils_eia(pudl_engine)[source]#

Get a list of all mapped EIA Utilites not found in the PUDL DB.

pudl.glue.ferc1_eia.glue(ferc1=False, eia=False)[source]#

Generates a dictionary of dataframes for glue tables between FERC1, EIA.

That data is primarily stored in the plant_output and utility_output tabs of package_data/glue/pudl_id_mapping.xlsx in the repository. There are a total of seven relations described in this data:

  • utilities: Unique id and name for each utility for use across the PUDL DB.

  • plants: Unique id and name for each plant for use across the PUDL DB.

  • utilities_eia: EIA operator ids and names attached to a PUDL utility id.

  • plants_eia: EIA plant ids and names attached to a PUDL plant id.

  • utilities_ferc: FERC respondent ids & names attached to a PUDL utility id.

  • plants_ferc: A combination of FERC plant names and respondent ids, associated with a PUDL plant ID. This is necessary because FERC does not provide plant ids, so the unique plant identifier is a combination of the respondent id and plant name.

  • utility_plant_assn: An association table which describes which plants have relationships with what utilities. If a record exists in this table then combination of PUDL utility id & PUDL plant id does have an association of some kind. The nature of that association is somewhat fluid, and more scrutiny will likely be required for use in analysis.

Presently, the ‘glue’ tables are a very basic piece of infrastructure for the PUDL DB, because they contain the primary key fields for utilities and plants in FERC1.

Parameters:
  • ferc1 (bool) – Are we ingesting FERC Form 1 data?

  • eia (bool) – Are we ingesting EIA data?

Returns:

a dictionary of glue table DataFrames

Return type:

dict