pudl.transform.eia923

Module to perform data cleaning functions on EIA923 data tables.

Module Contents

Functions

_get_plant_nuclear_unit_id_map(nuc_fuel: pandas.DataFrame) → Dict[int, str]

Get a plant_id -> nuclear_unit_id mapping for all plants with one nuclear unit.

_backfill_nuclear_unit_id(nuc_fuel: pandas.DataFrame) → pandas.DataFrame

Backfill 2001 and 2002 nuclear_unit_id for plants with one nuclear unit.

_get_plant_prime_mover_map(gen_fuel: pandas.DataFrame) → Dict[int, str]

Get a plant_id -> prime_mover_code mapping for all plants with one prime mover.

_backfill_prime_mover_code(gen_fuel: pandas.DataFrame) → pandas.DataFrame

Backfill 2001 and 2002 prime_mover_code for plants with one prime mover.

_get_most_frequent_energy_source_map(gen_fuel: pandas.DataFrame) → Dict[str, str]

Get the a mapping of the most common energy_source for each fuel_type_code_aer.

_clean_gen_fuel_energy_sources(gen_fuel: pandas.DataFrame) → pandas.DataFrame

Clean the generator_fuel_eia923.energy_source_code field specifically.

_aggregate_generation_fuel_duplicates(gen_fuel: pandas.DataFrame, nuclear: bool = False) → pandas.DataFrame

Aggregate remaining duplicate generation fuels.

_yearly_to_monthly_records(df: pandas.DataFrame) → pandas.DataFrame

Converts an EIA 923 record of 12 months of data into 12 monthly records.

_coalmine_cleanup(cmi_df: pandas.DataFrame) → pandas.DataFrame

Clean up the coalmine_eia923 table.

plants(eia923_dfs, eia923_transformed_dfs)

Transforms the plants_eia923 table.

gen_fuel_nuclear(gen_fuel_nuke: pandas.DataFrame) → pandas.DataFrame

Transforms the generation_fuel_nuclear_eia923 table.

generation_fuel(eia923_dfs, eia923_transformed_dfs)

Transforms the generation_fuel_eia923 table.

_map_prime_mover_sets(prime_mover_set: numpy.ndarray) → str

Map unique prime mover combinations to a single prime mover code.

_aggregate_duplicate_boiler_fuel_keys(boiler_fuel_df: pandas.DataFrame) → pandas.DataFrame

Combine boiler_fuel rows with duplicate keys by aggregating them.

boiler_fuel(eia923_dfs, eia923_transformed_dfs)

Transforms the boiler_fuel_eia923 table.

generation(eia923_dfs, eia923_transformed_dfs)

Transforms the generation_eia923 table.

coalmine(eia923_dfs, eia923_transformed_dfs)

Transforms the coalmine_eia923 table.

fuel_receipts_costs(eia923_dfs, eia923_transformed_dfs)

Transforms the fuel_receipts_costs_eia923 dataframe.

transform(eia923_raw_dfs, eia923_settings: pudl.settings.Eia923Settings = Eia923Settings())

Transforms all the EIA 923 tables.

Attributes

logger

COALMINE_COUNTRY_CODES

A mapping of EIA foreign coal mine country codes to 3-letter ISO-3166-1 codes.

pudl.transform.eia923.logger[source]
pudl.transform.eia923.COALMINE_COUNTRY_CODES :Dict[str, str][source]

A mapping of EIA foreign coal mine country codes to 3-letter ISO-3166-1 codes.

The EIA-923 lists the US state of origin for coal deliveries using standard 2-letter US state abbreviations. However, foreign countries are also included as “states” in this category and because some of them have 2-letter abbreviation collisions with US states, their coding is non-standard.

Instead of using the provided non-standard codes, we convert to the ISO-3166-1 three letter country codes: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3

pudl.transform.eia923._get_plant_nuclear_unit_id_map(nuc_fuel: pandas.DataFrame) Dict[int, str][source]

Get a plant_id -> nuclear_unit_id mapping for all plants with one nuclear unit.

Parameters

nuc_fuel – dataframe of nuclear unit fuels.

Returns

one to one mapping of plant_id_eia to nuclear_unit_id.

Return type

plant_to_nuc_id

pudl.transform.eia923._backfill_nuclear_unit_id(nuc_fuel: pandas.DataFrame) pandas.DataFrame[source]

Backfill 2001 and 2002 nuclear_unit_id for plants with one nuclear unit.

2001 and 2002 generation_fuel_eia923 records do not include nuclear_unit_id which is required for the primary key of nuclear_unit_fuel_eia923. We backfill this field for plants with one nuclear unit. nuclear_unit_id is filled with ‘UNK’ if the nuclear_unit_id can’t be recovered.

Parameters

nuc_fuel – nuclear fuels dataframe.

Returns

nuclear fuels dataframe with backfilled nuclear_unit_id field.

Return type

nuc_fuel

pudl.transform.eia923._get_plant_prime_mover_map(gen_fuel: pandas.DataFrame) Dict[int, str][source]

Get a plant_id -> prime_mover_code mapping for all plants with one prime mover.

Parameters

gen_fuel – dataframe of generation fuels.

Returns

one to one mapping of plant_id_eia to prime_mover_codes.

Return type

fuel_type_map

pudl.transform.eia923._backfill_prime_mover_code(gen_fuel: pandas.DataFrame) pandas.DataFrame[source]

Backfill 2001 and 2002 prime_mover_code for plants with one prime mover.

2001 and 2002 generation_fuel_eia923 records do not include prime_mover_code which is required for the primary key. We backfill this field for plants with one prime mover. prime_mover_code is set to ‘UNK’ if future plants have multiple prime movers.

Parameters

gen_fuel – generation fuels dataframe.

Returns

generation fuels dataframe with backfilled prime_mover_code field.

Return type

gen_fuel

pudl.transform.eia923._get_most_frequent_energy_source_map(gen_fuel: pandas.DataFrame) Dict[str, str][source]

Get the a mapping of the most common energy_source for each fuel_type_code_aer.

Parameters

gen_fuel – generation_fuel dataframe.

Returns

mapping of fuel_type_code_aer to energy_source_code.

Return type

energy_source_map

pudl.transform.eia923._clean_gen_fuel_energy_sources(gen_fuel: pandas.DataFrame) pandas.DataFrame[source]

Clean the generator_fuel_eia923.energy_source_code field specifically.

Transformations include:

  • Remap MSW to biogenic and non biogenic fuel types.

  • Fill missing energy_source_code using most common code for each AER fuel codes.

Parameters

gen_fuel – generation fuels dataframe.

Returns

generation fuels dataframe with cleaned energy_source_code field.

Return type

gen_fuel

pudl.transform.eia923._aggregate_generation_fuel_duplicates(gen_fuel: pandas.DataFrame, nuclear: bool = False) pandas.DataFrame[source]

Aggregate remaining duplicate generation fuels.

There are a handful of plants (< 100) whose prime_mover_code can’t be imputed or duplicates exist in the raw table. We resolve these be aggregate the variable fields.

Parameters
  • gen_fuel – generation fuels dataframe.

  • nuclear – adds nuclear_unit_id to list of natural key fields.

Returns

generation fuels dataframe without duplicates in natural key fields.

Return type

gen_fuel

pudl.transform.eia923._yearly_to_monthly_records(df: pandas.DataFrame) pandas.DataFrame[source]

Converts an EIA 923 record of 12 months of data into 12 monthly records.

Much of the data reported in EIA 923 is monthly, but all 12 months worth of data is reported in a single record, with one field for each of the 12 months. This function converts these annualized composite records into a set of 12 monthly records containing the same information, by parsing the field names for months, and adding a month field. Non - time series data is retained in the same format.

Parameters

df – A pandas DataFrame containing the annual data to be converted into monthly records.

Returns

A dataframe containing the same data as was passed in via df, but with monthly records as rows instead of as columns.

pudl.transform.eia923._coalmine_cleanup(cmi_df: pandas.DataFrame) pandas.DataFrame[source]

Clean up the coalmine_eia923 table.

This function does most of the coalmine_eia923 table transformation. It is separate from the coalmine() transform function because of the peculiar way that we are normalizing the fuel_receipts_costs_eia923() table.

All of the coalmine information is originally coming from the EIA fuel_receipts_costs spreadsheet, but it really belongs in its own table. We strip it out of FRC, and create that separate table, but then we need to refer to that table through a foreign key. To do so, we actually merge the entire contents of the coalmine table into FRC, including the surrogate key, and then drop the data fields.

For this to work, we need to have exactly the same coalmine data fields in both the new coalmine table, and the FRC table. To ensure that’s true, we isolate the transformations here in this function, and apply them to the coalmine columns in both the FRC table and the coalmine table.

Parameters

cmi_df – Coal mine information table (e.g. mine name, county, state)

Returns

A cleaned DataFrame containing coalmine information.

pudl.transform.eia923.plants(eia923_dfs, eia923_transformed_dfs)[source]

Transforms the plants_eia923 table.

Much of the static plant information is reported repeatedly, and scattered across several different pages of EIA 923. The data frame that this function uses is assembled from those many different pages, and passed in via the same dictionary of dataframes that all the other ingest functions use for uniformity.

Transformations include:

  • Map full spelling onto code values.

  • Convert Y/N columns to booleans.

  • Remove excess white space around values.

  • Drop duplicate rows.

Parameters
  • eia923_dfs (dictionary of pandas.DataFrame) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA 923 form, as reported in the Excel spreadsheets they distribute.

  • eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Returns

eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Return type

dict

pudl.transform.eia923.gen_fuel_nuclear(gen_fuel_nuke: pandas.DataFrame) pandas.DataFrame[source]

Transforms the generation_fuel_nuclear_eia923 table.

Transformations include:

  • Backfill nuclear_unit_ids for 2001 and 2002.

  • Set all prime_mover_codes to ‘ST’.

  • Aggregate remaining duplicate units.

Parameters

gen_fuel_nuke – dataframe of nuclear unit fuels.

Returns

Transformed nuclear generation fuel table.

pudl.transform.eia923.generation_fuel(eia923_dfs, eia923_transformed_dfs)[source]

Transforms the generation_fuel_eia923 table.

Transformations include:

  • Remove fields implicated elsewhere.

  • Replace . values with NA.

  • Remove rows with utility ids 99999.

  • Create a fuel_type_code_pudl field that organizes fuel types into clean, distinguishable categories.

  • Combine year and month columns into a single date column.

  • Clean and impute fuel_type field.

  • Backfill missing prime_mover_codes

  • Create a separate generation_fuel_nuclear table.

  • Aggregate records with duplicate natural keys.

Parameters
  • eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.

  • eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Returns

eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Return type

dict

pudl.transform.eia923._map_prime_mover_sets(prime_mover_set: numpy.ndarray) str[source]

Map unique prime mover combinations to a single prime mover code.

In 2001-2019 data, the .value_counts() of the combinations is: (CA, CT) 750 (ST, CA) 101 (ST) 60 (CA) 17 (CS, ST, CT) 2 :param prime_mover_set: unique combinations of prime_mover_code :type prime_mover_set: np.ndarray

Returns

single prime mover code

Return type

str

pudl.transform.eia923._aggregate_duplicate_boiler_fuel_keys(boiler_fuel_df: pandas.DataFrame) pandas.DataFrame[source]

Combine boiler_fuel rows with duplicate keys by aggregating them.

Boiler_fuel_eia923 contains a few records with duplicate keys, mostly caused by CA and CT parts of combined cycle plants being mapped to the same boiler ID. This is most likely a data entry error. See GitHub issue #852

One solution (implemented here) is to simply aggregate those records together. This is cheap and easy compared to the more thorough solution of making surrogate boiler IDs. Aggregation was preferred to purity due to the low volume of affected records (4.5% of combined cycle plants).

Parameters

boiler_fuel_df – the boiler_fuel dataframe

Returns

A copy of boiler_fuel dataframe with duplicates removed and aggregates appended.

pudl.transform.eia923.boiler_fuel(eia923_dfs, eia923_transformed_dfs)[source]

Transforms the boiler_fuel_eia923 table.

Transformations include:

  • Remove fields implicated elsewhere.

  • Drop values with plant and boiler id values of NA.

  • Replace . values with NA.

  • Create a fuel_type_code_pudl field that organizes fuel types into clean, distinguishable categories.

  • Combine year and month columns into a single date column.

Parameters
  • eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.

  • eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Returns

eia923_transformed_dfs, a dictionary of DataFrame objects in which pages

from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Return type

dict

pudl.transform.eia923.generation(eia923_dfs, eia923_transformed_dfs)[source]

Transforms the generation_eia923 table.

Transformations include:

  • Drop rows with NA for generator id.

  • Remove fields implicated elsewhere.

  • Replace . values with NA.

  • Drop generator-date row duplicates (all have no data).

Parameters
  • eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.

  • eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Returns

eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Return type

dict

pudl.transform.eia923.coalmine(eia923_dfs, eia923_transformed_dfs)[source]

Transforms the coalmine_eia923 table.

Transformations include:

  • Remove fields implicated elsewhere.

  • Drop duplicates with MSHA ID.

Parameters
  • eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.

  • eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Returns

eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Return type

dict

pudl.transform.eia923.fuel_receipts_costs(eia923_dfs, eia923_transformed_dfs)[source]

Transforms the fuel_receipts_costs_eia923 dataframe.

Transformations include:

  • Remove fields implicated elsewhere.

  • Replace . values with NA.

  • Standardize codes values.

  • Fix dates.

  • Replace invalid mercury content values with NA.

Fuel cost is reported in cents per mmbtu. Converts cents to dollars.

Parameters
  • eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.

  • eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Returns

eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).

Return type

dict

pudl.transform.eia923.transform(eia923_raw_dfs, eia923_settings: pudl.settings.Eia923Settings = Eia923Settings())[source]

Transforms all the EIA 923 tables.

Parameters
  • eia923_raw_dfs (dict) – a dictionary of tab names (keys) and DataFrames (values). Generated from pudl.extract.eia923.extract().

  • settings – Object containing validated settings relevant to EIA 923. Contains the tables and years to be loaded into PUDL.

Returns

A dictionary of DataFrame with table names as keys and pandas.DataFrame objects as values, where the contents of the DataFrames correspond to cleaned and normalized PUDL database tables, ready for loading.

Return type

dict