pudl.transform.eia923¶
Module to perform data cleaning functions on EIA923 data tables.
Attributes¶
A mapping of EIA foreign coal mine country codes to 3-letter ISO-3166-1 codes. |
Functions¶
|
Get a plant_id -> nuclear_unit_id mapping for all plants with one nuclear unit. |
|
Backfill 2001 and 2002 nuclear_unit_id for plants with one nuclear unit. |
|
Get a plant_id -> prime_mover_code mapping for all plants with one prime mover. |
|
Backfill 2001 and 2002 prime_mover_code for plants with one prime mover. |
|
Get the a mapping of the most common energy_source for each fuel_type_code_agg. |
|
Clean the generator_fuel_eia923.energy_source_code field specifically. |
|
Aggregate remaining duplicate generation fuels. |
|
Converts an EIA 923 record of 12 months of data into 12 monthly records. |
|
Clean up the core_eia923__entity_coalmine table. |
|
Transforms the plants_eia923 table. |
|
Transforms the core_eia923__monthly_generation_fuel_nuclear table. |
Transforms the raw_eia923__generation_fuel table. |
|
|
Map unique prime mover combinations to a single prime mover code. |
|
Combine boiler_fuel rows with duplicate keys by aggregating them. |
|
Transforms the core_eia923__monthly_boiler_fuel table. |
|
Deduplicate on primary keys for core_eia923__monthly_boiler_fuel. |
|
Transforms the EIA 923 generation table. |
|
Transforms the raw_eia923__fuel_receipts_costs table. |
|
Transforms the eia923__fuel_receipts_costs dataframe. |
Transforms the eia923__cooling_system_information dataframe. |
|
We do not expect any columns to be completely null. |
|
|
Withdrawal should be equal to discharge + consumption. |
Check to see if columns vary as slowly as expected. |
|
|
Transforms the _core_eia923__fgd_operation_maintenance table. |
Check that columns other than expected columns aren't null. |
|
Opex costs should sum to opex_fgd_total_cost. |
|
|
Check to see if columns vary as slowly as expected. |
|
Transforms the eia923_energy_storage table. |
Module Contents¶
- pudl.transform.eia923.COALMINE_COUNTRY_CODES: dict[str, str][source]¶
A mapping of EIA foreign coal mine country codes to 3-letter ISO-3166-1 codes.
The EIA-923 lists the US state of origin for coal deliveries using standard 2-letter US state abbreviations. However, foreign countries are also included as “states” in this category and because some of them have 2-letter abbreviation collisions with US states, their coding is non-standard.
Instead of using the provided non-standard codes, we convert to the ISO-3166-1 three letter country codes: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3
- pudl.transform.eia923._get_plant_nuclear_unit_id_map(nuc_fuel: pandas.DataFrame) dict[int, str] [source]¶
Get a plant_id -> nuclear_unit_id mapping for all plants with one nuclear unit.
- Parameters:
nuc_fuel – dataframe of nuclear unit fuels.
- Returns:
one to one mapping of plant_id_eia to nuclear_unit_id.
- Return type:
plant_to_nuc_id
- pudl.transform.eia923._backfill_nuclear_unit_id(nuc_fuel: pandas.DataFrame) pandas.DataFrame [source]¶
Backfill 2001 and 2002 nuclear_unit_id for plants with one nuclear unit.
2001 and 2002 core_eia923__monthly_generation_fuel records do not include nuclear_unit_id which is required for the primary key of nuclear_unit_fuel_eia923. We backfill this field for plants with one nuclear unit. nuclear_unit_id is filled with ‘UNK’ if the nuclear_unit_id can’t be recovered.
- Params:
nuc_fuel: nuclear fuels dataframe.
- Returns:
nuclear fuels dataframe with backfilled nuclear_unit_id field.
- Return type:
nuc_fuel
- pudl.transform.eia923._get_plant_prime_mover_map(gen_fuel: pandas.DataFrame) dict[int, str] [source]¶
Get a plant_id -> prime_mover_code mapping for all plants with one prime mover.
- Parameters:
gen_fuel – dataframe of generation fuels.
- Returns:
one to one mapping of plant_id_eia to prime_mover_codes.
- Return type:
fuel_type_map
- pudl.transform.eia923._backfill_prime_mover_code(gen_fuel: pandas.DataFrame) pandas.DataFrame [source]¶
Backfill 2001 and 2002 prime_mover_code for plants with one prime mover.
2001 and 2002 core_eia923__monthly_generation_fuel records do not include prime_mover_code which is required for the primary key. We backfill this field for plants with one prime mover. prime_mover_code is set to ‘UNK’ if future plants have multiple prime movers.
- Parameters:
gen_fuel – generation fuels dataframe.
- Returns:
generation fuels dataframe with backfilled prime_mover_code field.
- Return type:
gen_fuel
- pudl.transform.eia923._get_most_frequent_energy_source_map(gen_fuel: pandas.DataFrame) dict[str, str] [source]¶
Get the a mapping of the most common energy_source for each fuel_type_code_agg.
- Parameters:
gen_fuel – generation_fuel dataframe.
- Returns:
mapping of fuel_type_code_agg to energy_source_code.
- Return type:
energy_source_map
- pudl.transform.eia923._clean_gen_fuel_energy_sources(gen_fuel: pandas.DataFrame) pandas.DataFrame [source]¶
Clean the generator_fuel_eia923.energy_source_code field specifically.
Transformations include:
Remap MSW to biogenic and non biogenic fuel types.
Fill missing energy_source_code using most common code for each AER fuel codes.
- Parameters:
gen_fuel – generation fuels dataframe.
- Returns:
generation fuels dataframe with cleaned energy_source_code field.
- Return type:
gen_fuel
- pudl.transform.eia923._aggregate_generation_fuel_duplicates(gen_fuel: pandas.DataFrame, nuclear: bool = False) pandas.DataFrame [source]¶
Aggregate remaining duplicate generation fuels.
There are a handful of plants (< 100) whose prime_mover_code can’t be imputed or duplicates exist in the raw table. We resolve these be aggregate the variable fields.
- Parameters:
gen_fuel – generation fuels dataframe.
nuclear – adds nuclear_unit_id to list of natural key fields.
- Returns:
generation fuels dataframe without duplicates in natural key fields.
- Return type:
gen_fuel
- pudl.transform.eia923._yearly_to_monthly_records(df: pandas.DataFrame) pandas.DataFrame [source]¶
Converts an EIA 923 record of 12 months of data into 12 monthly records.
Much of the data reported in EIA 923 is monthly, but all 12 months worth of data is reported in a single record, with one field for each of the 12 months. This function converts these annualized composite records into a set of 12 monthly records containing the same information, by parsing the field names for months, and adding a month field. Non - time series data is retained in the same format.
- Parameters:
df – A pandas DataFrame containing the annual data to be converted into monthly records.
- Returns:
A dataframe containing the same data as was passed in via df, but with monthly records as rows instead of as columns.
- pudl.transform.eia923._coalmine_cleanup(cmi_df: pandas.DataFrame) pandas.DataFrame [source]¶
Clean up the core_eia923__entity_coalmine table.
This function does most of the core_eia923__entity_coalmine table transformation. It is separate from the coalmine() transform function because of the peculiar way that we are normalizing the ref:core_eia923__monthly_fuel_receipts_costs table.
All of the coalmine information is originally coming from the EIA fuel_receipts_costs spreadsheet, but it really belongs in its own table. We strip it out of FRC, and create that separate table, but then we need to refer to that table through a foreign key. To do so, we actually merge the entire contents of the coalmine table into FRC, including the surrogate key, and then drop the data fields.
For this to work, we need to have exactly the same coalmine data fields in both the new coalmine table, and the FRC table. To ensure that’s true, we isolate the transformations here in this function, and apply them to the coalmine columns in both the FRC table and the coalmine table.
- Parameters:
cmi_df – Coal mine information table (e.g. mine name, county, state)
- Returns:
A cleaned DataFrame containing coalmine information.
- pudl.transform.eia923.plants_eia923(eia923_dfs: dict[str, pandas.DataFrame], eia923_transformed_dfs: dict[str, pandas.DataFrame]) dict[str, pandas.DataFrame] [source]¶
Transforms the plants_eia923 table.
Much of the static plant information is reported repeatedly, and scattered across several different pages of EIA 923. The data frame that this function uses is assembled from those many different pages, and passed in via the same dictionary of dataframes that all the other ingest functions use for uniformity.
Transformations include:
Map full spelling onto code values.
Convert Y/N columns to booleans.
Remove excess white space around values.
Drop duplicate rows.
- Parameters:
eia923_dfs – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA 923 form, as reported in the Excel spreadsheets they distribute.
eia923_transformed_dfs – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns:
A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- pudl.transform.eia923.gen_fuel_nuclear(gen_fuel_nuke: pandas.DataFrame) pandas.DataFrame [source]¶
Transforms the core_eia923__monthly_generation_fuel_nuclear table.
Transformations include:
Backfill nuclear_unit_ids for 2001 and 2002.
Set all prime_mover_codes to ‘ST’.
Aggregate remaining duplicate units.
- Parameters:
gen_fuel_nuke – dataframe of nuclear unit fuels.
- Returns:
Transformed nuclear generation fuel table.
- pudl.transform.eia923._core_eia923__pre_generation_fuel(raw_eia923__generation_fuel: pandas.DataFrame)[source]¶
Transforms the raw_eia923__generation_fuel table.
Transformations include:
Remove fields implicated elsewhere.
Replace . values with NA.
Remove rows with utility ids 99999.
Create a fuel_type_code_pudl field that organizes fuel types into clean, distinguishable categories.
Combine year and month columns into a single date column.
Clean and impute fuel_type field.
Backfill missing prime_mover_codes
Create a separate generation_fuel_nuclear table.
Aggregate records with duplicate natural keys.
- Parameters:
raw_eia923__generation_fuel – The raw
raw_eia923__generation_fuel
dataframe.- Returns:
Cleaned
eia923__generation_fuel
dataframe ready for harvesting. _core_eia923__generation_fuel_nuclear: Cleanedeia923__generation_fuel_nuclear
dataframe ready for harvesting.- Return type:
_core_eia923__generation_fuel
- pudl.transform.eia923._map_prime_mover_sets(prime_mover_set: numpy.ndarray) str [source]¶
Map unique prime mover combinations to a single prime mover code.
In 2001-2019 data, the .value_counts() of the combinations is: (CA, CT) 750 (ST, CA) 101 (ST) 60 (CA) 17 (CS, ST, CT) 2 :param prime_mover_set: unique combinations of prime_mover_code :type prime_mover_set: np.ndarray
- Returns:
single prime mover code
- Return type:
- pudl.transform.eia923._aggregate_duplicate_boiler_fuel_keys(boiler_fuel_df: pandas.DataFrame) pandas.DataFrame [source]¶
Combine boiler_fuel rows with duplicate keys by aggregating them.
core_eia923__monthly_boiler_fuel contains a few records with duplicate keys, mostly caused by CA and CT parts of combined cycle plants being mapped to the same boiler ID. This is most likely a data entry error. See GitHub issue #852
One solution (implemented here) is to simply aggregate those records together. This is cheap and easy compared to the more thorough solution of making surrogate boiler IDs. Aggregation was preferred to purity due to the low volume of affected records (4.5% of combined cycle plants).
- Parameters:
boiler_fuel_df – the boiler_fuel dataframe
- Returns:
A copy of boiler_fuel dataframe with duplicates removed and aggregates appended.
- pudl.transform.eia923._core_eia923__boiler_fuel(raw_eia923__boiler_fuel: pandas.DataFrame) pandas.DataFrame [source]¶
Transforms the core_eia923__monthly_boiler_fuel table.
Transformations include:
Remove fields implicated elsewhere.
Drop values with plant and boiler id values of NA.
Replace . values with NA.
Create a fuel_type_code_pudl field that organizes fuel types into clean, distinguishable categories.
Combine year and month columns into a single date column.
Drop duplicate rows with NA or 0 in all value columns.
Eventually we should truncate this table by the last year-month that was integrated. Right now all months get integrated for a given year, regardless of whether there’s data for them.
- Parameters:
raw_eia923__boiler_fuel – The raw
raw_eia923__boiler_fuel
dataframe.- Returns:
Cleaned
core_eia923__monthly_boiler_fuel
dataframe ready for harvesting.
- pudl.transform.eia923.remove_duplicate_pks_boiler_fuel_eia923(bf: pandas.DataFrame) pandas.DataFrame [source]¶
Deduplicate on primary keys for core_eia923__monthly_boiler_fuel.
There are a relatively small number of records ~5% from the boiler fuel table that have duplicate records based on what we believe is this table’s primary keys. Fortunately, all of these duplicates have at least one records w/ only zeros and or nulls. So this method drops only the records which have duplicate pks and only have zeros or nulls in the non-primary key columns.
Note: There are 4 boilers in 2021 that are being dropped entirely during this cleaning. They have BOTH duplicate pks and only have zeros or nulls in the non-primary key columns. We could choose to preserve all instances of the pks even after
drop_invalid_rows()
or only dropping one when there are two. We chose to leave this be because it was minor and these boilers show up in other years. See comment for more details.
- pudl.transform.eia923._core_eia923__generation(raw_eia923__generator: pandas.DataFrame) pandas.DataFrame [source]¶
Transforms the EIA 923 generation table.
Transformations include:
Drop rows with NA for generator id.
Remove fields implicated elsewhere.
Replace . values with NA.
Drop generator-date row duplicates (all have no data).
- Parameters:
raw_eia923__generator – The raw
raw_eia923__generator
dataframe.- Returns:
Cleaned
_core_eia923__generation
dataframe ready for harvesting.
- pudl.transform.eia923._core_eia923__coalmine(raw_eia923__fuel_receipts_costs: pandas.DataFrame) pandas.DataFrame [source]¶
Transforms the raw_eia923__fuel_receipts_costs table.
Transformations include:
Remove fields implicated elsewhere.
Drop duplicates with MSHA ID.
- Parameters:
raw_eia923__fuel_receipts_costs – raw precursor to the core_eia923__monthly_fuel_receipts_costs table.
- Returns:
Cleaned
_core_eia923__coalmine
dataframe ready for harvesting.
- pudl.transform.eia923._core_eia923__fuel_receipts_costs(raw_eia923__fuel_receipts_costs: pandas.DataFrame, _core_eia923__coalmine: pandas.DataFrame) pandas.DataFrame [source]¶
Transforms the eia923__fuel_receipts_costs dataframe.
Transformations include:
Remove fields implicated elsewhere.
Replace . values with NA.
Standardize codes values.
Fix dates.
Replace invalid mercury content values with NA.
Fuel cost is reported in cents per mmbtu. Converts cents to dollars.
- Parameters:
raw_eia923__fuel_receipts_costs – The raw
raw_eia923__fuel_receipts_costs
dataframe._core_eia923__coalmine – The cleaned pre-harvest EIA 923 coal mine dataframe.
- Returns:
Cleaned
eia923__fuel_receipts_costs
dataframe ready for harvesting.
- pudl.transform.eia923._core_eia923__cooling_system_information(raw_eia923__cooling_system_information: pandas.DataFrame) pandas.DataFrame [source]¶
Transforms the eia923__cooling_system_information dataframe.
Applies typical NA conversion and date conversion.
As of 2024-02-28: 2008 and 2009 only have annual rates, but the “_rate_gallons_per_minute” values are otherwise monthly; we leave them NA for 2008-2009 to avoid confusion.
As of 2024-02-28: In 2008 and 2009 the rate columns are labeled as “0.1 cubic feet per second”; In 2013 they change it to gallons/min.
If taken to mean that the earlier unit is literally deci-cubic-feet per second, we find that all these rates jump 10x when we hit the 2013 data; so we interpret “0.1 cubic feet per second” to mean “cubic feet per second, with precision of 0.1 cfs.”
- pudl.transform.eia923.cooling_system_information_null_check(csi)[source]¶
We do not expect any columns to be completely null.
In fast-ETL context (only recent years), the annual columns may also be completely null.
- pudl.transform.eia923.cooling_system_information_withdrawal_discrepancy_check(csi)[source]¶
Withdrawal should be equal to discharge + consumption.
To allow for some data quality errors we assert that withdrawal ~= discharge + consumption at least 90% of the time (with a 1% acceptable discrepancy).
- pudl.transform.eia923.cooling_system_information_continuity(csi)[source]¶
Check to see if columns vary as slowly as expected.
- pudl.transform.eia923._core_eia923__fgd_operation_maintenance(raw_eia923__fgd_operation_maintenance: pandas.DataFrame) pandas.DataFrame [source]¶
Transforms the _core_eia923__fgd_operation_maintenance table.
Transformations include:
Drop values with plant and boiler id values of NA.
Replace . values with NA.
Convert thousands of dollars to dollars.
Fix datetimes for SO2 test dates.
Ensure a unique primary key and drop some duplicated rows.
- Parameters:
raw_eia923__fgd_operation_maintenance – The raw
raw_eia923__fgd_operation_maintenance
dataframe.- Returns:
Cleaned
_core_eia923__fgd_operation_maintenance
dataframe ready for harvesting.
- pudl.transform.eia923.fgd_operation_maintenance_null_check(fgd)[source]¶
Check that columns other than expected columns aren’t null.
- pudl.transform.eia923.fgd_cost_discrepancy_check(fgd)[source]¶
Opex costs should sum to opex_fgd_total_cost.
To allow for some data quality errors we assert that costs ~= total cost at least 99% of the time (with a 1% acceptable discrepancy).
- pudl.transform.eia923.fgd_continuity_check(fgd)[source]¶
Check to see if columns vary as slowly as expected.
- pudl.transform.eia923._core_eia923__energy_storage(raw_eia923__energy_storage: pandas.DataFrame) pandas.DataFrame [source]¶
Transforms the eia923_energy_storage table.
Transformations include:
Replace . values with NA.
Clean up
fuel_unit
strings.Make wide monthly columns into tall monthly columns.
Convert date to month.
Encode relevant columns.
Other cleaning that could be done:
Come up with an encoder for
fuel_unit
(tricky because different between FERC and EIA).