pudl.output.eia¶
A collection of denormalized EIA assets.
Attributes¶
Functions¶
|
Pull all fields from the EIA Utilities table. |
|
Pull all fields from the EIA Plants tables. |
|
Pull all fields from the EIA Utilities table. |
|
Pull all fields reported in the EIA boilers tables. |
|
Create a dataframe of plant and utility IDs and names from EIA 860. |
|
Make a column containing each plant's most consistently reported BA code. |
|
Fill in missing |
|
Fill in missing |
|
Group generators into operational units using various heuristics. |
|
Back and forward fill Unit IDs for each plant / gen combination. |
|
Identify the largest unit ID associated with each plant so we don't overlap. |
|
Replace rows with new PUDL Unit IDs in the original dataframe. |
|
Assign a unique PUDL Unit ID to each generator of a given prime mover type. |
|
Assign PUDL Unit IDs for combined cycle generation units. |
|
Assign a PUDL Unit ID to all generators with a given prime mover and fuel. |
Module Contents¶
- pudl.output.eia.out_eia__yearly_utilities(core_eia__entity_utilities: pandas.DataFrame, core_eia860__scd_utilities: pandas.DataFrame, core_pudl__assn_eia_pudl_utilities: pandas.DataFrame) pandas.DataFrame [source]¶
Pull all fields from the EIA Utilities table.
- Parameters:
core_eia__entity_utilities – EIA utility entity table.
core_eia860__scd_utilities – EIA 860 annual utility table.
core_pudl__assn_eia_pudl_utilities – Associations between EIA utilities and pudl utility IDs.
- Returns:
A DataFrame containing utility attributes from EIA Forms 860 and 923.
- pudl.output.eia.out_eia__yearly_plants(core_eia__entity_plants: pandas.DataFrame, core_eia860__scd_plants: pandas.DataFrame, core_pudl__assn_eia_pudl_plants: pandas.DataFrame, core_pudl__assn_eia_pudl_utilities: pandas.DataFrame) pandas.DataFrame [source]¶
Pull all fields from the EIA Plants tables.
- Parameters:
core_eia__entity_plants – EIA plant entity table.
core_eia860__scd_plants – EIA 860 annual plant attribute table.
core_pudl__assn_eia_pudl_plants – Associations between EIA plants and pudl utility IDs.
core_pudl__assn_eia_pudl_utilities – EIA utility ID table.
- Returns:
A DataFrame containing plant attributes from EIA Forms 860 and 923
- pudl.output.eia._out_eia__yearly_generators(context, core_eia860__scd_generators: pandas.DataFrame, core_eia__entity_generators: pandas.DataFrame, core_eia__entity_plants: pandas.DataFrame, _out_eia__plants_utilities: pandas.DataFrame, core_eia860__assn_boiler_generator: pandas.DataFrame) pandas.DataFrame [source]¶
Pull all fields from the EIA Utilities table.
- Parameters:
context – A Dagster context object.
core_eia860__scd_generators – EIA 860 annual generator table.
core_eia__entity_generators – EIA generators entity table.
core_eia__entity_plants – EIA plant entity table.
_out_eia__plants_utilities – Denormalized plant_utility EIA ID table.
core_eia860__assn_boiler_generator – Associations between EIA boiler and generator IDs.
- Returns:
A DataFrame containing all the fields of the EIA 860 Utilities table.
- pudl.output.eia.out_eia__yearly_boilers(core_eia860__scd_boilers: pandas.DataFrame, core_eia__entity_boilers: pandas.DataFrame, core_eia__entity_plants: pandas.DataFrame, _out_eia__plants_utilities: pandas.DataFrame, core_eia860__assn_boiler_generator: pandas.DataFrame) pandas.DataFrame [source]¶
Pull all fields reported in the EIA boilers tables.
Merge in other useful fields including the latitude & longitude of the plant that the boilers are part of, canonical plant & operator names and the PUDL IDs of the plant and operator, for merging with other PUDL data sources.
- Parameters:
core_eia860__scd_boilers – EIA 860 annual boiler table.
core_eia__entity_boilers – EIA boiler entity table.
core_eia__entity_plants – EIA plant entity table.
_out_eia__plants_utilities – Denormalized plant_utility EIA ID table.
core_eia860__assn_boiler_generator – Associations between EIA boiler and generator IDs.
- Returns:
A DataFrame containing boiler attributes from EIA 860.
- pudl.output.eia._out_eia__plants_utilities(out_eia__yearly_plants: pandas.DataFrame, out_eia__yearly_utilities: pandas.DataFrame) pandas.DataFrame [source]¶
Create a dataframe of plant and utility IDs and names from EIA 860.
Returns a pandas dataframe with the following columns: - report_date (in which data was reported) - plant_name_eia (from EIA entity) - plant_id_eia (from EIA entity) - plant_id_pudl - utility_id_eia (from EIA860) - utility_name_eia (from EIA860) - utility_id_pudl
- Parameters:
out_eia__yearly_plants – Denormalized EIA plants table.
out_eia__yearly_utilities – Denormalized EIA utilities table.
- Returns:
A DataFrame containing plant and utility IDs and names from EIA 860.
- pudl.output.eia.add_consistent_ba_code_column(plants: pandas.DataFrame) pandas.DataFrame [source]¶
Make a column containing each plant’s most consistently reported BA code.
Employ the harvesting function
occurrence_consistency()
which determines how consistent the values in a table are across all records within each plant. This function grabs only the values determined to be at least 70% consitent and merges them onto the plants table as a new column:balancing_authority_code_eia_consistent
- pudl.output.eia.fill_in_missing_ba_codes(plants: pandas.DataFrame) pandas.DataFrame [source]¶
Fill in missing
balancing_authority_code_eia
values.Balancing authority codes did not begin being reported until 2013. This function fills in the old years with BA codes using two main methods:
Backfilling with the oldest reported BA code for each plant.
Backfilling with the most frequently reported BA code for each plant.
We add a column to represent each of these two methodologies via
add_backfilled_ba_code_column()
andadd_consistent_ba_code_column()
.We know that the BA codes do change over time and are incorrectly reported at times. This means we can’t simply
pd.fillna()
using either the oldest or most consistently reported values. This function employs several filling methods based on our investigation of the data:if the oldest code and the most consistent code are the same, use the consistent value (either would work!)
use the oldest code for plants that have
SWPP
(Southwest Power Pool
) as their most consistent BA code because we knowSWPP
has acquired many smaller balancing authorities in recent years.use the oldest code for plants that have
NWMT
(NorthWestern Energy
) as their most consistent BA code andWAUE
(Western Area Power Administration
) as their oldest BA code.use the oldest code for plants that have more than one year of older BA codes, using the assumption that more than one year of consistent old BA codes is not a reporting error.
- Parameters:
plants – table of annual plant attributes, including
balancing_authority_code_eia
- pudl.output.eia.fill_generator_technology_description(gens_df: pandas.DataFrame) pandas.DataFrame [source]¶
Fill in missing
technology_description
based by unique mapping & backfilling.Prior to 2014, the EIA 860 did not report
technology_description
.This function fills in missing values are then filled in using the consistent, unique mappings that are observed between
energy_source_code_1
,prime_mover_code
andtechnology_type
across all years and generators.Then function backfills those early years within groups defined by
plant_id_eia
,generator_id
,energy_source_code_1
andprime_mover_code
.As a result, more than 95% of all generator records end up having a
technology_description
associated with them.- Parameters:
gens_df – A core_eia860__scd_generators dataframe containing at least the columns
report_date
,plant_id_eia
,generator_id
,energy_source_code_1
, andtechnology_description
.- Returns:
A copy of the input dataframe, with
technology_description
filled in.
- pudl.output.eia.assign_unit_ids(gens_df: pandas.DataFrame) pandas.DataFrame [source]¶
Group generators into operational units using various heuristics.
Splits a few columns off from the big generator dataframe and uses several heuristic functions to fill in missing unit_id_pudl values beyond those that are generated in the boiler generator association process. Then merges the new unit ID values back in to the generators dataframe.
- Parameters:
gens_df – An EIA generator table. Must contain at least the columns:
report_date
,plant_id_eia
,generator_id
,unit_id_pudl
,bga_source
,fuel_type_code_pudl
,prime_mover_code
.- Returns:
Returned dataframe should only vary from the input in that some NA values in the
unit_id_pudl
andbga_source
columns have been filled in with real values.- Raises:
ValueError – If the input dataframe is missing required columns.
ValueError – If any generator is associated with more than one unit_id_pudl.
AssertionError – If row or column indices are changed.
AssertionError – If pre-existing unit_id_pudl or bga_source values are altered.
AssertionError – If contents of any other columns are altered at all.
- pudl.output.eia.fill_unit_ids(gens_df: pandas.DataFrame) pandas.DataFrame [source]¶
Back and forward fill Unit IDs for each plant / gen combination.
This routine assumes that the mapping of generators to units is constant over time, and extends those mappings into years where no boilers have been reported – since in the BGA we can only connect generators to each other if they are both connected to a boiler.
Prior to 2014, combined cycle units didn’t report any “boilers” but in latter years, they have been given “boilers” that correspond to their generators, so that all of their fuel consumption is recorded alongside that of other types of generators.
The bga_source field is set to “bfill_units” for those that were backfilled, and “ffill_units” for those that were forward filled.
Note: We could back/forward fill the boiler IDs prior to the BGA process and we ought to get consistent units across all the years that are the same as what we fill in here. We could also back/forward fill boiler IDs and Unit IDs after the fact, and we should get the same result. this will address many currently “boilerless” CCNG units that use generator ID as boiler ID in the latter years. We could try and apply this more generally, but in cases of generator IDs that haven’t been used as boiler IDs, it would break the foreign key relationship with the boiler table, unless we added them there too, which seems like too much deep muddling.
- Parameters:
gens_df – An core_eia860__scd_generators dataframe, which must contain columns: report_date, plant_id_eia, generator_id, unit_id_pudl, bga_source.
- Returns:
A DataFrame with the same columns as the input dataframe, but having some NA values filled in for both the unit_id_pudl and bga_source columns.
- pudl.output.eia.max_unit_id_by_plant(gens_df: pandas.DataFrame) pandas.DataFrame [source]¶
Identify the largest unit ID associated with each plant so we don’t overlap.
The PUDL Unit IDs are sequentially assigned integers. To assign a new ID, we need to know the largest existing Unit ID within a plant. This function calculates that largest existing ID, or uses zero, if no Unit IDs are set within the plant.
Note that this calculation depends on having all of the pre-existing generators and units still available in the dataframe!
- Parameters:
gens_df – A core_eia860__scd_generators dataframe containing at least the columns plant_id_eia and unit_id_pudl.
- Returns:
plant_id_eia and max_unit_id_pudl in which each row should be unique.
- Return type:
A DataFrame having two columns
- pudl.output.eia._append_masked_units(gens_df: pandas.DataFrame, row_mask: numpy.ndarray, unit_ids: pandas.DataFrame, on: str | list[str]) pandas.DataFrame [source]¶
Replace rows with new PUDL Unit IDs in the original dataframe.
Merges the newly assigned Unit IDs found in
unit_ids
into thegens_df
dataframe, but only for those rows which are selected by the booleanrow_mask
. Merges using the column or columns specified byon
. This operation should only result in changes to the values ofunit_id_pudl
andbga_source
in the output dataframe. All ofgens_df
,unit_ids
androw_mask
must be similarly indexed for this to work.- Parameters:
gens_df – a gens_eia860 based dataframe.
row_mask (boolean mask) – A boolean array indicating which records in
gens_df
should be replaced using values fromunit_ids
.unit_ids – A dataframe containing newly assigned
unit_id_pudl
values to be integrated intogens_df
.on – Column or list of columns to merge on.
- Returns:
A DataFrame with unit IDs.
- pudl.output.eia.assign_single_gen_unit_ids(gens_df: pandas.DataFrame, prime_mover_codes: list[str], fuel_type_code_pudl: str = None, label_prefix: str = 'single') pandas.DataFrame [source]¶
Assign a unique PUDL Unit ID to each generator of a given prime mover type.
Calculate the maximum pre-existing PUDL Unit ID within each plant, and assign each as of yet unidentified distinct generator within each plant with an incrementing integer unit_id_pudl, beginning with 1 + the previous maximum unit_id_pudl found in that plant. Mark that generator with a label in the bga_source column consisting of label_prefix + the prime mover code.
If fuel_type_code_pudl is not None, then only assign new Unit IDs to those generators having the specified fuel type code, and use that fuel type code as the label prefix, e.g. “coal_st” for a coal-fired steam turbine.
Only generators having NA unit_id_pudl will be assigned a new ID.
- Parameters:
gens_df – A collection of EIA generator records. Must include the
plant_id_eia
,generator_id
andprime_mover_code
andunit_id_pudl
columns.prime_mover_codes – List of prime mover codes for which we are attempting to assign simple Unit IDs.
fuel_type_code_pudl – If not None, then limit the records assigned a unit_id to those that have the specified fuel_type_code_pudl (e.g. “coal”, “gas”, “oil”, “nuclear”)
label_prefix – String to use in labeling records as to how their unit_id_pudl was set. Will be concatenated with the prime mover code.
- Returns:
A new dataframe with the same rows and columns as were passed in, but with the unit_id_pudl and bga_source columns updated to reflect the newly assigned Unit IDs.
- pudl.output.eia.assign_cc_unit_ids(gens_df: pandas.DataFrame) pandas.DataFrame [source]¶
Assign PUDL Unit IDs for combined cycle generation units.
This applies only to combined cycle units reported as a combination of CT and CA prime movers. All CT and CA generators within a plant that do not already have a unit_id_pudl assigned will be given the same unit ID. The
bga_source
column is set to one of several flags indicating what type of arrangement was found:orphan_ct
(zero CA gens, 1+ CT gens)orphan_ca
(zero CT gens, 1+ CA gens)one_ct_one_ca_inferred
(1 CT, 1 CA)one_ct_many_ca_inferred
(1 CT, 1+ CA)many_ct_one_ca_inferred
(1+ CT, 1 CA)many_ct_many_ca_inferred
(1+ CT, 1+ CA)
Orphaned generators are still assigned a
unit_id_pudl
so that they can potentially be associated with other generators in the same unit across years. It’s likely that these orphans are a result of mislabled or missing generators. Note that as generators are added or removed over time, the flags associated with each generator may change, even though it remains part of the same inferred unit.- Returns:
A dataframe with assigned PUDL unit IDs.
- pudl.output.eia.assign_prime_fuel_unit_ids(gens_df: pandas.DataFrame, prime_mover_code: str, fuel_type_code_pudl: str) pandas.DataFrame [source]¶
Assign a PUDL Unit ID to all generators with a given prime mover and fuel.
Within each plant, assign a Unit ID to all generators that don’t have one, and that share the same fuel_type_code_pudl and prime_mover_code. This is especially useful for differentiating between different types of steam turbine generators, as there are so many different kinds of steam turbines, and the only characteristic we have to differentiate between them in this context is the fuel they consume. E.g. nuclear, geothermal, solar thermal, natural gas, diesel, and coal can all run steam turbines, but it doesn’t make sense to lump those turbines together into a single unit just because they are located at the same plant.
This routine only assigns a PUDL Unit ID to generators that have a consistently reported value of fuel_type_code_pudl across all of the years of data in gens_df. This consistency is important because otherwise the prime-fuel based unit assignment could put the same generator into different units in different years, which is currently not compatible with our concept of “units.”
- Parameters:
gens_df (pandas.DataFrame) – A collection of EIA generator records. Must include the
plant_id_eia
,generator_id
andprime_mover_code
andunit_id_pudl
columns.prime_mover_code (str) – List of prime mover codes for which we are attempting to assign simple Unit IDs.
fuel_type_code_pudl (str) – If not None, then limit the records assigned a unit_id to those that have the specified fuel_type_code_pudl (e.g. “coal”, “gas”, “oil”, “nuclear”)
- Returns:
A DataFrame where generators without a unit ID are assigned one based on fuel_type_code_pudl.