pudl.analysis.allocate_gen_fuel
¶
Allocate data from core_eia923__monthly_generation_fuel table to generator level.
The algorithm we’re using assumes the following about the reported data:
The core_eia923__monthly_generation_fuel table is the authoritative source of information about how much generation and fuel consumption is attributable to an entire plant. This table has the most complete data coverage, but it is not the most granular data reported. It’s primary keys are
IDX_PM_ESC
.The core_eia923__monthly_generation table contains the most granular net generation data. It is reported at the generator level with primary keys
IDX_GENS
. This table includes only ~39% of the total MWhs reported in the core_eia923__monthly_generation_fuel table.The core_eia923__monthly_boiler_fuel table contains the most granular fuel consumption data. It is reported at the boiler/prime mover/energy source level with primary keys
IDX_B_PM_ESC
. This table includes only ~38% of the total MMBTUs reported in the core_eia923__monthly_generation_fuel table.The core_eia860__scd_generators table provides an exhaustive list of all generators whose generation is being reported in the core_eia923__monthly_generation_fuel table - with primary keys
IDX_GENS
.
This module allocates the total net electricity generation and fuel consumption reported
in the core_eia923__monthly_generation_fuel table to individual generators, based
on more granular data reported in the core_eia923__monthly_generation and
core_eia923__monthly_boiler_fuel tables, as well as capacity (MW) found in the
core_eia860__scd_generators table. It uses other generator attributes from the
core_eia860__scd_generators table to associate the data found in the
core_eia923__monthly_generation_fuel with generators. It also uses as the
associations between boilers and generators found in the
core_eia860__assn_boiler_generator table to aggregate data
core_eia923__monthly_boiler_fuel tables. The main coordinating functions hereare
allocate_gen_fuel_by_generator_energy_source()
and
aggregate_gen_fuel_by_generator()
.
Some definitions:
Data columns refers to the net generation and fuel consumption - the specific columns are defined in
DATA_COLUMNS
.Granular tables refers to core_eia923__monthly_generation and core_eia923__monthly_boiler_fuel, which report granular data but do not have complete coverage.
There are six main stages of the allocation process in this module:
Read inputs: Read denormalized net generation and fuel consumption data from the PUDL DB and standardize data reporting frequency. (See
select_input_data()
andstandardize_input_frequency()
).Associate inputs: Merge data columns from the input tables described above on the basis of their shared primary key columns, producing an output with primary key
IDX_GENS_PM_ESC
. This broadcasts many data values across multiple rows for use in the allocation process below (seeassociate_generator_tables()
).Flag associated inputs: For each record in the associated inputs, add boolean flags that separately indicate whether the generation and fuel consumption in that record are directly reported in the granular tables. This lets us choose an appropriate data allocation method based on how complete the granular data coverage is for a given value of
IDX_PM_ESC
, which is the original primary key of the core_eia923__monthly_generation_fuel table. (Seeprep_allocation_fraction()
).Allocate: Allocate the net generation and fuel consumption reported in the less granular core_eia923__monthly_generation_fuel table to the
IDX_GENS_PM_ESC
level. More details on the allocation process are below (seeallocate_gen_fuel_by_gen_esc()
andallocate_fuel_by_gen_esc()
).Sanity check allocation: Verify that the total allocated net generation and fuel consumption within each plant is equal to the total of the originally reported values within some tolerance (see
test_original_gf_vs_the_allocated_by_gens_gf()
). Warn if assumptions about the data and the outputs aren’t met (seewarn_if_missing_pms()
,_test_frac()
,test_gen_fuel_allocation()
and_test_gen_pm_fuel_output()
)Aggregate outputs: Aggregate the allocated net generation and fuel consumption to the generator level, going from having primary keys of
IDX_GENS_PM_ESC
toIDX_GENS
(seeaggregate_gen_fuel_by_generator()
).
High-level description about the allocaiton step:
We allocate the data columns reported in the core_eia923__monthly_generation_fuel table on the basis of plant, prime mover, and energy source among the generators in each plant that have matching energy sources.
We group the associated data columns by IDX_PM_ESC
and categorize
each resulting group of generators based on whether ALL, SOME, or NONE of
them reported data in the granular tables. This is done for both the net generation and
fuel consumption since the same generator may have reported differently in its
respective granular table. This is done for both the net generation and fuel consumption
since the same generator may have reported differently in its respective granular table.
In more detail, within each reporting period, we split the plants into three groups:
The ALL Coverage Records: where ALL generators report in the granular tables.
The NONE Coverage Records: where NONE of the generators report in the granular tables.
The SOME Coverage Records: where only SOME of the generators report in the granular tables.
In the ALL generators case, the data columns reported in the core_eia923__monthly_generation_fuel table are allocated in proportion to data reported in the granular data tables. We do this instead of directly using the data columns from the granular tables because there are discrepancies between the core_eia923__monthly_generation_fuel table and the granular tables and we are assuming the totals reported in the core_eia923__monthly_generation_fuel table are authoritative.
In the NONE generators case, the data columns reported in the core_eia923__monthly_generation_fuel table are allocated in proportion to the each generator’s capacity.
In the SOME generators case, we use a combination of the two allocation methods described above. First, the data columns reported in the core_eia923__monthly_generation_fuel table are allocated between the two categories of generators: those that report granular data, and those that don’t. The fraction allocated to each of those categories is based on how much of the total is reported in the granular tables. If T is the total reported, and X is the quantity reported in the granular tables, then the allocation is X/T to the generators reporting granular data, and (T-X)/T to the generators not reporting granular data. Within each of those categories the allocation then follows the ALL or NONE allocation methods described above.
Known Drawbacks of this methodology:
Note that this methology does not distinguish between primary and secondary energy_sources for generators. It associates portions of net generation to each generators in the same plant do not report detailed generation, have the same prime_mover_code, and use the same fuels, but have very different capacity factors in reality, this methodology will allocate generation such that they end up with very similar capacity factors. We imagine this is an uncommon scenario.
This methodology has several potential flaws and drawbacks. Because there is no indicator of what portion of the energy_source_codes, we associate the net generation equally among them. In effect, if a plant had multiple generators with the same prime_mover_code but opposite primary and secondary fuels (eg. gen 1 has a primary fuel of ‘NG’ and secondary fuel of ‘DFO’, while gen 2 has a primary fuel of ‘DFO’ and a secondary fuel of ‘NG’), the methodology associates the core_eia923__monthly_generation_fuel records similarly across these two generators. However, the allocated net generation will still be porporational to each generator’s net generation (if it’s reported) or capacity (if generation is not reported).
Module Contents¶
Functions¶
Build yearly and monthly net generation & fuel consumption allocation assets. |
|
Allocate net gen from gen_fuel table to the generator/energy_source_code level. |
|
|
Select only the subset of input data needed for the allocation. |
|
Standardize the frequency of the input tables. |
Scale allocated net gen at the generator/energy_source_code level by ownership. |
|
|
Aggreate the allocated gen fuel data to the generator level. |
|
Stack the generator table with a set of columns. |
|
Associate the three tables needed to assign net gen and fuel to generators. |
|
Remove the retired generators. |
|
Identify any generators that retire mid-year. |
|
Identify entire plants that have previously retired but are reporting data. |
|
Identify generators that are coming online mid-year. |
|
Identify entirely new plants that are proposed but are already reporting data. |
|
Associate unassociated core_eia923__monthly_boiler_fuel table records on idx_cols. |
|
Prepare the associated generators for allocation. |
|
Allocate net generation to generators/energy_source_code via three methods. |
|
Allocate fuel_consumption to generators/energy_source_code via three methods. |
|
Replace the post-aggregation sentinel values in a column with zero. |
|
Catches duplicate keys in the allocated data and groups them together. |
Allocates annually-reported data from the gen or bf table to each month. |
|
|
Reassign fuel codes that differ between gen-fuel and gens tables. |
|
Adjusts MSW codes. |
|
Add energy_source_codes to gens that were found only in the gf or bf tables. |
|
Identify energy_source_codes that exist in gf or bf but not gens. |
|
Allocates boiler fuel data to the generator level. |
|
Log warning if there are too many null |
|
Check if each of the IDX_PM_ESC groups frac's add up to 1. |
|
|
|
Does the allocated MWh differ from the granular core_eia923__monthly_generation? |
Test whether the allocated data and original data sum up to similar values. |
Attributes¶
Primary key columns for generator records. |
|
Primary key columns for plant, generator, prime mover & energy source records. |
|
Primary key columns for plant, prime mover & energy source records. |
|
Primary key columns for plant, boiler, prime mover & energy source records. |
|
Primary key columns for plant & energy source records. |
|
Primary key columns for plant, energy source & unit records. |
|
Data columns from core_eia923__monthly_generation_fuel that are being allocated. |
|
A sentinel value for dealing with null or zero values. |
|
- pudl.analysis.allocate_gen_fuel.IDX_GENS = ['report_date', 'plant_id_eia', 'generator_id'][source]¶
Primary key columns for generator records.
- pudl.analysis.allocate_gen_fuel.IDX_GENS_PM_ESC = ['report_date', 'plant_id_eia', 'generator_id', 'prime_mover_code', 'energy_source_code'][source]¶
Primary key columns for plant, generator, prime mover & energy source records.
- pudl.analysis.allocate_gen_fuel.IDX_PM_ESC = ['report_date', 'plant_id_eia', 'energy_source_code', 'prime_mover_code'][source]¶
Primary key columns for plant, prime mover & energy source records.
- pudl.analysis.allocate_gen_fuel.IDX_B_PM_ESC = ['report_date', 'plant_id_eia', 'boiler_id', 'energy_source_code', 'prime_mover_code'][source]¶
Primary key columns for plant, boiler, prime mover & energy source records.
- pudl.analysis.allocate_gen_fuel.IDX_ESC = ['report_date', 'plant_id_eia', 'energy_source_code'][source]¶
Primary key columns for plant & energy source records.
- pudl.analysis.allocate_gen_fuel.IDX_UNIT_ESC = ['report_date', 'plant_id_eia', 'energy_source_code', 'unit_id_pudl'][source]¶
Primary key columns for plant, energy source & unit records.
- pudl.analysis.allocate_gen_fuel.DATA_COLUMNS = ['net_generation_mwh', 'fuel_consumed_mmbtu', 'fuel_consumed_for_electricity_mmbtu'][source]¶
Data columns from core_eia923__monthly_generation_fuel that are being allocated.
- pudl.analysis.allocate_gen_fuel.MISSING_SENTINEL = 1e-05[source]¶
A sentinel value for dealing with null or zero values.
Zeroes in the relevant data columns get filled in with the sentinel value in
associate_generator_tables()
. At this stage all of the zeros from the original data that are now associated with generators, prime mover codes and energy source codes.All of the nulls in the relevant data columns are filled with the sentinel value in
prep_alloction_fraction()
. (Could this also be done inassociate_generator_tables()
?)After the allocation of net generation (within
allocate_gen_fuel_by_gen_esc()
andallocate_fuel_by_gen_esc()
viaremove_aggregated_sentinel_value()
), convert all of the aggregated values that are between 0 and twenty times this sentinel value back to zero’s. This is meant to find all instances of aggregated sentinel values. We avoid any negative values because there are instances of negative orignal values - especially negative net generation.
- pudl.analysis.allocate_gen_fuel.allocate_gen_fuel_asset_factory(freq: Literal[YS, MS], io_manager_key: str | None = None) list[dagster.AssetsDefinition] [source]¶
Build yearly and monthly net generation & fuel consumption allocation assets.
- pudl.analysis.allocate_gen_fuel.allocate_gen_fuel_by_generator_energy_source(gf: pandas.DataFrame, bf: pandas.DataFrame, gen: pandas.DataFrame, bga: pandas.DataFrame, gens: pandas.DataFrame, freq: Literal[YS, MS], debug: bool = False) pandas.DataFrame [source]¶
Allocate net gen from gen_fuel table to the generator/energy_source_code level.
There are two main steps here:
associate
core_eia923__monthly_generation_fuel
table data w/ generatorsallocate
core_eia923__monthly_generation_fuel
table data proportionally
The association process happens via
associate_generator_tables()
.The allocation process (via
allocate_gen_fuel_by_gen_esc()
) entails generating a fraction for each record within aIDX_PM_ESC
group. We have two data points for generating this ratio: the net generation in the core_eia923__monthly_generation table and the capacity from the core_eia860__scd_generators table. The end result is afrac
column which is unique for each combination of generator, prime_mover, and fuel and is used to allocate the associated net generation from the core_eia923__monthly_generation_fuel table.- Parameters:
gf – Temporally aggregated out_eia923__generation_fuel_combined dataframe.
bf – Temporally aggregated core_eia923__monthly_boiler_fuel dataframe.
gen – Temporally aggregated core_eia923__monthly_generation dataframe.
bga – core_eia860__assn_boiler_generator dataframe.
gens – core_eia860__scd_generators dataframe.
freq – Frequency at which the tables are aggregated temporally.
debug – If True, return additional debugging information.
- pudl.analysis.allocate_gen_fuel.select_input_data(gf: pandas.DataFrame, bf: pandas.DataFrame, gen: pandas.DataFrame, bga: pandas.DataFrame, gens: pandas.DataFrame) tuple[pandas.DataFrame] [source]¶
Select only the subset of input data needed for the allocation.
This includes both selecting only a subset of columns from most input tables, and restricting the dates to those which are available in all inputs. Otherwise we end up with a bunch of NA values since the generators table has up to a year of more recent data from the EIA-860M.
- pudl.analysis.allocate_gen_fuel.standardize_input_frequency(bf: pandas.DataFrame, gens: pandas.DataFrame, gen: pandas.DataFrame, freq: Literal[MS, MS]) tuple [source]¶
Standardize the frequency of the input tables.
Employ
distribute_annually_reported_data_to_months_if_annual()
on the boiler fuel and generation table. Employpudl.helpers.expand_timeseries()
on the generators table. Also use the expanded generators table to ensure the generation table has all of the generators present.- Parameters:
bf – core_eia923__monthly_boiler_fuel table
gens – core_eia860__scd_generators table
gen – core_eia923__monthly_generation table
freq – the (time) frequency at which the tables will be aggregated.
- pudl.analysis.allocate_gen_fuel.scale_allocated_net_gen_fuel_by_ownership(net_gen_fuel_alloc: pandas.DataFrame, gens: pandas.DataFrame, own_eia860: pandas.DataFrame) pandas.DataFrame [source]¶
Scale allocated net gen at the generator/energy_source_code level by ownership.
It can be helpful to have a table of net generation and fuel consumption at the generator/fuel-type level (i.e. the result of
allocate_gen_fuel_by_generator_energy_source()
) to be associated and scaled with all of the owners of those generators. This allows the aggregation of fuel use to the utility level.This function uses the allocated net generation at the generator/fuel-type level, merges that with a generators table to ensure all necessary columns are available, and then feeds that table into the helper function
scale_by_ownership()
to scale generators by their owners’ ownership fraction.- Parameters:
net_gen_fuel_alloc – table of allocated generation and fuel consumption at the generator, prime mover, and energy source. From
allocate_gen_fuel_by_generator_energy_source()
gens –
core_eia860__scd_generators
table with cols: :const:IDX_GENS
,capacity_mw
andutility_id_eia
own_eia860 –
core_eia860__scd_ownership
table.
- pudl.analysis.allocate_gen_fuel.agg_by_generator(net_gen_fuel_alloc: pandas.DataFrame, by_cols: list[str] = IDX_GENS, sum_cols: list[str] = DATA_COLUMNS) pandas.DataFrame [source]¶
Aggreate the allocated gen fuel data to the generator level.
- Parameters:
net_gen_fuel_alloc – result of
allocate_gen_fuel_by_generator_energy_source()
by_cols – list of columns to use as
pandas.groupby
argby
sum_cols – Data columns from that are being aggregated via a
pandas.groupby.sum()
- pudl.analysis.allocate_gen_fuel.stack_generators(gens: pandas.DataFrame, cat_col: str = 'energy_source_code_num', stacked_col: str = 'energy_source_code') pandas.DataFrame [source]¶
Stack the generator table with a set of columns.
- Parameters:
gens – core_eia860__scd_generators table with cols:
IDX_GENS
and all of theenergy_source_code
columnscat_col – name of category column which will end up having the column names of
cols_to_stack
stacked_col – name of column which will end up with the stacked data from
cols_to_stack
- Returns:
a dataframe with these columns: idx_stack, cat_col, stacked_col
- Return type:
- pudl.analysis.allocate_gen_fuel.associate_generator_tables(gens: pandas.DataFrame, gf: pandas.DataFrame, gen: pandas.DataFrame, bf: pandas.DataFrame, bga: pandas.DataFrame) pandas.DataFrame [source]¶
Associate the three tables needed to assign net gen and fuel to generators.
The core_eia923__monthly_generation_fuel table’s data is reported at the
IDX_PM_ESC
granularity. Each generator in the core_eia860__scd_generators has oneprime_mover_code
, but potentially severalenergy_source_code``s that are reported in several columns. We need to reshape the generators table such that each generator has a separate record corresponding to each of its reported energy_source_codes, so it can be merged with the :ref:`core_eia923__monthly_generation_fuel` table. We do this using :func:``stack_generators
employingpd.DataFrame.stack()
.The stacked generators table has a primary key of
["plant_id_eia", "generator_id", "report_date", "energy_source_code"]
. The table also includes theprime_mover_code
column to enable merges with other tables, thecapacity_mw
column which we use to determine the allocation when there is no data in the granular data tables, and theoperational_status
column which we use to remove inactive plants from the association and allocation process.The remaining data tables are all less granular than this stacked generators table and have varying primary keys. We add suffixes to the data columns in these data tables to identify the source table before broadcast merging these data columns into the stacked generators. This broadcasted data will be used later in the allocation process.
This function also removes inactive generators so that we don’t associate any net generation or fuel to those generators. See
remove_inactive_generators()
for more details.There are some records in the data tables that have either
prime_mover_code
s orenergy_source_code
s that do no appear in the core_eia860__scd_generators table. We employ_allocate_unassociated_bf_records()
to make sure those records are associated.- Parameters:
gens – core_eia860__scd_generators table with cols:
IDX_GENS
and all of theenergy_source_code
columns and expanded to the same frequency.gf – core_eia923__monthly_generation_fuel table with columns:
IDX_PM_ESC
andnet_generation_mwh
andfuel_consumed_mmbtu
.gen – core_eia923__monthly_generation table with columns:
IDX_GENS
andnet_generation_mwh
.bf – core_eia923__monthly_boiler_fuel table with columns:
IDX_B_PM_ESC
and fuel consumption columns.bga – core_eia860__assn_boiler_generator table.
- Returns:
table of generators with stacked energy sources and broadcasted net generation and fuel data from the core_eia923__monthly_generation and core_eia923__monthly_generation_fuel tables. There are many duplicate values in this output which will later be used in the allocation process in
allocate_gen_fuel_by_gen_esc()
andallocate_fuel_by_gen_esc()
.
- pudl.analysis.allocate_gen_fuel.remove_inactive_generators(gen_assoc: pandas.DataFrame) pandas.DataFrame [source]¶
Remove the retired generators.
We don’t want to associate and later allocate net generation or fuel to generators that are retired (or proposed! or any other
operational_status
besidesexisting
). However, we do want to keep the generators that report operational statuses other thanexisting
but which report non-zero data despite beingretired
orproposed
. This includes several categories of generators/plants:retiring_generators
: generators that retire mid-yearretired_plants
: entire plants that supposedly retired prior to the current year but which report data. If a plant has a mix of gens which are existing and retired, they are not included in this category.proposed_generators
: generators that become operational mid-year, or which are marked asproposed
but start reporting non-zero dataproposed_plants
: entire plants that have aproposed
status but which start reporting data. If a plant has a mix of gens which are existing and proposed, they are not included in this category.
When we do not have generator-specific generation for a proposed/retired generator that is not coming online/retiring mid-year, we can also look at whether there is generation reported for this generator in the gf table. However, if a proposed/retired generator is part of an existing plant, it is possible that the reported generation from the gf table belongs to one of the other existing generators. Thus, we want to only keep proposed/retired generators where the entire plant is proposed/retired (in which case the gf- reported generation could only come from one of the new/retired generators).
We also want to keep unassociated plants that have no
generator_id
which will be associated via_allocate_unassociated_records()
.- Parameters:
gen_assoc – table of generators with stacked energy sources and broadcasted net generation data from the core_eia923__monthly_generation and core_eia923__monthly_generation_fuel tables. Output of
associate_generator_tables()
.
- pudl.analysis.allocate_gen_fuel.identify_retiring_generators(gen_assoc: pandas.DataFrame) pandas.DataFrame [source]¶
Identify any generators that retire mid-year.
These are generators with a retirement date after the earliest report_date or which report generator-specific generation data in the g table after their retirement date.
- pudl.analysis.allocate_gen_fuel.identify_retired_plants(gen_assoc: pandas.DataFrame) pandas.DataFrame [source]¶
Identify entire plants that have previously retired but are reporting data.
- pudl.analysis.allocate_gen_fuel.identify_generators_coming_online(gen_assoc: pandas.DataFrame) pandas.DataFrame [source]¶
Identify generators that are coming online mid-year.
These are defined as generators that have a proposed status but which report generator-specific generation data in the g table
- pudl.analysis.allocate_gen_fuel.identify_proposed_plants(gen_assoc: pandas.DataFrame) pandas.DataFrame [source]¶
Identify entirely new plants that are proposed but are already reporting data.
- pudl.analysis.allocate_gen_fuel._allocate_unassociated_pm_records(gen_assoc: pandas.DataFrame, idx_cols: list[str], col_w_unexpected_codes: Literal[energy_source_code, prime_mover_code], data_columns: list[str]) pandas.DataFrame [source]¶
Associate unassociated core_eia923__monthly_boiler_fuel table records on idx_cols.
There are a subset of core_eia923__monthly_boiler_fuel and core_eia923__monthly_generation_fuel records which do not merge onto the stacked generator table on
IDX_GENS_PM_ESC
orID_PM_ESC
respectively. These records generally don’t match with the set of prime movers and energy sources in the stacked generator table. In this method, we associate those straggler, unassociated records by merging these records with the stacked generators witouth the un-matching data column.- Parameters:
gen_assoc – generators associated with data.
idx_cols – ID columns (includes
col_w_unexpected_codes
)col_w_unexpected_codes – name of the column which has codes in it that were not found in the generators table.
data_columns – the data columns to associate and allocate.
- pudl.analysis.allocate_gen_fuel.prep_alloction_fraction(gen_assoc: pandas.DataFrame) pandas.DataFrame [source]¶
Prepare the associated generators for allocation.
Make flags and aggregations to prepare for the
allocate_gen_fuel_by_gen_esc()
andallocate_fuel_by_gen_esc()
functions.In
allocate_gen_fuel_by_gen_esc()
, we will break the generators out into four types - seeallocate_gen_fuel_by_gen_esc()
docs for details. This function adds flags for splitting the generators.- Parameters:
gen_assoc – a table of generators that have associated w/ energy sources, prime movers and boilers - result of
associate_generator_tables()
- pudl.analysis.allocate_gen_fuel.allocate_gen_fuel_by_gen_esc(gen_pm_fuel: pandas.DataFrame) pandas.DataFrame [source]¶
Allocate net generation to generators/energy_source_code via three methods.
- There are three main types of generators:
“all gen”: generators of plants which fully report to the
core_eia923__monthly_generation
table. This includes records that report more MWh to thecore_eia923__monthly_generation
table than to thecore_eia923__monthly_generation_fuel
table (if we did not include these records, the ).“some gen”: generators of plants which partially report to the
core_eia923__monthly_generation
table.“gf only”: generators of plants which do not report at all to the
core_eia923__monthly_generation
table.
Each different type of generator needs to be treated slightly differently, but all will end up with a
frac
column that can be used to allocate thenet_generation_mwh_gf_tbl
.- Parameters:
gen_pm_fuel – output of :func:
prep_alloction_fraction()
.
- pudl.analysis.allocate_gen_fuel.allocate_fuel_by_gen_esc(gen_pm_fuel: pandas.DataFrame) pandas.DataFrame [source]¶
Allocate fuel_consumption to generators/energy_source_code via three methods.
There are three main types of generators:
“all bf”: generators of plants which fully report to the core_eia923__monthly_boiler_fuel table.
“some bf”: generators of plants which partially report to the core_eia923__monthly_boiler_fuel table.
“gf only”: generators of plants which do not report at all to the core_eia923__monthly_boiler_fuel table.
Each different type of generator needs to be treated slightly differently, but all will end up with a
frac
column that can be used to allocate thefuel_consumed_mmbtu_gf_tbl
.- Parameters:
gen_pm_fuel – output of
prep_alloction_fraction()
.
- pudl.analysis.allocate_gen_fuel.remove_aggregated_sentinel_value(col: pandas.Series, scalar: float = 20.0) pandas.Series [source]¶
Replace the post-aggregation sentinel values in a column with zero.
- pudl.analysis.allocate_gen_fuel.group_duplicate_keys(df: pandas.DataFrame) pandas.DataFrame [source]¶
Catches duplicate keys in the allocated data and groups them together.
Merging
net_gen_alloc
andfuel_alloc
together requires unique keys in each df. Sometimes the allocation process creates duplicate keys. This function identifies when this happens, and aggregates the data on these keys to remove the duplicates.
- pudl.analysis.allocate_gen_fuel.distribute_annually_reported_data_to_months_if_annual(df: pandas.DataFrame, key_columns: list[str], data_column_name: str, freq: Literal[YS, MS]) pandas.DataFrame [source]¶
Allocates annually-reported data from the gen or bf table to each month.
Certain plants only report data to the generator table and boiler fuel table on an annual basis. In these cases, their annual total is reported as a single value in January or December, and the other 11 months are reported as missing values. This function first identifies which plants are annual respondents by identifying plants that have 11 months of missing data, with the one month of existing data being in January or December. This is an assumption based on seeing that over 40% of the plants that have 11 months of missing data report their one month of data in January and December (this ratio of reporting is checked and will raise a warning if it becomes untrue). It then distributes this annually-reported value evenly across all months in the year. Because we know some of the plants are reporting in only one month that is not January or December, the assumption about January and December only reporting is almost certainly resulting in some non-annual data being allocated across all months, but on average the data will be more accruate.
Note: We should be able to use the
reporting_frequency_code
column for the identification of annually reported data. This currently does not work because we assumed this was a plant-level annual attribute (and is thus stored in thecore_eia860__scd_plants
table). See Issue #1933.- Parameters:
df – a pandas dataframe, either loaded from pudl_out.gen_original_eia923() or pudl_out.bf_eia923()
key_columns – a list of the primary key column names, either
["plant_id_eia","boiler_id","energy_source_code"]
or["plant_id_eia","generator_id"]
data_column_name – the name of the data column to allocate, either “net_generation_mwh” or “fuel_consumed_mmbtu” depending on the df specified
freq – frequency of input df. Must be either
YS
orMS
.
- Returns:
df with the annually reported values allocated to each month
- pudl.analysis.allocate_gen_fuel.manually_fix_energy_source_codes(gf: pandas.DataFrame) pandas.DataFrame [source]¶
Reassign fuel codes that differ between gen-fuel and gens tables.
- pudl.analysis.allocate_gen_fuel.adjust_msw_energy_source_codes(gens: pandas.DataFrame, gf: pandas.DataFrame, bf_by_gens: pandas.DataFrame) pandas.DataFrame [source]¶
Adjusts MSW codes.
Adjust the MSW codes in gens to match those used in gf and bf.
In recent years, EIA-923 started splitting out the
MSW
(municipal_solid_waste) into its consitituent componentsMSB
(municipal_solid_waste_biogenic) andMSN
(municipal_solid_nonbiogenic). However, the EIA-860 Generators table still only uses theMSW
code.This function identifies which MSW codes are used in the gf and bf tables and creates records to match these.
- pudl.analysis.allocate_gen_fuel.add_missing_energy_source_codes_to_gens(gens_at_freq, gf, bf)[source]¶
Add energy_source_codes to gens that were found only in the gf or bf tables.
In some cases, non-zero fuel consumption and net generation is reported in the EIA-923 generation and fuel table that is associated with an energy_source_code that is not associated with that plant-prime mover in the gens table, which would cause these data to get dropped when these two tables are merged. To fix this, for each plant-pm, this function identifies such esc, and adds them to the gens_at_freq table as new energy_source_code columns.
- pudl.analysis.allocate_gen_fuel.identify_missing_gf_escs_in_gens(gens_at_freq, gf, bf)[source]¶
Identify energy_source_codes that exist in gf or bf but not gens.
- pudl.analysis.allocate_gen_fuel.allocate_bf_data_to_gens(bf: pandas.DataFrame, gens: pandas.DataFrame, bga: pandas.DataFrame) pandas.DataFrame [source]¶
Allocates boiler fuel data to the generator level.
Distributes boiler-level data from core_eia923__monthly_boiler_fuel to the generator level based on the boiler-generator association table and the nameplate capacity of the connected generators.
Because fuel consumption in the core_eia923__monthly_boiler_fuel table is reported per boiler_id, we must first map this data to generators using the core_eia860__assn_boiler_generator table. For boilers that have a 1:m or m: m relationship with generators, we allocate the reported fuel to each associated generator based on the nameplate capacity of each generator. So if boiler “1” was associated with generator A (25 MW) and generator B (75 MW), 25% of the fuel consumption would be allocated to generator A and 75% would be allocated to generator B.
- pudl.analysis.allocate_gen_fuel.warn_if_missing_pms(gens: pandas.DataFrame) None [source]¶
Log warning if there are too many null
prime_mover_code
s.Warn if prime mover codes in gens do not match the codes in the gf table this is something that should probably be fixed in the input data see https://github.com/catalyst-cooperative/pudl/issues/1585 set a threshold and ignore 2001 bc most errors are 2001 errors.
- pudl.analysis.allocate_gen_fuel._test_frac(gen_pm_fuel: pandas.DataFrame) pandas.DataFrame [source]¶
Check if each of the IDX_PM_ESC groups frac’s add up to 1.
- pudl.analysis.allocate_gen_fuel._test_gen_pm_fuel_output(gen_pm_fuel: pandas.DataFrame, gf: pandas.DataFrame, gen: pandas.DataFrame) pandas.DataFrame [source]¶
- pudl.analysis.allocate_gen_fuel.test_gen_fuel_allocation(gen: pandas.DataFrame, net_gen_alloc: pandas.DataFrame, ratio: float = 0.05) None [source]¶
Does the allocated MWh differ from the granular core_eia923__monthly_generation?
- Parameters:
gen – the
core_eia923__monthly_generation
table.net_gen_alloc – the allocated net generation at the
IDX_PM_ESC
levelratio – the tolerance
- pudl.analysis.allocate_gen_fuel.test_original_gf_vs_the_allocated_by_gens_gf(gf: pandas.DataFrame, gf_allocated: pandas.DataFrame, data_columns: list[str] = DATA_COLUMNS, by: list[str] = ['year', 'plant_id_eia'], acceptance_threshold: float = 0.07) pandas.DataFrame [source]¶
Test whether the allocated data and original data sum up to similar values.
- Raises:
AssertionError – If the number of plant/years that are off by more than 5% is not within acceptable level of tolerance.
AssertionError – If the difference between the allocated and original data for any plant/year is off by more than x10 or x-5.