pudl.analysis.allocate_net_gen#

Allocate data from generation_fuel_eia923 table to generator level.

The algorithm we’re using assumes the following about the reported data:

  • The generation_fuel_eia923 table is the authoritative source of information about how much generation and fuel consumption is attributable to an entire plant. This table has the most complete data coverage, but it is not the most granular data reported. It’s primary keys are IDX_PM_ESC.

  • The generation_eia923 table contains the most granular net generation data. It is reported at the generator level with primary keys IDX_GENS. This table includes only ~39% of the total MWhs reported in the generation_fuel_eia923 table.

  • The boiler_fuel_eia923 table contains the most granular fuel consumption data. It is reported at the boiler/prime mover/energy source level with primary keys IDX_B_PM_ESC. This table includes only ~38% of the total MMBTUs reported in the generation_fuel_eia923 table.

  • The generators_eia860 table provides an exhaustive list of all generators whose generation is being reported in the generation_fuel_eia923 table - with primary keys IDX_GENS.

This module allocates the total net electricity generation and fuel consumption reported in the generation_fuel_eia923 table to individual generators, based on more granular data reported in the generation_eia923 and boiler_fuel_eia923 tables, as well as capacity (MW) found in the generators_eia860 table. It uses other generator attributes from the generators_eia860 table to associate the data found in the generation_fuel_eia923 with generators. It also uses as the associations between boilers and generators found in the boiler_generator_assn_eia860 table to aggregate data boiler_fuel_eia923 tables. The main coordinating functions hereare allocate_gen_fuel_by_generator_energy_source() and aggregate_gen_fuel_by_generator().

Some definitions:

  • Data columns refers to the net generation and fuel consumption - the specific columns are defined in DATA_COLUMNS.

  • Granular tables refers to generation_eia923 and boiler_fuel_eia923, which report granular data but do not have complete coverage.

There are six main stages of the allocation process in this module:

  1. Read inputs: Read denormalized net generation and fuel consumption data from the PUDL DB and standardize data reporting frequency. (See extract_input_tables() and standardize_input_frequency()).

  2. Associate inputs: Merge data columns from the input tables described above on the basis of their shared primary key columns, producing an output with primary key IDX_GENS_PM_ESC. This broadcasts many data values across multiple rows for use in the allocation process below (see associate_generator_tables()).

  3. Flag associated inputs: For each record in the associated inputs, add boolean flags that separately indicate whether the generation and fuel consumption in that record are directly reported in the granular tables. This lets us choose an appropriate data allocation method based on how complete the granular data coverage is for a given value of IDX_PM_ESC, which is the original primary key of the generation_fuel_eia923 table. (See prep_allocation_fraction()).

  4. Allocate: Allocate the net generation and fuel consumption reported in the less granular generation_fuel_eia923 table to the IDX_GENS_PM_ESC level. More details on the allocation process are below (see allocate_net_gen_by_gen_esc() and allocate_fuel_by_gen_esc()).

  5. Sanity check allocation: Verify that the total allocated net generation and fuel consumption within each plant is equal to the total of the originally reported values within some tolerance (see test_original_gf_vs_the_allocated_by_gens_gf()). Warn if assumptions about the data and the outputs aren’t met (see warn_if_missing_pms(), _test_frac(), test_gen_fuel_allocation() and _test_gen_pm_fuel_output())

  6. Aggregate outputs: Aggregate the allocated net generation and fuel consumption to the generator level, going from having primary keys of IDX_GENS_PM_ESC to IDX_GENS (see aggregate_gen_fuel_by_generator()).

High-level description about the allocaiton step:

We allocate the data columns reported in the generation_fuel_eia923 table on the basis of plant, prime mover, and energy source among the generators in each plant that have matching energy sources.

We group the associated data columns by IDX_PM_ESC and categorize each resulting group of generators based on whether ALL, SOME, or NONE of them reported data in the granular tables. This is done for both the net generation and fuel consumption since the same generator may have reported differently in its respective granular table. This is done for both the net generation and fuel consumption since the same generator may have reported differently in its respective granular table.

In more detail, within each reporting period, we split the plants into three groups:

  • The ALL Coverage Records: where ALL generators report in the granular tables.

  • The NONE Coverage Records: where NONE of the generators report in the granular tables.

  • The SOME Coverage Records: where only SOME of the generators report in the granular tables.

In the ALL generators case, the data columns reported in the generation_fuel_eia923 table are allocated in proportion to data reported in the granular data tables. We do this instead of directly using the data columns from the granular tables because there are discrepancies between the generation_fuel_eia923 table and the granular tables and we are assuming the totals reported in the generation_fuel_eia923 table are authoritative.

In the NONE generators case, the data columns reported in the generation_fuel_eia923 table are allocated in proportion to the each generator’s capacity.

In the SOME generators case, we use a combination of the two allocation methods described above. First, the data columns reported in the generation_fuel_eia923 table are allocated between the two categories of generators: those that report granular data, and those that don’t. The fraction allocated to each of those categories is based on how much of the total is reported in the granular tables. If T is the total reported, and X is the quantity reported in the granular tables, then the allocation is X/T to the generators reporting granular data, and (T-X)/T to the generators not reporting granular data. Within each of those categories the allocation then follows the ALL or NONE allocation methods described above.

Known Drawbacks of this methodology:

Note that this methology does not distinguish between primary and secondary energy_sources for generators. It associates portions of net generation to each generators in the same plant do not report detailed generation, have the same prime_mover_code, and use the same fuels, but have very different capacity factors in reality, this methodology will allocate generation such that they end up with very similar capacity factors. We imagine this is an uncommon scenario.

This methodology has several potential flaws and drawbacks. Because there is no indicator of what portion of the energy_source_codes, we associate the net generation equally among them. In effect, if a plant had multiple generators with the same prime_mover_code but opposite primary and secondary fuels (eg. gen 1 has a primary fuel of ‘NG’ and secondary fuel of ‘DFO’, while gen 2 has a primary fuel of ‘DFO’ and a secondary fuel of ‘NG’), the methodology associates the generation_fuel_eia923 records similarly across these two generators. However, the allocated net generation will still be porporational to each generator’s net generation (if it’s reported) or capacity (if generation is not reported).

Module Contents#

Functions#

allocate_gen_fuel_by_generator_energy_source(pudl_out)

Allocate net gen from gen_fuel table to the generator/energy_source_code level.

aggregate_gen_fuel_by_generator(→ pandas.DataFrame)

Aggregate gen fuel data columns to generators.

extract_input_tables(pudl_out)

Extract the input tables from the pudl_out object.

standardize_input_frequency(bf, gens, gen, freq)

Standardize the frequency of the input tables.

scale_allocated_net_gen_by_ownership(→ pandas.DataFrame)

Scale allocated net gen at the generator/energy_source_code level by ownership.

agg_by_generator(→ pandas.DataFrame)

Aggreate the allocated gen fuel data to the generator level.

stack_generators(gens[, cat_col, stacked_col])

Stack the generator table with a set of columns.

associate_generator_tables(→ pandas.DataFrame)

Associate the three tables needed to assign net gen and fuel to generators.

remove_inactive_generators(→ pandas.DataFrame)

Remove the retired generators.

identify_retiring_generators(gen_assoc)

Identify any generators that retire mid-year.

identify_retired_plants(gen_assoc)

Identify entire plants that have previously retired but are reporting data.

identify_generators_coming_online(gen_assoc)

Identify generators that are coming online mid-year.

identify_proposed_plants(gen_assoc)

Identify entirely new plants that are proposed but are already reporting data.

_allocate_unassociated_records(→ pandas.DataFrame)

Associate unassociated gen_fuel table records on idx_cols.

prep_alloction_fraction(→ pandas.DataFrame)

Prepare the associated generators for allocation.

allocate_net_gen_by_gen_esc(→ pandas.DataFrame)

Allocate net generation to generators/energy_source_code via three methods.

allocate_fuel_by_gen_esc(→ pandas.DataFrame)

Allocate fuel_consumption to generators/energy_source_code via three methods.

remove_aggregated_sentinel_value(→ pandas.Series)

Replace the post-aggregation sentinel values in a column with zero.

group_duplicate_keys(→ pandas.DataFrame)

Catches duplicate keys in the allocated data and groups them together.

distribute_annually_reported_data_to_months_if_annual(...)

Allocates annually-reported data from the gen or bf table to each month.

manually_fix_energy_source_codes(→ pandas.DataFrame)

Patch: reassigns fuel codes in the gf table that don't match the fuel code in the gens table.

adjust_energy_source_codes(→ pandas.DataFrame)

Adjusts MSW codes.

allocate_bf_data_to_gens(→ pandas.DataFrame)

Allocates boiler fuel data to the generator level.

warn_if_missing_pms(gens)

Log warning if there are too many null prime_mover_code s.

_test_frac(gen_pm_fuel)

Check if each of the IDX_PM_ESC groups frac's add up to 1.

_test_gen_pm_fuel_output(gen_pm_fuel, gf, gen)

test_gen_fuel_allocation(gen, net_gen_alloc[, ratio])

Does the allocated MWh differ from the granular generation_eia923?

test_original_gf_vs_the_allocated_by_gens_gf(...)

Test whether the allocated data and original data sum up to similar values.

Attributes#

logger

IDX_GENS

Primary key columns for generator records.

IDX_GENS_PM_ESC

Primary key columns for plant, generator, prime mover & energy source records.

IDX_PM_ESC

Primary key columns for plant, prime mover & energy source records.

IDX_B_PM_ESC

Primary key columns for plant, boiler, prime mover & energy source records.

IDX_ESC

Primary key columns for plant & energy source records.

IDX_UNIT_ESC

Primary key columns for plant, energy source & unit records.

DATA_COLUMNS

Data columns from generation_fuel_eia923 that are being allocated.

MISSING_SENTINEL

A sentinel value for dealing with null or zero values.

pudl.analysis.allocate_net_gen.logger[source]#
pudl.analysis.allocate_net_gen.IDX_GENS = ['report_date', 'plant_id_eia', 'generator_id'][source]#

Primary key columns for generator records.

pudl.analysis.allocate_net_gen.IDX_GENS_PM_ESC = ['report_date', 'plant_id_eia', 'generator_id', 'prime_mover_code', 'energy_source_code'][source]#

Primary key columns for plant, generator, prime mover & energy source records.

pudl.analysis.allocate_net_gen.IDX_PM_ESC = ['report_date', 'plant_id_eia', 'energy_source_code', 'prime_mover_code'][source]#

Primary key columns for plant, prime mover & energy source records.

pudl.analysis.allocate_net_gen.IDX_B_PM_ESC = ['report_date', 'plant_id_eia', 'boiler_id', 'energy_source_code', 'prime_mover_code'][source]#

Primary key columns for plant, boiler, prime mover & energy source records.

pudl.analysis.allocate_net_gen.IDX_ESC = ['report_date', 'plant_id_eia', 'energy_source_code'][source]#

Primary key columns for plant & energy source records.

pudl.analysis.allocate_net_gen.IDX_UNIT_ESC = ['report_date', 'plant_id_eia', 'energy_source_code', 'unit_id_pudl'][source]#

Primary key columns for plant, energy source & unit records.

pudl.analysis.allocate_net_gen.DATA_COLUMNS = ['net_generation_mwh', 'fuel_consumed_mmbtu', 'fuel_consumed_for_electricity_mmbtu'][source]#

Data columns from generation_fuel_eia923 that are being allocated.

pudl.analysis.allocate_net_gen.MISSING_SENTINEL = 1e-05[source]#

A sentinel value for dealing with null or zero values.

  1. Zero’s in the relevant data columns get filled in with the sentinel value in associate_generator_tables(). At this stage all of the zeros from the original data that are now associated with generators, prime mover codes and energy source codes.

  2. All of the nulls in the relevant data columns are filled with the sentinel value in prep_alloction_fraction(). (Could this also be done in associate_generator_tables()?)

  3. After the allocation of net generation (within allocate_net_gen_by_gen_esc() and allocate_fuel_by_gen_esc() via remove_aggregated_sentinel_value()), convert all of the aggregated values that are between 0 and twenty times this sentinel value back to zero’s. This is meant to find all instances of aggregated sentinel values. We avoid any negative values because there are instances of negative orignal values - especially negative net generation.

pudl.analysis.allocate_net_gen.allocate_gen_fuel_by_generator_energy_source(pudl_out, drop_interim_cols: bool = True)[source]#

Allocate net gen from gen_fuel table to the generator/energy_source_code level.

Three main steps here:
  • grab the three input tables from pudl_out with only the needed columns

  • associate generation_fuel_eia923 table data w/ generators

  • allocate generation_fuel_eia923 table data proportionally

The association process happens via associate_generator_tables().

The allocation process (via allocate_net_gen_by_gen_esc()) entails generating a fraction for each record within a IDX_PM_ESC group. We have two data points for generating this ratio: the net generation in the generation_eia923 table and the capacity from the generators_eia860 table. The end result is a frac column which is unique for each generator/prime_mover/fuel record and is used to allocate the associated net generation from the generation_fuel_eia923 table.

Parameters:
  • pudl_out (pudl.output.pudltabl.PudlTabl) – An object used to create the tables for EIA and FERC Form 1 analysis.

  • drop_interim_cols – True/False flag for dropping interim columns which are used to generate the net_generation_mwh column (they are mostly the frac column and net generataion reported in the original generation_eia923 and generation_fuel_eia923 tables) that are useful for debugging. Default is False, which will drop the columns.

pudl.analysis.allocate_net_gen.aggregate_gen_fuel_by_generator(pudl_out: pudl.output.pudltabl.PudlTabl, net_gen_fuel_alloc: pandas.DataFrame, sum_cols: list[str] = DATA_COLUMNS) pandas.DataFrame[source]#

Aggregate gen fuel data columns to generators.

The generation_fuel_eia923 table includes net generation and fuel consumption data at the plant/energy source/prime mover level. The most granular level of plants that PUDL typically uses is at the plant/generator level. This function takes the plant/energy source code/prime mover level allocation, aggregates it to the generator level and then denormalizes it to make it more structurally in-line with the original generation_eia923 table (see pudl.output.eia923.denorm_generation_eia923()).

Parameters:
  • pudl_out – An object used to create the tables for EIA and FERC Form 1 analysis.

  • net_gen_fuel_alloc – table of allocated generation at the generator/prime mover/ energy source. Result of allocate_gen_fuel_by_generator_energy_source()

  • sum_cols – Data columns from that are being aggregated via a pandas.groupby.sum() in agg_by_generator

Returns:

table with columns IDX_GENS and net generation and fuel consumption scaled to the level of the IDX_GENS.

pudl.analysis.allocate_net_gen.extract_input_tables(pudl_out: pudl.output.pudltabl.PudlTabl)[source]#

Extract the input tables from the pudl_out object.

Extract all of the tables from pudl_out early in the process and select only the columns we need.

Parameters:

pudl_out – instantiated pudl output object.

pudl.analysis.allocate_net_gen.standardize_input_frequency(bf: pandas.DataFrame, gens: pandas.DataFrame, gen: pandas.DataFrame, freq: Literal[MS, MS])[source]#

Standardize the frequency of the input tables.

Employ distribute_annually_reported_data_to_months_if_annual() on the boiler fuel and generation table. Employ pudl.helpers.expand_timeseries() on the generators table. Also use the expanded generators table to ensure the generation table has all of the generators present.

Parameters:
pudl.analysis.allocate_net_gen.scale_allocated_net_gen_by_ownership(gen_pm_fuel: pandas.DataFrame, gens: pandas.DataFrame, own_eia860: pandas.DataFrame) pandas.DataFrame[source]#

Scale allocated net gen at the generator/energy_source_code level by ownership.

It can be helpful to have a table of net generation and fuel consumption at the generator/fuel-type level (i.e. the result of allocate_gen_fuel_by_generator_energy_source()) to be associated and scaled with all of the owners of those generators. This allows the aggregation of fuel use to the utility level.

Scaling generators with their owners’ ownership fraction is currently possible via pudl.analysis.plant_parts_eia.MakeMegaGenTbl. This function uses the allocated net generation at the generator/fuel-type level, merges that with a generators table to ensure all necessary columns are available, and then feeds that table into the EIA Plant-parts’ scale_by_ownership().

Parameters:
  • gen_pm_fuel – able of allocated generation at the generator/prime mover/energy source. Result of allocate_gen_fuel_by_generator_energy_source()

  • gensgenerators_eia860 table with cols: :const:IDX_GENS, capacity_mw and utility_id_eia

  • own_eia860ownership_eia860 table.

pudl.analysis.allocate_net_gen.agg_by_generator(net_gen_fuel_alloc: pandas.DataFrame, by_cols: list[str] = IDX_GENS, sum_cols: list[str] = DATA_COLUMNS) pandas.DataFrame[source]#

Aggreate the allocated gen fuel data to the generator level.

Parameters:
pudl.analysis.allocate_net_gen.stack_generators(gens: pandas.DataFrame, cat_col: str = 'energy_source_code_num', stacked_col: str = 'energy_source_code')[source]#

Stack the generator table with a set of columns.

Parameters:
  • gens – generators_eia860 table with cols: IDX_GENS and all of the energy_source_code columns

  • cat_col – name of category column which will end up having the column names of cols_to_stack

  • stacked_col – name of column which will end up with the stacked data from cols_to_stack

Returns:

a dataframe with these columns: idx_stack, cat_col, stacked_col

Return type:

pandas.DataFrame

pudl.analysis.allocate_net_gen.associate_generator_tables(gens: pandas.DataFrame, gf: pandas.DataFrame, gen: pandas.DataFrame, bf: pandas.DataFrame, bga: pandas.DataFrame) pandas.DataFrame[source]#

Associate the three tables needed to assign net gen and fuel to generators.

The generation_fuel_eia923 table’s data is reported at the IDX_PM_ESC granularity. Each generator in the generators_eia860 has one prime_mover_code, but potentially several energy_source_code``s that are reported in several columns. We need to reshape the generators table such that each generator has a separate record corresponding to each of its reported energy_source_codes, so it can be merged with the :ref:`generation_fuel_eia923` table. We do this using :func:``stack_generators employing pd.DataFrame.stack().

The stacked generators table has a primary key of ["plant_id_eia", "generator_id", "report_date", "energy_source_code"]. The table also includes the prime_mover_code column to enable merges with other tables, the capacity_mw column which we use to determine the allocation when there is no data in the granular data tables, and the operational_status column which we use to remove inactive plants from the association and allocation process.

The remaining data tables are all less granular than this stacked generators table and have varying primary keys. We add suffixes to the data columns in these data tables to identify the source table before broadcast merging these data columns into the stacked generators. This broadcasted data will be used later in the allocation process.

This function also removes inactive generators so that we don’t associate any net generation or fuel to those generators. See remove_inactive_generators() for more details.

There are some records in the data tables that have either prime_mover_code s or energy_source_code s that do no appear in the generators_eia860 table. We employ _allocate_unassociated_records() to make sure those records are associated.

Parameters:
Returns:

table of generators with stacked energy sources and broadcasted net generation and fuel data from the generation_eia923 and generation_fuel_eia923 tables. There are many duplicate values in this output which will later be used in the allocation process in allocate_net_gen_by_gen_esc() and allocate_fuel_by_gen_esc().

pudl.analysis.allocate_net_gen.remove_inactive_generators(gen_assoc: pandas.DataFrame) pandas.DataFrame[source]#

Remove the retired generators.

We don’t want to associate and later allocate net generation or fuel to generators that are retired (or proposed! or any other operational_status besides existing). However, we do want to keep the generators that report operational statuses other than existing but which report non-zero data despite being retired or proposed. This includes several categories of generators/plants:

  • retiring_generators: generators that retire mid-year

  • retired_plants: entire plants that supposedly retired prior to the current year but which report data. If a plant has a mix of gens which are existing and retired, they are not included in this category.

  • proposed_generators: generators that become operational mid-year, or which are marked as proposed but start reporting non-zero data

  • proposed_plants: entire plants that have a proposed status but which start reporting data. If a plant has a mix of gens which are existing and proposed, they are not included in this category.

When we do not have generator-specific generation for a proposed/retired generator that is not coming online/retiring mid-year, we can also look at whether there is generation reported for this generator in the gf table. However, if a proposed/retired generator is part of an existing plant, it is possible that the reported generation from the gf table belongs to one of the other existing generators. Thus, we want to only keep proposed/retired generators where the entire plant is proposed/retired (in which case the gf- reported generation could only come from one of the new/retired generators).

We also want to keep unassociated plants that have no generator_id which will be associated via _allocate_unassociated_records().

Parameters:

gen_assoc – table of generators with stacked energy sources and broadcasted net generation data from the generation_eia923 and generation_fuel_eia923 tables. Output of associate_generator_tables().

pudl.analysis.allocate_net_gen.identify_retiring_generators(gen_assoc)[source]#

Identify any generators that retire mid-year.

These are generators with a retirement date after the earliest report_date or which report generator-specific generation data in the g table after their retirement date.

pudl.analysis.allocate_net_gen.identify_retired_plants(gen_assoc)[source]#

Identify entire plants that have previously retired but are reporting data.

pudl.analysis.allocate_net_gen.identify_generators_coming_online(gen_assoc)[source]#

Identify generators that are coming online mid-year.

These are defined as generators that have a proposed status but which report generator-specific generation data in the g table

pudl.analysis.allocate_net_gen.identify_proposed_plants(gen_assoc)[source]#

Identify entirely new plants that are proposed but are already reporting data.

pudl.analysis.allocate_net_gen._allocate_unassociated_records(gen_assoc: pandas.DataFrame, idx_cols: list[str], col_w_unexpected_codes: Literal[energy_source_code, prime_mover_code], data_columns: list[str]) pandas.DataFrame[source]#

Associate unassociated gen_fuel table records on idx_cols.

There are a subset of generation_fuel_eia923 or boiler_fuel_eia923 records which do not merge onto the stacked generator table on IDX_PM_ESC or IDX_GENS_PM_ESC respecitively. These records generally don’t match with the set of prime movers and energy sources in the stacked generator table. In this method, we associate those straggler, unassociated records by merging these records with the stacked generators witouth the un-matching data column.

Parameters:
  • gen_assoc – generators associated with data.

  • idx_cols – ID columns (includes col_w_unexpected_codes)

  • col_w_unexpected_codes – name of the column which has codes in it that were not found in the generators table.

  • data_columns – the data columns to associate and allocate.

pudl.analysis.allocate_net_gen.prep_alloction_fraction(gen_assoc: pandas.DataFrame) pandas.DataFrame[source]#

Prepare the associated generators for allocation.

Make flags and aggregations to prepare for the allocate_net_gen_by_gen_esc() and allocate_fuel_by_gen_esc() functions.

In allocate_net_gen_by_gen_esc(), we will break the generators out into four types - see allocate_net_gen_by_gen_esc() docs for details. This function adds flags for splitting the generators.

Parameters:

gen_assoc – a table of generators that have associated w/ energy sources, prime movers and boilers - result of associate_generator_tables()

pudl.analysis.allocate_net_gen.allocate_net_gen_by_gen_esc(gen_pm_fuel: pandas.DataFrame) pandas.DataFrame[source]#

Allocate net generation to generators/energy_source_code via three methods.

There are three main types of generators:
  • “all gen”: generators of plants which fully report to the generation_eia923 table. This includes records that report more MWh to the generation_eia923 table than to the generation_fuel_eia923 table (if we did not include these records, the ).

  • “some gen”: generators of plants which partially report to the generation_eia923 table.

  • “gf only”: generators of plants which do not report at all to the generation_eia923 table.

Each different type of generator needs to be treated slightly differently, but all will end up with a frac column that can be used to allocate the net_generation_mwh_gf_tbl.

Parameters:

gen_pm_fuel – output of :func:prep_alloction_fraction().

pudl.analysis.allocate_net_gen.allocate_fuel_by_gen_esc(gen_pm_fuel: pandas.DataFrame) pandas.DataFrame[source]#

Allocate fuel_consumption to generators/energy_source_code via three methods.

There are three main types of generators:

  • “all bf”: generators of plants which fully report to the boiler_fuel_eia923 table.

  • “some bf”: generators of plants which partially report to the boiler_fuel_eia923 table.

  • “gf only”: generators of plants which do not report at all to the boiler_fuel_eia923 table.

Each different type of generator needs to be treated slightly differently, but all will end up with a frac column that can be used to allocate the fuel_consumed_mmbtu_gf_tbl.

Parameters:

gen_pm_fuel – output of prep_alloction_fraction().

pudl.analysis.allocate_net_gen.remove_aggregated_sentinel_value(col: pandas.Series, scalar: float = 20.0) pandas.Series[source]#

Replace the post-aggregation sentinel values in a column with zero.

pudl.analysis.allocate_net_gen.group_duplicate_keys(df: pandas.DataFrame) pandas.DataFrame[source]#

Catches duplicate keys in the allocated data and groups them together.

Merging net_gen_alloc and fuel_alloc together requires unique keys in each df. Sometimes the allocation process creates duplicate keys. This function identifies when this happens, and aggregates the data on these keys to remove the duplicates.

pudl.analysis.allocate_net_gen.distribute_annually_reported_data_to_months_if_annual(df: pandas.DataFrame, key_columns: list[str], data_column_name: str, freq: Literal[AS, MS]) pandas.DataFrame[source]#

Allocates annually-reported data from the gen or bf table to each month.

Certain plants only report data to the generator table and boiler fuel table on an annual basis. In these cases, their annual total is reported as a single value in January or December, and the other 11 months are reported as missing values. This function first identifies which plants are annual respondents by identifying plants that have 11 months of missing data, with the one month of existing data being in January or December. This is an assumption based on seeing that over 40% of the plants that have 11 months of missing data report their one month of data in January and December (this ratio of reporting is checked and will raise a warning if it becomes untrue). It then distributes this annually-reported value evenly across all months in the year. Because we know some of the plants are reporting in only one month that is not January or December, the assumption about January and December only reporting is almost certainly resulting in some non-annual data being allocated across all months, but on average the data will be more accruate.

Note: We should be able to use the reporting_frequency_code column for the identification of annually reported data. This currently does not work because we assumed this was a plant-level annual attribute (and is thus stored in the plants_eia860 table). See Issue #1933.

Parameters:
  • df – a pandas dataframe, either loaded from pudl_out.gen_original_eia923() or pudl_out.bf_eia923()

  • key_columns – a list of the primary key column names, either ["plant_id_eia","boiler_id","energy_source_code"] or ["plant_id_eia","generator_id"]

  • data_column_name – the name of the data column to allocate, either “net_generation_mwh” or “fuel_consumed_mmbtu” depending on the df specified

  • freq – frequency of input df. Must be either AS or MS.

Returns:

df with the annually reported values allocated to each month

pudl.analysis.allocate_net_gen.manually_fix_energy_source_codes(gf: pandas.DataFrame) pandas.DataFrame[source]#

Patch: reassigns fuel codes in the gf table that don’t match the fuel code in the gens table.

pudl.analysis.allocate_net_gen.adjust_energy_source_codes(gens: pandas.DataFrame, gf: pandas.DataFrame, bf_by_gens: pandas.DataFrame) pandas.DataFrame[source]#

Adjusts MSW codes.

Adjust the MSW codes in gens to match those used in gf and bf.

In recent years, EIA-923 started splitting out the MSW (municipal_solid_waste) into its consitituent components MSB (municipal_solid_waste_biogenic) and MSN (municipal_solid_nonbiogenic). However, the EIA-860 Generators table still only uses the MSW code.

This function identifies which MSW codes are used in the gf and bf tables and creates records to match these.

pudl.analysis.allocate_net_gen.allocate_bf_data_to_gens(bf: pandas.DataFrame, gens: pandas.DataFrame, bga: pandas.DataFrame) pandas.DataFrame[source]#

Allocates boiler fuel data to the generator level.

Distributes boiler-level data from boiler_fuel_eia923 to the generator level based on the boiler-generator association table and the nameplate capacity of the connected generators.

Because fuel consumption in the boiler_fuel_eia923 table is reported per boiler_id, we must first map this data to generators using the boiler_generator_assn_eia860 table. For boilers that have a 1:m or m:m relationship with generators, we allocate the reported fuel to each associated generator based on the nameplate capacity of each generator. So if boiler “1” was associated with generator A (25 MW) and generator B (75 MW), 25% of the fuel consumption would be allocated to generator A and 75% would be allocated to generator B.

pudl.analysis.allocate_net_gen.warn_if_missing_pms(gens)[source]#

Log warning if there are too many null prime_mover_code s.

Warn if prime mover codes in gens do not match the codes in the gf table this is something that should probably be fixed in the input data see https://github.com/catalyst-cooperative/pudl/issues/1585 set a threshold and ignore 2001 bc most errors are 2001 errors.

pudl.analysis.allocate_net_gen._test_frac(gen_pm_fuel)[source]#

Check if each of the IDX_PM_ESC groups frac’s add up to 1.

pudl.analysis.allocate_net_gen._test_gen_pm_fuel_output(gen_pm_fuel, gf, gen)[source]#
pudl.analysis.allocate_net_gen.test_gen_fuel_allocation(gen, net_gen_alloc, ratio=0.05)[source]#

Does the allocated MWh differ from the granular generation_eia923?

Parameters:
  • gen – the generation_eia923 table.

  • net_gen_alloc – the allocated net generation at the IDX_PM_ESC level

  • ratio – the tolerance

pudl.analysis.allocate_net_gen.test_original_gf_vs_the_allocated_by_gens_gf(gf: pandas.DataFrame, gf_allocated: pandas.DataFrame, data_columns: list[str] = DATA_COLUMNS, by: list[str] = ['year', 'plant_id_eia']) pandas.DataFrame[source]#

Test whether the allocated data and original data sum up to similar values.

Raises:
  • AssertionError – If the number of plant/years that are off by more than 5% is not within acceptable level of tolerance.

  • AssertionError – If the difference between the allocated and original data for any plant/year is off by more than x10 or x-5.