`pudl.analysis.plant_parts_eia`

Aggregate plant parts to make an EIA master plant-part table.

Practically speaking, a plant is a collection of generator(s). There are many attributes of generators (i.e. prime mover, primary fuel source, technology type). We can use these generator attributes to group generator records into larger aggregate records which we call “plant-parts”. A plant part is a record which corresponds to a particular collection of generators that all share an identical attribute. E.g. all of the generators with unit_id=2, or all of the generators with coal as their primary fuel source.

The EIA data about power plants (from EIA 923 and 860) is reported in tables with records that correspond to mostly generators and plants. Other datasets (cough cough FERC1) are less well organized and include plants, generators and other plant-parts all in the same table without any clear labels. The master plant-part table is an attempt to create records corresponding to many different plant-parts in order to connect specific slices of EIA plants to other datasets.

Because generators are often owned by multiple utilities, another dimention of the master unit list involves generating two records for each owner: one of the portion of the plant part they own and one for the plant part as a whole. The portion records are labeled in the ownership column as “owned” and the total records are labeled as “total”.

This module refers to “true granularies”. Many plant parts we cobble together here in the master plant-part list refer to the same collection of infrastructure as other plant-part list records. For example, if we have a “plant_prime_mover” plant part record and a “plant_unit” plant part record which were both cobbled together from the same two generators. We want to be able to reduce the plant-part list to only unique collections of generators, so we label the first unique granularity as a true granularity and label the subsequent records as false granularities with the true_gran column. In order to choose which plant-part to keep in these instances, we assigned a PLANT_PARTS_ORDERED and label whichever plant-part comes first as the unique granularity.

Recipe Book for the plant-part list

PLANT_PARTS is the main recipe book for how each of the plant-parts need to be compiled. These plant-parts represent ways to group generators based on widely reported values in EIA. All of these are logical ways to group collections of generators - in most cases - but some groupings of generators are more prevelant or relevant than others for certain types of plants.

The canonical example here is the plant_unit. A unit is a collection of generators that operate together - most notably the combined-cycle natural gas plants. Combined-cycle units generally consist of a number of gas turbines which feed excess steam to a number of steam turbines.

>>> df_gens = pd.DataFrame({
...     'plant_id_eia': [1, 1, 1],
...     'generator_id': ['a', 'b', 'c'],
...     'unit_id_pudl': [1, 1, 1],
...     'prime_mover_code': ['CT', 'CT', 'CA'],
...     'capacity_mw': [50, 50, 100],
... })
>>> df_gens
    plant_id_eia    generator_id    unit_id_pudl    prime_mover_code    capacity_mw
0              1               a               1                  CT             50
1              1               b               1                  CT             50
2              1               c               1                  CA            100

A good example of a plant-part that isn’t really logical also comes from a combined-cycle unit. Grouping this example plant by the prime_mover_code would generate two records that would basically never show up in FERC1. This stems from the inseparability of the generators.

>>> df_plant_prime_mover = pd.DataFrame({
...     'plant_id_eia': [1, 1],
...     'plant_part': ['plant_prime_mover', 'plant_prime_mover'],
...     'prime_mover_code': ['CT', 'CA'],
...     'capacity_mw': [100, 100],
... })
>>> df_plant_prime_mover
    plant_id_eia         plant_part    prime_mover_code    capacity_mw
0              1  plant_prime_mover                  CT            100
1              1  plant_prime_mover                  CA            100

In this case the unit is more relevant:

>>> df_plant_unit = pd.DataFrame({
...     'plant_id_eia': [1],
...     'plant_part': ['plant_unit'],
...     'unit_id_pudl': [1],
...     'capacity_mw': [200],
... })
>>> df_plant_unit
    plant_id_eia    plant_part    unit_id_pudl    capacity_mw
0              1    plant_unit               1            200

But if this same plant had both this combined-cycle unit and two more generators that were self contained “GT” or gas combustion turbine, a frequent way to group these generators is differnt for the combined-cycle unit and the gas-turbine.

>>> df_gens = pd.DataFrame({
...     'plant_id_eia': [1, 1, 1, 1, 1],
...     'generator_id': ['a', 'b', 'c', 'd', 'e'],
...     'unit_id_pudl': [1, 1, 1, 2, 3],
...     'prime_mover_code': ['CT', 'CT', 'CA', 'GT', 'GT'],
...     'capacity_mw': [50, 50, 100, 75, 75],
... })
>>> df_gens
    plant_id_eia    generator_id    unit_id_pudl    prime_mover_code    capacity_mw
0              1               a               1                  CT             50
1              1               b               1                  CT             50
2              1               c               1                  CA            100
3              1               d               2                  GT             75
4              1               e               3                  GT             75

>>> df_plant_part = pd.DataFrame({
...     'plant_id_eia': [1, 1],
...     'plant_part': ['plant_unit', 'plant_prime_mover'],
...     'unit_id_pudl': [1, pd.NA],
...     'prime_mover_code': [pd.NA, 'GT',],
...     'capacity_mw': [200, 150],
... })
>>> df_plant_part
    plant_id_eia           plant_part    unit_id_pudl    prime_mover_code    capacity_mw
0              1           plant_unit               1                <NA>            200
1              1    plant_prime_mover            <NA>                  GT            150

In this case last, the plant_unit record would have a null plant_prime_mover because the unit contains more than one prime_mover_code. Same goes for the unit_id_pudl of the plant_prime_mover. This is handled in the :class:AddConsistentAttributes.

Overview of flow for generating the master unit list:

The three main classes which enable the generation of the plant-part table are:

MakeMegaGenTbl: All of the plant parts are compiled from generators. So this class generates a big dataframe of generators with any ID and data columns we’ll need. This is also where we add records regarding utility ownership slices. The table includes two records for every generator-owner: one for the “total” generator (assuming the owner owns 100% of the generator) and one for the report ownership fraction of that generator with all of the data columns scaled to the ownership fraction.
LabelTrueGranularities: This class creates labels for all generators which note wether the plant-part records that will be compiled from each generator will be a “true granulary”, as described above.
MakePlantParts: This class uses the generator dataframe, the granularity dataframe from the above two classes as well as the information stored in PLANT_PARTS to know how to aggregate each of the plant parts. Then we have plant part dataframes with the columns which identify the plant part and all of the data columns aggregated to the level of the plant part. With that compiled plant part dataframe we also add in qualifier columns with AddConsistentAttributes. A qualifer column is a column which contain data that is not endemic to the plant part record (it is not one of the identifying columns or aggregated data columns) but the data is still useful data that is attributable to each of the plant part records. For more detail on what a qualifier column is, see AddConsistentAttributes.execute().

Generating the plant-parts list

There are two ways to generate the plant-parts table: one directly using the pudl.output.pudltabl.PudlTabl object and the other using the classes from this module. Either option needs a pudl.output.pudltabl.PudlTabl object.

Create the pudl.output.pudltabl.PudlTabl object:

import pudl
pudl_engine = sa.create_engine(pudl.workspace.setup.get_defaults()['pudl_db'])
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine,freq='AS')

Then make the table via pudl_out:

plant_parts_eia = pudl_out.plant_parts_eia()

OR make the table via objects in this module:

gens_mega = MakeMegaGenTbl().execute(mcoe, own_eia860)
true_grans = LabelTrueGranularities().execute(gens_mega)
parts_compiler = MakePlantParts(pudl_out)
plant_parts_eia = parts_compiler.execute(gens_mega=gens_mega, true_grans=true_grans)

Module Contents

Classes

`MakeMegaGenTbl`	Compiler for a MEGA generator table with ownership integrated.
`LabelTrueGranularities`	True Granularity Labeler.
`MakePlantParts`	Compile the plant parts for the master unit list.
`PlantPart`	Plant-part table maker.
`PartTrueGranLabeler`	Label a plant-part as a unique (or not) granularity.
`AddAttribute`	Base class for adding attributes to plant-part tables.
`AddConsistentAttributes`	Adder of attributes records to a plant-part table.
`AddPriorityAttribute`	Add Attributes based on a priority sorting from `PRIORITY_ATTRIBUTES`.

Functions

`validate_run_aggregations`(plant_parts_eia, gens_mega)	Run a test of the aggregated columns.
`_test_prep_merge`(part_name, plant_parts_eia, gens_mega)	Run the test groupby and merge with the aggregations.
`make_id_cols_list`()	Get a list of the id columns (primary keys) for all of the plant parts.
`make_parts_to_ids_dict`()	Make dict w/ plant-part names (keys) to the main id column (values).
`add_record_id`(part_df, id_cols, plant_part_col='plant_part', year=True)	Add a record id to a compiled part df.
`assign_record_id_eia`(test_df, plant_part_col='plant_part')	Assign record ids to a df with a mix of plant parts.

Attributes

`logger`
`width`
`max_columns`
`PLANT_PARTS`	this dictionary contains a key for each of the 'plant parts' that should
`PLANT_PARTS_ORDERED`
`IDX_TO_ADD`	list of additional columns to add to the id_cols in `PLANT_PARTS`.
`IDX_OWN_TO_ADD`	list of additional columns beyond the `IDX_TO_ADD` to add to the
`SUM_COLS`	list of columns to sum when aggregating a table.
`WTAVG_DICT`	a dictionary of columns (keys) to perform weighted averages on and
`CONSISTENT_ATTRIBUTE_COLS`	a list of column names to add as attributes when they are consistent into
`PRIORITY_ATTRIBUTES_DICT`
`FIRST_COLS`

pudl.analysis.plant_parts_eia.logger[source]

pudl.analysis.plant_parts_eia.width = 1000[source]

pudl.analysis.plant_parts_eia.max_columns = 1000[source]

pudl.analysis.plant_parts_eia.PLANT_PARTS :Dict[str, Dict[str, List]][source]

this dictionary contains a key for each of the ‘plant parts’ that should end up in the mater unit list. The top-level value for each key is another dictionary, which contains keys:

id_cols (the primary key type id columns for this plant part). The plant_id_eia column must come first.

Type: dict

pudl.analysis.plant_parts_eia.PLANT_PARTS_ORDERED :List[str] = ['plant', 'plant_unit', 'plant_prime_mover', 'plant_technology', 'plant_prime_fuel',...[source]

pudl.analysis.plant_parts_eia.IDX_TO_ADD :List[str] = ['report_date', 'operational_status_pudl'][source]

list of additional columns to add to the id_cols in PLANT_PARTS. The id_cols are the base columns that we need to aggregate on, but we also need to add the report date to keep the records time sensitive and the operational_status_pudl to separate the operating plant-parts from the non-operating plant-parts.

Type: list

pudl.analysis.plant_parts_eia.IDX_OWN_TO_ADD :List[str] = ['utility_id_eia', 'ownership'][source]

list of additional columns beyond the IDX_TO_ADD to add to the id_cols in PLANT_PARTS when we are dealing with plant-part records that have been broken out into “owned” and “total” records for each of their owners.

Type: list

pudl.analysis.plant_parts_eia.SUM_COLS :List[str] = ['total_fuel_cost', 'net_generation_mwh', 'capacity_mw', 'capacity_eoy_mw', 'total_mmbtu'][source]

list of columns to sum when aggregating a table.

Type: list

pudl.analysis.plant_parts_eia.WTAVG_DICT[source]

a dictionary of columns (keys) to perform weighted averages on and the weight column (values)

Type: dict

pudl.analysis.plant_parts_eia.CONSISTENT_ATTRIBUTE_COLS = ['fuel_type_code_pudl', 'planned_retirement_date', 'retirement_date', 'generator_id',...[source]

a list of column names to add as attributes when they are consistent into the aggregated plant-part records.

Type: list

pudl.analysis.plant_parts_eia.PRIORITY_ATTRIBUTES_DICT[source]

pudl.analysis.plant_parts_eia.FIRST_COLS = ['plant_id_eia', 'report_date', 'plant_part', 'generator_id', 'unit_id_pudl',...[source]

class pudl.analysis.plant_parts_eia.MakeMegaGenTbl[source]

Bases: object

Compiler for a MEGA generator table with ownership integrated.

Examples

Input Tables

Here is an example of one plant with three generators. We will use capacity_mw as the data column.

>>> mcoe = pd.DataFrame({
...     'plant_id_eia': [1, 1, 1],
...     'report_date': ['2020-01-01', '2020-01-01','2020-01-01'],
...     'generator_id': ['a', 'b', 'c'],
...     'utility_id_eia': [111, 111, 111],
...     'unit_id_pudl': [1, 1, 1],
...     'prime_mover_code': ['CT', 'CT', 'CA'],
...     'technology_description': [
...         'Natural Gas Fired Combined Cycle', 'Natural Gas Fired Combined Cycle', 'Natural Gas Fired Combined Cycle'
...     ],
...     'operational_status': ['existing', 'existing','existing'],
...     'retirement_date': [pd.NA, pd.NA, pd.NA],
...     'capacity_mw': [50, 50, 100],
... }).astype({
...     'retirement_date': "datetime64[ns]",
...     'report_date': "datetime64[ns]",
... })
>>> mcoe
    plant_id_eia    report_date     generator_id   utility_id_eia   unit_id_pudl    prime_mover_code              technology_description   operational_status  retirement_date      capacity_mw
0              1     2020-01-01                a              111              1                  CT    Natural Gas Fired Combined Cycle             existing               NaT              50
1              1     2020-01-01                b              111              1                  CT    Natural Gas Fired Combined Cycle             existing               NaT              50
2              1     2020-01-01                c              111              1                  CA    Natural Gas Fired Combined Cycle             existing               NaT             100

The ownership table from EIA 860 includes one record for every owner of each generator. In this example generator c has two owners.

>>> df_own_eia860 = pd.DataFrame({
...     'plant_id_eia': [1, 1, 1, 1],
...     'report_date': ['2020-01-01', '2020-01-01','2020-01-01', '2020-01-01'],
...     'generator_id': ['a', 'b', 'c', 'c'],
...     'utility_id_eia': [111, 111, 111, 111],
...     'owner_utility_id_eia': [111, 111, 111, 888],
...     'fraction_owned': [1, 1, .75, .25]
... }).astype({'report_date': "datetime64[ns]"})
>>> df_own_eia860
    plant_id_eia    report_date   generator_id      utility_id_eia  owner_utility_id_eia  fraction_owned
0              1     2020-01-01              a                 111                   111            1.00
1              1     2020-01-01              b                 111                   111            1.00
2              1     2020-01-01              c                 111                   111            0.75
3              1     2020-01-01              c                 111                   888            0.25

Output Mega Generators Table

MakeMegaGenTbl().execute(mcoe, df_own_eia860, slice_cols=['capacity_mw']) produces the output table gens_mega which includes two main sections: the generators with a “total” ownership stake for each of their owners and the generators with an “owned” ownership stake for each of their owners. For the generators that are owned 100% by one utility, the records are identical except the ownership column. For the generators that have more than one owner, there are two “total” records with 100% of the capacity of that generator - one for each owner - and two “owned” records with the capacity scaled to the ownership stake of each of the owner utilites - represented by fraction_owned.

execute(self, mcoe: pandas.DataFrame, own_eia860: pandas.DataFrame, slice_cols: List[str] = SUM_COLS) → pandas.DataFrame[source]

Make the mega generators table with ownership integrated.

Parameters

mcoe – generator-based mcoe table from pudl.output.PudlTabl.mcoe()
own_eia860 – ownership table from pudl.output.PudlTabl.own_eia860()
slice_cols – list of columns to slice by ownership fraction in MakeMegaGenTbl.slice_by_ownership(). Default is SUM_COLS

Returns

a table of all of the generators with identifying columns and data columns, sliced by ownership which makes “total” and “owned” records for each generator owner. The “owned” records have the generator’s data scaled to the ownership percentage (e.g. if a 200 MW generator has a 75% stake owner and a 25% stake owner, this will result in two “owned” records with 150 MW and 50 MW). The “total” records correspond to the full plant for every owner (e.g. using the same 2-owner 200 MW generator as above, each owner will have a records with 200 MW).

get_gens_mega_table(self, mcoe)[source]

Compile the main generators table that will be used as base of PPL.

Get a table of all of the generators there ever were and all of the data PUDL has to offer about those generators. This generator table will be used to compile all of the “plant-parts”, so we need to ensure that any of the id columns from the other plant-parts are in this generator table as well as all of the data columns that we are going to aggregate to the various plant-parts.

Returns: pandas.DataFrame

label_operating_gens(self, gen_df: pandas.DataFrame) → pandas.DataFrame[source]

Label the operating generators.

We want to distinguish between “operating” generators (those that report as “existing” and those that retire mid-year) and everything else so that we can group the operating generators into their own plant-parts separate from retired or proposed generators. We do this by creating a new label column called “operational_status_pudl”.

This method also adds a column called “capacity_eoy_mw”, which is the end of year capacity of the generators. We assume that if a generator isn’t “existing”, its EOY capacity should be zero.

Parameters: gen_df (pandas.DataFrame) – annual table of all generators from EIA.

Returns: pandas.DataFrame: annual table of all generators from EIA that operated within each reporting year.

Todo

This function results in warning: PerformanceWarning: DataFrame is highly fragmented… I expect this is because of the number of columns that are being assigned here via .loc[:, col_to_assign].

slice_by_ownership(self, gens_mega, own_eia860, slice_cols=SUM_COLS)[source]

Generate proportional data by ownership %s.

Why do we have to do this at all? Sometimes generators are owned by many different utility owners that own slices of that generator. EIA reports which portion of each generator is owned by which utility relatively clearly in their ownership table. On the other hand, in FERC1, sometimes a partial owner reports the full plant-part, sometimes they report only their ownership portion of the plant-part. And of course it is not labeld in FERC1. Because of this, we need to compile all of the possible ownership slices of the EIA generators.

In order to accumulate every possible version of how a generator could be reported, this method generates two records for each generator’s reported owners: one of the portion of the plant part they own and one for the plant-part as a whole. The portion records are labeled in the ownership column as “owned” and the total records are labeled as “total”.

In this function we merge in the ownership table so that generators with multiple owners then have one record per owner with the ownership fraction (in column fraction_owned). Because the ownership table only contains records for generators that have multiple owners, we assume that all other generators are owned 100% by their operator. Then we generate the “total” records by duplicating the “owned” records but assigning the fraction_owned to be 1 (i.e. 100%).

class pudl.analysis.plant_parts_eia.LabelTrueGranularities[source]

Bases: object

True Granularity Labeler.

execute(self, gens_mega: pandas.DataFrame, drop_extra_cols: bool = True)[source]

Prep the table that denotes true_gran for all generators.

This method will generate a dataframe based on gens_mega that has boolean columns that denotes whether each plant-part is a true or false granularity.

There are four main steps in this process:

For every combinations of plant-parts, count the number of unique types of peer plant-parts (see make_all_the_counts() for more details).
Convert those counts to boolean values if there is more or less than one unique type parent or child plant-part (see make_all_the_bools() for more details).
Using the boolean values label each plant-part as a True or False granularies if both the boolean for the parent-to-child and child-to-parent (see label_true_grans_by_part() for more details).
For each plant-part, label it with its the appropriate plant-part counterpart - if it is a True granularity, the appropriate label is itself (see label_true_id_by_part() for more details).

Parameters

gens_mega – a table of all of the generators with identifying columns and data columns, sliced by ownership which makes “total” and “owned” records for each generator owner.
drop_extra_cols – If True, the extra columns used to generate the true_gran columns are dropped. Default is True. Use False for debugging only.

get_parts_to_parent_parts(self)[source]

Make a dictionary of each plant-part’s parent parts.

We have imposed a hierarchy on the plant-parts with the PLANT_PARTS_ORDERED list and this method generates a dictionary of each plant-part’s (key) parent-parts (value).

make_all_the_counts(self, gens_mega: pandas.DataFrame) → pandas.DataFrame[source]

For each plant-part, count the unique child and parent parts.

All plant-part’s are situated within a hierarchy that is defined within PLANT_PARTS_ORDERED. Child parts are plant-parts that are defined as a lower priority within PLANT_PARTS_ORDERED and parent parts are those with a higher priority.

In order to determine if a particular plant-part is a unique granularity, we want to know if a cooresponding parent or child plant-part is comprised of the same collection of generators. In order to do that, we count the unique instances of the ID columns in both the cooresponding parent and child parts. We use this count in make_all_the_bools() and subsequently label_true_grans_by_part() to label each plant-part as a unique granularity or not based on whether there is only once type of plant-part.

Parameters: gens_mega – a table of all of the generators with identifying columns and data columns, sliced by ownership which makes “total” and “owned” records for each generator owner.
Returns: an agumented version of the gens_mega dataframe with new columns for each of the child and parent plant-parts with counts of unique instances of those parts. The columns will be named in the following format {child/parent_part_name}_count_per_{part_name}

make_all_the_bools(self, counts)[source]

Make booleans to indicate whether a cooresponding plant-part is consistent.

We’ve counted all of the child- and parent-parts contained within a plant-part in :meth:. If there is only one cooresponding plant-part within a differnt plant-part, then we can assume it is in effect a non-unique plant-part. So we convert the count columns into boolean columns to indicate wether a plant-part has only one cooresponding child- and parent-parts.

Parameters: all_the_counts (pandas.DataFrame) – result of make_all_the_counts()
Returns: a table with generator records where we have new boolean columns which indicated whether or not the plant-part has more than one child/parent-part. These columns are formated as: {child/parent_part_name}_has_only_one_{part_name}
Return type: pandas.DataFrame

label_true_grans_by_part(self, part_bools)[source]

Label the true/false granularies for each part/parent-part combo.

This method uses the indicator columns which let us know whether or not there are more than one unique value for both the parent and child plant-part ids to generate an additional indicator column that let’s us know whether the child plant-part is a true or false granularity when compared to the parent plant-part. With all of the indicator columns from each plant-part’s parent plant-parts, if all of those determined that the plant-part is a true granularity, then this method will label the plant-part as being a true granulary and vice versa.

Because we have forced a hierarchy within the PLANT_PARTS_ORDERED, the process for labeling true or false granularities must investigate bi-directionally. This is because all of the plant-parts besides ‘plant’ and ‘plant_gen’ are not necessarily bigger of smaller than their parent plant-part and thus there is overlap. Because of this, this method uses the checks in both directions (from partent to child and from child to parent).

Parameters: part_bools (pandas.DataFrame) – result of make_all_the_bools()

Todo

This function results in warning: PerformanceWarning: DataFrame is highly fragmented... I expect this is because of the number of columns that are being assigned here via .loc[:, col_to_assign]. This warning shows up only after the 5th iteration through the top-level loop (when part_name = ‘plant_prime_fuel’).

label_true_id_by_part(self, part_trues)[source]

Label the appropriate plant-part.

For each plant-part, we need to make a label which indicates what the “true” unique plant-part is.. if a gen vs a unit is a non-unique set a records, we only want to label one of them as false granularities. We are going to use the parts_to_parent_parts() dictionary to help us with this. We want to “save” the biggest parent plant-part as true granularity.

Because we have columns in part_trues that indicate whether a plant-part is a true gran vs each parent part, we can cycle through the possible parent-parts from biggest to smallest and the first time we find that a plant-part is a false gran, we label it’s true id as that parent-part.

class pudl.analysis.plant_parts_eia.MakePlantParts(pudl_out)[source]

Bases: object

Compile the plant parts for the master unit list.

This object generates a master list of different “plant-parts”, which are various collections of generators - i.e. units, fuel-types, whole plants, etc. - as well as various ownership arrangements. Each plant-part is included in the master plant-part table associated with each of the plant-part’s owner twice - once with the data scaled to the fraction of each owners’ ownership and another for a total plant-part for each owner.

This master plant parts table is generated by first creating a complete generators table - with all of the data columns we will be aggregating to different plant-part’s and sliced and scaled by ownership. Then we make a label for each plant-part record which indicates whether or not the record is a unique grouping of generator records. Then we use the complete generator table to aggregate by each of the plant-part categories.

The coordinating function here is execute().

execute(self, gens_mega, true_grans)[source]

Aggreate and slice data points by each plant part.

Return type: pandas.DataFrame

add_additonal_cols(self, plant_parts_eia)[source]

Add additonal data and id columns.

This method adds a set of either calculated columns or PUDL ID columns.

Returns

master unit list table with these additional columns:

utility_id_pudl +
plant_id_pudl +
capacity_factor +
ownership_dupe (boolean): indicator of whether the “owned” record has a corresponding “total” duplicate.

Return type

pandas.DataFrame

_clean_plant_parts(self, plant_parts_eia)[source]

validate_ownership_for_owned_records(self, plant_parts_eia)[source]

Test ownership - fraction owned for owned records.

This test can be run at the end of or with the result of MakePlantParts.execute(). It tests a few aspects of the the fraction_owned column and raises assertions if the tests fail.

class pudl.analysis.plant_parts_eia.PlantPart(part_name)[source]

Bases: object

Plant-part table maker.

The coordinating method here is execute().

Examples

Below are some examples of how the main processing step in this class operates: PlantPart.ag_part_by_own_slice(). If we have a plant with four generators that looks like this:

>>> gens_mega = pd.DataFrame({
...     'plant_id_eia': [1, 1, 1, 1],
...     'report_date': ['2020-01-01', '2020-01-01', '2020-01-01', '2020-01-01',],
...     'utility_id_eia': [111, 111, 111, 111],
...     'generator_id': ['a', 'b', 'c', 'd'],
...     'prime_mover_code': ['ST', 'GT', 'CT', 'CA'],
...     'energy_source_code_1': ['BIT', 'NG', 'NG', 'NG'],
...     'ownership': ['total', 'total', 'total', 'total',],
...     'operational_status_pudl': ['operating', 'operating', 'operating', 'operating'],
...     'capacity_mw': [400, 50, 125, 75],
... }).astype({
...     'report_date': 'datetime64[ns]',
... })
>>> gens_mega
    plant_id_eia   report_date      utility_id_eia  generator_id    prime_mover_code        energy_source_code_1    ownership       operational_status_pudl         capacity_mw
0              1    2020-01-01                 111             a                  ST                         BIT        total                     operating                 400
1              1    2020-01-01                 111             b                  GT                          NG        total                     operating                  50
2              1    2020-01-01                 111             c                  CT                          NG        total                     operating                 125
3              1    2020-01-01                 111             d                  CA                          NG        total                     operating                  75

This gens_mega table can then be aggregated by plant, plant_prime_fuel, plant_prime_mover, or plant_gen.

execute(self, gens_mega: pandas.DataFrame, sum_cols: List[str] = SUM_COLS, wtavg_dict: Dict = WTAVG_DICT) → pandas.DataFrame[source]

Get a table of data aggregated by a specific plant-part.

This method will take gens_mega and aggregate the generator records to the level of the plant-part. This is mostly done via ag_part_by_own_slice(). Then several additional columns are added and the records are labeled as true or false granularities.

Returns: a table with records that have been aggregated to a plant-part.

ag_part_by_own_slice(self, gens_mega, sum_cols=SUM_COLS, wtavg_dict=WTAVG_DICT) → pandas.DataFrame[source]

Aggregate the plant part by seperating ownership types.

There are total records and owned records in this master unit list. Those records need to be aggregated differently to scale. The “total” ownership slice is now grouped and aggregated as a single version of the full plant and then the utilities are merged back. The “owned” ownership slice is grouped and aggregated with the utility_id_eia, so the portions of generators created by slice_by_ownership will be appropriately aggregated to each plant part level.

Returns: dataframe aggregated to the level of the part_name
Return type: pandas.DataFrame

ag_fraction_owned(self, part_ag)[source]

Calculate the fraction owned for a plant-part df.

This method takes a dataframe of records that are aggregated to the level of a plant-part (with certain id_cols) and appends a fraction_owned column, which indicates the % ownership that a particular utility owner has for each aggreated plant-part record.

For partial owner records (ownership == “owned”), fraction_owned is calcuated based on the portion of the capacity and the total capacity of the plant. For total owner records (ownership == “total”), the fraction_owned is always 1.

This method is meant to be run after ag_part_by_own_slice().

Parameters: part_ag (pandas.DataFrame) –

add_install_year(self, part_df, gens_mega)[source]

Add the install year from the entities table to your plant part.

TODO: This should be converted into an AddAttribute… an AddSortedAttribute or something like that.

add_new_plant_name(self, part_df, gens_mega)[source]

Add plants names into the compiled plant part df.

Parameters

part_df (pandas.DataFrame) – dataframe containing records associated with one plant part (ex: all plant’s or plant_prime_mover’s).
gens_mega (pandas.DataFrame) – a table of all of the generators with identifying columns and data columns, sliced by ownership which makes “total” and “owned” records for each generator owner.

add_record_count_per_plant(self, part_df: pandas.DataFrame) → pandas.DataFrame[source]

Add a record count for each set of plant part records in each plant.

Parameters: part_df – dataframe containing records associated with one plant part (ex: all plant’s or plant_prime_mover’s).
Returns: augmented version of part_df with a new column named record_count

class pudl.analysis.plant_parts_eia.PartTrueGranLabeler(part_name: Literal[PLANT_PARTS_ORDERED])[source]

Label a plant-part as a unique (or not) granularity.

execute(self, part_df, true_grans)[source]

Merge the true granularity labels into the plant part df.

Parameters: part_df (pandas.DataFrame) –

class pudl.analysis.plant_parts_eia.AddAttribute(attribute_col, part_name)[source]

Bases: object

Base class for adding attributes to plant-part tables.

class pudl.analysis.plant_parts_eia.AddConsistentAttributes(attribute_col, part_name)[source]

Bases: AddAttribute

Adder of attributes records to a plant-part table.

execute(self, part_df, gens_mega)[source]

Get qualifier records.

For an individual dataframe of one plant part (e.g. only “plant_prime_mover” plant part records), we typically have identifying columns and aggregated data columns. The identifying columns for a given plant part are only those columns which are required to uniquely specify a record of that type of plant part. For example, to uniquely specify a plant_unit record, we need both plant_id_eia, unit_id_pudl, report_date and nothing else. In other words, the identifying columns for a given plant part would make up a natural composite primary key for a table composed entirely of that type of plant part. Every plant part is cobbled together from generator records, so each record in each part_df can be thought of as a collection of generators.

Identifier and qualifier columns are the same columns; whether a column is an identifier or a qualifier is a function of the plant part you’re considering. All the other columns which could be identifiers in the context of other plant parrts (but aren’t for this plant part) are qualifiers.

This method takes a part_df and goes and checks whether or not the data we are trying to grab from the record_name column is consistent across every component genertor from each record.

Parameters

part_df (pandas.DataFrame) – dataframe containing records associated with one plant part.
gens_mega (pandas.DataFrame) – a table of all of the generators with identifying columns and data columns, sliced by ownership which makes “total” and “owned” records for each generator owner.

get_consistent_qualifiers(self, record_df)[source]

Get fully consistent qualifier records.

When data is a qualifer column is identical for every record in a plant part, we associate this data point with the record. If the data points for the related generator records are not identical, then nothing is associated with the record.

Parameters

record_df (pandas.DataFrame) – the dataframe with the record
base_cols (list) – list of identifying columns.
record_name (string) – name of qualitative record

class pudl.analysis.plant_parts_eia.AddPriorityAttribute(attribute_col, part_name)[source]

Bases: AddAttribute

Add Attributes based on a priority sorting from PRIORITY_ATTRIBUTES.

This object associates one attribute from the generators that make up a plant-part based on a sorted list within PRIORITY_ATTRIBUTES. For example, for “operational_status” we will grab the highest level of operational status that is associated with each records’ component generators. The order of operational status is defined within the method as: ‘existing’, ‘proposed’, then ‘retired’. For example if a plant_unit is composed of two generators, and one of them is “existing” and another is “retired” the entire plant_unit will be considered “existing”.

execute(self, part_df, gens_mega)[source]

Add the attribute to the plant-part df based on priority.

Parameters

part_df (pandas.DataFrame) – dataframe containing records associated with one plant part.
gens_mega (pandas.DataFrame) – a table of all of the generators with identifying columns and data columns, sliced by ownership which makes “total” and “owned” records for each generator owner.

pudl.analysis.plant_parts_eia.validate_run_aggregations(plant_parts_eia, gens_mega)[source]

Run a test of the aggregated columns.

This test will used the plant_parts_eia, re-run groubys and check similarity.

pudl.analysis.plant_parts_eia._test_prep_merge(part_name, plant_parts_eia, gens_mega)[source]: Run the test groupby and merge with the aggregations.

pudl.analysis.plant_parts_eia.make_id_cols_list()[source]

Get a list of the id columns (primary keys) for all of the plant parts.

Returns: a list of the ID columns for all of the plant-parts, including report_date
Return type: list

pudl.analysis.plant_parts_eia.make_parts_to_ids_dict()[source]

Make dict w/ plant-part names (keys) to the main id column (values).

All plant-parts have 1 or 2 ID columns in PLANT_PARTS plant_id_eia and a secondary column (with the exception of the “plant” plant-part). The plant_id_eia column is always first, so we’re going to grab the last column.

Returns: plant-part names (keys) cooresponding to the main ID column (value).
Return type: dictionary

pudl.analysis.plant_parts_eia.add_record_id(part_df, id_cols, plant_part_col='plant_part', year=True)[source]

Add a record id to a compiled part df.

We need a standardized way to refer to these compiled records that contains enough information in the id itself that in theory we could deconstruct the id and determine which plant id and plant part id columns are associated with this record.

pudl.analysis.plant_parts_eia.assign_record_id_eia(test_df, plant_part_col='plant_part')[source]

Assign record ids to a df with a mix of plant parts.

Parameters

test_df (pandas.DataFrame) –
plant_part_col (string) –

Todo

This function results in warning: PerformanceWarning: DataFrame is highly fragmented... I’m honestly not sure if this is happening because of this function specifically or is a result from all of the column assignments in label_true_id_by_part() or label_true_grans_by_part() where we are also getting this warning.

pudl.analysis.plant_parts_eia

Module Contents

Classes

Functions

Attributes

`pudl.analysis.plant_parts_eia`