pudl.analysis.record_linkage.eia_ferc1_record_linkage#

Connect FERC1 plant tables to EIA’s plant-parts via record linkage.

FERC plant records are reported very non-uniformly. In the same table there are records that are reported as whole plants, individual generators, and collections of prime movers. This means portions of EIA plants that correspond to a plant record in FERC Form 1 are heterogeneous, which complicates using the two data sets together.

The EIA plant data is much cleaner and more uniformly structured. The are generators with ids and plants with ids reported in seperate tables. Several generator IDs are typically grouped under a single plant ID. In pudl.analysis.plant_parts_eia, we create a large number of synthetic aggregated records representing many possible slices of a power plant which could in theory be what is actually reported in the FERC Form 1.

In this module we infer which of the many plant_parts_eia records is most likely to correspond to an actually reported FERC Form 1 plant record. this is done with a logistic regression model.

We train the logistic regression model using manually labeled training data that links together several thousand EIA and FERC plant records, and use grid search cross validation to select a best set of hyperparameters. This trained model is used to predict matches on the full dataset (see run_model()). The model can return multiple EIA match options for each FERC1 record, so we rank the matches and choose the one with the highest score (see find_best_matches()). Any matches identified by the model which are in conflict with our training data are overwritten with the manually assigned associations (see overwrite_bad_predictions()). The final match results are the connections we keep as the matches between FERC1 plant records and EIA plant-parts.

Module Contents#

Classes#

InputManager

Class prepare inputs for linking FERC1 and EIA.

Functions#

get_compiled_input_manager(plants_all_ferc1, ...)

Get InputManager object with compiled inputs for model.

get_pairs_dfs(inputs)

Get a dataframe with all possible FERC to EIA record pairs.

get_y_label_df(train_pairs_df, inputs)

Get the dataframe of y labels.

get_best_matches_with_overwrites(match_df, inputs)

Get dataframe with the best EIA match for each FERC record.

run_matching_model(features_train, features_all, y_df, ...)

Run model to match EIA to FERC records.

get_match_full_records(best_match_df, inputs)

Join full dataframe onto matches to make usable and get stats.

ferc_to_eia(→ pandas.DataFrame)

Coordinate the connection between FERC1 plants and EIA plant-parts.

run_model(→ pandas.DataFrame)

Train Logistic Regression model using GridSearch cross validation.

find_best_matches(match_df)

Only keep the best EIA match for each FERC record.

overwrite_bad_predictions(→ pandas.DataFrame)

Overwrite incorrect predictions with the correct match from training data.

restrict_train_connections_on_date_range(...)

Restrict the training data based on the date ranges of the input tables.

prep_train_connections(→ pandas.DataFrame)

Get and prepare the training connections for the model.

prettyify_best_matches(→ pandas.DataFrame)

Make the EIA-FERC best matches usable.

_log_match_coverage(connects_ferc1_eia)

check_match_consistency(→ pandas.DataFrame)

Check how consistent FERC-EIA matches are with FERC-FERC matches.

add_null_overrides(connects_ferc1_eia)

Override known null matches with pd.NA.

Attributes#

pudl.analysis.record_linkage.eia_ferc1_record_linkage.logger[source]#
pudl.analysis.record_linkage.eia_ferc1_record_linkage.pair_vectorizers[source]#
pudl.analysis.record_linkage.eia_ferc1_record_linkage.get_compiled_input_manager(plants_all_ferc1, fbp_ferc1, plant_parts_eia)[source]#

Get InputManager object with compiled inputs for model.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.get_pairs_dfs(inputs)[source]#

Get a dataframe with all possible FERC to EIA record pairs.

Merge the FERC and EIA records on block_col to generate possible record pairs for the matching model.

Parameters:

inputsInputManager object.

Returns:

A dataframe with all possible record pairs from all the input data and a dataframe with all possible record pairs from the training data.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.get_y_label_df(train_pairs_df, inputs)[source]#

Get the dataframe of y labels.

For each record pair in train_pairs_df, a 0 if the pair is not a match and a 1 if the pair is a match.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.get_best_matches_with_overwrites(match_df, inputs)[source]#

Get dataframe with the best EIA match for each FERC record.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.run_matching_model(features_train, features_all, y_df, experiment_tracker)[source]#

Run model to match EIA to FERC records.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.get_match_full_records(best_match_df, inputs)[source]#

Join full dataframe onto matches to make usable and get stats.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.ferc_to_eia(experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker, out_ferc1__yearly_all_plants: pandas.DataFrame, out_ferc1__yearly_steam_plants_fuel_by_plant_sched402: pandas.DataFrame, out_eia__yearly_plant_parts: pandas.DataFrame) pandas.DataFrame[source]#

Coordinate the connection between FERC1 plants and EIA plant-parts.

Parameters:
  • out_ferc1__yearly_all_plants – Table of all of the FERC1-reporting plants.

  • out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 – Table of the fuel reported aggregated to the FERC1 plant-level.

  • out_eia__yearly_plant_parts – The EIA plant parts list.

class pudl.analysis.record_linkage.eia_ferc1_record_linkage.InputManager(plants_all_ferc1: pandas.DataFrame, fbp_ferc1: pandas.DataFrame, plant_parts_eia: pandas.DataFrame)[source]#

Class prepare inputs for linking FERC1 and EIA.

get_plant_parts_eia_true(clobber: bool = False) pandas.DataFrame[source]#

Get the EIA plant-parts with only the unique granularities.

get_plants_ferc1(clobber: bool = False) pandas.DataFrame[source]#

Prepare FERC1 plants data for record linkage with EIA plant-parts.

This method grabs two tables (plants_all_ferc1 and fuel_by_plant_ferc1, accessed originally via pudl.output.pudltabl.PudlTabl.plants_all_ferc1() and pudl.output.pudltabl.PudlTabl.fbp_ferc1() respectively) and ensures that the columns the same as their EIA counterparts, because the output of this method will be used to link FERC and EIA.

Returns:

A cleaned table of FERC1 plants plant records with fuel cost data.

get_train_df() pandas.DataFrame[source]#

Get the training connections.

Prepare them if the training data hasn’t been connected to FERC data yet.

get_train_records(dataset_df: pandas.DataFrame, dataset_id_col: Literal[record_id_eia, record_id_ferc1]) pandas.DataFrame[source]#

Generate a set of known connections from a dataset using training data.

This method grabs only the records from the the datasets (EIA or FERC) that we have in our training data.

Parameters:
  • dataset_df – either FERC1 plants table (result of get_plants_ferc1()) or EIA plant-parts (result of get_plant_parts_eia_true()).

  • dataset_id_col – Identifying column name. Either record_id_eia for plant_parts_eia_true or record_id_ferc1 for plants_ferc1.

get_train_eia(clobber: bool = False) pandas.DataFrame[source]#

Get the known training data from EIA.

get_train_ferc1(clobber: bool = False) pandas.DataFrame[source]#

Get the known training data from FERC1.

execute(clobber: bool = False)[source]#

Compile all the inputs.

This method is only run if/when you want to ensure all of the inputs are generated all at once. While using InputManager, it is preferred to access each input dataframe or index via their get_ method instead of accessing the attribute.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.run_model(features_train: pandas.DataFrame, features_all: pandas.DataFrame, y_df: pandas.DataFrame, experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker) pandas.DataFrame[source]#

Train Logistic Regression model using GridSearch cross validation.

Search over the parameter grid for the best fit parameters for the Logistic Regression estimator on the training data. Predict matches on all the input features.

Parameters:
  • features_train – Dataframe of the feature vectors for the training data.

  • features_all – Dataframe of the feature vectors for all the input data.

  • y_df – Dataframe with 1 if a pair in features_train is a match and 0 if a pair is not a match.

Returns:

A dataframe of matches with record_id_ferc1 and record_id_eia as the index and a column for the probability of a match.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.find_best_matches(match_df)[source]#

Only keep the best EIA match for each FERC record.

We only want one EIA match for each FERC1 plant record. If there are multiple predicted matches for a FERC1 record, the match with the highest probability found by the model is chosen.

Parameters:

match_df – A dataframe of matches with record_id_eia and record_id_ferc1 as the index and a column for the probability of the match.

Returns:

Dataframe of matches with one EIA record for each FERC1 record.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.overwrite_bad_predictions(match_df: pandas.DataFrame, train_df: pandas.DataFrame) pandas.DataFrame[source]#

Overwrite incorrect predictions with the correct match from training data.

Parameters:
  • match_df – A dataframe of the best matches with only one match for each FERC1 record.

  • train_df – A dataframe of the training data.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.restrict_train_connections_on_date_range(train_df: pandas.DataFrame, id_col: Literal[record_id_eia, record_id_ferc1], start_date: pandas.Timestamp, end_date: pandas.Timestamp) pandas.DataFrame[source]#

Restrict the training data based on the date ranges of the input tables.

The training data for this model spans the full PUDL date range. We don’t want to add training data from dates that are outside of the range of the FERC and EIA data we are attempting to match. So this function restricts the training data based on start and end dates.

The training data is only the record IDs, which contain the report year inside them. This function compiles a regex using the date range to grab only training records which contain the years in the date range followed by and preceeded by _ - in the format of record_id_eia``and ``record_id_ferc1. We use that extracted year to determine

pudl.analysis.record_linkage.eia_ferc1_record_linkage.prep_train_connections(ppe: pandas.DataFrame, start_date: pandas.Timestamp, end_date: pandas.Timestamp) pandas.DataFrame[source]#

Get and prepare the training connections for the model.

We have stored training data, which consists of records with ids columns for both FERC and EIA. Those id columns serve as a connection between ferc1 plants and the EIA plant-parts. These connections indicate that a ferc1 plant records is reported at the same granularity as the connected EIA plant-parts record.

Parameters:
  • ppe – The EIA plant parts. Records from this dataframe will be connected to the training data records. This needs to be the full EIA plant parts, not just the distinct/true granularities because the training data could contain non-distinct records and this function reassigns those to their distinct counterparts.

  • start_date – Beginning date for records from the training data. Should match the start date of ppe. Default is None and all the training data will be used.

  • end_date – Ending date for records from the training data. Should match the end date of ppe. Default is None and all the training data will be used.

Returns:

A dataframe of training connections which has a MultiIndex of record_id_eia and record_id_ferc1.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.prettyify_best_matches(matches_best: pandas.DataFrame, plant_parts_eia_true: pandas.DataFrame, plants_ferc1: pandas.DataFrame, train_df: pandas.DataFrame, debug: bool = False) pandas.DataFrame[source]#

Make the EIA-FERC best matches usable.

Use the ID columns from the best matches to merge together both EIA plant-parts data and FERC plant data. This removes the comparison vectors (the floats between 0 and 1 that compare the two columns from each dataset).

pudl.analysis.record_linkage.eia_ferc1_record_linkage._log_match_coverage(connects_ferc1_eia)[source]#
pudl.analysis.record_linkage.eia_ferc1_record_linkage.check_match_consistency(connects_ferc1_eia: pandas.DataFrame, train_df: pandas.DataFrame, match_set: Literal[all, overrides] = 'all') pandas.DataFrame[source]#

Check how consistent FERC-EIA matches are with FERC-FERC matches.

We have two record linkage processes: one that links FERC plant records across time, and another that links FERC plant records to EIA plant-parts. This function checks that the two processes are as consistent with each other as we expect. Here “consistent” means that each FERC plant ID is associated with a single EIA plant parts ID across time. The reverse is not necessarily required – a single EIA plant part ID may be associated with various FERC plant IDs across time.

Parameters:
  • connects_ferc1_eia – Matches of FERC1 to EIA.

  • train_df – training data.

  • match_set – either all - to check all of the matches - or overrides - to check just the overrides. Default is all. The overrides are less consistent than all of the data, so this argument changes the consistency threshold for this check.

pudl.analysis.record_linkage.eia_ferc1_record_linkage.add_null_overrides(connects_ferc1_eia)[source]#

Override known null matches with pd.NA.

There is no way to indicate in the training data that certain FERC records have no proper EIA match. That is to say–you can’t specifiy a blank match or tell the AI not to match a given record. Because we’ve gone through by hand and know for a fact that some FERC records have no EIA match (even when you aggregate generators), we have to add in these null matches after the fact.

This function reads in a list of record_id_ferc1 values that are known to have no cooresponding EIA record match and makes sure they are mapped as NA in the final record linkage output. It also updates the match_type field to indicate that this value has been overriden.