pudl.analysis.record_linkage.eia_ferc1_inputs

Prepare the inputs to the FERC1 to EIA record linkage model.

Module Contents

Classes

InputManager

Class to prepare inputs for linking FERC1 and EIA.

Functions

restrict_train_connections_on_date_range(...)

Restrict the training data based on the date ranges of the input tables.

prep_train_connections(→ pandas.DataFrame)

Get and prepare the training connections for the model.

Attributes

pudl.analysis.record_linkage.eia_ferc1_inputs.logger[source]
class pudl.analysis.record_linkage.eia_ferc1_inputs.InputManager(plants_all_ferc1: pandas.DataFrame, fbp_ferc1: pandas.DataFrame, plant_parts_eia: pandas.DataFrame)[source]

Class to prepare inputs for linking FERC1 and EIA.

get_plant_parts_eia_true(clobber: bool = False) pandas.DataFrame[source]

Get the EIA plant-parts with only the unique granularities.

get_plants_ferc1(clobber: bool = False) pandas.DataFrame[source]

Prepare FERC1 plants data for record linkage with EIA plant-parts.

This method grabs two tables (plants_all_ferc1 and fuel_by_plant_ferc1, accessed originally via pudl.output.pudltabl.PudlTabl.plants_all_ferc1() and pudl.output.pudltabl.PudlTabl.fbp_ferc1() respectively) and ensures that the columns the same as their EIA counterparts, because the output of this method will be used to link FERC and EIA.

Returns:

A cleaned table of FERC1 plants plant records with fuel cost data.

get_train_df() pandas.DataFrame[source]

Get the training connections.

Prepare them if the training data hasn’t been connected to FERC data yet.

get_train_records(dataset_df: pandas.DataFrame, dataset_id_col: Literal[record_id_eia, record_id_ferc1]) pandas.DataFrame[source]

Generate a set of known connections from a dataset using training data.

This method grabs only the records from the the datasets (EIA or FERC) that we have in our training data.

Parameters:
  • dataset_df – either FERC1 plants table (result of get_plants_ferc1()) or EIA plant-parts (result of get_plant_parts_eia_true()).

  • dataset_id_col – Identifying column name. Either record_id_eia for plant_parts_eia_true or record_id_ferc1 for plants_ferc1.

get_train_eia(clobber: bool = False) pandas.DataFrame[source]

Get the known training data from EIA.

get_train_ferc1(clobber: bool = False) pandas.DataFrame[source]

Get the known training data from FERC1.

execute(clobber: bool = False)[source]

Compile all the inputs.

This method is only run if/when you want to ensure all of the inputs are generated all at once. While using InputManager, it is preferred to access each input dataframe or index via their get_ method instead of accessing the attribute.

pudl.analysis.record_linkage.eia_ferc1_inputs.restrict_train_connections_on_date_range(train_df: pandas.DataFrame, id_col: Literal[record_id_eia, record_id_ferc1], start_date: pandas.Timestamp, end_date: pandas.Timestamp) pandas.DataFrame[source]

Restrict the training data based on the date ranges of the input tables.

The training data for this model spans the full PUDL date range. We don’t want to add training data from dates that are outside of the range of the FERC and EIA data we are attempting to match. So this function restricts the training data based on start and end dates.

The training data is only the record IDs, which contain the report year inside them. This function compiles a regex using the date range to grab only training records which contain the years in the date range followed by and preceeded by _ - in the format of record_id_eia``and ``record_id_ferc1. We use that extracted year to determine

pudl.analysis.record_linkage.eia_ferc1_inputs.prep_train_connections(ppe: pandas.DataFrame, start_date: pandas.Timestamp, end_date: pandas.Timestamp) pandas.DataFrame[source]

Get and prepare the training connections for the model.

We have stored training data, which consists of records with ids columns for both FERC and EIA. Those id columns serve as a connection between ferc1 plants and the EIA plant-parts. These connections indicate that a ferc1 plant records is reported at the same granularity as the connected EIA plant-parts record.

Parameters:
  • ppe – The EIA plant parts. Records from this dataframe will be connected to the training data records. This needs to be the full EIA plant parts, not just the distinct/true granularities because the training data could contain non-distinct records and this function reassigns those to their distinct counterparts.

  • start_date – Beginning date for records from the training data. Should match the start date of ppe. Default is None and all the training data will be used.

  • end_date – Ending date for records from the training data. Should match the end date of ppe. Default is None and all the training data will be used.

Returns:

A dataframe of training connections which has a MultiIndex of record_id_eia and record_id_ferc1.