pudl.analysis.record_linkage.eia_ferc1_train

Create spreadsheets for manually mapping FERC-EIA records and validate matches.

pudl.analysis.record_linkage.eia_ferc1_record_linkage uses machine learning to link records from FERC Form 1 with records from EIA. While this process is way more efficient and logical than a human, it requires a set of hand-compiled training data in order to do it’s job.

The training data also serve as overrides for otherwise bad AI matches. There are several examples of plants that require human intuition to make sense of. For instance, sometimes FERC capacities lag behind by several years or are comprised of two or more EIA records.

This module creates an output spreadsheet, based on a certain utility, that makes the matching and machine-matched human validation process much easier. It also contains functions that will read those new/updated/validated matches from the spreadsheet, validate them, and incorporate them into the existing training data.

Module Contents

Functions

_pct_diff(→ pandas.DataFrame)

Calculate percent difference between EIA and FERC versions of a column.

_is_best_match(→ pandas.DataFrame)

Fill the best_match column with strings to show cap, net_gen, inst_year match.

_prep_eia_ferc1(→ pandas.DataFrame)

Prep FERC-EIA for use in override output sheet pre-utility subgroups.

_prep_ppe(→ pandas.DataFrame)

Prep PPE table for use in override output sheet pre-utility subgroups.

_prep_deprish(→ pandas.DataFrame)

Prep depreciation data for use in override output sheet pre-utility subgroups.

_get_util_year_subsets(→ dict)

Get utility and year subsets for each of the input dfs.

_output_override_spreadsheet(→ None)

Output spreadsheet with tabs for ferc-eia, ppe, deprish for one utility.

generate_all_override_spreadsheets(→ None)

Output override spreadsheets for all specified utilities and years.

_check_id_consistency(→ None)

Check for rogue FERC or EIA ids that don't exist.

check_if_already_in_training(training_data, ...)

Check whether any manually mapped records aren't yet in the training data.

validate_override_fixes(→ pandas.DataFrame)

Process the verified and/or fixed matches and look for human error.

get_multi_match_df(→ pandas.DataFrame)

Process the verified and/or fixed matches and generate a list of 1:m matches.

_add_to_training(→ None)

Add the new overrides to the old override sheet.

_add_to_null_overrides(→ None)

Take record_id_ferc1 values verified to have no EIA match and add them to csv.

_add_to_one_to_many_overrides(→ None)

Add record_id_ferc1 values verified to have multiple EIA matches to csv.

validate_and_add_to_training(→ None)

Validate, combine, and add overrides to the training data.

Attributes

pudl.analysis.record_linkage.eia_ferc1_train.logger[source]
pudl.analysis.record_linkage.eia_ferc1_train.RENAME_COLS_FERC1_EIA: dict[source]
pudl.analysis.record_linkage.eia_ferc1_train.RELEVANT_COLS_PPE: list = ['record_id_eia', 'report_year', 'utility_id_pudl', 'utility_id_eia', 'utility_name_eia',...[source]
pudl.analysis.record_linkage.eia_ferc1_train._pct_diff(df, col) pandas.DataFrame[source]

Calculate percent difference between EIA and FERC versions of a column.

pudl.analysis.record_linkage.eia_ferc1_train._is_best_match(df, cap_pct_diff=6, net_gen_pct_diff=6, inst_year_diff=3) pandas.DataFrame[source]

Fill the best_match column with strings to show cap, net_gen, inst_year match.

The process of manually checking all of the FERC-EIA matches made by the machine learning algorithm is tedius. This function makes it easier to speed through the obviously good matches and pay more attention to those that are more questionable.

By default, a “best match” is comprised of a FERC-EIA match with a capacity percent difference of less than 6%, a net generation percent difference of less than 6%, and an installation year difference of less than 3 years.

pudl.analysis.record_linkage.eia_ferc1_train._prep_eia_ferc1(eia_ferc1, utils_eia860) pandas.DataFrame[source]

Prep FERC-EIA for use in override output sheet pre-utility subgroups.

Parameters:
  • eia_ferc1 (pd.DataFrame) – The dataframe resulting from pudl_out.ferc1_eia().

  • utils_eia860 (pd.DataFrame) – The dataframe resulting from pudl_out.utils_eia860.

Returns:

A version of the eia_ferc1 table that’s been modified for

the purposes of creating an manual mapping spreadsheet.

Return type:

pd.DataFrame

pudl.analysis.record_linkage.eia_ferc1_train._prep_ppe(ppe, utils_eia860) pandas.DataFrame[source]

Prep PPE table for use in override output sheet pre-utility subgroups.

Parameters:
  • ppe (pd.DataFrame) – The dataframe resulting from pudl_out.plant_parts_eia

  • utils_eia860 (pd.DataFrame) – The dataframe resulting from pudl_out.utils_eia860.

Returns:

A version of the plant_parts_eia table that’s been modified for

the purposes of creating an manual mapping spreadsheet.

Return type:

pd.DataFrame

pudl.analysis.record_linkage.eia_ferc1_train._prep_deprish(deprish, utils_eia860) pandas.DataFrame[source]

Prep depreciation data for use in override output sheet pre-utility subgroups.

pudl.analysis.record_linkage.eia_ferc1_train._get_util_year_subsets(inputs_dict, util_id_eia_list, years) dict[source]

Get utility and year subsets for each of the input dfs.

After generating the dictionary with all of the inputs tables loaded, we’ll want to create subsets of each of those tables based on the utility and year inputs we’re given. This function takes the input dict generated in _generate_input_dfs() and outputs an updated version with df values pertaining to the utilities in util_id_eia_list and years in years.

Parameters:
  • inputs_dict (dict) – The output of running _generation_input_dfs()

  • util_id_eia_list (list) – A list of the utility_id_eia values you want to include in a single spreadsheet output. Generally this is a list of the subsidiaries that pertain to a single parent company.

  • years (list) – A list of the years you’d like to add to the override sheets.

Returns:

A subset of the inputs_dict that contains versions of the value dfs that

pertain only to the utilites and years specified in util_id_eia_list and years.

Return type:

dict

pudl.analysis.record_linkage.eia_ferc1_train._output_override_spreadsheet(util_year_subset_dict, util_name, output_dir_path) None[source]

Output spreadsheet with tabs for ferc-eia, ppe, deprish for one utility.

Parameters:
  • util_year_subset_dict (dict) – The output from _get_util_year_subsets()

  • util_name (str) – A string indicating the name of the utility that you are creating an override sheet for. The string will be used as the suffix for the name of the excel file. Ex: for util_name = “BHE”, the file name will be BHE_fix_FERC-EIA_overrides.xlsx.

  • output_dir_path (str) – The relative path to the folder where you’d like to output the override spreadsheets that this function creates.

pudl.analysis.record_linkage.eia_ferc1_train.generate_all_override_spreadsheets(eia_ferc1, ppe, utils_eia860, util_dict, years, output_dir_path) None[source]

Output override spreadsheets for all specified utilities and years.

These manual override files will be output to a folder called “overrides” in the output directory.

Parameters:
  • eia_ferc1 (pd.DataFrame) – The dataframe resulting from pudl_out.ferc1_eia().

  • ppe (pd.DataFrame) – The dataframe resulting from pudl_out.plant_parts_eia

  • utils_eia860 (pd.DataFrame) – The dataframe resulting from pudl_out.utils_eia860.

  • util_dict (dict) – A dictionary with keys that are the names of utility parent companies and values that are lists of subsidiary utility_id_eia values. EIA values are used instead of PUDL in this case because PUDL values are subject to change.

  • years (list) – A list of the years you’d like to add to the override sheets.

  • output_dir_path (str) – The relative path to the folder where you’d like to output the override spreadsheets that this function creates.

pudl.analysis.record_linkage.eia_ferc1_train._check_id_consistency(id_col: Literal[record_id_eia_override_1, record_id_ferc1], df, actual_ids, error_message) None[source]

Check for rogue FERC or EIA ids that don’t exist.

Parameters:
  • id_col (str) – The name of either the ferc record id column: record_id_ferc1 or the eia record override column: record_id_eia_override_1.

  • df (pd.DataFrame) – A dataframe of intended overrides.

  • actual_ids (list) – A list of the ferc or eia ids that are valid and come from either the ppe or official ferc-eia record linkage.

  • error_message (str) – A short string to indicate the type of error you’re checking for. This could be looking for values that aren’t in the official list or values that are already in the training data.

pudl.analysis.record_linkage.eia_ferc1_train.check_if_already_in_training(training_data, validated_connections)[source]

Check whether any manually mapped records aren’t yet in the training data.

This function is useful for instances where you’ve started the manual mapping process, taken an extended break, and need to check whether the data you’ve mapped has been integrated into the training data or not.

pudl.analysis.record_linkage.eia_ferc1_train.validate_override_fixes(validated_connections, ppe, eia_ferc1, training_data, expect_override_overrides=False, allow_mismatched_utilities=True) pandas.DataFrame[source]

Process the verified and/or fixed matches and look for human error.

Parameters:
  • validated_connections (pd.DataFrame) – A dataframe in the add_to_training directory that is ready to be added to be validated and subsumed into the training data.

  • ppe (pd.DataFrame) – The dataframe resulting from pudl_out.plant_parts_eia

  • eia_ferc1 (pd.DataFrame) – The dataframe resulting from pudl_out.ferc1_eia

  • training_data (pd.DataFrame) – The current FERC-EIA training data

  • expect_override_overrides (boolean) – Whether you expect the tables to have overridden matches already in the training data.

  • allow_mismatched_utilities (boolean) – Whether you want to allow FERC and EIA record ids to come from different utilities.

Raises:
  • AssertionError – If there are EIA override id records that aren’t in the original FERC-EIA connection.

  • AssertionError – If there are FERC record ids that aren’t in the original FERC-EIA connection.

  • AssertionError – If there are EIA override ids that are duplicated throughout the override document.

  • AssertionError – If the utility id in the EIA override id doesn’t match the pudl id cooresponding with the FERC record.

  • AssertionError – If there are EIA override id records that don’t correspond to the correct report year.

  • AssertionError – If you didn’t expect to override overrides but the new training data implies an override to the existing training data.

Returns:

The validated FERC-EIA dataframe you’re trying to add to the

training data.

Return type:

pd.DataFrame

pudl.analysis.record_linkage.eia_ferc1_train.get_multi_match_df(training_data) pandas.DataFrame[source]

Process the verified and/or fixed matches and generate a list of 1:m matches.

Filter the dataframe to only include FERC records with more than one EIA match. Melt this dataframe to report all matched EIA records in the record_id_eia_override_1 column.

Parameters:

training_data (pd.DataFrame) – A dataframe in the add_to_training directory that is ready to be validated and subsumed into the training data.

Returns:

A dataframe of 1:m matches formatted to fit into the existing

validation framework.

Return type:

pd.DataFrame

pudl.analysis.record_linkage.eia_ferc1_train._add_to_training(new_overrides, path_to_current_training) None[source]

Add the new overrides to the old override sheet.

pudl.analysis.record_linkage.eia_ferc1_train._add_to_null_overrides(null_matches, current_null_overrides_path) None[source]

Take record_id_ferc1 values verified to have no EIA match and add them to csv.

pudl.analysis.record_linkage.eia_ferc1_train._add_to_one_to_many_overrides(one_to_many, current_one_to_many_path) None[source]

Add record_id_ferc1 values verified to have multiple EIA matches to csv.

pudl.analysis.record_linkage.eia_ferc1_train.validate_and_add_to_training(utils_eia860, ppe, eia_ferc1, input_dir_path, expect_override_overrides=False, allow_mismatched_utilities=True, one_to_many=True) None[source]

Validate, combine, and add overrides to the training data.

Validating and combinging the records so you only have to loop through the files once. Runs the validate_override_fixes() function and add_to_training.

Parameters:
  • pudl_out (PudlTabl) – the pudl_out object generated in a notebook and passed in.

  • rmi_out (Output) – the rmi_out object generated in a notebook and passed in.

  • input_dir_path (str) – The path to the place where the matched files that you want to validate or integrate are.

  • expect_override_overrides (bool) – This value is explicitly assigned at the top of the notebook.

  • allow_mismatched_utilities (bool) – Whether you are allowed to have FERC-EIA matches from different utilities.

  • one_to_many (bool) – If True, will also validate and save csv of 1:m matches.

Returns:

A DataFrame with all of the new overrides combined.

Return type:

pandas.DataFrame