pudl.transform.ferc1 module

Routines for transforming FERC Form 1 data before loading into the PUDL DB.

This module provides a variety of functions that are used in cleaning up the FERC Form 1 data prior to loading into our database. This includes adopting standardized units and column names, standardizing the formatting of some string values, and correcting data entry errors which we can infer based on the existing data. It may also include removing bad data, or replacing it with the appropriate NA values.

class pudl.transform.ferc1.FERCPlantClassifier(min_sim=0.75, plants_df=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

A classifier for identifying FERC plant time series in FERC Form 1 data.

We want to be able to give the classifier a FERC plant record, and get back the group of records(or the ID of the group of records) that it ought to be part of.

There are hundreds of different groups of records, and we can only know what they are by looking at the whole dataset ahead of time. This is the “fitting” step, in which the groups of records resulting from a particular set of model parameters(e.g. the weights that are attributes of the class) are generated.

Once we have that set of record categories, we can test how well the classifier performs, by checking it against test / training data which we have already classified by hand. The test / training set is a list of lists of unique FERC plant record IDs(each record ID is the concatenation of: report year, respondent id, supplement number, and row number). It could also be stored as a dataframe where each column is associated with a year of data(some of which could be empty). Not sure what the best structure would be.

If it’s useful, we can assign each group a unique ID that is the time ordered concatenation of each of the constituent record IDs. Need to understand what the process for checking the classification of an input record looks like.

To score a given classifier, we can look at what proportion of the records in the test dataset are assigned to the same group as in our manual classification of those records. There are much more complicated ways to do the scoring too… but for now let’s just keep it as simple as possible.

fit(X, y=None)[source]

Use weighted FERC plant features to group records into time series.

The fit method takes the vectorized, normalized, weighted FERC plant features (X) as input, calculates the pairwise cosine similarity matrix between all records, and groups the records in their best time series. The similarity matrix and best time series are stored as data members in the object for later use in scoring & predicting.

This isn’t quite the way a fit method would normally work.

Parameters
  • () (y) – a sparse matrix of size n_samples x n_features.

  • ()

Returns

Return type

pandas.DataFrame

Todo

Zane revisit args and returns

predict(X, y=None)[source]

Identify time series of similar records to input record_ids.

Given a one-dimensional dataframe X, containing FERC record IDs, return a dataframe in which each row corresponds to one of the input record_id values (ordered as the input was ordered), with each column corresponding to one of the years worth of data. Values in the returned dataframe are the FERC record_ids of the record most similar to the input record within that year. Some of them may be null, if there was no sufficiently good match.

Row index is the seed record IDs. Column index is years.

TODO: * This method is hideously inefficient. It should be vectorized. * There’s a line that throws a FutureWarning that needs to be fixed.

score(X, y=None)[source]

Scores a collection of FERC plant categorizations.

For every record ID in X, predict its record group and calculate a metric of similarity between the prediction and the “ground truth” group that was passed in for that value of X.

Parameters
  • X (pandas.DataFrame) – an n_samples x 1 pandas dataframe of FERC Form 1 record IDs.

  • y (pandas.DataFrame) – a dataframe of “ground truth” FERC Form 1 record groups, corresponding to the list record IDs in X

Returns

The average of all the similarity metrics as the score.

Return type

numpy.ndarray

transform(X, y=None)[source]

Passthrough transform method – just returns self.

pudl.transform.ferc1.accumulated_depreciation(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 depreciation data for loading into PUDL.

This information is organized by FERC account, with each line of the FERC Form 1 having a different descriptive identifier like ‘balance_end_of_year’ or ‘transmission’.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of the transformed DataFrames.

Return type

dict

pudl.transform.ferc1.fuel(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 fuel data for loading into PUDL Database.

This process includes converting some columns to be in terms of our preferred units, like MWh and mmbtu instead of kWh and btu. Plant names are also standardized (stripped & Title Case). Fuel and fuel unit strings are also standardized using our cleanstrings() function and string cleaning dictionaries found in pudl.constants.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of transformed dataframes.

Return type

dict

pudl.transform.ferc1.fuel_by_plant_ferc1(fuel_df, thresh=0.5)[source]

Calculates useful FERC Form 1 fuel metrics on a per plant-year basis.

Each record in the FERC Form 1 corresponds to a particular type of fuel. Many plants – especially coal plants – use more than one fuel, with gas and/or diesel serving as startup fuels. In order to be able to classify the type of plant based on relative proportions of fuel consumed or fuel costs it is useful to aggregate these per-fuel records into a single record for each plant.

Fuel cost (in nominal dollars) and fuel heat content (in mmBTU) are calculated for each fuel based on the cost and heat content per unit, and the number of units consumed, and then summed by fuel type (there can be more than one record for a given type of fuel in each plant because we are simplifying the fuel categories). The per-fuel records are then pivoted to create one column per fuel type. The total is summed and stored separately, and the individual fuel costs & heat contents are divided by that total, to yield fuel proportions. Based on those proportions and a minimum threshold that’s passed in, a “primary” fuel type is then assigned to the plant-year record and given a string label.

Parameters
  • fuel_df (pandas.DataFrame) – Pandas DataFrame resembling the post-transform result for the fuel_ferc1 table.

  • thresh (float) – A value between 0.5 and 1.0 indicating the minimum fraction of overall heat content that must have been provided by a fuel in a plant-year for it to be considered the “primary” fuel for the plant in that year. Default value: 0.5.

Returns

A DataFrame with a single record for each

Return type

pandas.DataFrame

plant-year, including the columns required to merge it with the plants_steam_ferc1 table/DataFrame (report_year, utility_id_ferc1, and plant_name) as well as totals for fuel mmbtu consumed in that plant-year, and the cost of fuel in that year, the proportions of heat content and fuel costs for each fuel in that year, and a column that labels the plant’s primary fuel for that year.

Raises

AssertionError – If the DataFrame input does not have the columns required to run the function.

pudl.transform.ferc1.make_ferc_clf(plants_df, ngram_min=2, ngram_max=10, min_sim=0.75, plant_name_wt=2.0, plant_type_wt=2.0, construction_type_wt=1.0, capacity_mw_wt=1.0, construction_year_wt=1.0, utility_id_ferc1_wt=1.0, fuel_fraction_wt=1.0)[source]

Create a FERC Plant Classifier using several weighted features.

Given a FERC steam plants dataframe plants_df, which also includes fuel consumption information, transform a selection of useful columns into features suitable for use in calculating inter-record cosine similarities. Individual features are weighted according to the keyword arguments.

Features include:
  • plant_name (via TF-IDF, with ngram_min and ngram_max as parameters)

  • plant_type (OneHot encoded categorical feature)

  • construction_type (OneHot encoded categorical feature)

  • capacity_mw (MinMax scaled numerical feature)

  • construction year (OneHot encoded categorical feature)

  • utility_id_ferc1 (OneHot encoded categorical feature)

  • fuel_fraction_mmbtu (several MinMax scaled numerical columns, which are normalized and treated as a single feature.)

This feature matrix is then used to instantiate a FERCPlantClassifier.

The combination of the ColumnTransformer and FERCPlantClassifier are combined in a sklearn Pipeline, which is returned by the function.

Parameters
  • ngram_min (int) – the minimum lengths to consider in the vectorization of the plant_name feature.

  • ngram_max (int) – the maximum n-gram lengths to consider in the vectorization of the plant_name feature.

  • min_sim (float) – the minimum cosine similarity between two records that can be considered a “match” (a number between 0.0 and 1.0).

  • plant_name_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • plant_type_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • construction_type_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • capacity_mw_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • construction_year_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • utility_id_ferc1_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • fuel_fraction_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

Returns

an sklearn Pipeline that performs reprocessing and classification with a FERCPlantClassifier object.

Return type

sklearn.pipeline.Pipeline

pudl.transform.ferc1.plant_in_service(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 plant_in_service data for loading into PUDL.

This information is organized by FERC account, with each line of the FERC Form 1 having a different FERC account id (most are numeric and correspond to FERC’s Uniform Electric System of Accounts). As of PUDL v0.1, this data is only valid from 2007 onward, as the line numbers for several accounts are different in earlier years.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of the transformed DataFrames.

Return type

dict

pudl.transform.ferc1.plants_hydro(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 plant_hydro data for loading into PUDL Database.

Standardizes plant names (stripping whitespace and Using Title Case). Also converts into our preferred units of MW and MWh.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of transformed dataframes.

Return type

dict

pudl.transform.ferc1.plants_pumped_storage(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 pumped storage data for loading into PUDL.

Standardizes plant names (stripping whitespace and Using Title Case). Also converts into our preferred units of MW and MWh.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of transformed dataframes.

Return type

dict

pudl.transform.ferc1.plants_small(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 plant_small data for loading into PUDL Database.

This FERC Form 1 table contains information about a large number of small plants, including many small hydroelectric and other renewable generation facilities. Unfortunately the data is not well standardized, and so the plants have been categorized manually, with the results of that categorization stored in an Excel spreadsheet. This function reads in the plant type data from the spreadsheet and merges it with the rest of the information from the FERC DB based on record number, FERC respondent ID, and report year. When possible the FERC license number for small hydro plants is also manually extracted from the data.

This categorization will need to be renewed with each additional year of FERC data we pull in. As of v0.1 the small plants have been categorized for 2004-2015.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of transformed dataframes.

Return type

dict

pudl.transform.ferc1.plants_steam(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 plant_steam data for loading into PUDL Database.

This includes converting to our preferred units of MWh and MW, as well as standardizing the strings describing the kind of plant and construction.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

of transformed dataframes, including the newly transformed plants_steam_ferc1 dataframe.

Return type

dict

pudl.transform.ferc1.plants_steam_validate_ids(ferc1_steam_df)[source]

Tests that plant_id_ferc1 times series includes one record per year.

Parameters

ferc1_steam_df (pandas.DataFrame) – A DataFrame of the data from the FERC 1 Steam table.

Returns

None

pudl.transform.ferc1.purchased_power(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 pumped storage data for loading into PUDL.

This table has data about inter-utility power purchases into the PUDL DB. This includes how much electricty was purchased, how much it cost, and who it was purchased from. Unfortunately the field describing which other utility the power was being bought from is poorly standardized, making it difficult to correlate with other data. It will need to be categorized by hand or with some fuzzy matching eventually.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of the transformed DataFrames.

Return type

dict

pudl.transform.ferc1.transform(ferc1_raw_dfs, ferc1_tables=('fuel_ferc1', 'plants_steam_ferc1', 'plants_small_ferc1', 'plants_hydro_ferc1', 'plants_pumped_storage_ferc1', 'plant_in_service_ferc1', 'purchased_power_ferc1', 'accumulated_depreciation_ferc1'))[source]

Transforms FERC 1.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database

  • ferc1_tables (tuple) – A tuple containing the set of tables which have been successfully integrated into PUDL

Returns

A dictionary of the transformed DataFrames.

Return type

dict