pudl.analysis.classify_plants_ferc1#

Scikit-Learn classification pipeline for identifying related FERC 1 plant records.

Sadly FERC doesn’t provide any kind of real IDs for the plants that report to them – all we have is their names (a freeform string) and the data that is reported alongside them. This is often enough information to be able to recognize which records ought to be associated with each other year to year to create a continuous time series. However, we want to do that programmatically, which means using some clustering / categorization tools from scikit-learn

Module Contents#

Classes#

FERCPlantClassifier

A classifier for identifying FERC plant time series in FERC Form 1 data.

Functions#

make_ferc1_clf(plants_df[, ngram_min, ngram_max, ...])

Create a FERC Plant Classifier using several weighted features.

plants_steam_assign_plant_ids(→ pandas.DataFrame)

Assign IDs to the large steam plants.

revert_filled_in_string_nulls(→ pandas.DataFrame)

Revert the filled nulls from string columns.

revert_filled_in_float_nulls(→ pandas.DataFrame)

Revert the filled nulls from float columns.

plants_steam_validate_ids(→ pandas.DataFrame)

Tests that plant_id_ferc1 times series includes one record per year.

fuel_by_plant_ferc1(→ pandas.DataFrame)

Calculates useful FERC Form 1 fuel metrics on a per plant-year basis.

Attributes#

pudl.analysis.classify_plants_ferc1.logger[source]#
class pudl.analysis.classify_plants_ferc1.FERCPlantClassifier(plants_df: pandas.DataFrame, min_sim: float = 0.75)[source]#

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

A classifier for identifying FERC plant time series in FERC Form 1 data.

We want to be able to give the classifier a FERC plant record, and get back the group of records(or the ID of the group of records) that it ought to be part of.

There are hundreds of different groups of records, and we can only know what they are by looking at the whole dataset ahead of time. This is the “fitting” step, in which the groups of records resulting from a particular set of model parameters(e.g. the weights that are attributes of the class) are generated.

Once we have that set of record categories, we can test how well the classifier performs, by checking it against test / training data which we have already classified by hand. The test / training set is a list of lists of unique FERC plant record IDs(each record ID is the concatenation of: report year, respondent id, supplement number, and row number). It could also be stored as a dataframe where each column is associated with a year of data(some of which could be empty). Not sure what the best structure would be.

If it’s useful, we can assign each group a unique ID that is the time ordered concatenation of each of the constituent record IDs. Need to understand what the process for checking the classification of an input record looks like.

To score a given classifier, we can look at what proportion of the records in the test dataset are assigned to the same group as in our manual classification of those records. There are much more complicated ways to do the scoring too… but for now let’s just keep it as simple as possible.

fit(X, y=None) FERCPlantClassifier[source]#

Use weighted FERC plant features to group records into time series.

The fit method takes the vectorized, normalized, weighted FERC plant features (X) as input, calculates the pairwise cosine similarity matrix between all records, and groups the records in their best time series. The similarity matrix and best time series are stored as data members in the object for later use in scoring & predicting.

This isn’t quite the way a fit method would normally work.

Parameters:
  • X – a sparse matrix of size n_samples x n_features.

  • y – Included only for API compatibility.

transform(X, y=None)[source]#

Passthrough transform method – just returns self.

predict(X, y=None)[source]#

Identify time series of similar records to input record_ids.

Given a one-dimensional dataframe X, containing FERC record IDs, return a dataframe in which each row corresponds to one of the input record_id values (ordered as the input was ordered), with each column corresponding to one of the years worth of data. Values in the returned dataframe are the FERC record_ids of the record most similar to the input record within that year. Some of them may be null, if there was no sufficiently good match.

Row index is the seed record IDs. Column index is years.

Todo: * This method is hideously inefficient. It should be vectorized. * There’s a line that throws a FutureWarning that needs to be fixed.

score(X, y=None)[source]#

Scores a collection of FERC plant categorizations.

For every record ID in X, predict its record group and calculate a metric of similarity between the prediction and the “ground truth” group that was passed in for that value of X.

Parameters:
  • X (pandas.DataFrame) – an n_samples x 1 pandas dataframe of FERC Form 1 record IDs.

  • y (pandas.DataFrame) – a dataframe of “ground truth” FERC Form 1 record groups, corresponding to the list record IDs in X

Returns:

The average of all the similarity metrics as the score.

Return type:

numpy.ndarray

_best_by_year()[source]#

Finds the best match for each plant record in each other year.

pudl.analysis.classify_plants_ferc1.make_ferc1_clf(plants_df, ngram_min=2, ngram_max=10, min_sim=0.75, plant_name_ferc1_wt=2.0, plant_type_wt=2.0, construction_type_wt=1.0, capacity_mw_wt=1.0, construction_year_wt=1.0, utility_id_ferc1_wt=1.0, fuel_fraction_wt=1.0)[source]#

Create a FERC Plant Classifier using several weighted features.

Given a FERC steam plants dataframe plants_df, which also includes fuel consumption information, transform a selection of useful columns into features suitable for use in calculating inter-record cosine similarities. Individual features are weighted according to the keyword arguments.

Features include:

  • plant_name (via TF-IDF, with ngram_min and ngram_max as parameters)

  • plant_type (OneHot encoded categorical feature)

  • construction_type (OneHot encoded categorical feature)

  • capacity_mw (MinMax scaled numerical feature)

  • construction year (OneHot encoded categorical feature)

  • utility_id_ferc1 (OneHot encoded categorical feature)

  • fuel_fraction_mmbtu (several MinMax scaled numerical columns, which are normalized and treated as a single feature.)

This feature matrix is then used to instantiate a FERCPlantClassifier.

The combination of the ColumnTransformer and FERCPlantClassifier are combined in a sklearn Pipeline, which is returned by the function.

Parameters:
  • ngram_min (int) – the minimum lengths to consider in the vectorization of the plant_name feature.

  • ngram_max (int) – the maximum n-gram lengths to consider in the vectorization of the plant_name feature.

  • min_sim (float) – the minimum cosine similarity between two records that can be considered a “match” (a number between 0.0 and 1.0).

  • plant_name_ferc1_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • plant_type_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • construction_type_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • capacity_mw_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • construction_year_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • utility_id_ferc1_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • fuel_fraction_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

Returns:

an sklearn Pipeline that performs reprocessing and classification with a FERCPlantClassifier object.

Return type:

sklearn.pipeline.Pipeline

pudl.analysis.classify_plants_ferc1.plants_steam_assign_plant_ids(ferc1_steam_df: pandas.DataFrame, ferc1_fuel_df: pandas.DataFrame, fuel_categories: list[str]) pandas.DataFrame[source]#

Assign IDs to the large steam plants.

pudl.analysis.classify_plants_ferc1.revert_filled_in_string_nulls(df: pandas.DataFrame) pandas.DataFrame[source]#

Revert the filled nulls from string columns.

Many columns that are used for the classification in plants_steam_assign_plant_ids() have many nulls. The classifier can’t handle nulls well, so we filled in nulls with empty strings for string columns. This function replaces empty strings with null values for specific columns that are known to contain empty strings introduced for the classifier.

pudl.analysis.classify_plants_ferc1.revert_filled_in_float_nulls(df: pandas.DataFrame) pandas.DataFrame[source]#

Revert the filled nulls from float columns.

Many columns that are used for the classification in plants_steam_assign_plant_ids() have many nulls. The classifier can’t handle nulls well, so we filled in nulls with zeros for float columns. This function replaces zeros with nulls for all float columns.

pudl.analysis.classify_plants_ferc1.plants_steam_validate_ids(ferc1_steam_df: pandas.DataFrame) pandas.DataFrame[source]#

Tests that plant_id_ferc1 times series includes one record per year.

Parameters:

ferc1_steam_df – A DataFrame of the data from the FERC 1 Steam table.

Returns:

The input dataframe, to enable method chaining.

pudl.analysis.classify_plants_ferc1.fuel_by_plant_ferc1(fuel_df: pandas.DataFrame, fuel_categories: list[str], thresh: float = 0.5) pandas.DataFrame[source]#

Calculates useful FERC Form 1 fuel metrics on a per plant-year basis.

Each record in the FERC Form 1 corresponds to a particular type of fuel. Many plants – especially coal plants – use more than one fuel, with gas and/or diesel serving as startup fuels. In order to be able to classify the type of plant based on relative proportions of fuel consumed or fuel costs it is useful to aggregate these per-fuel records into a single record for each plant.

Fuel cost (in nominal dollars) and fuel heat content (in mmBTU) are calculated for each fuel based on the cost and heat content per unit, and the number of units consumed, and then summed by fuel type (there can be more than one record for a given type of fuel in each plant because we are simplifying the fuel categories). The per-fuel records are then pivoted to create one column per fuel type. The total is summed and stored separately, and the individual fuel costs & heat contents are divided by that total, to yield fuel proportions. Based on those proportions and a minimum threshold that’s passed in, a “primary” fuel type is then assigned to the plant-year record and given a string label.

Parameters:
  • fuel_df – Pandas DataFrame resembling the post-transform result for the fuel_ferc1 table.

  • thresh – A value between 0.5 and 1.0 indicating the minimum fraction of overall heat content that must have been provided by a fuel in a plant-year for it to be considered the “primary” fuel for the plant in that year. Default value: 0.5.

Returns:

DataFrame with a single record for each plant-year, including the columns required to merge it with the plants_steam_ferc1 table/DataFrame (report_year, utility_id_ferc1, and plant_name) as well as totals for fuel mmbtu consumed in that plant-year, and the cost of fuel in that year, the proportions of heat content and fuel costs for each fuel in that year, and a column that labels the plant’s primary fuel for that year.

Raises:

AssertionError – If the DataFrame input does not have the columns required to run the function.