pudl.transform.ferc1#

Classes & functions to process FERC Form 1 data before loading into the PUDL DB.

Note that many of the classes/objects here inherit from/are instances of classes defined in pudl.transform.classes. Their design and relationships to each other are documented in that module.

See pudl.transform.params.ferc1 for the values that parameterize many of these transformations.

Module Contents#

Classes#

Ferc1Source

Enumeration of allowed FERC 1 raw data sources.

Ferc1TableId

Enumeration of the allowed FERC 1 table IDs.

Ferc1RenameColumns

Dictionaries for renaming either XBRL or DBF derived FERC 1 columns.

Ferc1TableTransformParams

A model defining what TransformParams are allowed for FERC Form 1.

Ferc1AbstractTableTransformer

An abstract class defining methods common to many FERC Form 1 tables.

FuelFerc1TableTransformer

A table transformer specific to the fuel_ferc1 table.

PlantsSteamFerc1TableTransformer

Transformer class for the plants_steam_ferc1 table.

PlantsHydroFerc1TableTransformer

A table transformer specific to the plants_hydro_ferc1 table.

PlantsPumpedStorageFerc1TableTransformer

Transformer class for plants_pumped_storage_ferc1 table.

PurchasedPowerTableTransformer

Transformer class for purchased_power_ferc1 table.

PlantInServiceFerc1TableTransformer

A transformer for the plant_in_service_ferc1 table.

PlantsSmallFerc1TableTransformer

A table transformer specific to the plants_small_ferc1 table.

Functions#

get_ferc1_dbf_rows_to_map(→ pandas.DataFrame)

Identify DBF rows that need to be mapped to XBRL columns.

update_dbf_to_xbrl_map(→ pandas.DataFrame)

Regenerate the FERC 1 DBF+XBRL glue while retaining existing mappings.

read_dbf_to_xbrl_map(→ pandas.DataFrame)

Read the manually compiled DBF row to XBRL column mapping for a given table.

fill_dbf_to_xbrl_map(→ pandas.DataFrame)

Forward-fill missing years in the minimal, manually compiled DBF to XBRL mapping.

get_data_cols_raw_xbrl(→ list[str])

Get a list of all XBRL data columns appearing in a given XBRL table.

transform(→ dict[str, pandas.DataFrame])

Coordinate the transformation of all FERC Form 1 tables.

Attributes#

pudl.transform.ferc1.logger[source]#
class pudl.transform.ferc1.Ferc1Source[source]#

Bases: enum.Enum

Enumeration of allowed FERC 1 raw data sources.

XBRL = xbrl[source]#
DBF = dbf[source]#
class pudl.transform.ferc1.Ferc1TableId[source]#

Bases: enum.Enum

Enumeration of the allowed FERC 1 table IDs.

Hard coding this doesn’t seem ideal. Somehow it should be either defined in the context of the Package, the Ferc1Settings, an etl_group, or DataSource. All of the table transformers associated with a given data source should have a table_id that’s from that data source’s subset of the database. Where should this really happen? Alternatively, the allowable values could be derived from the structure of the Package. But this works for now.

FUEL_FERC1 = fuel_ferc1[source]#
PLANTS_STEAM_FERC1 = plants_steam_ferc1[source]#
PLANTS_HYDRO_FERC1 = plants_hydro_ferc1[source]#
PLANTS_SMALL_FERC1 = plants_small_ferc1[source]#
PLANTS_PUMPED_STORAGE_FERC1 = plants_pumped_storage_ferc1[source]#
PLANT_IN_SERVICE_FERC1 = plant_in_service_ferc1[source]#
PURCHASED_POWER_FERC1 = purchased_power_ferc1[source]#
class pudl.transform.ferc1.Ferc1RenameColumns[source]#

Bases: pudl.transform.classes.TransformParams

Dictionaries for renaming either XBRL or DBF derived FERC 1 columns.

This is FERC 1 specific, because we need to store both DBF and XBRL rename dictionaires separately. Note that this parameter model does not have its own unique transform function. Like the generic pudl.transform.classes.RenameColumns it depends on the build in pd.rename() method, which is called with the values DBF or XBRL parameters depending on the context.

Potential parameters validations that could be implemented

  • Validate that all keys appear in the original dbf/xbrl sources. This has to be true, but right now we don’t have stored metadata enumerating all of the columns that exist in the raw data, so we don’t have anything to check against. Implement once when we have schemas defined for after the extract step.

  • Validate all values appear in PUDL tables, and all expected PUDL names are mapped. Actually we can’t require that the rename values appear in the PUDL tables, because there will be cases in which the original column gets dropped or modified, e.g. in the case of unit conversions with a column rename.

dbf :pudl.transform.classes.RenameColumns[source]#
xbrl :pudl.transform.classes.RenameColumns[source]#
class pudl.transform.ferc1.Ferc1TableTransformParams[source]#

Bases: pudl.transform.classes.TableTransformParams

A model defining what TransformParams are allowed for FERC Form 1.

This adds additional parameter models beyond the ones inherited from the pudl.transform.classes.AbstractTableTransformer class.

class Config[source]#

Only allow the known table transform params.

extra = forbid[source]#
rename_columns_ferc1 :Ferc1RenameColumns[source]#
rename_columns_instant_xbrl :pudl.transform.classes.RenameColumns[source]#
rename_columns_duration_xbrl :pudl.transform.classes.RenameColumns[source]#
pudl.transform.ferc1.get_ferc1_dbf_rows_to_map(ferc1_engine: sqlalchemy.engine.Engine) pandas.DataFrame[source]#

Identify DBF rows that need to be mapped to XBRL columns.

Select all records in the f1_row_lit_tbl where the row literal associated with a given combination of table and row number is different from the preceeding year. This is the smallest set of records which we can use to reproduce the whole table by expanding the time series to include all years, and forward filling the row literals.

pudl.transform.ferc1.update_dbf_to_xbrl_map(ferc1_engine: sqlalchemy.engine.Engine) pandas.DataFrame[source]#

Regenerate the FERC 1 DBF+XBRL glue while retaining existing mappings.

Reads all rows that need to be mapped out of the f1_row_lit_tbl and appends columns containing any previously mapped values, returning the resulting dataframe.

pudl.transform.ferc1.read_dbf_to_xbrl_map(dbf_table_name: str) pandas.DataFrame[source]#

Read the manually compiled DBF row to XBRL column mapping for a given table.

Parameters:

dbf_table_name – The original name of the table in the FERC Form 1 DBF database whose mapping to the XBRL data you want to extract. for example f1_plant_in_srvce.

Returns:

DataFrame with columns [report_year, row_number, row_type, xbrl_column_stem]

pudl.transform.ferc1.fill_dbf_to_xbrl_map(df: pandas.DataFrame, dbf_years: list[int] | None = None) pandas.DataFrame[source]#

Forward-fill missing years in the minimal, manually compiled DBF to XBRL mapping.

The relationship between a DBF row and XBRL column/fact/entity/whatever is mostly consistent from year to year. To minimize the amount of manual mapping work we have to do, we only map the years in which the relationship changes. In the end we do need a complete correspondence for all years though, and this function uses the minimal information we’ve compiled to fill in all the gaps, producing a complete mapping across all requested years.

One complication is that we need to explicitly indicate which DBF rows have headers in them (which don’t exist in XBRL), to differentiate them from null values in the exhaustive index we create below. We set a HEADER_ROW sentinel value so we can distinguish between two different reasons that we might find NULL values in the xbrl_column_stem field:

  1. It’s NULL because it’s between two valid mapped values (the NULL was created in our filling of the time series) and should thus be filled in, or

  2. It’s NULL because it was a header row in the DBF data, which means it should NOT be filled in. Without the HEADER_ROW value, when a row number from year X becomes associated with a non-header row in year X+1 the ffill will keep right on filling, associating all of the new header rows with the value of xbrl_column_stem that was associated with the old row number.

Parameters:
  • df – A dataframe containing a DBF row to XBRL mapping for a single FERC 1 DBF table.

  • dbf_years – The list of years that should have their DBF row to XBRL mapping filled in. This defaults to all available years of DBF data for FERC 1. In general this parameter should only be set to a non-default value for testing purposes.

Returns:

A complete mapping of DBF row number to XBRL columns for all years of data within a single FERC 1 DBF table. Has columns of [report_year, row_number, xbrl_column_stem]

pudl.transform.ferc1.get_data_cols_raw_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) list[str][source]#

Get a list of all XBRL data columns appearing in a given XBRL table.

Returns:

A list of all the data columns found in the original XBRL DB that correspond to the given PUDL table. Includes columns from both the instant and duration tables but excludes structural columns that appear in all XBRL tables.

class pudl.transform.ferc1.Ferc1AbstractTableTransformer(params: TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True, xbrl_metadata_json: list[dict[Any]] | None = None)[source]#

Bases: pudl.transform.classes.AbstractTableTransformer

An abstract class defining methods common to many FERC Form 1 tables.

This subclass remains abstract because it does not define transform_main(), which is always going to be table-specific.

  • Methods that only apply to XBRL data should end with _xbrl

  • Methods that only apply to DBF data should end with _dbf

table_id :Ferc1TableId[source]#
parameter_model[source]#
params :Ferc1AbstractTableTransformer.parameter_model[source]#
has_unique_record_ids :bool = True[source]#

True if each record in the transformed table corresponds to one input record.

For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.

xbrl_metadata_json :list[dict[Any]] = [][source]#

An array of JSON objects extracted from the FERC 1 XBRL taxonomy.

xbrl_metadata_normalized :pandas.DataFrame[source]#

A semi-normalized dataframe containing table-specific XBRL metadata.

transform_start(raw_dbf: pandas.DataFrame, raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame[source]#

Process the raw data until the XBRL and DBF inputs have been unified.

transform_main(df: pandas.DataFrame) pandas.DataFrame[source]#

Generic FERC1 main table transformer.

Params:

df: Pre-processed, concatenated XBRL and DBF data.

Returns:

A single transformed table concatenating multiple years of cleaned data derived from the raw DBF and/or XBRL inputs.

transform_end(df: pandas.DataFrame) pandas.DataFrame[source]#

Enforce the database schema and remove any cached dataframes.

normalize_metadata_xbrl(xbrl_fact_names: list[str] | None) pandas.DataFrame[source]#

Normalize XBRL metadata, select table-specific rows, and transform them.

In order to select the relevant rows from the normalized metadata, this function needs to know which XBRL facts pertain to the table being transformed. These are the names of the data columns in the raw XBRL data. To avoid creating a separate dependency on the FERC 1 XBRL DB, we defer the assignment of this class attribute until Ferc1AbstractTableTransformer.process_xbrl() is called, and read the column labels from the input dataframes.

align_row_numbers_dbf(df: pandas.DataFrame) pandas.DataFrame[source]#

Align row-numbers across multiple years of DBF data.

This is a no-op in the abstract base class, but for row-oriented DBF data where the same data shows up in different row numbers in different years, it needs to be implemented. Parameterization TBD with additional experience. See: https://github.com/catalyst-cooperative/pudl/issues/2012

process_dbf(raw_dbf: pandas.DataFrame) pandas.DataFrame[source]#

DBF-specific transformations that take place before concatenation.

process_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame[source]#

XBRL-specific transformations that take place before concatenation.

wide_to_tidy_xbrl(df: pandas.DataFrame) pandas.DataFrame[source]#

Reshape the XBRL data from wide to tidy format.

This is a no-op in the abstract base class, but should be implemented for child classes which need to reshape the XBRL data for concatenation with DBF. Parameterization TBD based on additional experience. See: https://github.com/catalyst-cooperative/pudl/issues/2012

merge_instant_and_duration_tables_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame[source]#

Merge XBRL instant and duration tables, reshaping instant as needed.

FERC1 XBRL instant period signifies that it is true as of the reported date, while a duration fact pertains to the specified time period. The date column for an instant fact corresponds to the end_date column of a duration fact.

When merging the instant and duration tables, we need to preserve row order. For the small generators table, row order is how we label and extract information from header and note rows. Outer merging messes up the order, so we need to use a one-sided merge. So far, it seems like the duration df contains all the index values in the instant df. To be sure, there’s a check that makes sure there are no unique intant df index values. If that passes, we merge the instant table into the duration table, and the row order is preserved.

Note: This should always be applied before :meth:rename_columns

Parameters:
  • raw_xbrl_instant – table representing XBRL instant facts.

  • raw_xbrl_duration – table representing XBRL duration facts.

Returns:

A unified table combining the XBRL duration and instant facts, if both types of facts were present. If either input dataframe is empty, the other dataframe is returned unchanged, except that several unused columns are dropped. If both input dataframes are empty, an empty dataframe is returned.

process_instant_xbrl(df: pandas.DataFrame) pandas.DataFrame[source]#

Pre-processing required to make instant and duration tables compatible.

Column renaming is sometimes required because a few columns in the instant and duration tables do not have corresponding names that follow the naming conventions of ~95% of all the columns, which we rely on programmatically when reshaping and concatenating these tables together.

process_duration_xbrl(df: pandas.DataFrame) pandas.DataFrame[source]#

Pre-processing required to make instant and duration tables compatible.

Column renaming is sometimes required because a few columns in the instant and duration tables do not have corresponding names that follow the naming conventions of ~95% of all the columns, which we rely on programmatically when reshaping and concatenating these tables together.

drop_footnote_columns_dbf(df: pandas.DataFrame) pandas.DataFrame[source]#

Drop DBF footnote reference columns, which all end with _f.

source_table_id(source_ferc1: Ferc1Source) str[source]#

Look up the ID of the raw data source table.

source_table_primary_key(source_ferc1: Ferc1Source) list[str][source]#

Look up the pre-renaming source table primary key columns.

renamed_table_primary_key(source_ferc1: Ferc1Source) list[str][source]#

Look up the post-renaming primary key columns.

drop_unused_original_columns_dbf(df: pandas.DataFrame) pandas.DataFrame[source]#

Remove residual DBF specific columns.

assign_record_id(df: pandas.DataFrame, source_ferc1: Ferc1Source) pandas.DataFrame[source]#

Add a column identifying the original source record for each row.

It is often useful to be able to tell exactly which record in the FERC Form 1 database a given record within the PUDL database came from.

Within each FERC Form 1 DBF table, each record is supposed to be uniquely identified by the combination of: report_year, report_prd, utility_id_ferc1_dbf, spplmnt_num, row_number.

The FERC Form 1 XBRL tables do not have these supplement and row number columns, so we construct an id based on: report_year, utility_id_ferc1_xbrl, and the primary key columns of the XBRL table

Parameters:
  • df – table to assign record_id to

  • table_name – name of table

  • source_ferc1 – data source of raw ferc1 database.

Raises:
  • ValueError – If any of the primary key columns are missing from the DataFrame being processed.

  • ValueError – If there are any null values in the primary key columns.

  • ValueError – If the resulting record_id column is non-unique.

assign_utility_id_ferc1(df: pandas.DataFrame, source_ferc1: Ferc1Source) pandas.DataFrame[source]#

Assign the PUDL-assigned utility_id_ferc1 based on the native utility ID.

We need to replace the natively reported utility ID from each of the two FERC1 sources with a PUDL-assigned utilty. The mapping between the native ID’s and these PUDL-assigned ID’s can be accessed in the database tables utilities_dbf_ferc1 and utilities_xbrl_ferc1.

Parameters:
  • df – the input table with the native utilty ID column.

  • source_ferc1 – the

Returns:

an augemented version of the input df with a new column that replaces the natively reported utility ID with the PUDL-assigned utility ID.

class pudl.transform.ferc1.FuelFerc1TableTransformer(params: TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True, xbrl_metadata_json: list[dict[Any]] | None = None)[source]#

Bases: Ferc1AbstractTableTransformer

A table transformer specific to the fuel_ferc1 table.

The fuel_ferc1 table reports data about fuel consumed by large thermal power plants in the plants_steam_ferc1 table. Each record in the steam table is typically associated with several records in the fuel table, with each fuel record reporting data for a particular type of fuel consumed by that plant over the course of a year. The fuel table presents several challenges.

The type of fuel, which is part of the primary key for the table, is a freeform string with hundreds of different nonstandard values. These strings are categorized manually and converted to fuel_type_code_pudl. Some values cannot be categorized and are set to other. In other string categorizations we set the unidentifiable values to NA, but in this table the fuel type is part of the primary key and primary keys cannot contain NA values.

This simplified categorization occasionally results in records with duplicate primary keys. In those cases the records are aggregated into a single record if they have the same apparent physical units. If the fuel units are different, only the first record is retained.

Several columns have unspecified, inconsistent, fuel-type specific units of measure associated with them. In order for records to be comparable and aggregatable, we have to infer and standardize these units.

In the raw FERC Form 1 data there is a fuel_units column which describes the units of fuel delivered or consumed. Most commonly this is short tons for solid fuels (coal), thousands of cubic feet (Mcf) for gaseous fuels, and barrels (bbl) for liquid fuels. However, the fuel_units column is also a freeform string with hundreds of nonstandard values which we have to manually categorize, and many of the values do not map directly to the most commonly used units for fuel quantities. E.g. some solid fuel quantities are reported in pounds, or thousands of pounds, not tons; some liquid fuels are reported in gallons or thousands of gallons, not barrels; and some gaseous fuels are reported in cubic feet not thousands of cubic feet.

Two additional columns report fuel price per unit of heat content and fuel heat content per physical unit of fuel. The units of those columns are not explicitly reported, vary by fuel, and are inconsistent within individual fuel types.

We adopt standardized units and attempt to convert all reported values in the fuel table into those units. For physical fuel units we adopt those that are used by the EIA: short tons (tons) for solid fuels, barrels (bbl) for liquid fuels, and thousands of cubic feet (mcf) for gaseous fuels. For heat content per (physical) unit of fuel, we use millions of British thermal units (mmbtu). All fuel prices are converted to US dollars, while many are reported in cents.

Because the reported fuel price and heat content units are implicit, we have to infer them based on observed values. This is only possible because these quantities are ratios with well defined ranges of valid values. The common units that we observe and attempt to standardize include:

  • coal: primarily BTU/pound, but also MMBTU/ton and MMBTU/pound.

  • oil: primarily BTU/gallon.

  • gas: reported in a mix of MMBTU/cubic foot, and MMBTU/thousand cubic feet.

table_id :Ferc1TableId[source]#
transform_main(df: pandas.DataFrame) pandas.DataFrame[source]#

Table specific transforms for fuel_ferc1.

Parameters:

df – Pre-processed, concatenated XBRL and DBF data.

Returns:

A single transformed table concatenating multiple years of cleaned data derived from the raw DBF and/or XBRL inputs.

process_dbf(raw_dbf: pandas.DataFrame) pandas.DataFrame[source]#

Start with inherited method and do some fuel-specific processing.

We have to do most of the transformation before the DBF and XBRL data have been concatenated because the fuel type column is part of the primary key and it is extensively modified in the cleaning process.

process_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame[source]#

Special pre-concat treatment of the fuel_ferc1 table.

We have to do most of the transformation before the DBF and XBRL data have been concatenated because the fuel type column is part of the primary key and it is extensively modified in the cleaning process. For the XBRL data, this means we can’t create a record ID until that fuel type value is clean. In addition, the categorization of fuel types results in a number of duplicate fuel records which need to be aggregated.

Parameters:
  • raw_xbrl_instant – Freshly extracted XBRL instant fact table.

  • raw_xbrl_duration – Freshly extracted XBRL duration fact table.

Returns:

Almost fully transformed XBRL data table, with instant and duration facts merged together.

standardize_physical_fuel_units(df: pandas.DataFrame) pandas.DataFrame[source]#

Convert reported fuel quantities to standard units depending on fuel type.

Use the categorized fuel type and reported fuel units to convert all fuel quantities to the following standard units, depending on whether the fuel is a solid, liquid, or gas. When a single fuel reports its quantity in fundamentally different units, convert based on typical values. E.g. 19.85 MMBTU per ton of coal, 1.037 Mcf per MMBTU of natural gas, 7.46 barrels per ton of oil.

  • solid fuels (coal and waste): short tons [ton]

  • liquid fuels (oil): barrels [bbl]

  • gaseous fuels (gas): thousands of cubic feet [mcf]

Columns to which these physical units apply:

  • fuel_consumed_units (tons, bbl, mcf)

  • fuel_cost_per_unit_burned (usd/ton, usd/bbl, usd/mcf)

  • fuel_cost_per_unit_delivered (usd/ton, usd/bbl, usd/mcf)

One remaining challenge in this standardization is that nuclear fuel is reported in both mass of Uranium and fuel heat content, and it’s unclear if there’s any reasonable typical conversion between these units, since available heat content depends on the degree of U235 enrichement, the type of reactor, and whether the fuel is just Uranium, or a mix of Uranium and Plutonium from decommissioned nuclear weapons. See:

https://world-nuclear.org/information-library/facts-and-figures/heat-values-of-various-fuels.aspx

aggregate_duplicate_fuel_types_xbrl(fuel_xbrl: pandas.DataFrame) pandas.DataFrame[source]#

Aggregate the fuel records having duplicate primary keys.

drop_total_rows(df: pandas.DataFrame) pandas.DataFrame[source]#

Drop rows that represent plant totals rather than individual fuels.

This is an imperfect, heuristic process. The rows we identify as probably representing totals rather than individual fuels:

  • have zero or null values in all of their numerical data columns

  • have no identifiable fuel type

  • have no identifiable fuel units

  • DO report a value for MMBTU / MWh (heat rate)

In the case of the fuel_ferc1 table, we drop any row where all the data columns are null AND there’s a non-null value in the fuel_mmbtu_per_mwh column, as it typically indicates a “total” row for a plant. We also require a null value for the fuel_units and an “other” value for the fuel type.

drop_invalid_rows(df: pandas.DataFrame, params: InvalidRows | None = None) pandas.DataFrame[source]#

Drop invalid rows from the fuel table.

This method both drops rows in which all required data columns are null (using the inherited parameterized method) and then also drops those rows we believe represent plant totals. See FuelFerc1TableTransformer.drop_total_rows().

class pudl.transform.ferc1.PlantsSteamFerc1TableTransformer(params: TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True, xbrl_metadata_json: list[dict[Any]] | None = None)[source]#

Bases: Ferc1AbstractTableTransformer

Transformer class for the plants_steam_ferc1 table.

table_id :Ferc1TableId[source]#
transform_main(df: pandas.DataFrame, transformed_fuel: pandas.DataFrame) pandas.DataFrame[source]#

Perform table transformations for the plants_steam_ferc1 table.

Note that this method has a non-standard call signature, since the plants_steam_ferc1 table depends on the fuel_ferc1 table.

Parameters:
  • df – The pre-processed steam plants table.

  • transformed_fuel – The fully transformed fuel_ferc1 table. This is required because fuel consumption information is used to help link steam plant records together across years using plants_steam_assign_plant_ids()

transform(raw_dbf: pandas.DataFrame, raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame, transformed_fuel: pandas.DataFrame) pandas.DataFrame[source]#

Redfine the transform method to accommodate the use of transformed_fuel.

This is duplicating code from the parent class, but is necessary because the steam table needs the fuel table for its transform. Is there a better way to do this that doesn’t require cutting and pasting the whole method just to stick the extra dataframe input into transform_main()?

class pudl.transform.ferc1.PlantsHydroFerc1TableTransformer(params: TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True, xbrl_metadata_json: list[dict[Any]] | None = None)[source]#

Bases: Ferc1AbstractTableTransformer

A table transformer specific to the plants_hydro_ferc1 table.

table_id :Ferc1TableId[source]#
transform_main(df)[source]#

Add bespoke removal of duplicate record after standard transform_main.

targeted_drop_duplicates(df)[source]#

Targeted removal of known duplicate record.

There are two records in 2019 with a utility_id_ferc1 of 200 and a plant_name_ferc1 of “marmet”. The records are nearly duplicates of eachother, except one have nulls in the capex columns. Surgically remove the record with the nulls.

class pudl.transform.ferc1.PlantsPumpedStorageFerc1TableTransformer(params: TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True, xbrl_metadata_json: list[dict[Any]] | None = None)[source]#

Bases: Ferc1AbstractTableTransformer

Transformer class for plants_pumped_storage_ferc1 table.

table_id :Ferc1TableId[source]#
class pudl.transform.ferc1.PurchasedPowerTableTransformer(params: TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True, xbrl_metadata_json: list[dict[Any]] | None = None)[source]#

Bases: Ferc1AbstractTableTransformer

Transformer class for purchased_power_ferc1 table.

This table has data about inter-utility power purchases into the PUDL DB. This includes how much electricty was purchased, how much it cost, and who it was purchased from. Unfortunately the field describing which other utility the power was being bought from is poorly standardized, making it difficult to correlate with other data. It will need to be categorized by hand or with some fuzzy matching eventually.

table_id :Ferc1TableId[source]#
class pudl.transform.ferc1.PlantInServiceFerc1TableTransformer(params: TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True, xbrl_metadata_json: list[dict[Any]] | None = None)[source]#

Bases: Ferc1AbstractTableTransformer

A transformer for the plant_in_service_ferc1 table.

table_id :Ferc1TableId[source]#
has_unique_record_ids :bool = False[source]#
align_row_numbers_dbf(df: pandas.DataFrame) pandas.DataFrame[source]#

Align historical FERC1 DBF row numbers with XBRL account IDs.

process_instant_xbrl(df: pandas.DataFrame) pandas.DataFrame[source]#

Pre-processing required to make the instant and duration tables compatible.

Each year the plant account balances are reported twice, in two separate records: one for the end of the previous year, and one for the end of the current year, with appropriate dates for the two year ends. Here we are reshaping the table so that we instead have two columns: starting_balance and ending_balance that both pertain to the current year, so that all of the records pertaining to a single report_year can be identified without dealing with the instant / duration distinction.

wide_to_tidy_xbrl(df: pandas.DataFrame) pandas.DataFrame[source]#

Reshape wide tables with FERC account columns to tidy format.

The XBRL table coming into this method contains all the data from both the instant and duration tables in a wide format – with one column for every combination of value type (e.g. additions, ending_balance) and accounting category, which means ~500 columns.

We tidy this into a long table with one column for each of the value types (6 in all), and a new column that contains the accounting categories. This allows aggregation across columns to calculate the ending balance based on the starting balance and all of the reported changes, and aggregation across groups of rows to total up various hierarchical accounting categories (hydraulic turbines -> hydraulic production plant -> all production plant -> all electric utility plant) though the categorical columns required for that aggregation are added later.

normalize_metadata_xbrl(xbrl_fact_names: list[str] | None) pandas.DataFrame[source]#

Transform the metadata to reflect the transformed data.

The XBRL Taxonomy metadata as extracted pertains to the XBRL data as extracted. When we re-shape the data, we also need to adjust the metadata to be usable alongside the reshaped data. For the plant in service table, this means selecting metadata fields that pertain to the “stem” column name (not differentiating between starting/ending balance, retirements, additions, etc.)

We fill in some gaps in the metadata, e.g. for FERC accounts that have been split across multiple rows, or combined without being calculated. We also need to rename the XBRL metadata categories to conform to the same naming convention that we are using in the data itself (since FERC doesn’t quite follow their own naming conventions…). We use the same rename dictionary, but as an argument to pd.Series.replace() instead of pd.DataFrame.rename().

merge_metadata(df: pandas.DataFrame) pandas.DataFrame[source]#

Combine XBRL-derived metadata with the data it pertains to.

While the metadata we’re using to annotate the data comes from the more recent XBRL data, it applies generally to all the historical DBF data as well! This method reads the normalized metadata out of an attribute.

apply_sign_conventions(df) pandas.DataFrame[source]#

Adjust rows and column sign conventsion to enable aggregation by summing.

Columns have uniform sign conventions, which we have manually inferred from the original metadata. This can and probably should be done programmatically in the future. If not, we’ll probably want to store the column_weights as a parameter rather than hard-coding it in here.

targeted_drop_duplicates_dbf(df: pandas.DataFrame) pandas.DataFrame[source]#

Drop bad duplicate records from a specific utility in 2018.

This is a very specific fix, meant to get rid of a particular observed set of duplicate records: FERC Respondent ID 187 in 2018 has two sets of plant in service records, one of which contains a bunch of null data.

This method is part of the DBF processing because we want to be able to hard-code a specific value of utility_id_ferc1_dbf and those IDs are no longer available later in the process. I think.

process_dbf(df: pandas.DataFrame) pandas.DataFrame[source]#

Drop targeted duplicates in the DBF data so we can use FERC respondent ID.

transform_main(df: pandas.DataFrame) pandas.DataFrame[source]#

The main table-specific transformations, affecting contents not structure.

Annotates and alters data based on information from the XBRL taxonomy metadata.

class pudl.transform.ferc1.PlantsSmallFerc1TableTransformer(params: TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True, xbrl_metadata_json: list[dict[Any]] | None = None)[source]#

Bases: Ferc1AbstractTableTransformer

A table transformer specific to the plants_small_ferc1 table.

table_id :Ferc1TableId[source]#
transform_main(df: pandas.DataFrame) pandas.DataFrame[source]#

Table specific transforms for plants_small_ferc1.

Params:

df: Pre-processed, concatenated XBRL and DBF data.

Returns:

A single transformed table concatenating multiple years of cleaned data derived from the raw DBF and/or XBRL inputs.

extract_ferc1_license(df: pandas.DataFrame) pandas.DataFrame[source]#

Extract FERC license number from plant_name_ferc1.

Many FERC license numbers are embedded in the plant_name_ferc1 column, but not all numbers in the plant_name_ferc1 column are FERC licenses. Some are dates, dollar amounts, page numbers, or numbers of wind turbines. This function extracts valid FERC license numbers and puts them in a new column called license_id_ferc1.

Potential FERC license numbers are valid when:

  • Two or more integers were found.

  • The found integers were accompanied by key phrases such as: ["license", "no.", "ferc", "project"].

  • The accompanying name does not contain phrases such as: ["page", "pg", "$",  "wind", "units"].

  • The found integers don’t fall don’t fall within the range of a valid year, defined as: 1900-2050.

  • The plant record is categorized as hydro or not categorized via the plant_type and fuel_type columns.

This function also fills other fuel types with hydro for all plants with valid FERC licenses because only hydro plants have FERC licenses.

Params:

df: Pre-processed, concatenated XBRL and DBF data.

Returns:

The same input DataFrame but with a new column called license_id_ferc1 that contains FERC 1 license infromation extracted from plant_name_ferc1.

_find_possible_header_or_note_rows(df: pandas.DataFrame) pandas.DataFrame[source]#

Find and label rows that might be headers or notes.

Called by the coordinating function label_row_types().

This function creates a column called possible_header_or_note that is either True or False depending on whether a group of columns are all NA. Rows labeled as True will be further scrutinized in the _label_header_rows() and _label_note_rows() functions to determine whether they are actually headers or notes.

Params:

df: Pre-processed, concatenated XBRL and DBF data.

Returns:

The same input DataFrame but with a new column called possible_header_or_note that flags rows that might contain useful header or note information.

_find_note_clumps(group: pandas.core.groupby.DataFrameGroupBy) tuple[pandas.core.groupby.DataFrameGroupBy, pandas.DataFrame][source]#

Find groups of rows likely to be notes.

Once the _find_possible_header_or_note_rows() function identifies rows that are either headers or notes, we must deterine which one they are. As described in the _label_note_rows() function, notes rows are usually adjecent rows with no content.

This function itentifies instances of two or more adjecent rows where possible_header_or_note = True. It takes individual utility-year groups as a parameter as opposed to the entire dataset because adjecent rows are only meaningful if they are from the same reporting entity in the same year. If we were to run this on the whole dataframe, we would see “note clumps” that are actually notes from the end of one utility’s report and headers from the beginning of another. For this reason, we run this function from within the _label_note_rows_group() function.

The output of this function is not a modified version of the original utility-year group, rather, it is a DataFrame containing information about the nature of the possible_header_or_note = True rows that is used to determine if that row is a note or not. It also returns the original utility-year-group as groupby objects seperated by each time possible_header_or_note changes from True to False or vice versa.

If you pass in the following df:

plant_name_ferc1

possible_header_or_note

HYDRO:

True

rainbow falls (b)

False

cadyville (a)

False

keuka (c)

False

  1. project #2738

True

  1. project #2835

True

  1. project #2852

True

You will get the following output (in addition to the groupby objects for each clump):

header_or_note

rows_per_clump

True

1

False

3

True

3

This shows each clump of adjecent records where possible_header_or_note is True or False and how many records are in each clump.

Params:
group: A utility-year grouping of the concatenated FERC XBRL and DBF tables.

This table must have been run through the _find_possible_header_or_note_rows() function and contain the column possible_header_or_note.

Returns:

A tuple containing groupby objects for each of the note and non-note clumps and a DataFrame indicating the number of rows in each note or non-note clump.

_label_header_rows(df: pandas.DataFrame) pandas.DataFrame[source]#

Label header rows by adding header to row_type column.

Called by the coordinating function label_row_types().

Once possible header or notes rows have been identified via the _find_possible_header_or_note_rows() function, this function sorts out which ones are headers. It does this by identifying a list of strings that, when found in the plant_name_ferc1 column, indicate that the row is or is not a header.

Sometimes this function identifies a header that is acutally a note. For this reason, it’s important that the function be called before _label_note_rows() so that the bad header values get overridden by the note designation.

Params:

df: Pre-processed, concatenated XBRL and DBF data that has been run through the _find_possible_header_or_note_rows() function and contains the column possible_header_or_note.

Returns:

The same input DataFrame but with likely headers rows containing the string header in the row_type column.

_label_note_rows_group(util_year_group: pandas.core.groupby.DataFrameGroupBy) pandas.core.groupby.DataFrameGroupBy[source]#

Label note rows by adding note to row_type column.

Called within the wraper function _label_note_rows()

This function breaks the data down by reporting unit (utility and year) and determines whether a possible_header_note = True row is a note based on two criteria:

  • Clumps of 2 or more adjecent rows where possible_header_or_note is True.

  • Instances where the last row in a utility-year group has possible_header_or_note as True.

There are a couple of important exceptions that this function also addresses. Utilities often have multiple headers in a single utility-year grouping. You might see something like: pd.Series([header, plant1, plant2, note, header, plant3, plant4]). In this case, a note clump is actually comprised of a note followed by a header. This function will not override the header as a note. Unfortunately, there is always the possability that a header row is followed by a plant that had no values reported. This would look like, and therefore be categorized as a note clump. I haven’t built a work around, but hopefully there aren’t very many of these.

Params:

util_year_group: A groupby object that contains a single year and utility.

Returns:

The same input but with likely note rows containing the string note in the row_type column.

_label_note_rows(df: pandas.DataFrame) pandas.DataFrame[source]#

Wrapper for _label_note_rows_group().

The small plants table has lots of note rows that contain useful information. Unfortunately, the notes are in their own row rather than their own column! This means that useful information pertaining to plant rows is floating around as a junk row with no other information except the note in the plant_name_ferc1 field. Luckily, the data are reported just like they would be on paper. I.e., The headers are at the top, and the notes are at the bottom. See the table in label_row_types() for more detail. This function labels note rows.

Note rows are determined by row location within a given report, so we must break the data into reporting units (utility and year) and then apply note-finding methodology defined in _label_note_rows_group() to each group.

Params:

df: Pre-processed, concatenated XBRL and DBF data that has been run through the _find_possible_header_or_note_rows() function and contains the column possible_header_or_note.

Returns:

The same input DataFrame but with likely note rows containing the string note in the row_type column.

_label_total_rows(df: pandas.DataFrame) pandas.DataFrame[source]#

Label total rows by adding total to row_type column.

Called within the wraper function _label_note_rows()

For the most part, when plant_name_ferc1 contains the string total, the values therein are duplicates of what is already reported, i.e.: a total value. However, there are some cases where that’s not true. For example, the phrase amounts are for the total appears when chunks of plants (usually but not always wind) are reported together. It’s a total, but it’s not double counting which is the reason for the total flag.

Similar to _label_header_rows(), it’s important that this be called before _label_note_rows() in label_row_types() so that not clumps can override certain non-totals that are mistakenly labeled as such.

Params:

df: Pre-processed, concatenated XBRL and DBF data.

Returns:

The same input DataFrame but with likely total rows containing the string total in the row_type column.

label_row_types(df: pandas.DataFrame) pandas.DataFrame[source]#

Coordinate labeling of row_types as headers, notes, or totals.

The small plants table is more like a digitized PDF than an actual data table. The rows contain all sorts of information in addition to what the columns might suggest. For instance, there are header rows, note rows, and total rows that contain useful information, but cause confusion in their current state, mixed in with the rest of the data.

Here’s an example of what you might find in the small plants table:

plant_name_ferc1

plant_type

capacity_mw

HYDRO:

NA

NA

rainbow falls (b)

NA

30

cadyville (a)

NA

100

keuka (c)

NA

80

total plants

NA

310

  1. project #2738

NA

NA

  1. project #2835

NA

NA

  1. project #2852

NA

NA

Notice how misleading it is to have all this infomration in one column. The goal of this function is to coordinate labeling functions so that we can identify which rows contain specific plant information and which rows are headers, notes, or totals.

Once labeled, other functions can either remove rows that might cause double counting, extract useful plant or fuel type information from headers, and extract useful context or license id information from notes.

Coordinates _label_header_rows(), _label_total_rows(), _label_note_rows().

Params:

df: Pre-processed, concatenated XBRL and DBF data that has been run through the _find_possible_header_or_note_rows() function and contains the column possible_header_or_note.

Returns:

The same input DataFrame but with a column called row_type containg the strings header, note, total, or NA to indicate what type of row it is.

prep_header_fuel_and_plant_types(df: pandas.DataFrame, show_unmapped_headers=False) pandas.DataFrame[source]#

Forward fill header rows to prep for fuel and plant type extraction.

The headers we’ve identified in _label_header_rows() can be used to supplement the values in the plant_type and fuel_type columns.

This function groups the data by utility, year, and header; extracts the header into a new column; and forward fills the headers so that each record in the header group is associated with that header. Because the headers map to different fuel types and plant types (ex: solar pv maps to fuel type solar and plant type photovoltaic), the new forward-filled header column is duplicated and called fuel_type_from_header and plant_type_from_header. In map_header_fuel_and_plant_types(), these columns will be mapped to their respective fuel and plant types, used to fill in blank values in the plant_type and fuel_type, and then eventually removed.

Why separate the prep step from the map step?

We trust the values originally reported in the fuel_type and plant_type columns more than the extracted and forward filled header values, so we only want to replace fuel_type and plant_type values that are labeled as pd.NA or other. The values reported to those columns are extremely messy and must be cleaned via pudl.transform.classes.categorize_strings() in order for us to know which are truely pd.NA or other. Because we also use pudl.transform.classes.categorize_strings() to map the headers to fuel and plant types, it makes sense to clean all four columns at once and then combine them.

Here’s a look at what this function does. It starts with the following table:

plant_name_ferc1

plant_type

fuel_type

row_type

HYDRO:

NA

NA

header

rainbow falls (b)

NA

NA

NA

cadyville (a)

NA

NA

NA

keuka (c)

NA

NA

NA

Wind Turbines:

NA

NA

header

sunny grove

NA

NA

NA

green park wind

NA

wind

NA

And ends with this:

plant_name_ferc1

plant _type

fuel _type

plant_type _from_header

fuel_type _from_header

HYDRO:

NA

NA

HYDRO:

HYDRO:

rainbow falls (b)

NA

NA

HYDRO:

HYDRO:

cadyville (a)

NA

NA

HYDRO:

HYDRO:

keuka (c)

NA

NA

HYDRO:

HYDRO:

Wind Turbines:

NA

NA

Wind Turbines:

Wind Turbines:

sunny grove

NA

NA

Wind Turbines:

Wind Turbines:

green park wind

NA

wind

Wind Turbines:

Wind Turbines:

NOTE: If a utility’s plant_name_ferc1 values look like this: ["STEAM", "coal_plant1", "coal_plant2", "wind_turbine1"], then this algorythem will think that last wind turbine is a steam plant. Luckily, when a utility embeds headers in the data it usually includes them for all plant types: ["STEAM", "coal_plant1", "coal_plant2", "WIND", "wind_turbine"].

Params:

df: Pre-processed, concatenated XBRL and DBF data that has been run through _label_row_type() and contains the columns row_type.

Returns:

The same input DataFrame but with new columns plant_type_from_header and fuel_type_from_header that forward fill the values in the header rows by utility, year, and header group.

map_header_fuel_and_plant_types(df: pandas.DataFrame) pandas.DataFrame[source]#

Fill pd.NA and other plant and fuel types with cleaned headers.

prep_header_fuel_and_plant_types() extracted and forward filled the header values; pudl.transform.params.categorize_strings() cleaned them according to both the fuel and plant type parameters. This function combines the fuel_type_from_header with fuel_type and plant_type_from_header with plant_type when the reported, cleaned values are pd.NA or other.

To understand more about why these steps are necessary read the docstrings for prep_header_fuel_and_plant_types().

Params:

df: Pre-processed, concatenated XBRL and DBF data that has been run through prep_header_fuel_and_plant_types() and contains the columns fuel_type_from_header and plant_type_from_header.

Returns:

The same input DataFrame but with rows with pd.NA or other in the fuel_type and plant_type columns filled in with the respective values from fuel_type_from_header and plant_type_from_header when available. fuel_type_from_header and plant_type_from_header columns removed.

map_plant_name_fuel_types(df: pandas.DataFrame) pandas.DataFrame[source]#

Suppliment fuel_type with information in plant_name_ferc1.

Sometimes fuel type is embedded in a plant name (not just headers). In this case we can identify that what that fuel is from the name and fill in empty fuel_type values. Right now, this only works for hydro plants because the rest are complicated and have a slew of exceptions. This could probably be applied to the plant_type column in the future too.

Params:

df: Pre-processed, concatenated XBRL and DBF data.

Returns:

The same input DataFrame but with rows with other in the fuel_type column filled in notable fuel types extracted from the the plant_name_ferc1 column.

associate_notes_with_values(df: pandas.DataFrame) pandas.DataFrame[source]#

Use footnote indicators to map notes and FERC licenses to plant rows.

There are many utilities that report a bunch of mostly empty note rows at the bottom of their yearly entry. These notes often pertain to specific plant rows above. Sometimes the notes and their respective plant rows are linked by a common footnote indicator such as (a) or (1) etc.

This function takes this:

plant_name_ferc1

row_type

license_id_ferc1

HYDRO:

header

NA

rainbow falls (b)

NA

NA

cadyville (a)

NA

NA

keuka (c)

NA

NA

total plants

total

NA

  1. project #2738

note

2738

  1. project #2835

note

2738

  1. project #2852

note

2738

Finds the note rows with footnote indicators, maps the content from the note row into a new note column that’s associated with the value row, and maps any FERC license extracted from this note column to the license_id_ferc1 column in the value row.

plant_name_ferc1

row_type

notes

license_id_ferc1

HYDRO:

header

NA

NA

rainbow falls (b)

NA

  1. project #2835

2835

cadyville (a)

NA

  1. project #2738

2738

keuka (c)

NA

  1. project #2852

2752

total plants

total

NA

NA

  1. project #2738

note

NA

2738

  1. project #2835

note

NA

2835

  1. project #2852

note

NA

2752

(Header and note rows are removed later).

NOTE: Note rows that don’t have a footnote indicator or note rows with a footnote indicator that don’t have a cooresponding plant row with the same indicator are not captured. They will ultimately get removed and their content will not be preserved.

Params:

df: Pre-processed, concatenated XBRL and DBF data that has been run through label_row_types() and contains the column row_type.

Returns:

The same input DataFrame but with a column called notes that contains notes, reported below, in the same row as the plant values they pertain to. Also, any further additions to the license_id_ferc1 field as extracted from these newly associated notes.

spot_fix_rows(df: pandas.DataFrame) pandas.DataFrame[source]#

Fix one-off row errors.

In 2004, utility_id_ferc1 251 reports clumps of units together. Each unit clump looks something like this: intrepid wind farm (107 units @ 1.5 mw each) and is followed by a row that looks like this: (amounts are for the total of all 107 units). For the most part, these rows are useless note rows. However, there is one instance where important values are reported in this note row rather than in the actual plant row above.

There are probably plenty of other spot fixes one could add here.

Params:

df: Pre-processed, concatenated XBRL and DBF data.

Returns:

The same input DataFrame but with some spot fixes corrected.

pudl.transform.ferc1.transform(ferc1_dbf_raw_dfs: dict[str, pandas.DataFrame], ferc1_xbrl_raw_dfs: dict[str, dict[str, pandas.DataFrame]], xbrl_metadata_json: list[dict[Any]], ferc1_settings: Ferc1Settings | None = None) dict[str, pandas.DataFrame][source]#

Coordinate the transformation of all FERC Form 1 tables.

Parameters:
  • ferc1_dbf_raw_dfs – Dictionary mapping PUDL table names (keys) to raw DBF dataframes (values).

  • ferc1_xbrl_raw_dfs – Nested dictionary containing both an instant and duration table for each input XBRL table. Some of these are empty.

  • xbrl_metadata_json – FERC 1XBRL taxonomy metadata exported as an array of JSON objects.

  • ferc1_settings – Validated FERC 1 ETL settings.

Returns:

A dictionary of transformed DataFrames.

pudl.transform.ferc1.ferc1_settings[source]#