pudl.transform.ferc1¶
Classes & functions to process FERC Form 1 data before loading into the PUDL DB.
Note that many of the classes/objects here inherit from/are instances of classes defined
in pudl.transform.classes
. Their design and relationships to each other are
documented in that module.
See pudl.transform.params.ferc1
for the values that parameterize many of these
transformations.
Attributes¶
Classes¶
Enumeration of allowed FERC 1 raw data sources. |
|
Enumeration of the allowed FERC 1 table IDs. |
|
Dictionaries for renaming either XBRL or DBF derived FERC 1 columns. |
|
Parameters for converting a wide table to a tidy table with value types. |
|
Parameters for converting either or both XBRL and DBF table from wide to tidy. |
|
Parameters for merging in XBRL metadata. |
|
Parameter for dropping duplicate DBF rows. |
|
Parameters for aligning DBF row numbers with metadata from mannual maps. |
|
Parameters for |
|
Parameters for |
|
Parameters for |
|
Parameters for transfering quarterly reported data to annual columns. |
|
Parameters for adding a column to a table with a single value. |
|
Parameters for adding columns to a table with a single value. |
|
Info for testing a particular check. |
|
Calc params organized by check type. |
|
Tolerances for all data checks to be preformed within a grouped df. |
|
Data quality expectations related to FERC 1 calculations. |
|
Input for checking calculations organized by group and test. |
|
Parameters for reconciling xbrl-metadata based calculations within a table. |
|
Base class for checking a particular metric within a group. |
|
Check error frequency in XBRL calculations. |
|
Check relative magnitude of errors in XBRL calculations. |
|
Check absolute magnitude of errors in XBRL calculations. |
|
Check the frequency of null calculated values. |
|
Check the frequency of null reported values. |
|
A model defining what TransformParams are allowed for FERC Form 1. |
|
An abstract class defining methods common to many FERC Form 1 tables. |
|
A table transformer specific to the core_ferc1__yearly_steam_plants_fuel_sched402 table. |
|
Transformer class for the core_ferc1__yearly_steam_plants_sched402 table. |
|
A table transformer specific to the core_ferc1__yearly_hydroelectric_plants_sched406 table. |
|
Transformer class for core_ferc1__yearly_pumped_storage_plants_sched408 table. |
|
Transformer class for core_ferc1__yearly_purchased_power_and_exchanges_sched326. |
|
A transformer for the core_ferc1__yearly_plant_in_service_sched204 table. |
|
A table transformer specific to the core_ferc1__yearly_small_plants_sched410 table. |
|
A table transformer for the core_ferc1__yearly_transmission_lines_sched422 table. |
|
Transformer class for core_ferc1__yearly_energy_sources_sched401 table. |
|
Transformer class for core_ferc1__yearly_energy_dispositions_sched401 table. |
|
Transformer class for core_ferc1__yearly_utility_plant_summary_sched200 table. |
|
Transformer class for core_ferc1__yearly_balance_sheet_liabilities_sched110 table. |
|
Transformer class for core_ferc1__yearly_balance_sheet_assets_sched110 table. |
|
Transformer class for the core_ferc1__yearly_income_statements_sched114 table. |
|
Transformer class for core_ferc1__yearly_retained_earnings_sched118 table. |
|
Transformer class for core_ferc1__yearly_depreciation_summary_sched336 table. |
|
Transformer class for core_ferc1__yearly_depreciation_changes_sched219 table. |
|
Transformer for core_ferc1__yearly_depreciation_by_function_sched219 table. |
|
Transformer class for core_ferc1__yearly_operating_expenses_sched320 table. |
|
Transformer class for core_ferc1__yearly_operating_revenues_sched300 table. |
|
Transform class for core_ferc1__yearly_cash_flows_sched120 table. |
|
Transform class for core_ferc1__yearly_sales_by_rate_schedules_sched304 table. |
|
Transformer class for core_ferc1__yearly_other_regulatory_liabilities_sched278 table. |
Functions¶
|
Generate cleaned json xbrl metadata. |
|
Add a |
|
Reshape wide tables with FERC account columns to tidy format. |
|
Merge metadata based on params. |
|
Drop duplicate DBF rows if duplicates have indentical data or one row has nulls. |
|
Rename the xbrl_factoid column after |
|
Select DBF rows with values listed or found in XBRL in a categorical-like column. |
Turn start year end year rows into columns for each value type. |
|
|
Combine axis columns from squished XBRL tables into one column with no NAs. |
|
Transfer 4th quarter reported data to the annual columns. |
|
Add a column to a table with a single value. |
|
Ensure intra-table calculated values match reported values within a tolerance. |
Calculate vales, run metric checks and add corrections. |
|
Add total to subdimension calculations into calculation components. |
|
|
Add all observed subdimensions into the calculation components. |
|
Apply calculations derived from XBRL metadata to reported XBRL data. |
|
Tabulate the results of the calculation checks by group. |
|
Run the calculation metrics and determine if calculations are within tolerance. |
|
Add corrections to discrepancies between reported & calculated values. |
Select for annual records within their report_year. |
|
|
Identify DBF rows that need to be mapped to XBRL columns. |
|
Regenerate the FERC 1 DBF+XBRL glue while retaining existing mappings. |
|
Read the manually compiled DBF row to XBRL column mapping for a given table. |
|
Forward-fill missing years in the minimal, manually compiled DBF to XBRL mapping. |
|
Get a list of all XBRL data columns appearing in a given XBRL table. |
|
Read in the table of calculation fixes. |
|
Create an asset that pulls in raw ferc Form 1 assets and applies transformations. |
Create a list of transformed FERC Form 1 assets. |
|
|
Get a list of the other dimension columns across all of the transformers. |
|
Build a dictionary of table name (keys) to |
|
Build a dictionary of table name (keys) to column_to_check from reconcile_table_calculations. |
|
Build a table of values of dimensions observed in the transformed data tables. |
|
Build a table of all of the tables' XBRL metadata. |
Create calculation-component table from table-level metadata. |
|
|
Find unexpected components in within-fact total calculations. |
Check for duplicates calculation records. |
|
|
Fill in null dimensions w/ the values observed in |
|
Add dimensions to calculation parents. |
|
Define dimension total calculations. |
|
Add records into the calculation components table. |
|
Check calculation metrics for all transformed tables which have reconciled calcs. |
Module Contents¶
- pudl.transform.ferc1._core_ferc1_xbrl__metadata_json(raw_ferc1_xbrl__metadata_json: dict[str, dict[str, list[dict[str, Any]]]]) dict[str, dict[str, list[dict[str, Any]]]] [source]¶
Generate cleaned json xbrl metadata.
For now, this only runs
add_source_tables_to_xbrl_metadata()
.
- pudl.transform.ferc1.add_source_tables_to_xbrl_metadata(raw_ferc1_xbrl__metadata_json: dict[str, dict[str, list[dict[str, Any]]]]) dict[str, dict[str, list[dict[str, Any]]]] [source]¶
Add a
source_tables
field into metadata calculation components.When a particular component of a calculation does not originate from the table in which the calculated field is being reported, label the source table.
- class pudl.transform.ferc1.SourceFerc1(*args, **kwds)[source]¶
Bases:
enum.Enum
Enumeration of allowed FERC 1 raw data sources.
- class pudl.transform.ferc1.TableIdFerc1(*args, **kwds)[source]¶
Bases:
enum.Enum
Enumeration of the allowed FERC 1 table IDs.
Hard coding this doesn’t seem ideal. Somehow it should be either defined in the context of the Package, the Ferc1Settings, an etl_group, or DataSource. All of the table transformers associated with a given data source should have a table_id that’s from that data source’s subset of the database. Where should this really happen? Alternatively, the allowable values could be derived from the structure of the Package. But this works for now.
- class pudl.transform.ferc1.RenameColumnsFerc1(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Dictionaries for renaming either XBRL or DBF derived FERC 1 columns.
This is FERC 1 specific, because we need to store both DBF and XBRL rename dictionaires separately. Note that this parameter model does not have its own unique transform function. Like the generic
pudl.transform.classes.RenameColumns
it depends on the build inpd.rename()
method, which is called with the values DBF or XBRL parameters depending on the context.Potential parameters validations that could be implemented
Validate that all keys appear in the original dbf/xbrl sources. This has to be true, but right now we don’t have stored metadata enumerating all of the columns that exist in the raw data, so we don’t have anything to check against. Implement once when we have schemas defined for after the extract step.
Validate all values appear in PUDL tables, and all expected PUDL names are mapped. Actually we can’t require that the rename values appear in the PUDL tables, because there will be cases in which the original column gets dropped or modified, e.g. in the case of unit conversions with a column rename.
- duration_xbrl: pudl.transform.classes.RenameColumns[source]¶
- instant_xbrl: pudl.transform.classes.RenameColumns[source]¶
- class pudl.transform.ferc1.WideToTidy(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for converting a wide table to a tidy table with value types.
- stacked_column_name: str | None = None[source]¶
Name of column that will contain the stacked categories.
- value_types: list[str] | None = None[source]¶
List of names of value types that will end up being the column names.
Some of the FERC tables have multiple data types spread across many different categories. In the input dataframe given to
wide_to_tidy()
, the value types must be the suffixes of the column names. If the table does not natively have the pattern of “{to-be stacked category}_{value_type}”, rename the columns using arename_columns.duration_xbrl
,rename_columns.instant_xbrl
orrename_columns.dbf
parameter which will be employed inprocess_duration_xbrl()
,process_instant_xbrl()
orprocess_dbf()
.
- expected_drop_cols: int = 0[source]¶
The number of columns that are expected to be dropped.
wide_to_tidy_xbrl()
will generate a regex pattern assuming thevalue_types
are the column name’s suffixes. If a column does not conform to that pattern, it will be filtered out. This is helpful for us to not include a bunch of columns from the input dataframe incorrectly included in the stacking process. We could enumerate every column that we want to drop, but this could be tedious and potentially error prone. But this does mean that if a column is incorrectly named - or barely missing the pattern, it will be dropped. This parameter enables us to lock the number of expected columns. If the dropped columns are a different number, an error will be raised.
- class pudl.transform.ferc1.WideToTidySourceFerc1(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for converting either or both XBRL and DBF table from wide to tidy.
- xbrl: WideToTidy | list[WideToTidy][source]¶
- dbf: WideToTidy | list[WideToTidy][source]¶
- pudl.transform.ferc1.wide_to_tidy(df: pandas.DataFrame, params: WideToTidy) pandas.DataFrame [source]¶
Reshape wide tables with FERC account columns to tidy format.
The XBRL table coming into this method could contain all the data from both the instant and duration tables in a wide format – with one column for every combination of value type (e.g. additions, ending_balance) and value category, which means ~500 columns for some tables.
We tidy this into a long table with one column for each of the value types in
params.value_types
and a new column namedxbrl_factoid
that contains categories that were previously the XBRL column name stems.This allows aggregations of multiple
xbrl_factoid
categories in a columnar fashion such as aggregation across groups of rows to total up various hierarchical accounting categories (hydraulic turbines -> hydraulic production plant -> all production plant -> all electric utility plant) though the categorical columns required for that aggregation are added later.For table that have a internal relationship between the values in the
params.value_types
, such as the core_ferc1__yearly_plant_in_service_sched204 table, this also enables aggregation across columns to calculate the ending balance based on the starting balance and all of the reported changes.
- class pudl.transform.ferc1.MergeXbrlMetadata(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for merging in XBRL metadata.
- rename_columns: dict[str, str][source]¶
Dictionary to rename columns in the normalized metadata before merging.
This dictionary will be passed as
pd.DataFrame.rename()
columns
parameter.
- on: str | None = None[source]¶
Column name to merge on in
merge_xbrl_metadata()
.
- pudl.transform.ferc1.merge_xbrl_metadata(df: pandas.DataFrame, xbrl_metadata: pandas.DataFrame, params: MergeXbrlMetadata) pandas.DataFrame [source]¶
Merge metadata based on params.
- class pudl.transform.ferc1.DropDuplicateRowsDbf(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameter for dropping duplicate DBF rows.
- table_name: TableIdFerc1 | None = None[source]¶
Name of table used to grab primary keys of PUDL table to check for duplicates.
- pudl.transform.ferc1.drop_duplicate_rows_dbf(df: pandas.DataFrame, params: DropDuplicateRowsDbf, return_dupes_w_unique_data: bool = False) pandas.DataFrame [source]¶
Drop duplicate DBF rows if duplicates have indentical data or one row has nulls.
There are several instances of the DBF data reporting the same value on multiple rows. This function checks to see if all of the duplicate values that have the same primary keys have reported the same data or have records with null data in any of the data columns while the other record has complete data. If the duplicates have no unique data, the duplicates are dropped with
keep="first"
. If any duplicates do not contain the same data or half null data, an assertion will be raised.- Parameters:
df – DBF table containing PUDL primary key columns
params – an instance of
DropDuplicateRowsDbf
return_dupes_w_unique_data – Boolean flag used for debuging only which returns the duplicates which contain actually unique data instead of raising assertion. Default is False.
- class pudl.transform.ferc1.AlignRowNumbersDbf(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for aligning DBF row numbers with metadata from mannual maps.
- pudl.transform.ferc1.align_row_numbers_dbf(df: pandas.DataFrame, params: AlignRowNumbersDbf) pandas.DataFrame [source]¶
Rename the xbrl_factoid column after
align_row_numbers_dbf()
.
- class pudl.transform.ferc1.SelectDbfRowsByCategory(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for
select_dbf_rows_by_category()
.- select_by_xbrl_categories: bool = False[source]¶
Boolean flag to indicate whether or not to use the categories in the XBRL table.
If True,
select_dbf_rows_by_category()
will find the list of categories that exist in the passed inprocessed_xbrl
to select by.
- additional_categories: list[str] = [][source]¶
List of additional categories to select by.
If
select_by_xbrl_categories
isTrue
, these categories will be added to the XBRL categories and both will be used to select rows from the DBF data. Ifselect_by_xbrl_categories
isFalse
, only the “additional” categories will be the used to select rows from the DBF data.
- len_expected_categories_to_drop: int = 0[source]¶
Number of categories that are expected to be dropped from the DBF data.
This is here to ensure no unexpected manipulations to the categories have occured. A warning will be flagged if this number is different than the number of categories that are being dropped.
- pudl.transform.ferc1.select_dbf_rows_by_category(processed_dbf: pandas.DataFrame, processed_xbrl: pandas.DataFrame, params: SelectDbfRowsByCategory) pandas.DataFrame [source]¶
Select DBF rows with values listed or found in XBRL in a categorical-like column.
The XBRL data often breaks out sub-sections of DBF tables into their own table. These breakout tables are often messy, unstructured portions of a particular schedule or page on the FERC1 PDF. We often want to preserve some of the ways the XBRL data is segmented so we need to be able to select only portions of the DBF table to be concatenated with the XBRL data.
In mapping DBF data to XBRL data for the tables that rely on their
row_number
we map each row to its correspondingxbrl_factoid
. The standard use of this transformer is to use thecolumn_name
that corresponds to thexbrl_factoid
that was merged into the DBF data viaalign_row_numbers_dbf()
and was converted into a column in the XBRL data viawide_to_tidy()
.Note: Often, the unstructured portion of the DBF table that (possibly) sums up into a single value in structured data has the same
xbrl_factoid
name in the XBRL tables. By convention, we are employing a pattern in thedbf_to_xbrl.csv
map that involves adding an_unstructed
suffix to the rows that correspond to the unstructured portion of the table. This enables a simple selection of the structured part of the table. When processing the unstructured table, you can either rename the XBRL data’s factoid name to include an_unstructed
suffix or you can specify the categories with_unstructed
suffixes using theadditional_categories
parameter.
- class pudl.transform.ferc1.UnstackBalancesToReportYearInstantXbrl(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for
unstack_balances_to_report_year_instant_xbrl()
.
- pudl.transform.ferc1.unstack_balances_to_report_year_instant_xbrl(df: pandas.DataFrame, params: UnstackBalancesToReportYearInstantXbrl, primary_key_cols: list[str]) pandas.DataFrame [source]¶
Turn start year end year rows into columns for each value type.
Called in
Ferc1AbstractTableTransformer.process_instant_xbrl()
.Some instant tables report year-end data, with their datestamps in different years, but we want year-start and year-end data within a single report_year (which is equivalent) stored in two separate columns for compatibility with the DBF data.
This function unstacks that table and adds the suffixes
_starting_balance
and_ending_balance
to each of the columns. These can then be used asvalue_types
inwide_to_tidy()
to normalize the table.There are two checks in place:
First, it will make sure that there are not duplicate entries for a single year + other primary key fields. Ex: a row for 2020-12-31 and 2020-06-30 for entitiy_id X means that the data isn’t annually unique. We could just drop these mid-year values, but we might want to keep them or at least check that there is no funny business with the data.
We also check that there are no mid-year dates at all. If an entity reports a value from the middle of the year, we can’t identify it as a start/end of year value.
- Params:
- primary_key_cols: The columns that should be used to check for duplicated data,
and also for unstacking the balance – these are set to be the index before unstack is called. These are typically set by the wrapping method and generated automatically based on other class transformation parameters via
Ferc1AbstractTableTransformer.source_table_primary_key()
.
- class pudl.transform.ferc1.CombineAxisColumnsXbrl(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for
combine_axis_columns_xbrl()
.
- pudl.transform.ferc1.combine_axis_columns_xbrl(df: pandas.DataFrame, params: CombineAxisColumnsXbrl) pandas.DataFrame [source]¶
Combine axis columns from squished XBRL tables into one column with no NAs.
Called in
Ferc1AbstractTableTransformer.process_xbrl()
.There are instances (ex: sales_by_rate_schedule_ferc1) where the DBF table is equal to several concatenated XBRL tables. These XBRL tables are extracted together with the function
extract_xbrl_concat()
. Once combined, we need to deal with their axis columns.We use the axis columns (the primary key for the raw XBRL tables) in the creation of
record_id``s for each of the rows. If each of the concatinated XBRL tables has the same axis column name then there's no need to fret. However, if the columns have slightly different names (ex: ``residential_sales_axis
vs.industrial_sales_axis
), we’ll need to combine them. We combine them to get rid of NA values which aren’t allowed in primary keys. Otherwise it would look like this:residential_sales_axis
industrial_sales_axis
value1
NA
value2
NA
NA
valueA
NA
valueB
vs. this:
sales_axis
value1
value2
valueA
valueB
- class pudl.transform.ferc1.AssignQuarterlyDataToYearlyDbf(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for transfering quarterly reported data to annual columns.
- pudl.transform.ferc1.assign_quarterly_data_to_yearly_dbf(df: pandas.DataFrame, params: AssignQuarterlyDataToYearlyDbf) pandas.DataFrame [source]¶
Transfer 4th quarter reported data to the annual columns.
For some reason in the dbf data for this table reported all of the balance data as quarterly data between specific years. We already choose the end of the year in
select_annual_rows_dbf()
. This ensures that by this point, any quarterly data remaining in the input dataframe pertains to the 4th quarter.
- class pudl.transform.ferc1.AddColumnWithUniformValue(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for adding a column to a table with a single value.
- class pudl.transform.ferc1.AddColumnsWithUniformValues(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for adding columns to a table with a single value.
- columns_to_add: dict[str, AddColumnWithUniformValue][source]¶
Dictionary of column names (keys) with
AddColumnWithUniformValue
(values)
- pudl.transform.ferc1.add_columns_with_uniform_values(df: pandas.DataFrame, params: AddColumnsWithUniformValues) pandas.DataFrame [source]¶
Add a column to a table with a single value.
- class pudl.transform.ferc1.IsCloseTolerance(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Info for testing a particular check.
- class pudl.transform.ferc1.CalculationIsCloseTolerance(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Calc params organized by check type.
- error_frequency: IsCloseTolerance[source]¶
- relative_error_magnitude: IsCloseTolerance[source]¶
- null_calculated_value_frequency: IsCloseTolerance[source]¶
- absolute_error_magnitude: IsCloseTolerance[source]¶
- null_reported_value_frequency: IsCloseTolerance[source]¶
- class pudl.transform.ferc1.MetricTolerances(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Tolerances for all data checks to be preformed within a grouped df.
- class pudl.transform.ferc1.GroupMetricTolerances(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Data quality expectations related to FERC 1 calculations.
We are doing a lot of comparisons between calculated and reported values to identify reporting errors in the data, errors in FERC’s metadata, and bugs in our own code. This class provides a structure for encoding our expectations about the level of acceptable (or at least expected) errors, and allows us to pass them around.
In the future we might also want to specify much more granular expectations, pertaining to individual tables, years, utilities, or facts to ensure that we don’t have low overall error rates, but a problem with the way the data or metadata is reported in a particular year. We could also define per-filing and per-table error tolerances to help us identify individual utilities that have e.g. used an outdated version of Form 1 when filing.
- ungrouped: MetricTolerances[source]¶
- xbrl_factoid: MetricTolerances[source]¶
- utility_id_ferc1: MetricTolerances[source]¶
- report_year: MetricTolerances[source]¶
- table_name: MetricTolerances[source]¶
- class pudl.transform.ferc1.GroupMetricChecks(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Input for checking calculations organized by group and test.
- groups_to_check: list[Literal['ungrouped', 'table_name', 'xbrl_factoid', 'utility_id_ferc1', 'report_year']] = ['ungrouped', 'report_year', 'xbrl_factoid', 'utility_id_ferc1'][source]¶
- metrics_to_check: list[str] = ['error_frequency', 'relative_error_magnitude', 'null_calculated_value_frequency',...[source]¶
- group_metric_tolerances: GroupMetricTolerances[source]¶
- is_close_tolerance: CalculationIsCloseTolerance[source]¶
- class pudl.transform.ferc1.ReconcileTableCalculations(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TransformParams
Parameters for reconciling xbrl-metadata based calculations within a table.
- column_to_check: str | None = None[source]¶
Name of data column to check.
This will typically be
dollar_value
orending_balance
column for the income statement and the balance sheet tables.
- group_metric_checks: GroupMetricChecks[source]¶
Fraction of calculated values which we allow not to match reported values.
- subdimension_column: str | None = None[source]¶
Sub-dimension column name (e.g. utility type) to compare calculations against in
reconcile_table_calculations()
.
- pudl.transform.ferc1.reconcile_table_calculations(df: pandas.DataFrame, calculation_components: pandas.DataFrame, xbrl_metadata: pandas.DataFrame, xbrl_factoid_name: str, table_name: str, params: ReconcileTableCalculations) pandas.DataFrame [source]¶
Ensure intra-table calculated values match reported values within a tolerance.
In addition to checking whether all reported “calculated” values match the output of our repaired calculations, this function adds a correction record to the dataframe that is included in the calculations so that after the fact the calculations match exactly. This is only done when the fraction of records that don’t match within the tolerances of
numpy.isclose()
is below a set threshold.Note that only calculations which are off by a significant amount result in the creation of a correction record. Many calculations are off from the reported values by exaclty one dollar, presumably due to rounding errrors. These records typically do not fail the
numpy.isclose()
test and so are not corrected.- Parameters:
df – processed table containing data values to check.
calculation_components – processed calculation component metadata.
xbrl_metadata – A dataframe of fact-level metadata, required for inferring the sub-dimension total calculations.
xbrl_factoid_name – The name of the column which contains XBRL factoid values in the processed table.
table_name – name of the PUDL table whose data and metadata is being processed. This is necessary so we can ensure the metadata has the same structure as the calculation components, which at a minimum need both
table_name
andxbrl_factoid
to identify them.params –
ReconcileTableCalculations
parameters.
- Returns:
A dataframe that includes new
*_correction
records with values that ensure the calculations all match to within the required tolerance. It will also contain columns created by the calculation checking process likeabs_diff
andrel_diff
.
- pudl.transform.ferc1.reconcile_one_type_of_table_calculations(data: pandas.DataFrame, calculation_components: pandas.DataFrame, calc_idx: list[str], value_col: str, group_metric_checks: GroupMetricChecks, table_name: str, is_subdimension: bool, calc_to_data_merge_validation: Literal['one_to_many', 'many_to_many'] = 'one_to_many') pandas.DataFrame [source]¶
Calculate vales, run metric checks and add corrections.
- Parameters:
data – exploded FERC data to apply the calculations to. Primary key should be
report_year
,utility_id_ferc1
,table_name
,xbrl_factoid
, and whatever additional dimensions are relevant to the data.calculation_components – Table defining the calculations, with each row defining a single component, including its weight. Groups of rows identified by
table_name_parent
andxbrl_factoid_parent
indicate the values being calculated.calc_idx – primary key columns that uniquely identify a calculation component (not including the
_parent
columns).value_col – label of the column in
data
that contains the values to apply the calculations to (typicallydollar_value
orending_balance
).
- pudl.transform.ferc1._calculation_components_subdimension_calculations(intra_table_calcs: pandas.DataFrame, table_dims: pandas.DataFrame, xbrl_metadata: pandas.DataFrame, dim_cols: list[str], table_name: str) pandas.DataFrame [source]¶
Add total to subdimension calculations into calculation components.
- pudl.transform.ferc1._add_intra_table_calculation_dimensions(intra_table_calcs: pandas.DataFrame, table_dims: pandas.DataFrame, dim_cols: list[str]) pandas.DataFrame [source]¶
Add all observed subdimensions into the calculation components.
- pudl.transform.ferc1.calculate_values_from_components(calculation_components: pandas.DataFrame, data: pandas.DataFrame, calc_idx: list[str], value_col: str, calc_to_data_merge_validation: Literal['one_to_many', 'many_to_many'] = 'one_to_many') pandas.DataFrame [source]¶
Apply calculations derived from XBRL metadata to reported XBRL data.
- Parameters:
calculation_components – Table defining the calculations, with each row defining a single component, including its weight. Groups of rows identified by
table_name_parent
andxbrl_factoid_parent
indicate the values being calculated.data – exploded FERC data to apply the calculations to. Primary key should be
report_year
,utility_id_ferc1
,table_name
,xbrl_factoid
, and whatever additional dimensions are relevant to the data.calc_idx – primary key columns that uniquely identify a calculation component (not including the
_parent
columns).value_col – label of the column in
data
that contains the values to apply the calculations to (typicallydollar_value
orending_balance
).
- pudl.transform.ferc1.check_calculation_metrics_by_group(calculated_df: pandas.DataFrame, group_metric_checks: GroupMetricChecks) pandas.DataFrame [source]¶
Tabulate the results of the calculation checks by group.
Convert all of the groups’ checks into a big df. This will have two indexes: first for the group name (group) and one for the groups values. the columns will include three for each test: the test mertic that is the same name as the test (ex: error_frequency), the tolerance for that group/test and a boolean indicating whether or not that metric failed to meet the tolerance.
- pudl.transform.ferc1.check_calculation_metrics(calculated_df: pandas.DataFrame, group_metric_checks: GroupMetricChecks) pandas.DataFrame [source]¶
Run the calculation metrics and determine if calculations are within tolerance.
- class pudl.transform.ferc1.ErrorMetric(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel
Base class for checking a particular metric within a group.
- by: Literal['ungrouped', 'table_name', 'xbrl_factoid', 'utility_id_ferc1', 'report_year'][source]¶
Name of group to check the metric based on.
With the exception of the ungrouped case, all groups depend on table_name as well as the other column specified via by.
If by==”table_name” then that is the only column used in the groupby().
If by==”ungrouped” then all records are included in the “group” (via a dummy column named ungrouped that contains only the value ungrouped). This allows us to use the same infrastructure for applying the metrics to grouped and ungrouped data.
- is_close_tolerance: IsCloseTolerance[source]¶
Inputs for the metric to determine
is_not_close()
. Instance ofIsCloseTolerance
.
- required_cols: list[str] = ['table_name', 'xbrl_factoid', 'report_year', 'utility_id_ferc1', 'reported_value',...[source]¶
- has_required_cols(df: pandas.DataFrame)[source]¶
Check that the input dataframe has all required columns.
- abstract metric(gb: pandas.core.groupby.DataFrameGroupBy) pandas.Series [source]¶
Metric function that will be applied to each group of values being checked.
- is_not_close(df: pandas.DataFrame) pandas.Series [source]¶
Flag records where reported and calculated values differ significantly.
We only want to check this metric when there is a non-null
abs_diff
because we want to avoid the instances in which there are either null reported or calculated values.
- groupby_cols() list[str] [source]¶
The list of columns to group by.
We want to default to adding the table_name into all groupby’s, but two of our
by
options need special treatment.
- apply_metric(df: pandas.DataFrame) pandas.Series [source]¶
Generate the metric values within each group through an apply method.
This method adds a column
is_not_close
into the df before the groupby because that column is used in many of themetric()
.
- check(calculated_df) pandas.DataFrame [source]¶
Make a df w/ the metric, tolerance and is_error columns.
- class pudl.transform.ferc1.ErrorFrequency(/, **data: Any)[source]¶
Bases:
ErrorMetric
Check error frequency in XBRL calculations.
- metric(gb: pandas.core.groupby.DataFrameGroupBy) pandas.Series [source]¶
Calculate the frequency with which records are tagged as errors.
- class pudl.transform.ferc1.RelativeErrorMagnitude(/, **data: Any)[source]¶
Bases:
ErrorMetric
Check relative magnitude of errors in XBRL calculations.
- metric(gb: pandas.core.groupby.DataFrameGroupBy) pandas.Series [source]¶
Calculate the mangnitude of the errors relative to total reported value.
- class pudl.transform.ferc1.AbsoluteErrorMagnitude(/, **data: Any)[source]¶
Bases:
ErrorMetric
Check absolute magnitude of errors in XBRL calculations.
These numbers may vary wildly from table to table so no default values for the expected errors are provided here…
- metric(gb: pandas.core.groupby.DataFrameGroupBy) pandas.Series [source]¶
Calculate the absolute mangnitude of XBRL calculation errors.
- class pudl.transform.ferc1.NullCalculatedValueFrequency(/, **data: Any)[source]¶
Bases:
ErrorMetric
Check the frequency of null calculated values.
- apply_metric(df: pandas.DataFrame) pandas.Series [source]¶
Only apply metric to rows that contain calculated values.
- metric(gb: pandas.core.groupby.DataFrameGroupBy) pandas.Series [source]¶
Fraction of non-null reported values that have null corresponding calculated values.
- class pudl.transform.ferc1.NullReportedValueFrequency(/, **data: Any)[source]¶
Bases:
ErrorMetric
Check the frequency of null reported values.
- metric(gb: pandas.core.groupby.DataFrameGroupBy) pandas.Series [source]¶
Frequency with which the reported values are Null.
- pudl.transform.ferc1.add_corrections(calculated_df: pandas.DataFrame, value_col: str, is_close_tolerance: IsCloseTolerance, table_name: str, is_subdimension: bool) pandas.DataFrame [source]¶
Add corrections to discrepancies between reported & calculated values.
To isolate the sources of error, and ensure that all totals add up as expected in later phases of the transformation, we add correction records to the dataframe which compensate for any difference between the calculated and reported values. The
_correction
factoids that are added here have already been added to the calculation components during the metadata processing.- Parameters:
calculated_df – DataFrame containing the data to correct. Must already have
abs_diff
column that was added bycheck_calculation_metrics()
value_col – Label of the column whose values are being calculated.
calculation_tolerance – Data structure containing various calculation tolerances.
table_name – Name of the table whose data we are working with. For logging.
is_subdimension – Indicator of whether or not the correction to add is a total to subdimension calculated value.
- class pudl.transform.ferc1.Ferc1TableTransformParams(/, **data: Any)[source]¶
Bases:
pudl.transform.classes.TableTransformParams
A model defining what TransformParams are allowed for FERC Form 1.
This adds additional parameter models beyond the ones inherited from the
pudl.transform.classes.AbstractTableTransformer
class.- rename_columns_ferc1: RenameColumnsFerc1[source]¶
- wide_to_tidy: WideToTidySourceFerc1[source]¶
- merge_xbrl_metadata: MergeXbrlMetadata[source]¶
- align_row_numbers_dbf: AlignRowNumbersDbf[source]¶
- drop_duplicate_rows_dbf: DropDuplicateRowsDbf[source]¶
- assign_quarterly_data_to_yearly_dbf: AssignQuarterlyDataToYearlyDbf[source]¶
- select_dbf_rows_by_category: SelectDbfRowsByCategory[source]¶
- unstack_balances_to_report_year_instant_xbrl: UnstackBalancesToReportYearInstantXbrl[source]¶
- combine_axis_columns_xbrl: CombineAxisColumnsXbrl[source]¶
- reconcile_table_calculations: ReconcileTableCalculations[source]¶
- add_columns_with_uniform_values: AddColumnsWithUniformValues[source]¶
- property rename_dicts_xbrl[source]¶
Compile all of the XBRL rename dictionaries into an ordered list.
- property wide_to_tidy_value_types: list[str][source]¶
Compile a list of all of the
value_types
fromwide_to_tidy
.
- pudl.transform.ferc1.select_current_year_annual_records_duration_xbrl(df: pandas.DataFrame, table_name: str)[source]¶
Select for annual records within their report_year.
Select only records that have a start_date at beginning of the report_year and have an end_date at the end of the report_year.
- pudl.transform.ferc1.get_ferc1_dbf_rows_to_map(ferc1_engine: sqlalchemy.Engine) pandas.DataFrame [source]¶
Identify DBF rows that need to be mapped to XBRL columns.
Select all records in the
f1_row_lit_tbl
where the row literal associated with a given combination of table and row number is different from the preceeding year. This is the smallest set of records which we can use to reproduce the whole table by expanding the time series to include all years, and forward filling the row literals.
- pudl.transform.ferc1.update_dbf_to_xbrl_map(ferc1_engine: sqlalchemy.Engine) pandas.DataFrame [source]¶
Regenerate the FERC 1 DBF+XBRL glue while retaining existing mappings.
Reads all rows that need to be mapped out of the
f1_row_lit_tbl
and appends columns containing any previously mapped values, returning the resulting dataframe.
- pudl.transform.ferc1.read_dbf_to_xbrl_map(dbf_table_names: list[str]) pandas.DataFrame [source]¶
Read the manually compiled DBF row to XBRL column mapping for a given table.
- Parameters:
dbf_table_name – The original name of the table in the FERC Form 1 DBF database whose mapping to the XBRL data you want to extract. for example
f1_plant_in_srvce
.- Returns:
DataFrame with columns
[sched_table_name, report_year, row_number, row_type, xbrl_factoid]
- pudl.transform.ferc1.fill_dbf_to_xbrl_map(df: pandas.DataFrame, dbf_years: list[int] | None = None) pandas.DataFrame [source]¶
Forward-fill missing years in the minimal, manually compiled DBF to XBRL mapping.
The relationship between a DBF row and XBRL column/fact/entity/whatever is mostly consistent from year to year. To minimize the amount of manual mapping work we have to do, we only map the years in which the relationship changes. In the end we do need a complete correspondence for all years though, and this function uses the minimal information we’ve compiled to fill in all the gaps, producing a complete mapping across all requested years.
One complication is that we need to explicitly indicate which DBF rows have headers in them (which don’t exist in XBRL), to differentiate them from null values in the exhaustive index we create below. We set a
HEADER_ROW
sentinel value so we can distinguish between two different reasons that we might find NULL values in thexbrl_factoid
field:It’s NULL because it’s between two valid mapped values (the NULL was created in our filling of the time series) and should thus be filled in, or
It’s NULL because it was a header row in the DBF data, which means it should NOT be filled in. Without the
HEADER_ROW
value, when a row number from year X becomes associated with a non-header row in year X+1 the ffill will keep right on filling, associating all of the new header rows with the value ofxbrl_factoid
that was associated with the old row number.
- Parameters:
df – A dataframe containing a DBF row to XBRL mapping for a single FERC 1 DBF table.
dbf_years – The list of years that should have their DBF row to XBRL mapping filled in. This defaults to all available years of DBF data for FERC 1. In general this parameter should only be set to a non-default value for testing purposes.
- Returns:
A complete mapping of DBF row number to XBRL columns for all years of data within a single FERC 1 DBF table. Has columns of
[report_year, row_number, xbrl_factoid]
- pudl.transform.ferc1.get_data_cols_raw_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) list[str] [source]¶
Get a list of all XBRL data columns appearing in a given XBRL table.
- Returns:
A list of all the data columns found in the original XBRL DB that correspond to the given PUDL table. Includes columns from both the instant and duration tables but excludes structural columns that appear in all XBRL tables.
- pudl.transform.ferc1.read_xbrl_calculation_fixes() pandas.DataFrame [source]¶
Read in the table of calculation fixes.
- class pudl.transform.ferc1.Ferc1AbstractTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
pudl.transform.classes.AbstractTableTransformer
An abstract class defining methods common to many FERC Form 1 tables.
This subclass remains abstract because it does not define transform_main(), which is always going to be table-specific.
Methods that only apply to XBRL data should end with _xbrl
Methods that only apply to DBF data should end with _dbf
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- parameter_model[source]¶
The
pydantic
model that is used to contain & instantiate parameters.In child classes this should be replaced with the data source-specific
TableTransformParams
class, if it has been defined.
- params: Ferc1AbstractTableTransformer.parameter_model[source]¶
The parameters that will be used to control the transformation functions.
This attribute is of type
parameter_model
which is defined above. This type varies across datasets and is used to construct and validate the parameters based, so it needs to be set separately in child classes. Seepudl.transform.ferc1.Ferc1AbstractTableTransformer
for an example.
- has_unique_record_ids: bool = True[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- xbrl_metadata: pandas.DataFrame[source]¶
Dataframe combining XBRL metadata for both instant and duration table columns.
- xbrl_calculations: pandas.DataFrame | None = None[source]¶
Dataframe of calculation components.
If
None
, the calculations have not been instantiated. If the table has been instantiated but is an empty table, then there are no calculations for that table.
- transform_start(raw_dbf: pandas.DataFrame, raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
Process the raw data until the XBRL and DBF inputs have been unified.
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
Generic FERC1 main table transformer.
- Params:
df: Pre-processed, concatenated XBRL and DBF data.
- Returns:
A single transformed table concatenating multiple years of cleaned data derived from the raw DBF and/or XBRL inputs.
- transform_end(df: pandas.DataFrame) pandas.DataFrame [source]¶
Standardized final cleanup after the transformations are done.
Checks calculations. Enforces dataframe schema. Checks for empty dataframes and null columns.
- select_dbf_rows_by_category(processed_dbf: pandas.DataFrame, processed_xbrl: pandas.DataFrame, params: SelectDbfRowsByCategory | None = None) pandas.DataFrame [source]¶
Wrapper method for
select_dbf_rows_by_category()
.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Normalize the XBRL JSON metadata, turning it into a dataframe.
This process concatenates and deduplicates the metadata which is associated with the instant and duration tables, since the metadata is only combined with the data after the instant and duration (and DBF) tables have been merged. This happens in
Ferc1AbstractTableTransformer.merge_xbrl_metadata()
.
- process_xbrl_metadata(xbrl_metadata_converted: pandas.DataFrame, xbrl_calculations: pandas.DataFrame) pandas.DataFrame [source]¶
Process XBRL metadata after the calculations have been cleaned.
Add
row_type_xbrl
andis_within_table_calc
columns and createxbrl_factoid
records for the calculation corrections.- Parameters:
xbrl_metadata_converted – Dataframe of relatively unprocessed metadata. Result of
convert_xbrl_metadata_json_to_df()
.xbrl_calculations – Dataframe of calculation components. Result of
process_xbrl_metadata_calculations()
.
- deduplicate_xbrl_factoid_xbrl_metadata(tbl_meta: pandas.DataFrame) pandas.DataFrame [source]¶
De-duplicate the xbrl_metadata based on
xbrl_factoid
.Default is to do nothing besides check for duplicate values because almost all tables have no deduping. Deduplication needs to be applied before the
apply_xbrl_calculation_fixes()
inside ofprocess_xbrl_metadata()
.
- raw_xbrl_factoid_to_pudl_name(col_name_xbrl: str) str [source]¶
Rename a column name from original XBRL name to the transformed PUDL name.
There are several transform params that either explicitly or implicity rename columns: *
RenameColumnsFerc1
*WideToTidySourceFerc1
*UnstackBalancesToReportYearInstantXbrl
*ConvertUnits
This method attempts to use the table params to translate a column name.
Note: Instead of doing this for each individual column name, we could compile a rename dict for the whole table with a similar processand then apply it for each group of columns instead of running through this full process every time. If this took longer than… ~5 ms on a single table w/ lots of calcs this would probably be worth it for simplicity.
- rename_xbrl_factoid(col: pandas.Series) pandas.Series [source]¶
Rename a series of raw to PUDL factoid names via
raw_xbrl_factoid_to_pudl_name()
.
- rename_xbrl_factoid_other_tables(calc_comps)[source]¶
Rename the factoids from calculation components from other tables.
Note: It is probably possible to build an apply style function that takes a series of factoid names and a series of table names and returns a table-specific rename_xbrl_factoid.
- add_metadata_corrections(tbl_meta: pandas.DataFrame) pandas.DataFrame [source]¶
Create metadata records for the calculation correction factoids.
- Parameters:
tbl_meta – processed metadata table which contains columns
row_type_xbrl
.
- add_calculation_corrections(calc_components: pandas.DataFrame) pandas.DataFrame [source]¶
Add correction components and parent-only factoids to calculation metadata.
- Parameters:
tbl_meta – Partially transformed table metadata in dataframe form.
- Returns:
An updated version of the table metadata containing calculation definitions that include a correction component.
- get_xbrl_calculation_fixes() pandas.DataFrame [source]¶
Grab the XBRL calculation file.
- apply_xbrl_calculation_fixes(calc_components: pandas.DataFrame, calc_fixes: pandas.DataFrame) pandas.DataFrame [source]¶
Use the fixes we’ve compiled to update calculations in the XBRL metadata.
Note: Temp fix. These updates should probably be moved into the table params and integrated into the calculations via TableCalcs.
- process_xbrl_metadata_calculations(xbrl_metadata_converted: pandas.DataFrame) pandas.DataFrame [source]¶
Convert xbrl metadata calculations into a table of calculation components.
This method extracts the calculations from the
xbrl_metadata_converted
that are stored as json embedded within thecalculations``column and convert those into calculation component records. The resulting table includes columns pertaining to both the calculation components and the parent factoid that the components pertain to. The parental columns had suffixes of ``_parent
.This method also adds fixes to the calculations via
apply_xbrl_calculation_fixes()
, adds corrections records viaadd_calculation_corrections()
and adds the columnis_within_table_calc
.- Parameters:
xbrl_metadata_converted – Dataframe of relatively unprocessed metadata. Result of
convert_xbrl_metadata_json_to_df()
.
- add_columns_with_uniform_values(df: pandas.DataFrame, params: AddColumnsWithUniformValues | None = None) pandas.DataFrame [source]¶
Add a column with a uniform value.
- merge_xbrl_metadata(df: pandas.DataFrame, params: MergeXbrlMetadata | None = None) pandas.DataFrame [source]¶
Combine XBRL-derived metadata with the data it pertains to.
While the metadata we’re using to annotate the data comes from the more recent XBRL data, it applies generally to all the historical DBF data as well! This method reads the normalized metadata out of an attribute.
- align_row_numbers_dbf(df: pandas.DataFrame, params: AlignRowNumbersDbf | None = None) pandas.DataFrame [source]¶
Align historical FERC1 DBF row numbers with XBRL account IDs.
Additional Parameterization TBD with additional experience. See: https://github.com/catalyst-cooperative/pudl/issues/2012
- drop_duplicate_rows_dbf(df: pandas.DataFrame, params: DropDuplicateRowsDbf | None = None) pandas.DataFrame [source]¶
Drop the DBF rows where the PKs and data columns are duplicated.
Wrapper function for
drop_duplicate_rows_dbf()
.
- process_dbf(raw_dbf: pandas.DataFrame) pandas.DataFrame [source]¶
DBF-specific transformations that take place before concatenation.
- process_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
XBRL-specific transformations that take place before concatenation.
- rename_columns(df: pandas.DataFrame, rename_stage: Literal['dbf', 'xbrl', 'xbrl_instant', 'xbrl_duration'] | None = None, params: pudl.transform.classes.RenameColumns | None = None)[source]¶
Grab the params based on the rename stage and run default rename_columns.
- Parameters:
df – Table to be renamed.
rename_stage – Name of stage in the transform process. Used to get specific stage’s parameters if None have been passed.
params – Rename column parameters.
- select_annual_rows_dbf(df)[source]¶
Select only annually reported DBF Rows.
There are some DBF tables that include a mix of reporting frequencies. For now, the default for PUDL tables is to have only the annual records.
- assign_quarterly_data_to_yearly_dbf(df, params: AssignQuarterlyDataToYearlyDbf | None = None)[source]¶
Transfer quarterly filed data to annual columns.
- unstack_balances_to_report_year_instant_xbrl(df: pandas.DataFrame, params: UnstackBalancesToReportYearInstantXbrl | None = None) pandas.DataFrame [source]¶
Turn start year end year rows into columns for each value type.
- wide_to_tidy(df: pandas.DataFrame, source_ferc1: SourceFerc1, params: WideToTidy | None = None) pandas.DataFrame [source]¶
Reshape wide tables with FERC account columns to tidy format.
The XBRL table coming into this method contains all the data from both the instant and duration tables in a wide format – with one column for every combination of value type (e.g. additions, ending_balance) and accounting category, which means ~500 columns.
We tidy this into a long table with one column for each of the value types (6 in all), and a new column that contains the accounting categories. This allows aggregation across columns to calculate the ending balance based on the starting balance and all of the reported changes, and aggregation across groups of rows to total up various hierarchical accounting categories (hydraulic turbines -> hydraulic production plant -> all production plant -> all electric utility plant) though the categorical columns required for that aggregation are added later.
- combine_axis_columns_xbrl(df: pandas.DataFrame, params: CombineAxisColumnsXbrl | None = None) pandas.DataFrame [source]¶
Combine axis columns from squished XBRL tables into one column with no NA.
- merge_instant_and_duration_tables_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
Merge XBRL instant and duration tables, reshaping instant as needed.
FERC1 XBRL instant period signifies that it is true as of the reported date, while a duration fact pertains to the specified time period. The
date
column for an instant fact corresponds to theend_date
column of a duration fact.When merging the instant and duration tables, we need to preserve row order. For the small generators table, row order is how we label and extract information from header and note rows. Outer merging messes up the order, so we need to use a one-sided merge. So far, it seems like the duration df contains all the index values in the instant df. To be sure, there’s a check that makes sure there are no unique instant df index values. If that passes, we merge the instant table into the duration table, and the row order is preserved.
Note: This should always be applied before :meth:
rename_columns
- Parameters:
raw_xbrl_instant – table representing XBRL instant facts.
raw_xbrl_duration – table representing XBRL duration facts.
- Returns:
A unified table combining the XBRL duration and instant facts, if both types of facts were present. If either input dataframe is empty, the other dataframe is returned unchanged, except that several unused columns are dropped. If both input dataframes are empty, an empty dataframe is returned.
- process_instant_xbrl(df: pandas.DataFrame) pandas.DataFrame [source]¶
Pre-processing required to make instant and duration tables compatible.
Column renaming is sometimes required because a few columns in the instant and duration tables do not have corresponding names that follow the naming conventions of ~95% of all the columns, which we rely on programmatically when reshaping and concatenating these tables together.
- process_duration_xbrl(df: pandas.DataFrame) pandas.DataFrame [source]¶
Pre-processing required to make instant and duration tables compatible.
Column renaming is sometimes required because a few columns in the instant and duration tables do not have corresponding names that follow the naming conventions of ~95% of all the columns, which we rely on programmatically when reshaping and concatenating these tables together.
- drop_footnote_columns_dbf(df: pandas.DataFrame) pandas.DataFrame [source]¶
Drop DBF footnote reference columns, which all end with _f.
- source_table_primary_key(source_ferc1: SourceFerc1) list[str] [source]¶
Look up the pre-renaming source table primary key columns.
- renamed_table_primary_key(source_ferc1: SourceFerc1) list[str] [source]¶
Look up the post-renaming primary key columns.
- drop_unused_original_columns_dbf(df: pandas.DataFrame) pandas.DataFrame [source]¶
Remove residual DBF specific columns.
- assign_record_id(df: pandas.DataFrame, source_ferc1: SourceFerc1) pandas.DataFrame [source]¶
Add a column identifying the original source record for each row.
It is often useful to be able to tell exactly which record in the FERC Form 1 database a given record within the PUDL database came from.
Within each FERC Form 1 DBF table, each record is supposed to be uniquely identified by the combination of: report_year, report_prd, utility_id_ferc1_dbf, spplmnt_num, row_number.
The FERC Form 1 XBRL tables do not have these supplement and row number columns, so we construct an id based on: report_year, utility_id_ferc1_xbrl, and the primary key columns of the XBRL table
- Parameters:
df – table to assign record_id to
source_ferc1 – data source of raw ferc1 database.
- Raises:
ValueError – If any of the primary key columns are missing from the DataFrame being processed.
ValueError – If there are any null values in the primary key columns.
ValueError – If the resulting record_id column is non-unique.
- assign_utility_id_ferc1(df: pandas.DataFrame, source_ferc1: SourceFerc1) pandas.DataFrame [source]¶
Assign the PUDL-assigned utility_id_ferc1 based on the native utility ID.
We need to replace the natively reported utility ID from each of the two FERC1 sources with a PUDL-assigned utilty. The mapping between the native ID’s and these PUDL-assigned ID’s can be accessed in the database tables
utilities_dbf_ferc1
andutilities_xbrl_ferc1
.- Parameters:
df – the input table with the native utilty ID column.
source_ferc1 – the
- Returns:
an augemented version of the input
df
with a new column that replaces the natively reported utility ID with the PUDL-assigned utility ID.
- reconcile_table_calculations(df: pandas.DataFrame, params: ReconcileTableCalculations | None = None)[source]¶
Check how well a table’s calculated values match reported values.
- class pudl.transform.ferc1.SteamPlantsFuelTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
A table transformer specific to the core_ferc1__yearly_steam_plants_fuel_sched402 table.
The core_ferc1__yearly_steam_plants_fuel_sched402 table reports data about fuel consumed by large thermal power plants in the core_ferc1__yearly_steam_plants_sched402 table. Each record in the steam table is typically associated with several records in the fuel table, with each fuel record reporting data for a particular type of fuel consumed by that plant over the course of a year. The fuel table presents several challenges.
The type of fuel, which is part of the primary key for the table, is a freeform string with hundreds of different nonstandard values. These strings are categorized manually and converted to
fuel_type_code_pudl
. Some values cannot be categorized and are set toother
. In other string categorizations we set the unidentifiable values to NA, but in this table the fuel type is part of the primary key and primary keys cannot contain NA values.This simplified categorization occasionally results in records with duplicate primary keys. In those cases the records are aggregated into a single record if they have the same apparent physical units. If the fuel units are different, only the first record is retained.
Several columns have unspecified, inconsistent, fuel-type specific units of measure associated with them. In order for records to be comparable and aggregatable, we have to infer and standardize these units.
In the raw FERC Form 1 data there is a
fuel_units
column which describes the units of fuel delivered or consumed. Most commonly this is short tons for solid fuels (coal), thousands of cubic feet (Mcf) for gaseous fuels, and barrels (bbl) for liquid fuels. However, thefuel_units
column is also a freeform string with hundreds of nonstandard values which we have to manually categorize, and many of the values do not map directly to the most commonly used units for fuel quantities. E.g. some solid fuel quantities are reported in pounds, or thousands of pounds, not tons; some liquid fuels are reported in gallons or thousands of gallons, not barrels; and some gaseous fuels are reported in cubic feet not thousands of cubic feet.Two additional columns report fuel price per unit of heat content and fuel heat content per physical unit of fuel. The units of those columns are not explicitly reported, vary by fuel, and are inconsistent within individual fuel types.
We adopt standardized units and attempt to convert all reported values in the fuel table into those units. For physical fuel units we adopt those that are used by the EIA: short tons (tons) for solid fuels, barrels (bbl) for liquid fuels, and thousands of cubic feet (mcf) for gaseous fuels. For heat content per (physical) unit of fuel, we use millions of British thermal units (mmbtu). All fuel prices are converted to US dollars, while many are reported in cents.
Because the reported fuel price and heat content units are implicit, we have to infer them based on observed values. This is only possible because these quantities are ratios with well defined ranges of valid values. The common units that we observe and attempt to standardize include:
coal: primarily BTU/pound, but also MMBTU/ton and MMBTU/pound.
oil: primarily BTU/gallon.
gas: reported in a mix of MMBTU/cubic foot, and MMBTU/thousand cubic feet.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
Table specific transforms for core_ferc1__yearly_steam_plants_fuel_sched402.
- Parameters:
df – Pre-processed, concatenated XBRL and DBF data.
- Returns:
A single transformed table concatenating multiple years of cleaned data derived from the raw DBF and/or XBRL inputs.
- process_dbf(raw_dbf: pandas.DataFrame) pandas.DataFrame [source]¶
Start with inherited method and do some fuel-specific processing.
We have to do most of the transformation before the DBF and XBRL data have been concatenated because the fuel type column is part of the primary key and it is extensively modified in the cleaning process.
- process_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
Special pre-concat treatment of the core_ferc1__yearly_steam_plants_fuel_sched402 table.
We have to do most of the transformation before the DBF and XBRL data have been concatenated because the fuel type column is part of the primary key and it is extensively modified in the cleaning process. For the XBRL data, this means we can’t create a record ID until that fuel type value is clean. In addition, the categorization of fuel types results in a number of duplicate fuel records which need to be aggregated.
- Parameters:
raw_xbrl_instant – Freshly extracted XBRL instant fact table.
raw_xbrl_duration – Freshly extracted XBRL duration fact table.
- Returns:
Almost fully transformed XBRL data table, with instant and duration facts merged together.
- to_numeric(df: pandas.DataFrame) pandas.DataFrame [source]¶
Convert columns containing numeric strings to numeric types.
- standardize_physical_fuel_units(df: pandas.DataFrame) pandas.DataFrame [source]¶
Convert reported fuel quantities to standard units depending on fuel type.
Use the categorized fuel type and reported fuel units to convert all fuel quantities to the following standard units, depending on whether the fuel is a solid, liquid, or gas. When a single fuel reports its quantity in fundamentally different units, convert based on typical values. E.g. 19.85 MMBTU per ton of coal, 1.037 Mcf per MMBTU of natural gas, 7.46 barrels per ton of oil.
solid fuels (coal and waste): short tons [ton]
liquid fuels (oil): barrels [bbl]
gaseous fuels (gas): thousands of cubic feet [mcf]
Columns to which these physical units apply:
fuel_consumed_units (tons, bbl, mcf)
fuel_cost_per_unit_burned (usd/ton, usd/bbl, usd/mcf)
fuel_cost_per_unit_delivered (usd/ton, usd/bbl, usd/mcf)
One remaining challenge in this standardization is that nuclear fuel is reported in both mass of Uranium and fuel heat content, and it’s unclear if there’s any reasonable typical conversion between these units, since available heat content depends on the degree of U235 enrichement, the type of reactor, and whether the fuel is just Uranium, or a mix of Uranium and Plutonium from decommissioned nuclear weapons. See:
https://world-nuclear.org/information-library/facts-and-figures/heat-values-of-various-fuels.aspx
- aggregate_duplicate_fuel_types_xbrl(fuel_xbrl: pandas.DataFrame) pandas.DataFrame [source]¶
Aggregate the fuel records having duplicate primary keys.
- drop_total_rows(df: pandas.DataFrame) pandas.DataFrame [source]¶
Drop rows that represent plant totals rather than individual fuels.
This is an imperfect, heuristic process. The rows we identify as probably representing totals rather than individual fuels:
have zero or null values in all of their numerical data columns
have no identifiable fuel type
have no identifiable fuel units
DO report a value for MMBTU / MWh (heat rate)
In the case of the core_ferc1__yearly_steam_plants_fuel_sched402 table, we drop any row where all the data columns are null AND there’s a non-null value in the
fuel_mmbtu_per_mwh
column, as it typically indicates a “total” row for a plant. We also require a null value for the fuel_units and an “other” value for the fuel type.
- drop_invalid_rows(df: pandas.DataFrame, params: pudl.transform.classes.InvalidRows | None = None) pandas.DataFrame [source]¶
Drop invalid rows from the fuel table.
This method both drops rows in which all required data columns are null (using the inherited parameterized method) and then also drops those rows we believe represent plant totals. See
SteamPlantsFuelTableTransformer.drop_total_rows()
.
- class pudl.transform.ferc1.SteamPlantsTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for the core_ferc1__yearly_steam_plants_sched402 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- class pudl.transform.ferc1.HydroelectricPlantsTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
A table transformer specific to the core_ferc1__yearly_hydroelectric_plants_sched406 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- transform_main(df)[source]¶
Standard transform_main, bespoke remove duplicate record & remove
.
from project_num column.
- targeted_drop_duplicates(df)[source]¶
Targeted removal of known duplicate record.
There are two records in 2019 with a
utility_id_ferc1
of 200 and aplant_name_ferc1
of “marmet”. The records are nearly duplicates of eachother, except one have nulls in the capex columns. Surgically remove the record with the nulls.
- class pudl.transform.ferc1.PumpedStoragePlantsTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_pumped_storage_plants_sched408 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- class pudl.transform.ferc1.PurchasedPowerAndExchangesTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_purchased_power_and_exchanges_sched326.
This table has data about inter-utility power purchases into the PUDL DB. This includes how much electricty was purchased, how much it cost, and who it was purchased from. Unfortunately the field describing which other utility the power was being bought from is poorly standardized, making it difficult to correlate with other data. It will need to be categorized by hand or with some fuzzy matching eventually.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- class pudl.transform.ferc1.PlantInServiceTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
A transformer for the core_ferc1__yearly_plant_in_service_sched204 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- process_xbrl_metadata(xbrl_metadata_converted: pandas.DataFrame, xbrl_calculations: pandas.DataFrame) pandas.DataFrame [source]¶
Transform the metadata to reflect the transformed data.
We fill in some gaps in the metadata, e.g. for FERC accounts that have been split across multiple rows, or combined without being calculated. We also need to rename the XBRL metadata categories to conform to the same naming convention that we are using in the data itself (since FERC doesn’t quite follow their own naming conventions…). We use the same rename dictionary, but as an argument to
pd.Series.replace()
instead ofpd.DataFrame.rename()
.
- deduplicate_xbrl_factoid_xbrl_metadata(tbl_meta: pandas.DataFrame) pandas.DataFrame [source]¶
De-duplicate the XBLR metadata.
We deduplicate the metadata on the basis of the
xbrl_factoid
name. This table in particular has multiplewide_to_tidy
value_types
because there are multiple dollar columns embedded (it has both the standard start/end balances as well as modifcations like transfers/retirements). In the XBRL metadata, each xbrl_fact has its own set of metadata and possibly its own set of calculations. Which means that onexbrl_factoid
for this table natively could have multiple calculations or other metadata.For merging, we need the metadata to have one field per
xbrl_factoid
. Because we normally only use the start/end balance in calculations, when there are duplicate renamedxbrl_factoid
s in our processed metadata, we are going to prefer the one that refers to the start/end balances. In an ideal world, we would be able to access this metadata based on both thexbrl_factoid
and any column fromvalue_types
but that would require a larger change in architecture.
- apply_sign_conventions(df) pandas.DataFrame [source]¶
Adjust rows and column sign conventsion to enable aggregation by summing.
Columns have uniform sign conventions, which we have manually inferred from the original metadata. This can and probably should be done programmatically in the future. If not, we’ll probably want to store the column_weights as a parameter rather than hard-coding it in here.
- targeted_drop_duplicates_dbf(df: pandas.DataFrame) pandas.DataFrame [source]¶
Drop bad duplicate records from a specific utility in 2018.
This is a very specific fix, meant to get rid of a particular observed set of duplicate records: FERC Respondent ID 187 in 2018 has two sets of plant in service records, one of which contains a bunch of null data.
This method is part of the DBF processing because we want to be able to hard-code a specific value of
utility_id_ferc1_dbf
and those IDs are no longer available later in the process. I think.
- process_dbf(raw_dbf: pandas.DataFrame) pandas.DataFrame [source]¶
Drop targeted duplicates in the DBF data so we can use FERC respondent ID.
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
The main table-specific transformations, affecting contents not structure.
Annotates and alters data based on information from the XBRL taxonomy metadata. Also assigns utility type, plant status & function for use in table explosions. Make all electric_plant_sold balances positive.
- class pudl.transform.ferc1.SmallPlantsTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
A table transformer specific to the core_ferc1__yearly_small_plants_sched410 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
Table specific transforms for core_ferc1__yearly_small_plants_sched410.
- Params:
df: Pre-processed, concatenated XBRL and DBF data.
- Returns:
A single transformed table concatenating multiple years of cleaned data derived from the raw DBF and/or XBRL inputs.
- extract_ferc1_license(df: pandas.DataFrame) pandas.DataFrame [source]¶
Extract FERC license number from
plant_name_ferc1
.Many FERC license numbers are embedded in the
plant_name_ferc1
column, but not all numbers in theplant_name_ferc1
column are FERC licenses. Some are dates, dollar amounts, page numbers, or numbers of wind turbines. This function extracts valid FERC license numbers and puts them in a new column calledlicense_id_ferc1
.Potential FERC license numbers are valid when:
Two or more integers were found.
The found integers were accompanied by key phrases such as:
["license", "no.", "ferc", "project"]
.The accompanying name does not contain phrases such as:
["page", "pg", "$", "wind", "units"]
.The found integers don’t fall don’t fall within the range of a valid year, defined as: 1900-2050.
The plant record is categorized as
hydro
or not categorized via theplant_type
andfuel_type
columns.
This function also fills
other
fuel types withhydro
for all plants with valid FERC licenses because only hydro plants have FERC licenses.- Params:
df: Pre-processed, concatenated XBRL and DBF data.
- Returns:
The same input DataFrame but with a new column called
license_id_ferc1
that contains FERC 1 license infromation extracted fromplant_name_ferc1
.
- _find_possible_header_or_note_rows(df: pandas.DataFrame) pandas.DataFrame [source]¶
Find and label rows that might be headers or notes.
Called by the coordinating function
label_row_types()
.This function creates a column called
possible_header_or_note
that is either True or False depending on whether a group of columns are all NA. Rows labeled as True will be further scrutinized in the_label_header_rows()
and_label_note_rows()
functions to determine whether they are actually headers or notes.- Params:
df: Pre-processed, concatenated XBRL and DBF data.
- Returns:
The same input DataFrame but with a new column called
possible_header_or_note
that flags rows that might contain useful header or note information.
- _find_note_clumps(group: pandas.core.groupby.DataFrameGroupBy) tuple[pandas.core.groupby.DataFrameGroupBy, pandas.DataFrame] [source]¶
Find groups of rows likely to be notes.
Once the
_find_possible_header_or_note_rows()
function identifies rows that are either headers or notes, we must deterine which one they are. As described in the_label_note_rows()
function, notes rows are usually adjecent rows with no content.This function itentifies instances of two or more adjecent rows where
possible_header_or_note
= True. It takes individual utility-year groups as a parameter as opposed to the entire dataset because adjecent rows are only meaningful if they are from the same reporting entity in the same year. If we were to run this on the whole dataframe, we would see “note clumps” that are actually notes from the end of one utility’s report and headers from the beginning of another. For this reason, we run this function from within the_label_note_rows_group()
function.The output of this function is not a modified version of the original utility-year group, rather, it is a DataFrame containing information about the nature of the
possible_header_or_note
= True rows that is used to determine if that row is a note or not. It also returns the original utility-year-group as groupby objects seperated by each timepossible_header_or_note
changes from True to False or vice versa.If you pass in the following df:
plant_name_ferc1
possible_header_or_note
HYDRO:
True
rainbow falls (b)
False
cadyville (a)
False
keuka (c)
False
project #2738
True
project #2835
True
project #2852
True
You will get the following output (in addition to the groupby objects for each clump):
header_or_note
rows_per_clump
True
1
False
3
True
3
This shows each clump of adjecent records where
possible_header_or_note
is True or False and how many records are in each clump.- Params:
- group: A utility-year grouping of the concatenated FERC XBRL and DBF tables.
This table must have been run through the
_find_possible_header_or_note_rows()
function and contain the columnpossible_header_or_note
.
- Returns:
A tuple containing groupby objects for each of the note and non-note clumps and a DataFrame indicating the number of rows in each note or non-note clump.
- _label_header_rows(df: pandas.DataFrame) pandas.DataFrame [source]¶
Label header rows by adding
header
torow_type
column.Called by the coordinating function
label_row_types()
.Once possible header or notes rows have been identified via the
_find_possible_header_or_note_rows()
function, this function sorts out which ones are headers. It does this by identifying a list of strings that, when found in theplant_name_ferc1
column, indicate that the row is or is not a header.Sometimes this function identifies a header that is acutally a note. For this reason, it’s important that the function be called before
_label_note_rows()
so that the bad header values get overridden by thenote
designation.- Params:
df: Pre-processed, concatenated XBRL and DBF data that has been run through the
_find_possible_header_or_note_rows()
function and contains the columnpossible_header_or_note
.
- Returns:
The same input DataFrame but with likely headers rows containing the string
header
in therow_type
column.
- _label_note_rows_group(util_year_group: pandas.core.groupby.DataFrameGroupBy) pandas.core.groupby.DataFrameGroupBy [source]¶
Label note rows by adding
note
torow_type
column.Called within the wraper function
_label_note_rows()
This function breaks the data down by reporting unit (utility and year) and determines whether a
possible_header_note
= True row is a note based on two criteria:Clumps of 2 or more adjecent rows where
possible_header_or_note
is True.Instances where the last row in a utility-year group has
possible_header_or_note
as True.
There are a couple of important exceptions that this function also addresses. Utilities often have multiple headers in a single utility-year grouping. You might see something like:
pd.Series([header, plant1, plant2, note, header, plant3, plant4])
. In this case, a note clump is actually comprised of a note followed by a header. This function will not override the header as a note. Unfortunately, there is always the possability that a header row is followed by a plant that had no values reported. This would look like, and therefore be categorized as a note clump. I haven’t built a work around, but hopefully there aren’t very many of these.- Params:
util_year_group: A groupby object that contains a single year and utility.
- Returns:
The same input but with likely note rows containing the string
note
in therow_type
column.
- _label_note_rows(df: pandas.DataFrame) pandas.DataFrame [source]¶
Wrapper for
_label_note_rows_group()
.The small plants table has lots of note rows that contain useful information. Unfortunately, the notes are in their own row rather than their own column! This means that useful information pertaining to plant rows is floating around as a junk row with no other information except the note in the
plant_name_ferc1
field. Luckily, the data are reported just like they would be on paper. I.e., The headers are at the top, and the notes are at the bottom. See the table inlabel_row_types()
for more detail. This function labels note rows.Note rows are determined by row location within a given report, so we must break the data into reporting units (utility and year) and then apply note-finding methodology defined in
_label_note_rows_group()
to each group.- Params:
df: Pre-processed, concatenated XBRL and DBF data that has been run through the
_find_possible_header_or_note_rows()
function and contains the columnpossible_header_or_note
.
- Returns:
The same input DataFrame but with likely note rows containing the string
note
in therow_type
column.
- _label_total_rows(df: pandas.DataFrame) pandas.DataFrame [source]¶
Label total rows by adding
total
torow_type
column.Called within the wraper function
_label_note_rows()
For the most part, when
plant_name_ferc1
contains the stringtotal
, the values therein are duplicates of what is already reported, i.e.: a total value. However, there are some cases where that’s not true. For example, the phraseamounts are for the total
appears when chunks of plants (usually but not always wind) are reported together. It’s a total, but it’s not double counting which is the reason for thetotal
flag.Similar to
_label_header_rows()
, it’s important that this be called before_label_note_rows()
inlabel_row_types()
so that not clumps can override certain non-totals that are mistakenly labeled as such.- Params:
df: Pre-processed, concatenated XBRL and DBF data.
- Returns:
The same input DataFrame but with likely total rows containing the string
total
in therow_type
column.
- label_row_types(df: pandas.DataFrame) pandas.DataFrame [source]¶
Coordinate labeling of
row_types
as headers, notes, or totals.The small plants table is more like a digitized PDF than an actual data table. The rows contain all sorts of information in addition to what the columns might suggest. For instance, there are header rows, note rows, and total rows that contain useful information, but cause confusion in their current state, mixed in with the rest of the data.
Here’s an example of what you might find in the small plants table:
plant_name_ferc1
plant_type
capacity_mw
HYDRO:
NA
NA
rainbow falls (b)
NA
30
cadyville (a)
NA
100
keuka (c)
NA
80
total plants
NA
310
project #2738
NA
NA
project #2835
NA
NA
project #2852
NA
NA
Notice how misleading it is to have all this infomration in one column. The goal of this function is to coordinate labeling functions so that we can identify which rows contain specific plant information and which rows are headers, notes, or totals.
Once labeled, other functions can either remove rows that might cause double counting, extract useful plant or fuel type information from headers, and extract useful context or license id information from notes.
Coordinates
_label_header_rows()
,_label_total_rows()
,_label_note_rows()
.- Params:
df: Pre-processed, concatenated XBRL and DBF data that has been run through the
_find_possible_header_or_note_rows()
function and contains the columnpossible_header_or_note
.
- Returns:
The same input DataFrame but with a column called
row_type
containg the stringsheader
,note
,total
, or NA to indicate what type of row it is.
- prep_header_fuel_and_plant_types(df: pandas.DataFrame, show_unmapped_headers=False) pandas.DataFrame [source]¶
Forward fill header rows to prep for fuel and plant type extraction.
The headers we’ve identified in
_label_header_rows()
can be used to supplement the values in theplant_type
andfuel_type
columns.This function groups the data by utility, year, and header; extracts the header into a new column; and forward fills the headers so that each record in the header group is associated with that header. Because the headers map to different fuel types and plant types (ex:
solar pv
maps to fuel typesolar
and plant typephotovoltaic
), the new forward-filled header column is duplicated and calledfuel_type_from_header
andplant_type_from_header
. Inmap_header_fuel_and_plant_types()
, these columns will be mapped to their respective fuel and plant types, used to fill in blank values in theplant_type
andfuel_type
, and then eventually removed.Why separate the prep step from the map step?
We trust the values originally reported in the
fuel_type
andplant_type
columns more than the extracted and forward filled header values, so we only want to replacefuel_type
andplant_type
values that are labeled aspd.NA
orother
. The values reported to those columns are extremely messy and must be cleaned viapudl.transform.classes.categorize_strings()
in order for us to know which are truelypd.NA
orother
. Because we also usepudl.transform.classes.categorize_strings()
to map the headers to fuel and plant types, it makes sense to clean all four columns at once and then combine them.Here’s a look at what this function does. It starts with the following table:
plant_name_ferc1
plant_type
fuel_type
row_type
HYDRO:
NA
NA
header
rainbow falls (b)
NA
NA
NA
cadyville (a)
NA
NA
NA
keuka (c)
NA
NA
NA
Wind Turbines:
NA
NA
header
sunny grove
NA
NA
NA
green park wind
NA
wind
NA
And ends with this:
plant_name_ferc1
plant _type
fuel _type
plant_type _from_header
fuel_type _from_header
HYDRO:
NA
NA
HYDRO:
HYDRO:
rainbow falls (b)
NA
NA
HYDRO:
HYDRO:
cadyville (a)
NA
NA
HYDRO:
HYDRO:
keuka (c)
NA
NA
HYDRO:
HYDRO:
Wind Turbines:
NA
NA
Wind Turbines:
Wind Turbines:
sunny grove
NA
NA
Wind Turbines:
Wind Turbines:
green park wind
NA
wind
Wind Turbines:
Wind Turbines:
NOTE: If a utility’s
plant_name_ferc1
values look like this:["STEAM", "coal_plant1", "coal_plant2", "wind_turbine1"]
, then this algorythem will think that last wind turbine is a steam plant. Luckily, when a utility embeds headers in the data it usually includes them for all plant types:["STEAM", "coal_plant1", "coal_plant2", "WIND", "wind_turbine"]
.- Params:
df: Pre-processed, concatenated XBRL and DBF data that has been run through
_label_row_type()
and contains the columnsrow_type
.
- Returns:
The same input DataFrame but with new columns
plant_type_from_header
andfuel_type_from_header
that forward fill the values in the header rows by utility, year, and header group.
- map_header_fuel_and_plant_types(df: pandas.DataFrame) pandas.DataFrame [source]¶
Fill
pd.NA
andother
plant and fuel types with cleaned headers.prep_header_fuel_and_plant_types()
extracted and forward filled the header values;pudl.transform.params.categorize_strings()
cleaned them according to both the fuel and plant type parameters. This function combines thefuel_type_from_header
withfuel_type
andplant_type_from_header
withplant_type
when the reported, cleaned values arepd.NA
orother
.To understand more about why these steps are necessary read the docstrings for
prep_header_fuel_and_plant_types()
.- Params:
df: Pre-processed, concatenated XBRL and DBF data that has been run through
prep_header_fuel_and_plant_types()
and contains the columnsfuel_type_from_header
andplant_type_from_header
.
- Returns:
The same input DataFrame but with rows with
pd.NA
orother
in thefuel_type
andplant_type
columns filled in with the respective values fromfuel_type_from_header
andplant_type_from_header
when available.fuel_type_from_header
andplant_type_from_header
columns removed.
- map_plant_name_fuel_types(df: pandas.DataFrame) pandas.DataFrame [source]¶
Suppliment
fuel_type
with information inplant_name_ferc1
.Sometimes fuel type is embedded in a plant name (not just headers). In this case we can identify that what that fuel is from the name and fill in empty
fuel_type
values. Right now, this only works for hydro plants because the rest are complicated and have a slew of exceptions. This could probably be applied to theplant_type
column in the future too.- Params:
df: Pre-processed, concatenated XBRL and DBF data.
- Returns:
The same input DataFrame but with rows with
other
in thefuel_type
column filled in notable fuel types extracted from the theplant_name_ferc1
column.
- associate_notes_with_values(df: pandas.DataFrame) pandas.DataFrame [source]¶
Use footnote indicators to map notes and FERC licenses to plant rows.
There are many utilities that report a bunch of mostly empty note rows at the bottom of their yearly entry. These notes often pertain to specific plant rows above. Sometimes the notes and their respective plant rows are linked by a common footnote indicator such as (a) or (1) etc.
This function takes this:
plant_name_ferc1
row_type
license_id_ferc1
HYDRO:
header
NA
rainbow falls (b)
NA
NA
cadyville (a)
NA
NA
keuka (c)
NA
NA
total plants
total
NA
project #2738
note
2738
project #2835
note
2738
project #2852
note
2738
Finds the note rows with footnote indicators, maps the content from the note row into a new note column that’s associated with the value row, and maps any FERC license extracted from this note column to the
license_id_ferc1
column in the value row.plant_name_ferc1
row_type
notes
license_id_ferc1
HYDRO:
header
NA
NA
rainbow falls (b)
NA
project #2835
2835
cadyville (a)
NA
project #2738
2738
keuka (c)
NA
project #2852
2752
total plants
total
NA
NA
project #2738
note
NA
2738
project #2835
note
NA
2835
project #2852
note
NA
2752
(Header and note rows are removed later).
NOTE: Note rows that don’t have a footnote indicator or note rows with a footnote indicator that don’t have a corresponding plant row with the same indicator are not captured. They will ultimately get removed and their content will not be preserved.
- Params:
df: Pre-processed, concatenated XBRL and DBF data that has been run through
label_row_types()
and contains the columnrow_type
.
- Returns:
The same input DataFrame but with a column called
notes
that contains notes, reported below, in the same row as the plant values they pertain to. Also, any further additions to thelicense_id_ferc1
field as extracted from these newly associated notes.
- spot_fix_rows(df: pandas.DataFrame) pandas.DataFrame [source]¶
Fix one-off row errors.
In 2004, utility_id_ferc1 251 reports clumps of units together. Each unit clump looks something like this:
intrepid wind farm (107 units @ 1.5 mw each)
and is followed by a row that looks like this:(amounts are for the total of all 107 units)
. For the most part, these rows are useless note rows. However, there is one instance where important values are reported in this note row rather than in the actual plant row above.There are probably plenty of other spot fixes one could add here.
- Params:
df: Pre-processed, concatenated XBRL and DBF data.
- Returns:
The same input DataFrame but with some spot fixes corrected.
- class pudl.transform.ferc1.TransmissionLinesTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
A table transformer for the core_ferc1__yearly_transmission_lines_sched422 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
Do some string-to-numeric ninja moves.
- class pudl.transform.ferc1.EnergySourcesTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_energy_sources_sched401 table.
The raw DBF and XBRL table will be split up into two tables. This transformer generates the sources of electricity for utilities, dropping the information about dispositions. For XBRL, this is a duration-only table. Right now we are merging in the metadata but not actually keeping anything from it. We are also not yet doing anything with the sign.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Perform default xbrl metadata processing plus adding 1 new xbrl_factoid.
Note: we should probably parameterize this and add it into the standard
process_xbrl_metadata()
.
- class pudl.transform.ferc1.EnergyDispositionsTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_energy_dispositions_sched401 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- class pudl.transform.ferc1.UtilityPlantSummaryTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_utility_plant_summary_sched200 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- process_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
Remove the end-of-previous-year instant data.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Do the default metadata processing plus add a new factoid.
The new factoid corresponds to the aggregated factoid in
aggregated_xbrl_factoids()
.
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
Default transforming, plus spot fixing and building aggregate xbrl_factoid.
- aggregated_xbrl_factoids(df: pandas.DataFrame) pandas.DataFrame [source]¶
Aggregate xbrl_factoids records for linking to core_ferc1__yearly_plant_in_service_sched204.
This table has two
xbrl_factoid
which can be linked via calcuations to onexbrl_factoid
in the core_ferc1__yearly_plant_in_service_sched204. Doing this 2:1 linkage would be fine in theory. But the core_ferc1__yearly_plant_in_service_sched204 is in most senses the table with the more details and of our desire to build tree-link relationships between factoids, we need to build a new factoid to link in a 1:1 manner between this table and the core_ferc1__yearly_plant_in_service_sched204.We’ll also add this factoid into the metadata via
process_xbrl_metadata()
and add the linking calculation viaapply_xbrl_calculation_fixes()
.
- spot_fix_bad_signs(df: pandas.DataFrame) pandas.DataFrame [source]¶
Spot fix depreciation_utility_plant_in_service records with bad signs.
- class pudl.transform.ferc1.BalanceSheetLiabilitiesTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_balance_sheet_liabilities_sched110 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
Duplicate data that appears in multiple distinct calculations.
There is a one case in which exactly the same data values are referenced in multiple calculations which can’t be resolved by choosing one of the referenced values as the canonical location for that data. In order to preserve all of the calculation structure, we need to duplicate those records in the data, the metadata, and the calculation specifications. Here we duplicate the data and associated it with newly defined facts, which we will also add to the metadata and calculations.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Perform default xbrl metadata processing plus adding 2 new xbrl_factoids.
We add two new factoids which are defined (by PUDL) only for the DBF data, and also duplicate and redefine several factoids which are referenced in multiple calculations and need to be distinguishable from each other.
Note: we should probably parameterize this and add it into the standard
process_xbrl_metadata()
.
- class pudl.transform.ferc1.BalanceSheetAssetsTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_balance_sheet_assets_sched110 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
Duplicate data that appears in multiple distinct calculations.
There is a one case in which exactly the same data values are referenced in multiple calculations which can’t be resolved by choosing one of the referenced values as the canonical location for that data. In order to preserve all of the calculation structure, we need to duplicate those records in the data, the metadata, and the calculation specifications. Here we duplicate the data and associated it with newly defined facts, which we will also add to the metadata and calculations.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Default xbrl metadata processing plus some error correction.
We add two new factoids which are defined (by PUDL) only for the DBF data, and also duplicate and redefine several factoids which are referenced in multiple calculations and need to be distinguishable from each other.
Note: we should probably parameterize this and add it into the standard
process_xbrl_metadata()
.
- class pudl.transform.ferc1.IncomeStatementsTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for the core_ferc1__yearly_income_statements_sched114 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Perform default xbrl metadata processing plus adding a new xbrl_factoid.
Note: we should probably parameterize this and add it into the standard
process_xbrl_metadata()
.
- process_dbf(raw_dbf: pandas.DataFrame) pandas.DataFrame [source]¶
Drop incorrect row numbers from f1_incm_stmnt_2 before standard processing.
In 2003, two rows were added to the
f1_income_stmnt
dbf table, which bumped the startingrow_number
off1_incm_stmnt_2
from 25 to 27. A small handfull of respondents seem to have not gotten the memo about this this in 2003 and have information on these row numbers that shouldn’t exist at all for this table.This step necessitates the ability to know which source table each record actually comes from, which required adding a column (
sched_table_name
) in the extract step before these two dbf input tables were concatenated.Right now we are just dropping these bad row numbers. Should we actually be bumping the whole respondent’s row numbers - assuming they reported incorrectly for the whole table? See: https://github.com/catalyst-cooperative/pudl/issues/471
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
Drop duplicate records from f1_income_stmnt.
Because net_utility_operating_income is reported on both page 1 and 2 of the form, it ends up introducing a bunch of duplicated records, so we need to drop one of them. Since the value is used in the calculations that are part of the second page, we’ll drop it from the first page.
- class pudl.transform.ferc1.RetainedEarningsTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_retained_earnings_sched118 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Transform the metadata to reflect the transformed data.
Beyond the standard
Ferc1AbstractTableTransformer.process_xbrl_metadata()
processing, add FERC account values for a few known values.
- process_dbf(raw_dbf: pandas.DataFrame) pandas.DataFrame [source]¶
Preform generic
process_dbf()
, plus deal with duplicates.Along with the standard processing in
Ferc1AbstractTableTransformer.process_dbf()
, this method runs: *targeted_drop_duplicates_dbf()
*reconcile_double_year_earnings_types_dbf()
- transform_main(df)[source]¶
Add _previous_year factoids after standard transform_main.
Add _previous_year factoids for unappropriated_retained_earnings and unappropriated_undistributed_subsidiary_earnings after standard transform_main. This should only affect XBRL data, but we do it after merging to enable access to DBF data to fill this in as well.
- transform_end(df: pandas.DataFrame) pandas.DataFrame [source]¶
Check
_previous_year
factoids for consistency after the transformation is done.
- check_double_year_earnings_types(df: pandas.DataFrame) pandas.DataFrame [source]¶
Check previous year/current year factoids for consistency.
The terminology can be very confusing - here are the expectations:
“inter year consistency”: earlier year’s “current starting/end balance” == later year’s “previous starting/end balance”
“intra year consistency”: each year’s “previous ending balance” == “current starting balance”
- targeted_drop_duplicates_dbf(df: pandas.DataFrame) pandas.DataFrame [source]¶
Drop duplicates with truly duplicate data.
There are instances of utilities that reported multiple values for several earnings types for a specific year (utility_id_ferc1 68 in 1998 & utility_id_ferc1 296 in 2015). We are taking the largest value reported and dropping the rest. There very well could be a better strategey here, but there are only 25 records that have this problem, so we’ve going with this.
- reconcile_double_year_earnings_types_dbf(df: pandas.DataFrame) pandas.DataFrame [source]¶
Reconcile current and past year data reported in 1 report_year.
The DBF table includes two different earnings types that have: “Begining of Period” and “End of Period” rows. But the table has both an amount column that corresponds to a balance and a starting balance column. For these two earnings types, this means that there is in effect two years of data in this table for each report year: a starting and ending balance for the pervious year and a starting and ending balance for the current year. The ending balance for the previous year should be the same as the starting balance for the current year.
We need to keep both pieces of data in order to calculate ending_balances, so we want to check these assumptions, extract as much information from these two years of data, and keep both records for each of these two earnings types for each utility.
- Raises:
AssertionError – There are a very small number of instances in which the ending balance from the previous year does not match the starting balance from the current year. The % of these non-matching instances should be less than 2% of the records with these date duplicative earnings types.
- add_previous_year_factoid(df: pandas.DataFrame) pandas.DataFrame [source]¶
Create
*_previous_year
factoids for XBRL data.XBRL doesn’t include the previous year’s data, but DBF does - so we try to check that year X’s
*_current_year
factoid has the same value as year X+1’s*_previous_year
factoid.To do this, we need to add some
*_previous_year
factoids to the XBRL data.
- deduplicate_xbrl_factoid_xbrl_metadata(tbl_meta) pandas.DataFrame [source]¶
Deduplicate the xbrl_metadata based on the
xbrl_factoid
.The metadata relating to dollar_value column generally had the same name as the renamed xbrl_factoid. we’ll double check that we a) didn’t remove too many factoid’s by doing this AND that we have a fully deduped output below. In an ideal world, we would have multiple pieces of metadata information (like calucations and ferc account #’s), for every single
wide_to_tidy()
value column.Note: This is almost the same as the method for core_ferc1__yearly_operating_revenues_sched300. If we wanted to lean into this version of deduplication more generally this might be a fine way start to an abstraction, but ideally we wouldn’t need to dedupe this at all and instead enable metadata for every value column from
wide_to_tidy()
.
- class pudl.transform.ferc1.DepreciationSummaryTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_depreciation_summary_sched336 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- process_xbrl_metadata(xbrl_metadata_converted: pandas.DataFrame, xbrl_calculations: pandas.DataFrame) pandas.DataFrame [source]¶
Transform the metadata to reflect the transformed data.
Beyond the standard
Ferc1AbstractTableTransformer.process_xbrl_metadata()
processing, add FERC account values for a few known values.
- class pudl.transform.ferc1.DepreciationChangesTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_depreciation_changes_sched219 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json) pandas.DataFrame [source]¶
Transform the metadata to reflect the transformed data.
Warning: The calculations in this table are currently being corrected using reconcile_table_calculations(), but they still contain high rates of error. This function replaces the name of the single balance column reported in the XBRL Instant table with starting_balance / ending_balance. We pull those two values into their own separate labeled rows, each of which should get the metadata from the original column. We do this pre-processing before we call the main function in order for the calculation fixes and renaming to work as expected.
- process_dbf(raw_df: pandas.DataFrame) pandas.DataFrame [source]¶
Accumulated Depreciation table specific DBF cleaning operations.
The XBRL reports a utility_type which is always electric in this table, but which may be necessary for differentiating between different values when this data is combined with other tables. The DBF data doesn’t report this value so we are adding it here for consistency across the two data sources.
Also rename the
ending_balance_accounts
toending_balance
- process_instant_xbrl(df: pandas.DataFrame) pandas.DataFrame [source]¶
Pre-processing required to make the instant and duration tables compatible.
This table has a rename that needs to take place in an unusual spot – after the starting / ending balances have been usntacked, but before the instant & duration tables are merged. This method just reversed the order in which these operations happen, comapared to the inherited method.
- class pudl.transform.ferc1.DepreciationByFunctionTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer for core_ferc1__yearly_depreciation_by_function_sched219 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Create a metadata table with the one factoid we’ve assigned to this table.
Instead of adding facts to the metdata like a lot of the other table-specific
convert_xbrl_metadata_json_to_df()
, this method creates a metadata table with one singularxbrl_factoid
. We assign that factoid to the table intransform_main()
.
- raw_xbrl_factoid_to_pudl_name(col_name_xbrl: str) str [source]¶
Return the one fact name for this table.
We’ve artificially assigned this table to have one
xbrl_factoid
duringtransform_main()
. Because this table only has one value for itsxbrl_factoid
column, allcol_name_xbrl
should be converted to “accumulated_depreciation”.
- process_dbf(raw_df: pandas.DataFrame) pandas.DataFrame [source]¶
Accumulated Depreciation table specific DBF cleaning operations.
The XBRL reports a utility_type which is always electric in this table, but which may be necessary for differentiating between different values when this data is combined with other tables. The DBF data doesn’t report this value so we are adding it here for consistency across the two data sources.
- process_instant_xbrl(df: pandas.DataFrame) pandas.DataFrame [source]¶
Pre-processing required to make the instant and duration tables compatible.
This table has a rename that needs to take place in an unusual spot – after the starting / ending balances have been usntacked, but before the instant & duration tables are merged. This method reverses the order in which these operations happen comapared to the inherited method. We also want to strip the
accumulated_depreciation
that appears on every plant functional class.
- transform_main(df: pandas.DataFrame) pandas.DataFrame [source]¶
Add
depreciation_type
then run defaulttransform_main()
.We are adding
depreciation_type
as thexbrl_factoid
column for this table with one value (“accumulated_depreciation”) across the whole table. This table has multiple “dimension” columns such asutility_type
andplant_function
which differentiate what slice of a utility’s assets each record pertains to. We added this new column as thexbrl_factoid
of the table instead of using one of the dimensions of the table so that the table can conform to the same patern of treatment for these dimension columns.
- transform_end(df: pandas.DataFrame) pandas.DataFrame [source]¶
Run standard
Ferc1AbstractTableTransformer.transform_end()
plus a data validation step.In
infer_intra_factoid_totals()
, we restrict the child calculation components to only those without “total” in any of the dimension columns (e.g. plant_status == “total”). Because of this, when there is more than one dimension with totals in a table, as in this table, records with two totals (e.g. plant_status == “total” and plant_function == “total”) only get linked to children with no “totals” in any of their subdimensions. This is fine and good because it avoids possible double counting of mixed total and sub-dimension calculations. But it means that records with totals in one sub-dimension (e.g. plant_status == “in_service” and plant_function == “total”) aren’t linked to double-total parent factoids. To ensure that there aren’t many instances of data where most or all of the data is reported in these mixed-total records, we add a validation step to ward against large-scale data loss inpudl.output.ferc1.Exploder
.
- class pudl.transform.ferc1.OperatingExpensesTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_operating_expenses_sched320 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- targeted_drop_duplicates_dbf(raw_df: pandas.DataFrame) pandas.DataFrame [source]¶
Drop incorrect duplicate from 2002.
In 2002, utility_id_ferc1_dbf 96 reported two values for administrative_and_general_operation_expense. I found the correct value by looking at the prev_yr_amt value in 2003. This removes the incorrect row.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Default XBRL metadata processing and add a DBF-only xblr factoid.
Note: we should probably parameterize this and add it into the standard
process_xbrl_metadata()
.
- process_dbf(raw_dbf: pandas.DataFrame) pandas.DataFrame [source]¶
Process DBF but drop a bad row that is flagged by drop_duplicates.
- class pudl.transform.ferc1.OperatingRevenuesTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_operating_revenues_sched300 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- deduplicate_xbrl_factoid_xbrl_metadata(tbl_meta: pandas.DataFrame) pandas.DataFrame [source]¶
Transform the metadata to reflect the transformed data.
Employ the standard process for processing metadata. Then remove duplication on the basis of the
xbrl_factoid
. This table usedwide_to_tidy()
with three seperate value columns. Which results in onexbrl_factoid
referencing three seperate data columns. This method grabs only one piece of metadata for each renamedxbrl_factoid
, preferring the calculated value or the factoid referencing the dollar columns.In an ideal world, we would have multiple pieces of metadata information (like calucations and ferc account #’s), for every single
wide_to_tidy()
value column. We would probably want to employ that across the board - adding suffixes or something like that to stack the metadata in a similar fashion that we stack the data.
- class pudl.transform.ferc1.CashFlowsTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transform class for core_ferc1__yearly_cash_flows_sched120 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- process_instant_xbrl(df: pandas.DataFrame) pandas.DataFrame [source]¶
Pre-processing required to make the instant and duration tables compatible.
This table has a rename that needs to take place in an unusual spot – after the starting / ending balances have been unstacked, but before the instant & duration tables are merged. This method just reversed the order in which these operations happen, comapared to the inherited method.
- targeted_drop_duplicates(df)[source]¶
Drop one duplicate record from 2020, utility_id_ferc1 2037.
Note: This step could be avoided if we employed a
drop_invalid_rows()
transform step withrequired_valid_cols = ["amount"]
- validate_start_end_balance(df)[source]¶
Validate of start balance + net = end balance.
Add a quick check to ensure the vast majority of the ending balances are calculable from the net change + the starting balance = the ending balance.
- convert_xbrl_metadata_json_to_df(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]]) pandas.DataFrame [source]¶
Transform the metadata to reflect the transformed data.
Replace the name of the balance column reported in the XBRL Instant table with starting_balance / ending_balance since we pull those two values into their own separate labeled rows, each of which should get the original metadata for the Instant column.
- class pudl.transform.ferc1.SalesByRateSchedulesTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transform class for core_ferc1__yearly_sales_by_rate_schedules_sched304 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids: bool = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- add_axis_to_total_table_rows(df: pandas.DataFrame)[source]¶
Add total to the axis column for rows from the total table.
Because we’re adding the sales_of_electricity_by_rate_schedules_account_totals_304 table into the mix, we have a bunch of total values that get mixed in with all the _billed columns from the individual tables. If left alone, these totals aren’t labeled in any way becuse they don’t have the same _axis columns explaining what each of the values are. In order to distinguish them from the rest of the sub-total data we use this function to create an _axis value for them noting that they are totals.
It’s worth noting that there are also some total values in there already. Those would be hard to clean. The idea is that if you want the actual totals, don’t try and sum the sub-components, look at the actual labeled total rows.
This function relies on the
sched_table_name
column, so it must be called before that gets dropped.- Parameters:
df – The sales table with a
sched_table_name
column.
- process_xbrl(raw_xbrl_instant: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
Rename columns before running wide_to_tidy.
- class pudl.transform.ferc1.OtherRegulatoryLiabilitiesTableTransformer(xbrl_metadata_json: dict[Literal['instant', 'duration'], list[dict[str, Any]]] | None = None, params: pudl.transform.classes.TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True)[source]¶
Bases:
Ferc1AbstractTableTransformer
Transformer class for core_ferc1__yearly_other_regulatory_liabilities_sched278 table.
- table_id: TableIdFerc1[source]¶
Name of the PUDL database table that this table transformer produces.
Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate
TableTransformParams
object.
- has_unique_record_ids = False[source]¶
True if each record in the transformed table corresponds to one input record.
For tables that have been transformed from wide-to-tidy format, or undergone other kinds of reshaping, there is not a simple one-to-one relationship between input and output records, and so we should not expect record IDs to be unique. In those cases they serve only a forensic purpose, telling us where to find the original source of the transformed data.
- pudl.transform.ferc1.FERC1_TFR_CLASSES: collections.abc.Mapping[str, type[Ferc1AbstractTableTransformer]][source]¶
- pudl.transform.ferc1.ferc1_transform_asset_factory(table_name: str, tfr_class: Ferc1AbstractTableTransformer, io_manager_key: str = 'pudl_io_manager', convert_dtypes: bool = True, generic: bool = False) dagster.AssetsDefinition [source]¶
Create an asset that pulls in raw ferc Form 1 assets and applies transformations.
This is a convenient way to create assets for tables that only depend on raw dbf, raw xbrl instant and duration tables and xbrl metadata.
- Parameters:
table_name – The name of the table to create an asset for.
tfr_class – A transformer class corresponding to the table_name.
io_manager_key – the dagster io_manager key to use. None defaults to the fs_io_manager.
convert_dtypes – convert dtypes of transformed dataframes.
generic – If using GenericPlantFerc1TableTransformer pass table_id to constructor.
- Returns:
An asset for the clean table.
- pudl.transform.ferc1.create_ferc1_transform_assets() list[dagster.AssetsDefinition] [source]¶
Create a list of transformed FERC Form 1 assets.
- Returns:
A list of AssetsDefinitions where each asset is a clean ferc form 1 table.
- pudl.transform.ferc1.other_dimensions(table_names: list[str]) list[str] [source]¶
Get a list of the other dimension columns across all of the transformers.
- pudl.transform.ferc1.table_to_xbrl_factoid_name() dict[str, str] [source]¶
Build a dictionary of table name (keys) to
xbrl_factoid
column name.
- pudl.transform.ferc1.table_to_column_to_check() dict[str, list[str]] [source]¶
Build a dictionary of table name (keys) to column_to_check from reconcile_table_calculations.
- pudl.transform.ferc1._core_ferc1__table_dimensions(**kwargs) pandas.DataFrame [source]¶
Build a table of values of dimensions observed in the transformed data tables.
Compile a dataframe indicating what distinct values are observed in the data for each dimension column in association with each unique combination of
table_name
andxbrl_factoid
. E.g. for all factoids found in the core_ferc1__yearly_depreciation_by_function_sched219 table, the only value observed forutility_type
iselectric
and the values observed forplant_status
include:future
,in_service
,leased
andtotal
.We need to include the
xbrl_factoid
column because these dimensions can differ based on thexbrl_factoid
. So we first rename all of the columns which contain thexbrl_factoid
usingtable_to_xbrl_factoid_name()
rename dictionary. Then we concatenate all of the tables together and drop duplicates so we have unique instances of observedtable_name
andxbrl_factoid
and the other dimension columns found inother_dimensions()
.
- pudl.transform.ferc1._core_ferc1_xbrl__metadata(**kwargs) pandas.DataFrame [source]¶
Build a table of all of the tables’ XBRL metadata.
- pudl.transform.ferc1._core_ferc1_xbrl__calculation_components(**kwargs) pandas.DataFrame [source]¶
Create calculation-component table from table-level metadata.
- pudl.transform.ferc1.unexpected_total_components(calc_comps: pandas.DataFrame, dimensions: list[str]) pandas.DataFrame [source]¶
Find unexpected components in within-fact total calculations.
This doesn’t check anything about the calcs we get from the metadata, we are only looking at within-fact totals which we’ve added ourselves.
Finds calculation relationships where:
child components that do not match with parent in non-total dimensions.
For example, if utility_type_parent is not “total”, then utility_type must be the same as utility_type_parent.
child components, that share table_name/xbrl_factoid with their parent, that have “total” for any dimension - these should be represented by their child components
- Parameters:
calc_comps – calculation component join table
dimensions – list of dimensions we resolved “total” values for
- pudl.transform.ferc1.check_for_calc_components_duplicates(calc_components: pandas.DataFrame, table_names_known_dupes: list[str], idx: list[str]) None [source]¶
Check for duplicates calculation records.
We need to remove the core_ferc1__yearly_sales_by_rate_schedules_sched304 bc there are duplicate renamed factoids in that table (originally billed/unbilled).
- pudl.transform.ferc1.make_xbrl_factoid_dimensions_explicit(df_w_xbrl_factoid: pandas.DataFrame, table_dimensions_ferc1: pandas.DataFrame, dimensions: list[str], parent: bool = False) pandas.DataFrame [source]¶
Fill in null dimensions w/ the values observed in
_core_ferc1__table_dimensions()
.In the raw XBRL metadata’s calculations, there is an implicit assumption that calculated values are aggregated within categorical columns called Axes or dimensions, in addition to being grouped by date, utility, table, and fact. The dimensions and their values don’t need to be specified explicitly in the calculation components because the same calculation is assumed to apply in all cases.
We have extended this calculation system to allow independent calculations to be specified for different values within a given dimension. For example, the core_ferc1__yearly_utility_plant_summary_sched200 table contains records with a variety of different
utility_type
values (gas, electric, etc.). For many combinations of fact andutility_type
, no more detailed information about the soruce of the data is available, but for some, and only in the case of electric utilities, much more detail can be found in the core_ferc1__yearly_plant_in_service_sched204 table. In order to use this additional information when it is available, we sometimes explicitly specify different calculations for different values of additional dimension columns.This function uses the observed associations between
table_name
,xbrl_factoid
and the other dimension columns compiled by_core_ferc1__table_dimensions()
to fill in missing (previously implied) dimension values in the calculation components table.This is often a broadcast merge because many tables contain many values within these dimension columns, so it is expected that new calculation component table will have many more records than the input calculation components table.
Any dimension that was already specified in the calculation fixes will be left unchanged. If no value of a particular dimension has ever been observed in association with a given combination of
table_name
andxbrl_factoid
it will remain null.- Parameters:
calculation_components – a table of calculation component records which have had some manual calculation fixes applied.
table_dimensions_ferc1 – table with all observed values of
other_dimensions()
for eachtable_name
andxbrl_factoid
dimensions – list of dimension columns to check.
parent – boolean to indicate whether or not the dimensions to be added are the parental dimensions or the child dimensions.
- pudl.transform.ferc1.assign_parent_dimensions(calc_components: pandas.DataFrame, table_dimensions: pandas.DataFrame, dimensions: list[str]) pandas.DataFrame [source]¶
Add dimensions to calculation parents.
We now add in parent-dimension values for all of the original calculation component records using the observed dimensions.
- Parameters:
calc_components – a table of calculation component records which have had some manual calculation fixes applied.
table_dimensions – table with all observed values of
other_dimensions()
for eachtable_name
andxbrl_factoid
.dimensions – list of dimension columns to check.
- pudl.transform.ferc1.infer_intra_factoid_totals(calc_components: pandas.DataFrame, meta_w_dims: pandas.DataFrame, table_dimensions: pandas.DataFrame, dimensions: list[str]) pandas.DataFrame [source]¶
Define dimension total calculations.
Some factoids are marked as a total along some dimension in the metadata, which means that they are the sum of all the non-total factoids along that dimension.
We match the parent factoids from the metadata to child factoids from the table_dimensions. We treat “total” as a wildcard value.
We exclude child factoids that are themselves totals, because that would result in a double-count.
Here are a few examples:
Imagine a factoid with the following dimensions & values:
utility types: “total”, “gas”, “electric”;
plant status: “total”, “in_service”, “future”
Then the following parents would match/not-match:
parent: “total”, “in_service”
child: “gas”, “in_service” WOULD match.
child: “electric”, “in_service” WOULD match.
child: “electric”, “future” WOULD NOT match.
parent: “total”, “total”
child: “gas”, “in_service” WOULD match.
child: “electric”, “future” WOULD match.
See the unit test in ferc1_test.py for more details.
To be able to define these within-dimension calculations we also add dimension columns to all of the parent factoids in the table.
- Parameters:
calc_components – a table of calculation component records which have had some manual calculation fixes applied. Passed through unmodified.
meta_w_dims – metadata table with the dimensions.
table_dimensions – table with all observed values of
other_dimensions()
for eachtable_name
andxbrl_factoid
.dimensions – list of dimension columns to check.
- Returns:
An table associating calculation components with the parents they will be aggregated into. The components and the parents are each identified by
table_name
,xbrl_factoid
, and columns defining the additional dimensions (utility_type
,plant_status
,plant_function
). The parent columns have a_parent
suffix.
- pudl.transform.ferc1.add_calculation_component_corrections(calc_components: pandas.DataFrame) pandas.DataFrame [source]¶
Add records into the calculation components table.
All calculated records should have correction records. All total-to-subdimension calculations should also have correction records. For the core (non-subdimension) calculations, all calculation parents require a record with a correction child. For the total-to-subdimension calculations, we are assuming we can identify those calculations with the is_total_to_subdimensions_calc boolean column. All total-to-subdimension
All calculaitons will get a correction record in the calculation components table wether or not there is ever a correction record in the data.