pudl.transform.eiaaeo

Transform raw AEO tables into normalized assets.

Raw AEO tables often contain many different types of data which are split out along different dimensions. For example, one table may contain generation split out by fuel type as well as prices split out by service category.

As a result, we need to split these large tables into smaller tables that have more uniform data, which we do by filtering the large table to its relevant subsets, and then transforming some human-readable string fields into useful metadata fields.

Attributes

Classes

AeoCheckSpec

Define some simple checks that can run on any AEO asset.

Functions

__sanitize_string(→ pandas.Series)

get_series_info(→ pandas.DataFrame)

Break human-readable series name into machine-readable fields.

get_category_info(→ pandas.Series)

Break human-readable category name into machine-readable fields.

subtotals_match_reported_totals_ratio(→ float)

When subtotals and totals are reported in the same column, check their sums.

series_sum_ratio(→ float)

Find how well multiple columns sum to another column.

filter_enrich_sanitize(→ pandas.DataFrame)

Basic cleaning steps common to all AEO tables.

_collect_totals(→ pandas.DataFrame)

Various columns have different names for their "total" fact.

unstack(df, eventual_pk)

Unstack the values by the various variable names provided.

core_eiaaeo__yearly_projected_generation_in_electric_sector_by_technology(...)

Projected net summer generation capacity and additions/retirements.

core_eiaaeo__yearly_projected_electric_sales(...)

Projected electricity sales by customer class.

core_eiaaeo__yearly_projected_generation_in_end_use_sectors_by_fuel_type(...)

Projected generation capacity + gross generation in end-use sectors.

core_eiaaeo__yearly_projected_fuel_cost_in_electric_sector_by_type(...)

Projected fuel cost for the electric power sector.

make_check(→ dagster.AssetChecksDefinition)

Turn the AeoCheckSpec into an actual Dagster asset check.

Module Contents

pudl.transform.eiaaeo.__sanitize_string(series: pandas.Series) pandas.Series[source]
pudl.transform.eiaaeo.get_series_info(series_name: pandas.Series) pandas.DataFrame[source]

Break human-readable series name into machine-readable fields.

The series name contains several comma-separated fields: the variable, the region, the case, and the report year.

The variable then contains its own colon-separated fields: a general topic, a less general subtopic, and specific variable name. It may also contain a fourth field for a specific dimension such as fuel type.

pudl.transform.eiaaeo.get_category_info(category_name: pandas.Series) pandas.Series[source]

Break human-readable category name into machine-readable fields.

Fortunately the only field we’re pulling out of the category so far is the region, which is the last of two comma-separated fields.

pudl.transform.eiaaeo.subtotals_match_reported_totals_ratio(df: pandas.DataFrame, pk: list[str], fact_columns: list[str], dimension_column: str) float[source]

When subtotals and totals are reported in the same column, check their sums.

Group by some key, then check that within each group the non-"total" values sum up to the corresponding "total" value.

Checks the list of fact columns to in aggregate, but if you want to check that each column sums up correctly, individually, you can call this function once per column.

TODO 2024-05-06: it may make sense to pass the threshold into this function, which would clean up the call sites.

Parameters:
  • df – the dataframe to investigate

  • pk – the key to group facts by

  • fact_columns – the columns containing facts you’d like to sum

  • dimension_column – the column which tells you if a fact is a sub-total or a total.

Returns:

The ratio of reported totals that are np.isclose() to the sum of their component parts.

pudl.transform.eiaaeo.series_sum_ratio(summands: pandas.DataFrame, total: pandas.Series) float[source]

Find how well multiple columns sum to another column.

Parameters:
  • summands – the columns that should sum to total

  • total – the target total column

Returns:

the ratio of values in total that are np.isclose() to the sum of summands.

pudl.transform.eiaaeo.filter_enrich_sanitize(raw_df: pandas.DataFrame, relevant_series_names: tuple[str]) pandas.DataFrame[source]

Basic cleaning steps common to all AEO tables.

  1. Filter the AEO rows based on the series name

  2. Break the series name and category names into useful fields

  3. Sanitize strings & turn data values into a numeric field

  4. Make some defensive checks about data from multiple sources that should agree.

pudl.transform.eiaaeo._collect_totals(df: pandas.DataFrame, total_colname='dimension') pandas.DataFrame[source]

Various columns have different names for their “total” fact.

This combines them into one “total” dimension.

pudl.transform.eiaaeo.unstack(df: pandas.DataFrame, eventual_pk: list[str])[source]

Unstack the values by the various variable names provided.

pudl.transform.eiaaeo.core_eiaaeo__yearly_projected_generation_in_electric_sector_by_technology(raw_eiaaeo__electric_power_projections_regional)[source]

Projected net summer generation capacity and additions/retirements.

pudl.transform.eiaaeo.core_eiaaeo__yearly_projected_electric_sales(raw_eiaaeo__electric_power_projections_regional)[source]

Projected electricity sales by customer class.

pudl.transform.eiaaeo.core_eiaaeo__yearly_projected_generation_in_end_use_sectors_by_fuel_type(raw_eiaaeo__electric_power_projections_regional)[source]

Projected generation capacity + gross generation in end-use sectors.

This includes data that’s reported by fuel type and ignores data that’s only reported at the system-wide level, such as total generation, sales to grid, and generation for own use. Those three facts are reported in core_eiaaeo__yearly_projected_generation_in_end_use_sectors instead.

pudl.transform.eiaaeo.core_eiaaeo__yearly_projected_fuel_cost_in_electric_sector_by_type(raw_eiaaeo__electric_power_projections_regional)[source]

Projected fuel cost for the electric power sector.

Includes 2022 US dollars per million BTU and nominal US dollars per million BTU.

In future report years, the base year for the real cost will change, so we store that base year as well.

class pudl.transform.eiaaeo.AeoCheckSpec[source]

Define some simple checks that can run on any AEO asset.

name: str[source]
asset: str[source]
num_rows_by_report_year: dict[int, int][source]
category_counts: dict[str, int][source]
pudl.transform.eiaaeo.BASE_AEO_CATEGORIES[source]
pudl.transform.eiaaeo.check_specs[source]
pudl.transform.eiaaeo.make_check(spec: AeoCheckSpec) dagster.AssetChecksDefinition[source]

Turn the AeoCheckSpec into an actual Dagster asset check.

pudl.transform.eiaaeo._checks[source]