pudl.extract.eiaaeo

Extract EIA AEO data from the bulk JSON.

Module Contents

Classes

AEOCategory

Describe how the AEO data is categorized.

AEOSeries

Describe actual AEO timeseries data.

AEOTable

Data schema for a raw AEO table.

AEOTaxonomy

Container for all the information in one AEO report.

Functions

raw_eiaaeo(context)

Extract tables from EIA's Annual Energy Outlook.

raw_table_54_invariants(→ dagster.AssetCheckResult)

Check that the AEO Table 54 raw data conforms to some assumptions.

Attributes

pudl.extract.eiaaeo.logger[source]
class pudl.extract.eiaaeo.AEOCategory(/, **data: Any)[source]

Bases: pydantic.BaseModel

Describe how the AEO data is categorized.

Categories are the basic way in which metadata that is shared across multiple data series is represented.

category_id: int[source]
parent_category_id: int[source]
name: str[source]
notes: str[source]
childseries: list[str][source]
class pudl.extract.eiaaeo.AEOSeries(/, **data: Any)[source]

Bases: pydantic.BaseModel

Describe actual AEO timeseries data.

This includes the data itself as well as some timeseries-specific metadata that may not be shared across multiple timeseries.

series_id: str[source]
name: str[source]
last_updated: str[source]
units: str | None[source]
data: list[tuple[str, str | float]][source]
class pudl.extract.eiaaeo.AEOTable[source]

Bases: pandera.DataFrameModel

Data schema for a raw AEO table.

date: pandera.dtypes.Timestamp[source]
value: str[source]
units: str[source]
series_name: str[source]
category_name: str[source]
case: str[source]
class pudl.extract.eiaaeo.AEOTaxonomy(records: collections.abc.Iterable[str])[source]

Container for all the information in one AEO report.

AEO reports are composed of categories, which are metadata about multiple data series, and series, which are the actual data + metadata associated with one specific time series.

The categories and series form a DAG structure with 5 generations: root, case, subject, leaf category, and data series.

The first generation is the root - there is one root node which is nameless and which all other nodes descend from.

The second generation is the “cases.” Cases are different scenarios within the AEO. These have names like “Reference case,” “High Economic Growth”, “Low Oil and Gas Supply.” All direct children of the root node are cases.

The third generation is the “subjects.” These are high-level tags, with names like “Energy Prices”, “Energy Consumption”, etc. These are largely used for filtering in the AEO data UI, so we ignore these.

The fourth generation is the “leaf categories.” These are named things like “Table 54. Electric Power Projections by Electricity Market Module Region, United States” and have a long list of “child series” which actually contain the data. In other words, these leaf categories map the notion of an AEO “table” to the actual data.

The fifth generation is the “data series.” These actually contain the data points, and have no children. They have names like “Electricity : Electric Power Sector : Cumulative Planned Additions : Coal” and “Coal Supply : Delivered Prices : Electric Power.” As you can see the names imply a bunch of different dimensions, which we don’t try to make sense of in the extract step.

In the first four generations we see a strictly branching tree, but many leaf categories can point at the same data series so the whole taxonomy is a DAG. This is because of two reasons:

  • the subject tag doesn’t affect data values, but because of the tree structure, each leaf category is repeated once for each subject, leading to multiple duplicated leaf categories pointing at the same data series.

  • some data series are relevant to multiple different tables - so multiple different leaf categories point at the same data series. In this case we would expect the names of the leaf category to reflect their different identities.

Note, also, that there is no structural notion of a “Table” in the AEO data. That information is carried purely by the names of the leaf categories.

class EntityType(*args, **kwds)[source]

Bases: enum.Enum

These are the three types of entities in AEO.

ROOT = 1001[source]
CATEGORY = 1002[source]
SERIES = 1003[source]
class CheckSpec[source]

Encapsulate shared checks for the taxonomy structure.

generation: str[source]
typecheck: collections.abc.Callable[[int | str], bool][source]
in_degree: collections.abc.Callable[[int], bool][source]
out_degree: collections.abc.Callable[[int], bool][source]
__load_records(records: collections.abc.Iterable[str]) tuple[dict[int, AEOCategory], dict[str, AEOSeries]][source]

Read AEO JSON blob into memory.

A single JSON object can represent either a category or a series, so we parse those into two separate mappings.

__generate_graph(categories: dict[int, AEOCategory], series: dict[str, AEOSeries]) networkx.DiGraph[source]

Stitch categories and series together into a DAG.

__generation_invariants() list[source]

Check that the graph behaves the way we expect.

We have a few generic checks for all generations - node type, in-degree, and out-degree.

We also have bespoke checks for individual generations as needed.

Returns the list of generations for further manipulation.

__sanitize(s: str) str[source]
__series_to_records(series_id: str, potential_parents: set[int]) pandas.DataFrame[source]

Turn a data series into records we can feed into a DataFrame.

This uses graph ancestor data to figure out what case this series belongs to.

This series may be associated with multiple different tables in the graph. In that case, we’ll need to filter down only to the leaf categories that are relevant to the table we’re creating a DataFrame for. We do that by passing in potential_parents as a parameter.

get_table(table_number: int) pandas.DataFrame[source]

Get a specific table number as a DataFrame.

pudl.extract.eiaaeo.raw_eiaaeo(context: dagster.AssetExecutionContext)[source]

Extract tables from EIA’s Annual Energy Outlook.

We first extract a taxonomy from the AEO JSON blob, which connects individual data series to “categories”. Some categories are associated with a specific table; others are associated with an AEO case or subject.

The AEO cases are different scenarios such as “High Economic Growth” or “High Oil Price.” They include “Reference” and “2022 AEO reference case” as well.

The AEO subjects are only used for filtering which tables are relevant to which subjects, e.g. “Table 54 is relevant to Energy Prices.” So we ignore those right now.

The series each have their own timeseries data, as well as some metadata such as a series name and units. Many different dimensions can be inferred from the series names, but the data is somewhat heterogeneous so we do not try to infer those here and leave that to the transformation step.

pudl.extract.eiaaeo.raw_table_54_invariants(df: pandas.DataFrame) dagster.AssetCheckResult[source]

Check that the AEO Table 54 raw data conforms to some assumptions.