Clean and normalize EIA bulk electricity data.

EIA’s bulk electricity data contains 680,000 timeseries. These timeseries contain a variety of measures (fuel amount and cost are just two) across multiple levels of aggregation, from individual plants to national averages.

The data is formatted as a single 1.1GB text file of line-delimited JSON with one line per timeseries. Each JSON structure has two nested levels: the top level contains metadata describing the series and the second level (under the “data” heading) contains an array of timestamp/value pairs. This structure leads to a natural normalization into two tables: one of metadata and one of timeseries. That is the format delivered by the extract module.

The transform module parses a compound primary key out of long string IDs (“series_id”). The rest of the metadata is not very valuable so is not transformed or returned.

The EIA aggregates are related to their component categories via a set of association tables defined in pudl.metadata.dfs. For example, the “all_coal” fuel aggregate is linked to all the coal-related energy_source_code values: BIT, SUB, LIG, and WC. Similar relationships are defined for aggregates over fuel, sector, geography, and time.

Module Contents#


_extract_keys_from_series_id(→ pandas.DataFrame)

Parse primary key codes from EIA series_id.

_map_key_codes_to_readable_values(→ pandas.DataFrame)

_transform_timeseries(→ pandas.DataFrame)

Transform raw timeseries.

transform(→ dict[str, pandas.DataFrame])

Transform raw EIA bulk electricity aggregates.

pudl.transform.eia_bulk_elec._extract_keys_from_series_id(raw_df: pandas.DataFrame) pandas.DataFrame[source]#

Parse primary key codes from EIA series_id.

These codes comprise the compound primary key that uniquely identifies a data series: (metric, fuel, region, sector, frequency).

pudl.transform.eia_bulk_elec._map_key_codes_to_readable_values(compound_keys: pandas.DataFrame) pandas.DataFrame[source]#
pudl.transform.eia_bulk_elec._transform_timeseries(raw_ts: pandas.DataFrame) pandas.DataFrame[source]#

Transform raw timeseries.

Transform to tidy format and replace the obscure series_id with a readable compound primary key.


A dataframe with compound key (“fuel_agg”, “geo_agg”, “sector_agg”, “temporal_agg”, “report_date”) and two value columns: “fuel_received_mmbtu”, “fuel_cost_per_mmbtu”

pudl.transform.eia_bulk_elec.transform(raw_dfs: dict[str, pandas.DataFrame]) dict[str, pandas.DataFrame][source]#

Transform raw EIA bulk electricity aggregates.


raw_dfs – raw timeseries dataframe


(“fuel_agg”, “geo_agg”, “sector_agg”, “temporal_agg”, “report_date”) and two value columns: “fuel_received_mmbtu”, “fuel_cost_per_mmbtu”

Return type:

Transformed timeseries dataframe with compound key