pudl.extract.eia_bulk_elec#

Module to extract aggregate data from the EIA bulk electricity download.

EIA’s bulk electricity data contains 680,000 objects, most of which are timeseries. These timeseries contain a variety of measures (fuel amount and cost are just two) across multiple levels of aggregation from individual plants to national averages.

The data is formatted as a single 1.1GB text file of line-delimited JSON with one line per object. Each JSON structure has two nested levels: the top level contains metadata describing the series and the second level (under the “data” heading) contains an array of timestamp/value pairs. This structure leads to a natural normalization into two tables: one of metadata and one of timeseries. That is the format delivered by this module.

Module Contents#

Functions#

_filter_for_fuel_receipts_costs_series(→ pandas.DataFrame)

Pick out the desired data series.

_filter_and_read_to_dataframe(→ pandas.DataFrame)

Decompress and filter the 1100 MB file down to the 16 MB we actually want.

_parse_data_column(→ pandas.DataFrame)

_extract(→ dict[str, pandas.DataFrame])

Extract metadata and timeseries from raw EIA bulk electricity data.

extract(→ dict[str, pandas.DataFrame])

Extract metadata and timeseries from raw EIA bulk electricity data.

pudl.extract.eia_bulk_elec._filter_for_fuel_receipts_costs_series(df: pandas.DataFrame) pandas.DataFrame[source]#

Pick out the desired data series.

Fuel receipts and costs are about 1% of the total lines. This function filters for series that contain the name “RECEIPTS_BTU” or “COST_BTU” in their series_id.

Of the approximately 680,000 objects in the dataset, about 19,000 represent things other than data series (such as category definitions or plot axes). Those non-series objects do not have a field called series_id. The except KeyError: clause handles that situation.

pudl.extract.eia_bulk_elec._filter_and_read_to_dataframe(raw_zipfile: pathlib.Path) pandas.DataFrame[source]#

Decompress and filter the 1100 MB file down to the 16 MB we actually want.

This produces a dataframe with all text fields. The timeseries data is left as JSON strings in the ‘data’ column. The other columns are metadata.

pudl.extract.eia_bulk_elec._parse_data_column(elec_df: pandas.DataFrame) pandas.DataFrame[source]#
pudl.extract.eia_bulk_elec._extract(raw_zipfile) dict[str, pandas.DataFrame][source]#

Extract metadata and timeseries from raw EIA bulk electricity data.

Parameters:

raw_zipfile – Path or other file-like object that can be read by pd.read_json()

Returns:

Dictionary of dataframes with keys ‘metadata’ and ‘timeseries’

pudl.extract.eia_bulk_elec.extract(ds: pudl.workspace.datastore.Datastore) dict[str, pandas.DataFrame][source]#

Extract metadata and timeseries from raw EIA bulk electricity data.

Parameters:

ds – Datastore object

Returns:

Dictionary of dataframes with keys ‘metadata’ and ‘timeseries’