pudl.transform.eia#

Code for transforming EIA data that pertains to more than one EIA Form.

This module helps normalize EIA datasets and infers additonal connections between EIA entities (i.e. utilities, plants, units, generators…). This includes:

  • compiling a master list of plant, utility, boiler, and generator IDs that appear in any of the EIA 860 or 923 tables.

  • inferring more complete boiler-generator associations.

  • differentiating between static and time varying attributes associated with the EIA entities, storing the static fields with the entity table, and the variable fields in an annual table.

The boiler generator association inferrence (bga) takes the associations provided by the EIA 860, and expands on it using several methods which can be found in pudl.transform.eia._boiler_generator_assn().

Module Contents#

Functions#

find_timezone(*[, lng, lat, state, strict])

Find the timezone associated with the a specified input location.

_occurrence_consistency(entity_id, compiled_df, col, ...)

Find the occurence of plants & the consistency of records.

_lat_long(dirty_df, clean_df, entity_id_df, entity_id, ...)

Harvests more complete lat/long in special cases.

_add_timezone(→ pandas.DataFrame)

Add plant IANA timezone based on lat/lon or state if lat/lon is unavailable.

_add_additional_epacems_plants(plants_entity)

Adds the info for plants that have IDs in the CEMS data but not EIA data.

_compile_all_entity_records(entity, eia_transformed_dfs)

Compile all of the entity records from each table they appear in.

_manage_strictness(col, eia860m)

Manage the strictness level for each column.

harvesting(→ tuple)

Compile consistent records for various entities.

_boiler_generator_assn(eia_transformed_dfs[, ...])

Creates a set of more complete boiler generator associations.

_restrict_years(df[, eia923_years, eia860_years])

Restricts eia years for boiler generator association.

transform(eia_transformed_dfs[, eia_settings, debug])

Creates DataFrames for EIA Entity tables and modifies EIA tables.

Attributes#

logger

TZ_FINDER

A global TimezoneFinder to cache geographies in memory for faster access.

APPROXIMATE_TIMEZONES

Approximate mapping of US & Canadian jurisdictions to canonical timezones

pudl.transform.eia.logger[source]#
pudl.transform.eia.TZ_FINDER[source]#

A global TimezoneFinder to cache geographies in memory for faster access.

pudl.transform.eia.APPROXIMATE_TIMEZONES :dict[str, str][source]#

Approximate mapping of US & Canadian jurisdictions to canonical timezones

This is imperfect for states that have split timezones. See: https://en.wikipedia.org/wiki/List_of_time_offsets_by_U.S._state_and_territory For states that are split, the timezone that has more people in it. List of timezones in pytz.common_timezones Canada: https://en.wikipedia.org/wiki/Time_in_Canada#IANA_time_zone_database

pudl.transform.eia.find_timezone(*, lng=None, lat=None, state=None, strict=True)[source]#

Find the timezone associated with the a specified input location.

Note that this function requires named arguments. The names are lng, lat, and state. lng and lat must be provided, but they may be NA. state isn’t required, and isn’t used unless lng/lat are NA or timezonefinder can’t find a corresponding timezone.

Timezones based on states are imprecise, so it’s far better to use lng/lat if possible. If strict is True, state will not be used. More on state-to-timezone conversion here: https://en.wikipedia.org/wiki/List_of_time_offsets_by_U.S._state_and_territory

Parameters:
  • lng (int or float in [-180,180]) – Longitude, in decimal degrees

  • lat (int or float in [-90, 90]) – Latitude, in decimal degrees

  • state (str) – Abbreviation for US state or Canadian province

  • strict (bool) – Raise an error if no timezone is found?

Returns:

The timezone (as an IANA string) for that location.

Return type:

str

Todo

Update docstring.

pudl.transform.eia._occurrence_consistency(entity_id, compiled_df, col, cols_to_consit, strictness=0.7)[source]#

Find the occurence of plants & the consistency of records.

We need to determine how consistent a reported value is in the records across all of the years or tables that the value is being reported, so we want to compile two key numbers: the number of occurances of the entity and the number of occurances of each reported record for each entity. With that information we can determine if the reported records are strict enough.

Parameters:
  • entity_id (list) – a list of the id(s) for the entity. Ex: for a plant entity, the entity_id is [‘plant_id_eia’]. For a generator entity, the entity_id is [‘plant_id_eia’, ‘generator_id’].

  • compiled_df (pandas.DataFrame) – a dataframe with every instance of the column we are trying to harvest.

  • col (str) – the column name of the column we are trying to harvest.

  • cols_to_consit (list) – a list of the columns to determine consistency. This either the [entity_id] or the [entity_id, ‘report_date’], depending on whether the entity is static or annual.

  • strictness (float) – How consistent do you want the column records to be? The default setting is .7 (so 70% of the records need to be consistent in order to accept harvesting the record).

Returns:

this dataframe will be a transformed version of compiled_df with NaNs removed and with new columns with information about the consistency of the reported values.

Return type:

pandas.DataFrame

pudl.transform.eia._lat_long(dirty_df, clean_df, entity_id_df, entity_id, col, cols_to_consit, round_to=2)[source]#

Harvests more complete lat/long in special cases.

For all of the entities were there is not a consistent enough reported record for latitude and longitude, this function reduces the precision of the reported lat/long by rounding down the reported records in order to get more complete set of consistent records.

Parameters:
  • dirty_df (pandas.DataFrame) – a dataframe with entity records that have inconsistently reported lat/long.

  • clean_df (pandas.DataFrame) – a dataframe with entity records that have consistently reported lat/long.

  • entity_id_df (pandas.DataFrame) – a dataframe with a complete set of possible entity ids

  • entity_id (list) – a list of the id(s) for the entity. Ex: for a plant entity, the entity_id is [‘plant_id_eia’]. For a generator entity, the entity_id is [‘plant_id_eia’, ‘generator_id’].

  • col (string) – the column name of the column we are trying to harvest.

  • cols_to_consit (list) – a list of the columns to determine consistency. This either the [entity_id] or the [entity_id, ‘report_date’], depending on whether the entity is static or annual.

  • round_to (integer) – This is the number of decimals places we want to preserve while rounding down.

Returns:

a dataframe with all of the entity ids. some will have harvested records from the clean_df. some will have harvested records that were found after rounding. some will have NaNs if no consistently reported records were found.

Return type:

pandas.DataFrame

pudl.transform.eia._add_timezone(plants_entity: pandas.DataFrame) pandas.DataFrame[source]#

Add plant IANA timezone based on lat/lon or state if lat/lon is unavailable.

Parameters:

plants_entity – Plant entity table, including columns named “latitude”, “longitude”, and optionally “state”

Returns:

A DataFrame containing the same table, with a “timezone” column added. Timezone may be missing if lat / lon is missing or invalid.

pudl.transform.eia._add_additional_epacems_plants(plants_entity)[source]#

Adds the info for plants that have IDs in the CEMS data but not EIA data.

The columns loaded are plant_id_eia, plant_name, state, latitude, and longitude. Note that a side effect will be resetting the index on plants_entity, if onecexists. If that’s a problem, modify the code below.

Note that some of these plants disappear from the CEMS before the earliest EIA data PUDL processes, so if PUDL eventually ingests older data, these may be redundant.

The set of additional plants is every plant that appears in the hourly CEMS data (1995-2017) that never appears in the EIA 923 or 860 data (2009-2017 for EIA 923, 2011-2017 for EIA 860).

Parameters:

plants_entity (pandas.DataFrame) – appended to

Returns:

The same plants_entity table, with the addition of some missing EPA CEMS plants.

Return type:

pandas.DataFrame

pudl.transform.eia._compile_all_entity_records(entity, eia_transformed_dfs)[source]#

Compile all of the entity records from each table they appear in.

Comb through each of the dataframes in the eia_transformed_dfs dictionary to pull out every instance of the entity id.

pudl.transform.eia._manage_strictness(col, eia860m)[source]#

Manage the strictness level for each column.

Parameters:
  • col (str) – name of column

  • eia860m (boolean) – if True, the etl run is attempting to include year-to-date updated from EIA 860M.

pudl.transform.eia.harvesting(entity: str, eia_transformed_dfs: dict[str, pandas.DataFrame], entities_dfs: dict[str, pandas.DataFrame], eia860m: bool = False, debug: bool = False) tuple[source]#

Compile consistent records for various entities.

For each entity(plants, generators, boilers, utilties), this function finds all the harvestable columns from any table that they show up in. It then determines how consistent the records are and keeps the values that are mostly consistent. It compiles those consistent records into one normalized table.

There are a few things to note here. First being that we are not expecting the outcome here to be perfect! We choose to pull the most consistent record as reported across all the EIA tables and years, but we also required a “strictness” level of 70% (this is currently a hard coded argument for _occurrence_consistency). That means at least 70% of the records must be the same for us to use that value. So if values for an entity haven’t been reported 70% consistently, then it will show up as a null value. We built in the ability to add special cases for columns where we want to apply a different method to, but the only ones we added was for latitude and longitude because they are by far the dirtiest.

We have determined which columns should be considered “static” or “annual”. These can be found in constants in the entities dictionary. Static means That is should not change over time. Annual means there is annual variablity. This distinction was made in part by testing the consistency and in part by an understanding of how the entities and columns relate in the real world.

Parameters:
  • entity – plants, generators, boilers, utilties

  • eia_transformed_dfs – A dictionary of tbl names (keys) and transformed dfs (values)

  • entities_dfs – A dictionary of entity table names (keys) and entity dfs (values)

  • eia860m – if True, the etl run is attempting to include year-to-date updated from EIA 860M.

  • debug – If True, this function will also return an additional dictionary of dataframes that includes the pre-deduplicated compiled records with the number of occurances of the entity and the record to see consistency of reported values.

Returns:

A tuple containing:

eia_transformed_dfs (dict): dictionary of tbl names (keys) and transformed dfs (values) entity_dfs (dict): dictionary of entity table names (keys) and entity dfs (values)

Return type:

tuple

Raises:

AssertionError – If the consistency of any record value is <90%.

Todo

  • Return to role of debug.

  • Determine what to do with null records

  • Determine how to treat mostly static records

pudl.transform.eia._boiler_generator_assn(eia_transformed_dfs, eia923_years=DataSource.from_id('eia923').working_partitions['years'], eia860_years=DataSource.from_id('eia860').working_partitions['years'], debug=False)[source]#

Creates a set of more complete boiler generator associations.

Creates a unique unit_id_pudl for each collection of boilers and generators within a plant that have ever been associated with each other, based on the boiler generator associations reported in EIA860. Unfortunately, this information is not complete for years before 2014, as the gas turbine portion of combined cycle power plants in those earlier years were not reporting their fuel consumption, or existence as part of the plants.

For years 2014 and on, EIA860 contains a unit_id_eia value, allowing the combined cycle plant compoents to be associated with each other. For many plants not listed in the reported boiler generator associations, it is nonetheless possible to associate boilers and generators on a one-to-one basis, as they use identical strings to describe the units.

In the end, between the reported BGA table, the string matching, and the unit_id_eia values, it’s possible to create a nearly complete mapping of the generation units, at least for 2014 and later.

Parameters:
  • eia_transformed_dfs (dict) – a dictionary of post-transform dataframes representing the EIA database tables.

  • eia923_years (list-like) – a list of the years of EIA 923 data that should be used to infer the boiler-generator associations. By default it is all the working years of data.

  • eia860_years (list-like) – a list of the years of EIA 860 data that should be used to infer the boiler-generator associations. By default it is all the working years of data.

  • debug (bool) – If True, include columns in the returned dataframe indicating by what method the individual boiler generator associations were inferred.

Returns:

Returns the same dictionary of dataframes that was passed in, and adds a new dataframe to it representing the boiler-generator associations as records containing plant_id_eia, generator_id, boiler_id, and unit_id_pudl

Return type:

eia_transformed_dfs (dict)

Raises:
  • AssertionError – If the boiler - generator association graphs are not bi-partite, meaning generators only connect to boilers, and boilers only connect to generators.

  • AssertionError – If all the boilers do not end up with the same unit_id each year.

  • AssertionError – If all the generators do not end up with the same unit_id each year.

pudl.transform.eia._restrict_years(df, eia923_years=DataSource.from_id('eia923').working_partitions['years'], eia860_years=DataSource.from_id('eia860').working_partitions['years'])[source]#

Restricts eia years for boiler generator association.

pudl.transform.eia.transform(eia_transformed_dfs, eia_settings: pudl.settings.EiaSettings = EiaSettings(), debug=False)[source]#

Creates DataFrames for EIA Entity tables and modifies EIA tables.

This function coordinates two main actions: generating the entity tables via harvesting() and generating the boiler generator associations via _boiler_generator_assn().

There is also some removal of tables that are no longer needed after the entity harvesting is finished.

Parameters:
  • eia_transformed_dfs (dict) – a dictionary of table names (kays) and transformed dataframes (values).

  • settings – Object containing validated settings relevant to EIA datasets.

  • debug (bool) – if true, informational columns will be added into boiler_generator_assn

Returns:

two dictionaries having table names as keys and dataframes as values for the entity tables transformed EIA dataframes

Return type:

tuple