pudl.transform.eia module

Code for transforming EIA data that pertains to more than one EIA Form.

This module helps normalize EIA datasets and infers additonal connections between EIA entities (i.e. utilities, plants, units, generators…). This includes:

  • compiling a master list of plant, utility, boiler, and generator IDs that appear in any of the EIA 860 or 923 tables.

  • inferring more complete boiler-generator associations.

  • differentiating between static and time varying attributes associated with the EIA entities, storing the static fields with the entity table, and the variable fields in an annual table.

The boiler generator association inferrence (bga) takes the associations provided by the EIA 860, and expands on it using several methods which can be found in pudl.transform.eia._boiler_generator_assn().

pudl.transform.eia.harvesting(entity, eia_transformed_dfs, entities_dfs, eia860_ytd=False, debug=False)[source]

Compiles consistent records for various entities.

For each entity(plants, generators, boilers, utilties), this function finds all the harvestable columns from any table that they show up in. It then determines how consistent the records are and keeps the values that are mostly consistent. It compiles those consistent records into one normalized table.

There are a few things to note here. First being that we are not expecting the outcome here to be perfect! We choose to pull the most consistent record as reported across all the EIA tables and years, but we also required a “strictness” level of 70% (this is currently a hard coded argument for _occurrence_consistency). That means at least 70% of the records must be the same for us to use that value. So if values for an entity haven’t been reported 70% consistently, then it will show up as a null value. We built in the ability to add special cases for columns where we want to apply a different method to, but the only ones we added was for latitude and longitude because they are by far the dirtiest.

We have determined which columns should be considered “static” or “annual”. These can be found in constants in the entities dictionary. Static means That is should not change over time. Annual means there is annual variablity. This distinction was made in part by testing the consistency and in part by an understanding of how the entities and columns relate in the real world.

Parameters
  • entity (str) – plants, generators, boilers, utilties

  • eia_transformed_dfs (dict) – A dictionary of tbl names (keys) and transformed dfs (values)

  • entities_dfs (dict) – A dictionary of entity table names (keys) and entity dfs (values)

  • eia860_ytd (boolean) – if True, the etl run is attempting to include year-to-date updated from EIA 860M.

  • debug (bool) – If True, this function will also return an additional dictionary of dataframes that includes the pre-deduplicated compiled records with the number of occurances of the entity and the record to see consistency of reported values.

Returns

A tuple containing:

eia_transformed_dfs (dict): dictionary of tbl names (keys) and transformed dfs (values) entity_dfs (dict): dictionary of entity table names (keys) and entity dfs (values)

Return type

tuple

Raises

AssertionError – If the consistency of any record value is <90%.

Todo

  • Return to role of debug.

  • Determine what to do with null records

  • Determine how to treat mostly static records

pudl.transform.eia.transform(eia_transformed_dfs, eia860_years=(2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), eia923_years=(2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), eia860_ytd=False, debug=False)[source]

Creates DataFrames for EIA Entity tables and modifies EIA tables.

This function coordinates two main actions: generating the entity tables via harvesting() and generating the boiler generator associations via _boiler_generator_assn().

There is also some removal of tables that are no longer needed after the entity harvesting is finished.

Parameters
  • eia_transformed_dfs (dict) – a dictionary of table names (kays) and transformed dataframes (values).

  • eia860_years (list) – a list of years for EIA 860, must be continuous, and only include working years.

  • eia923_years (list) – a list of years for EIA 923, must be continuous, and include only working years.

  • eia860_ytd (boolean) – if True, the etl run is attempting to include year-to-date updated from EIA 860M.

  • debug (bool) – if true, informational columns will be added into boiler_generator_assn

Returns

two dictionaries having table names as keys and dataframes as values for the entity tables transformed EIA dataframes

Return type

tuple