pudl.transform.eia module¶
Code for transforming EIA data that pertains to more than one EIA Form.
This module helps normalize EIA datasets and infers additonal connections between EIA entities (i.e. utilities, plants, units, generators…). This includes:
compiling a master list of plant, utility, boiler, and generator IDs that appear in any of the EIA 860 or 923 tables.
inferring more complete boiler-generator associations.
differentiating between static and time varying attributes associated with the EIA entities, storing the static fields with the entity table, and the variable fields in an annual table.
The boiler generator association inferrence (bga) takes the associations
provided by the EIA 860, and expands on it using several methods which can be
found in pudl.transform.eia._boiler_generator_assn()
.
-
pudl.transform.eia.
harvesting
(entity, eia_transformed_dfs, entities_dfs, eia860_ytd=False, debug=False)[source]¶ Compiles consistent records for various entities.
For each entity(plants, generators, boilers, utilties), this function finds all the harvestable columns from any table that they show up in. It then determines how consistent the records are and keeps the values that are mostly consistent. It compiles those consistent records into one normalized table.
There are a few things to note here. First being that we are not expecting the outcome here to be perfect! We choose to pull the most consistent record as reported across all the EIA tables and years, but we also required a “strictness” level of 70% (this is currently a hard coded argument for _occurrence_consistency). That means at least 70% of the records must be the same for us to use that value. So if values for an entity haven’t been reported 70% consistently, then it will show up as a null value. We built in the ability to add special cases for columns where we want to apply a different method to, but the only ones we added was for latitude and longitude because they are by far the dirtiest.
We have determined which columns should be considered “static” or “annual”. These can be found in constants in the entities dictionary. Static means That is should not change over time. Annual means there is annual variablity. This distinction was made in part by testing the consistency and in part by an understanding of how the entities and columns relate in the real world.
- Parameters
entity (str) – plants, generators, boilers, utilties
eia_transformed_dfs (dict) – A dictionary of tbl names (keys) and transformed dfs (values)
entities_dfs (dict) – A dictionary of entity table names (keys) and entity dfs (values)
eia860_ytd (boolean) – if True, the etl run is attempting to include year-to-date updated from EIA 860M.
debug (bool) – If True, this function will also return an additional dictionary of dataframes that includes the pre-deduplicated compiled records with the number of occurances of the entity and the record to see consistency of reported values.
- Returns
- A tuple containing:
eia_transformed_dfs (dict): dictionary of tbl names (keys) and transformed dfs (values) entity_dfs (dict): dictionary of entity table names (keys) and entity dfs (values)
- Return type
- Raises
AssertionError – If the consistency of any record value is <90%.
Todo
Return to role of debug.
Determine what to do with null records
Determine how to treat mostly static records
-
pudl.transform.eia.
transform
(eia_transformed_dfs, eia860_years=(2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), eia923_years=(2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), eia860_ytd=False, debug=False)[source]¶ Creates DataFrames for EIA Entity tables and modifies EIA tables.
This function coordinates two main actions: generating the entity tables via
harvesting()
and generating the boiler generator associations via_boiler_generator_assn()
.There is also some removal of tables that are no longer needed after the entity harvesting is finished.
- Parameters
eia_transformed_dfs (dict) – a dictionary of table names (kays) and transformed dataframes (values).
eia860_years (list) – a list of years for EIA 860, must be continuous, and only include working years.
eia923_years (list) – a list of years for EIA 923, must be continuous, and include only working years.
eia860_ytd (boolean) – if True, the etl run is attempting to include year-to-date updated from EIA 860M.
debug (bool) – if true, informational columns will be added into boiler_generator_assn
- Returns
two dictionaries having table names as keys and dataframes as values for the entity tables transformed EIA dataframes
- Return type