PUDL Release Notes#
Incorporated 2021 data from the EPA Hourly Continuous Emission Monitoring System (CEMS) dataset. See #1778
Nightly Data Builds#
We added infrastructure to run the entire ETL and all tests nightly so we can catch data errors when they are merged into
dev. This allows us to automatically update the PUDL Intake data catalogs when there are new code releases. See #1177 for more details.
Created a docker image that installs PUDL and it’s depedencies. The
build-deploy-pudl.yamlGitHub Action builds and pushes the image to Docker Hub and deploys the image on a Google Compute Engine instance. The ETL outputs are then loaded to Google Cloud buckets for the data catalogs to access.
censusdp1tract_to_sqlitecommands and pytest.
Allow users to create monolithic and partitioned EPA CEMS outputs without having to clobber or move any existing CEMS outputs.
GoogleCloudStorageCachenow supports accessing requester pays buckets.
--loglevelarg to the package entrypoint commands.
Database Schema Changes#
After learning that generators’ prime movers do very occasionally change over time, we recategorized the
prime_mover_codecolumn in our entity resolution process to enable the rare but real variability over time. We moved the
prime_mover_codecolumn from the statically harvested/normalized data column to an annually harvested data column (i.e. from generators_entity_eia to generators_eia860) #1600. See #1585 for more details.
Created operational_status_eia into our static metadata tables (See PUDL Code Metadata). Used these standard codes and code fixes to clean
operational_status_codein the generators_entity_eia table. #1624
Moved a number of slowly changing plant attributes from the plants_entity_eia table to the annual plants_eia860 table. See #1748 and #1749. This was initially inspired by the desire to more accurately reproduce the aggregated fuel prices which are available in the EIA’s API. Along with state, census region, month, year, and fuel type, those prices are broken down by industrial sector. Previously
sector_id_eia(an aggregation of several
primary_purpose_naics_idvalues) had been assumed to be static over a plant’s lifetime, when in fact it can change if e.g. a plant is sold to an IPP by a regulated utility. Other plant attributes which are now allowed to vary annually include:
grid_voltage_1_kvin the plants_eia860 table, to follow the pattern of many other multiply reported values.
Date Merge Helper Function#
Replaced the PUDL helper function
clean_merge_asofthat merged two dataframes reported on different temporal granularities, for example monthly vs yearly data. The reworked function,
pudl.helpers.date_merge, is more encapsulating and faster and replaces
clean_merge_asofin the MCOE table and EIA 923 tables. See #1103, #1550
The helper function
pudl.helpers.expand_timeserieswas also added, which expands a dataframe to include a full timeseries of data at a certain frequency. The coordinating function
pudl.helpers.date_mergeto merge two dataframes of different temporal granularities, and then calls
pudl.helpers.expand_timeseriesto expand the merged dataframe to a full timeseries. The added
timeseries_fillinargument, makes this function optionally used to generate the MCOE table that includes a full monthly timeseries even in years when annually reported generators don’t have matching monthly data. See #1550
Plant Parts List Module Changes#
We refactored a couple components of the Plant Parts List module in preparation for the next round of entity matching of EIA and FERC Form 1 records with the Panda model developed by the Chu Data Lab at Georgia Tech, through work funded by a CCAI Innovation Grant. The labeling of different aggregations of EIA generators as the true granularity was sped up, resulting in faster generation of the final plant parts list. In addition, the generation of the
installation_yearcolumn in the plant parts list was fixed and a
construction_yearcolumn was also added. Finally,
operating_yearwas added as a level that the EIA generators are now aggregated to.
The mega generators table and in turn the plant parts list requires the MCOE table to generate. The MCOE table is now created with the new
pudl.helpers.date_mergehelper function (described above). As a result, now by default only columns from the EIA 860 generators table that are necessary for the creation of the plant parts list will be included in the MCOE table. This list of columns is defined by the global
pudl.analysis.mcoe.DEFAULT_GENS_COLS. If additional columns that are not part of the default list are needed from the EIA 860 generators table, these columns can be passed in with the
gens_colsargument. See #1550
Dask v2022.4.2 introduced breaking changes into
dask.dataframe.read_parquet(). However, we didn’t catch this when it happened because it’s only a problem when there’s more than one row-group. Now we’re processing 2019-2020 data for both ID and ME (two of the smallest states) in the tests. Also restricted the allowed Dask versions in our
setup.pyso that we get notified by the dependabot any time even a minor update. happens to any of the packages we depend on that use calendar versioning. See #1618.
Fixed a testing bug where the partitioned EPA CEMS outputs generated using parallel processing were getting output in the same output directory as the real ETL, which should never happen. See #1618.
Dependencies / Environment#
In conjunction with getting the @dependabot set up to merge its own PRs if CI passes, we tightened the version constraints on a lot of our dependencies. This should reduce the frequency with which we get surprised by changes breaking things after release. See #1655
We’ve switched to using mambaforge to manage our environments internally, and are recommending that users use it as well.
We’re moving toward treating PUDL like an application rather than a library, and part of that is no longer trying to be compatible with a wide range of versions of our dependencies, instead focusing on a single reproducible environment that is associated with each release, using lockfiles, etc. See #1669
As an “application” PUDL is now only supporting the most recent major version of Python (curently 3.10). We used pyupgrade and pep585-upgrade to update the syntax of to use Python 3.10 norms, and are now using those packages as pre-commit hooks as well. See #1685
For the purposes of linking EIA and FERC Form 1 records, we (mostly @cmgosnell) have created a new output called the Plant Parts List in
pudl.analysis.plant_parts_eiawhich combines many different sub-parts of the EIA generators based on their fuel type, prime movers, ownership, etc. This allows a huge range of hypothiecally possible FERC Form 1 plant records to be synthesized, so that we can identify exactly what data in EIA should be associated with what data in FERC using a variety of record linkage & entity matching techniques. This is still a work in progress, both with our partners at RMI, and in collaboration with the Chu Data Lab at Georgia Tech, through work funded by a CCAI Innovation Grant. #1157
Use the data source metadata classes to automatically export rich metadata for use with our Datasette deployement. #1479
Added static tables and metadata structures that store definitions and additional information related to the many coded categorical columns in the database. These tables are exported directly into the documentation (See PUDL Code Metadata). The metadata structures also document all of the non-standard values that we’ve identified in the raw data, and the standard codes that they are mapped to. #1388
As a result of all these metadata improvements we were finally able to close #52 and delete the
pudl.constantsjunk-drawer module… after 5 years.
We are now using the coding table metadata mentioned above and the foreign key relationships that are part of the database schema to automatically recode any column that refers to the codes defined in the coding table. This results in much more uniformity across the whole database, especially in the EIA
In the raw input data, often NULL values will be represented by the empty string or other not really NULL values. We went through and cleaned these up in all of the categorical / coded columns so that their values can be validated based on either an ENUM constraint in the database, or a foreign key constraint linking them to the static coding tables. Now they should primarily use the pandas NA value, or numpy.nan in the case of floats. #1376
Many FIPS and ZIP codes that appear in the raw data are stored as integers rather than strings, meaning that they lose their leading zeros, rendering them invalid in many contexts. We use the same method to clean them all up now, and enforce a uniform field width with leading zero padding. This also allows us to enforce a regex pattern constraint on these fields in the database outputs. #1405, #1476
We’re now able to fill in missing values in the very useful generators_eia860
technology_descriptionfield. Currently this is optionally available in the output layer, but we want to put more of this kind of data repair into the core database gong forward. #1075
Made better use of our Pydantic settings classes to validate and manage the ETL settings that are read in from YAML files and passed around throughout the functions that orchestrate the ETL process. #1506
Addressed a bunch of deprecation warnings being raised by
Silenced a bunch of 3rd party module warnings in the tests. See #1476
In addressing #851, #1296, #1325 the generation_fuel_eia923 table was split to create a generation_fuel_nuclear_eia923 table since they have different primary keys. This meant that the
pudl.output.pudltabl.PudlTabl.gf_eia923()method no longer included nuclear generation. This impacted the net generation allocation process and MCOE calculations downstream, which were expecting to have all the reported nuclear generation. This has now been fixed, and the generation fuel output includes both the nuclear and non-nuclear generation, with nuclear generation aggregated across nuclear unit IDs so that it has the same primary key as the rest of the generation fuel table. #1518
EIA changed the URL of their API to only accept connections over HTTPS, but we had a hard-coded HTTP URL, meaning the historical fuel price filling that uses the API broke. This has been fixed.
Everything is fiiiiiine.
Data Coverage Changes#
Integration of 2020 data for all our core datasets (See #1255):
EPA IPM / NEEDS data has been removed from PUDL as we didn’t have the internal resources to maintain it, and it was no longer working. Apologies to @gschivley!
SQLite and Parquet Outputs#
The ETL pipeline now outputs SQLite databases and Apache Parquet datasets directly, rather than generating tabular data packages. This is much faster and simpler, and also takes up less space on disk. Running the full ETL including all EPA CEMS data should now take around 2 hours if you have all the data downloaded.
pudl.load.parquetmodules contain this logic. The
pudl.load.metadatamodules have been removed along with other remaining datapackage infrastructure. See #1211
Many more tables now have natural primary keys explicitly specified within the database schema.
datapkg_to_sqlitescript has been removed and the
epacems_to_parquetscript can now be used to process the original EPA CEMS CSV data directly to Parquet using an existing PUDL database to source plant timezones. See #1176, #806.
Data types, specified value constraints, and the uniqueness / non-null constraints on primary keys are validated during insertion into the SQLite DB.
The PUDL ETL CLI
pudl.clinow has flags to toggle various constraint checks including
New Metadata System#
With the deprecation of tabular data package outputs, we’ve adopted a more
modular metadata management system that uses Pydantic. This setup will allow us to easily
validate the metadata schema and export to a variety of formats to support data
distribution via Datasette and Intake catalogs, and automatic generation of data
dictionaries and documentation. See #806, #1271, #1272 and the
subpackage. Many thanks to @ezwelty for most of this work.
ETL Settings File Format Changed#
We are also using Pydantic to parse and
validate the YAML settings files that tell PUDL what data to include in an ETL run. If
you have any old settings files of your own lying around they’ll need to be updated.
Examples of the new format will be deployed to your system if you re-run the
pudl_setup script. Or you can make a copy of the
etl_fast.yml files that are stored under
edit them to reflect your needs.
Database Schema Changes#
With the direct database output and the new metadata system, it’s much eaiser for us to create foreign key relationships automatically. Updates that are in progress to the database normalization and entity resolution process also benefit from using natural primary keys when possible. As a result we’ve made some changes to the PUDL database schema, which will probably affect some users.
We have split out a new generation_fuel_nuclear_eia923 table from the existing generation_fuel_eia923 table, as nuclear generation and fuel consumption are reported at the generation unit level, rather than the plant level, requiring a different natural primary key. See #851, #1296, #1325.
Implementing a natural primary key for the boiler_fuel_eia923 table required the aggregation of a small number of records that didn’t have well-defined
prime_mover_codevalues. See #852, #1306, #1311.
We repaired, aggregated, or dropped a small number of records in the generation_eia923 (See #1208, #1248) and ownership_eia860 (See #1207, #1258) tables due to null values in their primary key columns.
Many new foreign key constraints are being enforced between the EIA data tables, entity tables, and coding tables. See #1196.
Fuel types and energy sources reported to EIA are now defined in / constrained by the static energy_sources_eia table.
The columns that indicate the mode of transport for various fuels now contain short codes rather than longer labels, and are defined in / constrained by the static fuel_transportation_modes_eia table.
In the simplified FERC 1 fuel type categories, we’re now using
Several columns have been renamed to harmonize meanings between different tables and datasets, including:
In generation_fuel_eia923 and boiler_fuel_eia923 the
fuel_type_codecolumns have been replaced with
energy_source_code, which appears in various forms in generators_eia860 and fuel_receipts_costs_eia923.
mine_type(a human readable label, not a code).
Added a deployed console script for running the state-level hourly electricity demand allocation, using FERC 714 and EIA 861 data, simply called
state_demandand implemented in
pudl.analysis.state_demand. This script existed in the v0.4.0 release, but was not deployed on the user’s system.
SQLAlchemy 1.4.x: Addressed all deprecation warnings associated with API changes coming in SQLAlchemy 2.0, and bumped current requirement to 1.4.x
Pandas 1.3.x: Addressed many data type issues resulting from changes in how Pandas preserves and propagates ExtensionArray / nullable data types.
PyArrow v5.0.0 Updated to the most recent version
PyGEOS v0.10.x Updated to the most recent version
contextily has been removed, since we only used it optionally for making a single visualization and it has substantial dependencies itself.
goodtables-pandas-py has been removed since we’re no longer producing or validating datapackages.
SQLite 3.32.0 The type checks that we’ve implemented currently only work with SQLite version 3.32.0 or later, as we discovered in debugging build failures on PR #1228. Unfortunately Ubuntu 20.04 LTS shipped with SQLite 3.31.1. Using
condato manage your Python environment avoids this issue.
This is a ridiculously large update including more than a year and a half’s worth of work.
New Data Coverage#
EIA Form 860 for 2004-2008 + 2019, plus eia860m through 2020.
EIA Form 923 for 2001-2008 + 2019
EPA Hourly Continuous Emission Monitoring System (CEMS) for 2019-2020
FERC Form 1 for 2019
US Census Demographic Profile (DP1) for 2010
FERC Form 714 for 2006-2019 (experimental)
EIA Form 861 for 2001-2019 (experimental)
Documentation & Data Accessibility#
We’ve updated and (hopefully) clarified the documentation, and no longer expect most users to perform the data processing on their own. Instead, we are offering several methods of directly accessing already processed data:
Processed data archives on Zenodo that include a Docker container preserving the required software environment for working with the data.
Users who still want to run the ETL themselves will need to set up the set up the PUDL development environment
Data Cleaning & Integration#
We now inject placeholder utilities in the cloned FERC Form 1 database when respondent IDs appear in the data tables, but not in the respondent table. This addresses a bunch of unsatisfied foreign key constraints in the original databases published by FERC.
We’re doing much more software testing and data validation, and so hopefully we’re catching more issues early on.
Hourly Electricity Demand and Historical Utility Territories#
With support from GridLab and in collaboration with researchers at Berkeley’s Center for Environmental Public Policy, we did a bunch of work on spatially attributing hourly historical electricity demand. This work was largely done by @ezwelty and @yashkumar1803 and included:
Semi-programmatic compilation of historical utility and balancing authority service territory geometries based on the counties associated with utilities, and the utilities associated with balancing authorities in the EIA 861 (2001-2019). See e.g. #670 but also many others.
A method for spatially allocating hourly electricity demand from FERC 714 to US states based on the overlapping historical utility service territories described above. See #741
A fast timeseries outlier detection routine for cleaning up the FERC 714 hourly data using correlations between the time series reported by all of the different entities. See #871
Net Generation and Fuel Consumption for All Generators#
We have developed an experimental methodology to produce net generation and fuel consumption for all generators. The process has known issues and is being actively developed. See #989
Net electricity generation and fuel consumption are reported in multiple ways in the EIA 923. The generation_fuel_eia923 table reports both generation and fuel consumption, and breaks them down by plant, prime mover, and fuel. In parallel, the generation_eia923 table reports generation by generator, and the boiler_fuel_eia923 table reports fuel consumption by boiler.
The generation_fuel_eia923 table is more complete, but the generation_eia923 + boiler_fuel_eia923 tables are more granular. The generation_eia923 table includes only ~55% of the total MWhs reported in the generation_fuel_eia923 table.
pudl.analysis.allocate_net_gen module estimates the net electricity
generation and fuel consumption attributable to individual generators based on
the more expansive reporting of the data in the generation_fuel_eia923
Data Management and Archiving#
We now use a series of web scrapers to collect snapshots of the raw input data that is processed by PUDL. These original data are archived as Frictionless Data Packages on Zenodo, so that they can be accessed reproducibly and programmatically via a REST API. This addresses the problems we were having with the v0.3.x releases, in which the original data on the agency websites was liable to be modified long after its “final” release, rendering it incompatible with our software. These scrapers and the Zenodo archiving scripts can be found in our pudl-scrapers and pudl-zenodo-storage repositories. The archives themselves can be found within the Catalyst Cooperative community on Zenodo
There’s an experimental caching system that allows these Zenodo archives to work as long-term “cold storage” for citation and reproducibility, with cloud object storage acting as a much faster way to access the same data for day to day non-local use, implemented by @rousik
We’ve decided to shift to producing a combination of relational databases (SQLite files) and columnar data stores (Apache Parquet files) as the primary outputs of PUDL. Tabular Data Packages didn’t end up serving either database or spreadsheet users very well. The CSV file were often too large to access via spreadsheets, and users missed out on the relationships between data tables. Needing to separately load the data packages into SQLite and Parquet was a hassle and generated a lot of overly complicated and fragile code.
The EIA 861 and FERC 714 data are not yet integrated into the SQLite database outputs, because we need to overhaul our entity resolution process to accommodate them in the database structure. That work is ongoing, see #639
The EIA 860 and EIA 923 data don’t cover exactly the same rage of years. EIA 860 only goes back to 2004, while EIA 923 goes back to 2001. This is because the pre-2004 EIA 860 data is stored in the DBF file format, and we need to update our extraction code to deal with the different format. This means some analyses that require both EIA 860 and EIA 923 data (like the calculation of heat rates) can only be performed as far back as 2004 at the moment. See #848
There are 387 EIA utilities and 228 EIA palnts which appear in the EIA 923, but which haven’t yet been assigned PUDL IDs and associated with the corresponding utilities and plants reported in the FERC Form 1. These entities show up in the 2001-2008 EIA 923 data that was just integrated. These older plants and utilities can’t yet be used in conjuction with FERC data. When the EIA 860 data for 2001-2003 has been integrated, we will finish this manual ID assignment process. See #848, #1069
52 of the algorithmically assigned
plant_id_ferc1values found in the
plants_steam_ferc1table are currently associated with more than one
plant_id_pudlvalue (99 PUDL plant IDs are involved), indicating either that the algorithm is making poor assignments, or that the manually assigned
plant_id_pudlvalues are incorrect. This is out of several thousand distinct
plant_id_ferc1values. See #954
The county FIPS codes associated with coal mines reported in the Fuel Receipts and Costs table are being treated inconsistently in terms of their data types, especially in the output functions, so they are currently being output as floating point numbers that have been cast to strings, rather than zero-padded integers that are strings. See #1119
The primary changes in this release:
The 2009-2010 data for EIA 860 have been integrated, including updates to the data validation test cases.
Output tables are more uniform and less restrictive in what they include, no longer requiring PUDL Plant & Utility IDs in some tables. This release was used to compile v1.1.0 of the PUDL Data Release, which is archived at Zenodo under this DOI: https://doi.org/10.5281/zenodo.3672068
With this release, the EIA 860 & 923 data now (finally!) cover the same span of time. We do not anticipate integrating any older EIA 860 or 923 data at this time.
A couple of minor bugs were found in the preparation of the first PUDL data release:
No maximum version of Python was being specified in setup.py. PUDL currently only works on Python 3.7, not 3.8.
epacems_to_parquetconversion script was erroneously attempting to verify the availability of raw input data files, despite the fact that it now relies on the packaged post-ETL epacems data. Didn’t catch this before since it was always being run in a context where the original data was lying around… but that’s not the case when someone just downloads the released data packages and tries to load them.
This release is mostly about getting the infrastructure in place to do regular data releases via Zenodo, and updating ETL with 2018 data.
Added lots of data validation / quality assurance test cases in anticipation of archiving data. See the pudl.validate module for more details.
New data since v0.2.0 of PUDL:
EIA Form 860 for 2018
EIA Form 923 for 2018
FERC Form 1 for 1994-2003 and 2018 (select tables)
We removed the FERC Form 1 accumulated depreciation table from PUDL because it requires detailed row-mapping in order to be accurate across all the years. It and many other FERC tables will be integrated soon, using new row-mapping methods.
Lots of new plants and utilities integrated into the PUDL ID mapping process, for the earlier years (1994-2003). All years of FERC 1 data should be integrated for all future ferc1 tables.
Command line interfaces of some of the ETL scripts have changed, see their help messages for details.
This is the first release of PUDL to generate data packages as the canonical output, rather than loading data into a local PostgreSQL database. The data packages can then be used to generate a local SQLite database, without relying on any software being installed outside of the Python requirements specified for the catalyst.coop package.
This change will enable easier installation of PUDL, as well as archiving and bulk distribution of the data products in a platform independent format.
This is the only release of PUDL that will be made that makes use of PostgreSQL as the primary data product. It is provided for reference, in case there are users relying on this setup who need access to a well defined release.