PUDL Release Notes

0.5.0 (Unreleased)

SQLite and Parquet Outputs

  • The ETL pipeline now outputs SQLite databases and Apache Parquet datasets directly, rather than generating tabular data packages. This is much faster and simpler, and also takes up less space on disk. Running the full ETL including all EPA CEMS data should now take around 2 hours if you have all the data downloaded.

  • The new pudl.load.sqlite and pudl.load.parquet modules contain this logic. The pudl.load.csv and pudl.load.metadata modules have been deprecated and will be removed shortly along with other remaining datapackage infrastructure. See #1211

  • Many more tables now have natural primary keys explicitly specified within the database schema.

  • The datapkg_to_sqlite script has been removed and the epacems_to_parquet script can now be used to process the original EPA CEMS CSV data directly to Parquet using an existing PUDL database to source plant timezones. See #1176, #806.

  • Data types, specified value constraints, and the uniqueness / non-null constraints on primary keys are validated during insertion into the SQLite DB.

  • The PUDL ETL CLI pudl.cli now has flags to toggle various constraint checks including --ignore-foreign-key-constraints, --ignore-type-constraints, and --ignore-value-constraints.

New Metadata System

With the deprecation of tabular data package outputs, we’ve adopted a more modular metadata management system that uses Pydantic. This setup will allow us to easily validate the metadata schema and export to a variety of formats to support data distribution via Datasette and Intake catalogs, and automatic generation of data dictionaries and documentation. See #806 and the pudl.metadata subpackage. Many thanks to @ezwelty for most of this work.

Updated Dependencies

  • SQLAlchemy 1.4.x: Addressed all deprecation warnings associated with API changes coming in SQLAlchemy 2.0, and bumped current requirement to 1.4.x

  • Pandas 1.3.x: Addressed many data type issues resulting from changes in how Pandas preserves and propagates ExtensionArray / nullable data types.

  • PyArrow v5.0.0 Updated to the most recent version

  • PyGEOS v0.10.x Updated to the most recent version

  • contextily has been removed, since we only used it optionally for making a single visualization and it has substantial dependencies itself.

  • goodtables-pandas-py has been removed since we’re no longer producing or validating datapackages.

  • SQLite 3.32.0 The type checks that we’ve implemented currently only work with SQLite version 3.32.0 or later, as we discovered in debugging build failures on PR #1228. Unfortunately Ubuntu 20.04 LTS shipped with SQLite 3.31.1. Using conda to manage your Python environment avoids this issue.

New Analyses

  • Added a deployed console script for running the state-level hourly electricity demand allocation, using FERC 714 and EIA 861 data, simply called state_demand and implemented in pudl.analysis.state_demand. This script existed in the v0.4.0 release, but was not deployed on the user’s system.

Data Coverage Changes

  • EIA Form 860 for 2001-2003. See #1122.

  • EPA IPM / NEEDS data has been removed from PUDL as we didn’t have the internal resources to maintain it, and it was no longer working. Apologies to @gschivley!

Known Issues

  • The pudl_territories script has been disabled temporarily due to a memory issue. See #1174

  • The full extent of pre-load data validation that was previously being done with goodtables-pandas has not yet been fully reimplemented. Foreign key constraints are still being debugged. See #1196.

  • Several tables that should have natural primary keys currently do not, because of null or duplicate values in those columns. These need to be resolved in the ETL process. See #851, #852, #1207, #1208

0.4.0 (2021-08-16)

This is a ridiculously large update including more than a year and a half’s worth of work.

New Data Coverage

Documentation & Data Accessibility

We’ve updated and (hopefully) clarified the documentation, and no longer expect most users to perform the data processing on their own. Instead, we are offering several methods of directly accessing already processed data:

Users who still want to run the ETL themselves will need to set up the set up the PUDL development environment

Data Cleaning & Integration

  • We now inject placeholder utilities in the cloned FERC Form 1 database when respondent IDs appear in the data tables, but not in the respondent table. This addresses a bunch of unsatisfied foreign key constraints in the original databases published by FERC.

  • We’re doing much more software testing and data validation, and so hopefully we’re catching more issues early on.

Hourly Electricity Demand and Historical Utility Territories

With support from GridLab and in collaboration with researchers at Berkeley’s Center for Environmental Public Policy, we did a bunch of work on spatially attributing hourly historical electricity demand. This work was largely done by @ezwelty and @yashkumar1803 and included:

  • Semi-programmatic compilation of historical utility and balancing authority service territory geometries based on the counties associated with utilities, and the utilities associated with balancing authorities in the EIA 861 (2001-2019). See e.g. #670 but also many others.

  • A method for spatially allocating hourly electricity demand from FERC 714 to US states based on the overlapping historical utility service territories described above. See #741

  • A fast timeseries outlier detection routine for cleaning up the FERC 714 hourly data using correlations between the time series reported by all of the different entities. See #871

Net Generation and Fuel Consumption for All Generators

We have developed an experimental methodology to produce net generation and fuel consumption for all generators. The process has known issues and is being actively developed. See #989

Net electricity generation and fuel consumption are reported in multiple ways in the EIA 923. The generation_fuel_eia923 table reports both generation and fuel consumption, and breaks them down by plant, prime mover, and fuel. In parallel, the generation_eia923 table reports generation by generator, and the boiler_fuel_eia923 table reports fuel consumption by boiler.

The generation_fuel_eia923 table is more complete, but the generation_eia923 + boiler_fuel_eia923 tables are more granular. The generation_eia923 table includes only ~55% of the total MWhs reported in the generation_fuel_eia923 table.

The pudl.analysis.allocate_net_gen module estimates the net electricity generation and fuel consumption attributable to individual generators based on the more expansive reporting of the data in the generation_fuel_eia923 table.

Data Management and Archiving

  • We now use a series of web scrapers to collect snapshots of the raw input data that is processed by PUDL. These original data are archived as Frictionless Data Packages on Zenodo, so that they can be accessed reproducibly and programmatically via a REST API. This addresses the problems we were having with the v0.3.x releases, in which the original data on the agency websites was liable to be modified long after its “final” release, rendering it incompatible with our software. These scrapers and the Zenodo archiving scripts can be found in our pudl-scrapers and pudl-zenodo-storage repositories. The archives themselves can be found within the Catalyst Cooperative community on Zenodo

  • There’s an experimental caching system that allows these Zenodo archives to work as long-term “cold storage” for citation and reproducibility, with cloud object storage acting as a much faster way to access the same data for day to day non-local use, implemented by @rousik

  • We’ve decided to shift to producing a combination of relational databases (SQLite files) and columnar data stores (Apache Parquet files) as the primary outputs of PUDL. Tabular Data Packages didn’t end up serving either database or spreadsheet users very well. The CSV file were often too large to access via spreadsheets, and users missed out on the relationships between data tables. Needing to separately load the data packages into SQLite and Parquet was a hassle and generated a lot of overly complicated and fragile code.

Known Issues

  • The EIA 861 and FERC 714 data are not yet integrated into the SQLite database outputs, because we need to overhaul our entity resolution process to accommodate them in the database structure. That work is ongoing, see #639

  • The EIA 860 and EIA 923 data don’t cover exactly the same rage of years. EIA 860 only goes back to 2004, while EIA 923 goes back to 2001. This is because the pre-2004 EIA 860 data is stored in the DBF file format, and we need to update our extraction code to deal with the different format. This means some analyses that require both EIA 860 and EIA 923 data (like the calculation of heat rates) can only be performed as far back as 2004 at the moment. See #848

  • There are 387 EIA utilities and 228 EIA palnts which appear in the EIA 923, but which haven’t yet been assigned PUDL IDs and associated with the corresponding utilities and plants reported in the FERC Form 1. These entities show up in the 2001-2008 EIA 923 data that was just integrated. These older plants and utilities can’t yet be used in conjuction with FERC data. When the EIA 860 data for 2001-2003 has been integrated, we will finish this manual ID assignment process. See #848, #1069

  • 52 of the algorithmically assigned plant_id_ferc1 values found in the plants_steam_ferc1 table are currently associated with more than one plant_id_pudl value (99 PUDL plant IDs are involved), indicating either that the algorithm is making poor assignments, or that the manually assigned plant_id_pudl values are incorrect. This is out of several thousand distinct plant_id_ferc1 values. See #954

  • The county FIPS codes associated with coal mines reported in the Fuel Receipts and Costs table are being treated inconsistently in terms of their data types, especially in the output functions, so they are currently being output as floating point numbers that have been cast to strings, rather than zero-padded integers that are strings. See #1119

0.3.2 (2020-02-17)

The primary changes in this release:

  • The 2009-2010 data for EIA 860 have been integrated, including updates to the data validation test cases.

  • Output tables are more uniform and less restrictive in what they include, no longer requiring PUDL Plant & Utility IDs in some tables. This release was used to compile v1.1.0 of the PUDL Data Release, which is archived at Zenodo under this DOI: https://doi.org/10.5281/zenodo.3672068

    With this release, the EIA 860 & 923 data now (finally!) cover the same span of time. We do not anticipate integrating any older EIA 860 or 923 data at this time.

0.3.1 (2020-02-05)

A couple of minor bugs were found in the preparation of the first PUDL data release:

  • No maximum version of Python was being specified in setup.py. PUDL currently only works on Python 3.7, not 3.8.

  • epacems_to_parquet conversion script was erroneously attempting to verify the availability of raw input data files, despite the fact that it now relies on the packaged post-ETL epacems data. Didn’t catch this before since it was always being run in a context where the original data was lying around… but that’s not the case when someone just downloads the released data packages and tries to load them.

0.3.0 (2020-01-30)

This release is mostly about getting the infrastructure in place to do regular data releases via Zenodo, and updating ETL with 2018 data.

Added lots of data validation / quality assurance test cases in anticipation of archiving data. See the pudl.validate module for more details.

New data since v0.2.0 of PUDL:

  • EIA Form 860 for 2018

  • EIA Form 923 for 2018

  • FERC Form 1 for 1994-2003 and 2018 (select tables)

We removed the FERC Form 1 accumulated depreciation table from PUDL because it requires detailed row-mapping in order to be accurate across all the years. It and many other FERC tables will be integrated soon, using new row-mapping methods.

Lots of new plants and utilities integrated into the PUDL ID mapping process, for the earlier years (1994-2003). All years of FERC 1 data should be integrated for all future ferc1 tables.

Command line interfaces of some of the ETL scripts have changed, see their help messages for details.

0.2.0 (2019-09-17)

This is the first release of PUDL to generate data packages as the canonical output, rather than loading data into a local PostgreSQL database. The data packages can then be used to generate a local SQLite database, without relying on any software being installed outside of the Python requirements specified for the catalyst.coop package.

This change will enable easier installation of PUDL, as well as archiving and bulk distribution of the data products in a platform independent format.

0.1.0 (2019-09-12)

This is the only release of PUDL that will be made that makes use of PostgreSQL as the primary data product. It is provided for reference, in case there are users relying on this setup who need access to a well defined release.