Data Catalog

Available Data


Write up more extensive descriptions of each dataset, what’s in them, what the ETL process looks like for each of them, etc. Maybe use this page as an index, with each dataset having its own catalog page. We’ve got a lot of this information written up elsewhere and should be able to cut-and-paste.

EIA Form 860

Source URL

Source Format

Microsoft Excel (.xls/.xlsx)

Source Years


Size (Download)

127 MB

Size (Uncompressed)

247 MB



Years Liberated


Records Liberated



open issues labeled epacems

All of the data reported to the EIA on Form 860 is being pulled into the PUDL database for the years 2011-2017.

We are working on integrating the 2009-2010 EIA 860 data, which has a similar format. This will give us the same coverage in both EIA 860 and EIA 923, which is good since the two datasets are tightly integrated.

Currently we are extending the 2011 EIA 860 data back to 2009 as needed to integrate it with EIA 923.

EIA Form 923

Source URL

Source Format

Microsoft Excel (.xls/.xlsx)

Source Years


Size (Download)

196 MB

Size (Uncompressed)

299 MB



Years Liberated


Records Liberated

~2 million


open issues labeled epacems

Nearly all of EIA Form 923 is being pulled into the PUDL database, for years 2009-2017. Earlier data is available from EIA, but the reporting format for earlier years is substantially different from the present day, and will require more work to integrate. Monthly year to date releases are not yet being integrated.


Source URL

Source Format

Comma Separated Value (.csv)

Source Years


Size (Download)

7.6 GB

Size (Uncompressed)

~100 GB



Years Liberated


Records Liberated

~1 billion


open issues labeled epacems

All of the EPA’s hourly Continuous Emissions Monitoring System (CEMS) data is available. It is by far the largest dataset in PUDL at the moment, with hourly records for thousands of plants covering decades. Note that the ETL process can easily take all day for the full dataset. PUDL also provides a script that converts the raw EPA CEMS data into Apache Parquet files, which can be read and queried very efficiently from disk. For usage details run:

$ epacems_to_parquet --help

Thanks to Karl Dunkle Werner for contributing much of the EPA CEMS Hourly ETL code.


Source URL

Source Format

Microsoft Excel (.xlsx)

Source Years


Size (Download)

14 MB

Size (Uncompressed)

14 MB



Years Liberated


Records Liberated



open issues labeled epacems


Get Greg Schivley to write up a description of the EPA IPM dataset.

FERC Form 1

Source URL

Source Format

FoxPro Database (.DBC/.DBF)

Source Years


Size (Download)

1.4 GB

Size (Uncompressed)

2.5 GB



Years Liberated

1994-2018 (raw), 2004-2017 (parboiled)

Records Liberated

~12 million (raw), ~270,000 (parboiled)


open issues labeled

We have integrated a subset of the FERC Form 1 data, mostly pertaining to power plants, their capital & operating expenses, and fuel consumption, for 2004-2017. More work will be required to integrate the rest of the years and data. However we make all of the FERC Form 1 data available (7.2 GB of data in 116 tables, going back to 1994) in its raw form via an SQLite database. See Cloning FERC Form 1 for details.

We continue to improve the integration between the FERC Form 1 plants and the EIA plants and generators, many of which represent the same utility assets. Over time if there’s demand we may pull in and clean up additional FERC Form 1 tables.

When we integrate the 2018 FERC Form 1 data, we will also attempt to extend coverage for already integrated tables as far back as 1994.

Work in Progress

Thanks to a grant from the Alfred P. Sloan Foundation Energy & Environment Program, we have support to integrate the following new datasets.

EIA Form 861

Source URL

Source Format

Microsoft Excel (.xls/.xlsx)

Source Years


Size (Download)

Size (Uncompressed)



Years Liberated

Records Liberated


open issues labeled epacems

This form includes information about utility demand side management programs, distribution systems, total sales by customer class, net generation, ultimate disposition of power, and other information. This is a smaller dataset (~100s of MB) distributed as Microsoft Excel spreadsheets.


Locational marginal electricity pricing information from the various grid operators (e.g. MISO, CAISO, NEISO, PJM, ERCOT…). At high time resolution, with many different delivery nodes, this will be a very large dataset (hundreds of GB). The format for the data is different for each of the ISOs. Physical location of the delivery nodes is not always publicly available.

Future Data

There’s a huge variety and quantity of data about the US electric utility system available to the public. The data listed above is just the beginning! Other data we’ve heard demand for are listed below. If you’re interested in using one of them, and would like to add it to PUDL, check out our contribution guidelines. If there are other datasets you think we should be looking at integration, don’t hesitate to open an issue on Github requesting the data and explaining why it would be useful.

EIA Water Usage

EIA Water records water use by thermal generating stations in the US.

FERC Form 714

FERC Form 714 includes hourly loads, reported by load balancing authorities annually. This is a modestly sized dataset, in the 100s of MB, distributed as Microsoft Excel spreadsheets.


The FERC EQR Also known as the Electricity Quarterly Report or Form 920, this dataset includes the details of many transactions between different utilities, and between utilities and merchant generators. It covers ancillary services as well as energy and capacity, time and location of delivery, prices, contract length, etc. It’s one of the few public sources of information about renewable energy power purchase agreements (PPAs). This is a large (~100s of GB) dataset, composed of a very large number of relatively clean CSV files, but it requires fuzzy processing to get at some of the interesting and only indirectly reported attributes.

MSHA Mines and Production

The MSHA Mines & Production dataset describes coal production by mine and operating company, along with statistics about labor productivity and safety. This is a smaller dataset (100s of MB) available as relatively clean and well structured CSV files.

PHMSA Natural Gas Pipelines

The PHMSA Natural Gas Pipelines dataset, published by the Pipeline and Hazardous Materials Safety Administration (which is part of the US Dept. of Transportation) collects data about the natural gas transmission and distribution system, including their age, length, diameter, materials, and carrying capacity.

Transmission and Distribution Systems

In order to run electricity system operations models and cost optimizations, you need some kind of model of the interconnections between generation and loads. There doesn’t appear to be a generally accepted, publicly available set of these network descriptions (yet!).