Data Access#

We publish the PUDL pipeline outputs in several ways to serve different users and use cases. We’re always trying to increase accessibility of the PUDL data, so if you have a suggestion please open a GitHub issue. If you have a question you can create a GitHub discussion.

How Should You Access PUDL Data?#

We provide four primary ways of interacting with PUDL data. Here’s how to find out which one is right for you and your use case.

Access Method

Types of User

Use Cases

Datasette

Curious Explorer, Spreadsheet Analyst, Web Developer

Explore the PUDL database interactively in a web browser. Select data to download as CSVs for local analysis in spreadsheets. Create sharable links to a particular selection of data. Access PUDL data via a REST API.

Nightly Builds

Cloud Developer, Database User, Beta Tester

Get the freshest data that has passed all data validations, updated most weekday mornings. Fast downloads from AWS S3 storage buckets.

Zenodo Archives

Researcher, Database User, Notebook Analyst

Use a stable, citable, fully processed version of the PUDL on your own computer. Use PUDL in Jupyer Notebooks running in a stable, archived Docker container. Access the SQLite DB and Parquet files directly using any toolset.

Development Environment

Python Developer, Data Wrangler

Run the PUDL data processing pipeline on your own computer. Edit the PUDL source code and run the software tests and data validations. Integrate a new data source or newly released data from one of existing sources.

Datasette#

We provide web-based access to the PUDL data via a Datasette deployment at https://data.catalyst.coop.

Datasette is an open source tool that wraps SQLite databases in an interactive front-end. It allows users to browse database tables, select portions of them using dropdown menus, build their own SQL queries, and download data to CSVs. It also creates a REST API allowing the data in the database to be queried programmatically. All the query parameters are stored in the URL so you can also share links to the data you’ve selected.

Note

The only SQLite database containing cleaned and integrated data is the core PUDL database. There are also several FERC SQLite databases derived from their old Visual FoxPro and new XBRL data formats, which we publish as SQLite to improve accessibility of the raw inputs, but they should generally not be used directly if the data you need has integrated into the PUDL database.

Nightly Builds#

Every night we attempt to process all of the data that’s part of PUDL using the most recent version of the dev branch. If the ETL succeeds and the resulting outputs pass all of the data validation tests we’ve defined, the outputs are automatically uploaded to the AWS Open Data Registry, and used to deploy a new version of Datasette (see above). These nightly build outputs can be accessed using the AWS CLI, or programmatically via the S3 API. They can also be downloaded directly over HTTPS using the following links:

Note

To reduce network transfer times, we gzip the SQLite database files, which can be quite large when uncompressed. To decompress them locally, you can use the gunzip command.

$ gunzip *.sqlite.gz

Zenodo Archives#

We use Zenodo to archive our fully processed data as SQLite databases and Parquet files. We also archive a Docker image that contains the software environment required to use PUDL within Jupyter Notebooks. You can find all our archived data products in the Catalyst Cooperative Community on Zenodo.

  • The current version of the archived data and Docker container can be downloaded from This Zenodo archive

  • Detailed instructions on how to access the archived PUDL data using a Docker container can be found in our PUDL Examples repository.

  • The SQLite databases and Parquet files containing the PUDL data, the complete FERC 1 database, and EPA CEMS hourly data are contained in that same archive, if you want to access them directly without using PUDL.

Note

If you’re already familiar with Docker, you can also pull the image we use to run Jupyter directly:

$ docker pull catalystcoop/pudl-jupyter:latest

Development Environment#

If you want to run the PUDL data processing pipeline yourself from scratch, run the software tests, or make changes to the source code, you’ll need to set up our development environment. This is a bit involved, so it has its own separate documentation.

Most users shouldn’t need to do this, and will probably find working with the pre-processed data via one of the other access modes easier. But if you want to contribute to the project please give it a shot!