Data Access

We publish the PUDL pipeline outputs in several ways to serve different users and use cases. We’re always trying to increase accessibility of the PUDL data, so if you have suggestions or questions please open a GitHub issue or email us at pudl@catalyst.coop.

How Should You Access PUDL Data?

We provide four primary ways of interacting with PUDL data. Here’s how to find out which one is right for you and your use case.

Access Method

Types of User

Use Cases

Datasette

Curious Explorer, Spreadsheet Analyst, Web Developer

Explore the PUDL database interactively in a web browser. Select data to download as CSVs for local analysis in spreadsheets. Create sharable links to a particular selection of data. Access PUDL data via a REST API.

Zenodo Archives

Researcher, Database User, Notebook Analyst

Use a stable, citable, fully processed version of the PUDL on your own computer. Use PUDL in Jupyer Notebooks running in a stable, archived Docker container. Access the SQLite DB and Parquet files directly using any toolset.

JupyterHub

New Python User, Notebook Analyst

Work through the PUDL example notebooks ithout any downloads or setup. Perform your own notebook-based analyses using PUDL data and limited computational resources.

Development Environment

Python Developer, Data Wrangler

Run the PUDL data processing pipeline on your own computer. Edit the PUDL source code and run the software tests and data validations. Integrate a new data source or newly released data from one of existing sources.

Data Packages

Deprecated

For working with our published data prior to v0.4.0

Datasette

We provide web-based access to the PUDL data via a Datasette deployment at https://data.catalyst.coop.

Datasette is an open source tool that wraps SQLite databases in an interactive front-end. It allows users to browse database tables, select portions of them using dropdown menus, build their own SQL queries, and download data to CSVs. It also creates a REST API allowing the data in the database to be queried programmatically. All the query parameters are stored in the URL, so you can also share links to the data you’ve selected.

Note that only data which has been fully integrated into the SQLite databases are available here. Currently this includes the core PUDL database and our concatenation of all historical FERC Form 1 databases.

Zenodo Archives

We use Zenodo to archive our fully processed data as a SQLite databasees and Parquet files. We also archive a Docker image that contains the software environment required to use PUDL within Jupyter Notebooks. You can find all our archived data products in the Catalyst Cooperative Community on Zenodo.

  • The current (beta) version of the archived data and Docker container can be downloaded from This Zenodo archive

  • Detailed instructions on how to access the archived PUDL data using a Docker container can be found in our PUDL Examples repository.

  • The SQLite databases and Parquet files containing the PUDL data, the complete FERC 1 database, and EPA CEMS hourly data are contained in that same archive, if you want to access them directly without using PUDL.

Note

If you’re already familiar with Docker, you can also pull the image we use to run Jupyter directly:

$ docker pull catalystcoop/pudl-jupyter:latest

JupyterHub

We’ve set up a JupyterHub in collaboration with 2i2c.org which provides access to all of the processed PUDL data and the software environment required to work with it. You don’t have to download or install anything to use it, but we do need to create an account for you.

  • Request an account by submitting this form.

  • Once we’ve created an account for you follow this link to log in and open up the first example notebook on the JupyterHub.

  • You can create your own notebooks and upload, save, and download modest amounts of data on the hub.

We can only offer a small amount of memory (4-6GB) and processing power (1 CPU) per user on the JupyterHub for free. If you need to work with lots of data or do computationally intensive analysis, you may want to look into using the Zenodo Archives option on your own computer. The JupyterHub uses exactly the same data and software environment as the Zenodo Archives. Eventually we also want to offer paid access to the JupyterHub with plenty of computing power.

Development Environment

If you want to run the PUDL data processing pipeline yourself from scratch, run the software tests, or make changes to the source code, you’ll need to set up our development environment. This is a bit involved, so it has its own separate documentation.

Most users shouldn’t need to do this, and will probably find working with the pre-processed data via one of the other access modes easier. But if you want to contribute to the project please give it a shot!

Data Packages

Note

Prior to v0.4.0 of PUDL we only published processed data as tabular data packages. As of v0.4.0 we are will distribute the SQLite databases and Apache Parquet files alongside a set of data packages. As of PUDL v0.5.0 we will be generating SQLite and Apache Parquet outputs directly, and will no longer be archiving tabular data packages as the format of record, and the format conversions described below will no longer be necessary.

Archived Data Packages

We periodically publish data packages containing the full outputs from the PUDL ETL pipeline on Zenodo, and open data archiving service provided by CERN. The most recent release can always be found through this concept DOI: 10.5281/zenodo.3653158. Each individual version of the data releases will be assigned its own unique DOI.

All of our archived products can be found in the Catalyst Cooperative Community on Zenodo. These archives and the DOIs associated with them should be permanently accessible, and are suitable for use as references in academic and other publications.

Once you’ve downloaded or generated your own tabular data packages you will probably want to convert them into a more analysis oriented file format. We typically use SQLite for the core FERC and EIA data, and Apache Parquet files for the very long tables like EPA CEMS.

Converting to SQLite

If you want to access the data via SQL, we have provided a script that loads several data packages into a local sqlite3 database. Note that these data packages must have all been generated by the same ETL run, or they will be considered incompatible by the script. For example, to load three data packages generated by our example ETL configuration into your local SQLite DB, you could run the following command from within your PUDL workspace:

$ datapkg_to_sqlite \
    datapkg/pudl-example/ferc1-example/datapackage.json \
    datapkg/pudl-example/eia-example/datapackage.json \

Run datapkg_to_sqlite --help for more details.

Converting to Apache Parquet

The EPA CEMS Hourly data approaches 100 GB in size uncompressed, which is too large to work with directly in memory on most systems, and take a very long time to load into SQLite. Instead, we recommend converting the Hourly Emissions table into an Apache Parquet dataset which is stored on disk locally, and either reading in only parts of it using pandas, or using Dask dataframes, to serialize or distribute your analysis tasks. Dask can also speed up processing for in-memory tasks, especially if you have a powerful system with multiple cores, a solid state disk, and plenty of memory.

If you have generated an EPA CEMS data package, you can use the epacems_to_parquet script to convert the hourly emissions table like this:

$ epacems_to_parquet datapkg/pudl-example/epacems-eia-example/datapackage.json

The script will automatically generate a Parquet Dataset which is partitioned by year and state in the parquet/epacems directory within your workspace. Run epacems_to_parquet --help for more details.