We publish the PUDL pipeline outputs in several ways to serve different users and use cases. We’re always trying to increase accessibility of the PUDL data, so if you have a suggestion please open a GitHub issue. If you have a question you can create a GitHub discussion.
PUDL’s primary data output is the
pudl.sqlite database. We recommend working with
tables with the
out_ prefix, as these tables contain the most complete and easiest
to work with data. For more information about the different types
of tables, read through PUDL’s naming conventions.
How Should You Access PUDL Data?#
We provide four primary ways of interacting with PUDL data. Here’s how to find out which one is right for you and your use case.
Types of User
Curious Explorer, Spreadsheet Analyst, Web Developer
Explore the PUDL database interactively in a web browser. Select data to download as CSVs for local analysis in spreadsheets. Create sharable links to a particular selection of data. Access PUDL data via a REST API.
Data scientist, data analyst, Jupyter notebook user
Easy Jupyter notebook access to all PUDL data products, including example notebooks. Updated weekly based on the nightly builds.
Cloud Developer, Database User, Beta Tester
Get the freshest data that has passed all of our data validations, updated most weekday mornings. Fast, free downloads from AWS S3 storage buckets.
Researcher, Database User, Notebook Analyst
Use a stable, citable, fully processed version of the PUDL on your own computer. Access the SQLite DB and Parquet files directly using any toolset.
Researcher, Data Wrangler
Access the data that feeds into PUDL, unmodified from its original source.
Python Developer, Data Wrangler
Run the PUDL data processing pipeline on your own computer. Edit the PUDL source code and run the software tests and data validations. Integrate a new data source or newly released data from one of existing sources.
Datasette is an open source tool that wraps SQLite databases in an interactive front-end. It allows users to browse database tables, select portions of them using dropdown menus, build their own SQL queries, and download data to CSVs. It also creates a REST API allowing the data in the database to be queried programmatically. All the query parameters are stored in the URL so you can also share links to the data you’ve selected.
The only SQLite database containing cleaned and integrated data is the core PUDL database. There are also several FERC SQLite databases derived from their old Visual FoxPro and new XBRL data formats, which we publish as SQLite to improve accessibility of the raw inputs, but they should generally not be used directly if the data you need has integrated into the PUDL database.
Want to explore the PUDL data interactively in a Jupyter Notebook without needing to do any setup? Our nightly build outputs (see below) automatically update the PUDL Project Dataset on Kaggle once a week. There are several notebooks associated with the dataset, both curated by Catalyst and contributed by other Kaggle users which you can use to get oriented to the PUDL database.
Every night we attempt to process all of the data that’s part of PUDL using the most recent version of the main branch. If the ETL succeeds and the resulting outputs pass all of the data validation tests we’ve defined, the outputs are automatically uploaded to the AWS Open Data Registry, and used to deploy a new version of Datasette (see above). These nightly build outputs can be accessed using the AWS CLI, or programmatically via the S3 API. They can also be downloaded directly over HTTPS using the following links:
Raw FERC Form 1:
Raw FERC Form 2:
Raw FERC Form 6:
Raw FERC Form 60:
Raw FERC Form 714:
To reduce network transfer times, we compress the SQLite databases using
decompress them locally, at the command line on Linux, MacOS, or Windows you can use the
gunzip command. (Git for Windows installs
gunzip by default, and it
can also be installed using the conda package manager).
$ gunzip *.sqlite.gz
If you’re not familiar with using Unix command line tools in Windows you can also use a 3rd party tool like 7zip.
If you want a specific, immutable version of our data for any reason, you can find them all here on Zenodo. Zenodo assigns long-lived DOIs to each archive, suitable for citation in academic journals and other publications. The most recent versioned PUDL data release can be found using this Concept DOI: https://doi.org/10.5281/zenodo.3653158
The documentation for the latest such stable build is here. You can access the documentation for a specific version by hovering over the version selector at the bottom left of the page.
If you’re not after a specific version, but rather the latest stable
version, you can find them AWS Open Data Registry, in the
stable/ namespace. You can run
aws s3 ls --no-sign-request
s3://pudl.catalyst.coop/stable/ to see what’s available.
Sometimes you want to see the raw data that is published by the government, but it’s hard to find or difficult to download, or you want to see what an older version of the published data looked like prior to being revised or deleted.
We use Zenodo to archive and version our raw data inputs. You can find all of our archives in the Catalyst Cooperative Community.
These have been minimally processed - in some cases we’ve compressed them or grouped them into ZIP archives to fit the Zenodo repository requirements. In all cases we’ve added some metadata to help identify the resources you’re looking for. But, apart from that, these datasets are unmodified.
If you want to run the PUDL data processing pipeline yourself from scratch, run the software tests, or make changes to the source code, you’ll need to set up our development environment. This is a bit involved, so it has its own separate documentation.
Most users shouldn’t need to do this, and will probably find working with the pre-processed data via one of the other access modes easier. But if you want to contribute to the project please give it a shot!