The Public Utility Data Liberation Project¶
PUDL makes US energy data easier to access and use. Hundreds of gigabytes of information is available from government agencies, but it’s often difficult to work with, and different sources can be hard to combine. PUDL takes the original spreadsheets, CSV files, and databases and turns them into unified tabular data packages that can be used to populate a database, or read in directly with Python, R, Microsoft Access, and many other tools.
The project currently integrates data from:
The project is focused on serving researchers, activists, journalists, and policy makers that might not otherwise be able to afford access to this data from existing commercial data providers. You can sign up for PUDL email updates here.
Create and activate a conda environment named
pudl that installs packages
from the community maintained
conda-forge channel. In addition to the
catalystcoop.pudl package, install JupyterLab so we can work with the PUDL
$ conda create -y -n pudl -c conda-forge \ --strict-channel-priority python=3.7 \ catalystcoop.pudl jupyter jupyterlab pip $ conda activate pudl
Now create a data management workspace called
pudl-work and download EIA,
EPA, and FERC data for 2018 data using the
pudl_data script. The workspace
has a well defined directory structure that PUDL uses to organize the data it
downloads, processes, and outputs. Run
pudl_setup --help and
--help for details.
$ mkdir pudl-work $ pudl_setup pudl-work $ pudl_data --sources eia923 eia860 ferc1 epacems epaipm --years 2018 --states id
Now that we have some raw data, we can run the PUDL ETL (Extract, Transform, Load) pipeline to clean it up and integrate it together. There are several steps: cloning the FERC Form 1 database into SQLite, extracting data from that database and all the other sources and cleaning it up, outputting that data into well organized CSV/JSON based data packages, and finally loading those data packages into a local database.
PUDL provides a script to clone the FERC Form 1 database. The script is controlled by a YAML file which you can find in the settings folder.
$ ferc1_to_sqlite pudl-work/settings/ferc1_to_sqlite_example.yml
The main ETL process is controlled by another YAML file defining the data that
will be processed. A well commented
etl_example.yml can be found
settings directory of the PUDL workspace you set up. The script that
runs the ETL process is called
$ pudl_etl pudl-work/settings/etl_example.yml
This will generate a bundle of tabular data packages in
Tabular data packages are made up of CSV and JSON files. They’re relatively easy to parse programmatically, and readable by humans. They are also well suited to archiving, citation, and bulk distribution, but they are static.
To make the data easier to query and work with interactively, we typically load it into a local SQLite database using this script, which combines several data packages from the same bundle into a single data package,
$ datapkg_to_sqlite \ -o pudl-work/datapkg/pudl-example/pudl-merged \ pudl-work/datapkg/pudl-example/ferc1-example/datapackage.json \ pudl-work/datapkg/pudl-example/eia-example/datapackage.json \ pudl-work/datapkg/pudl-example/epaipm-example/datapackage.json
The EPA CEMS data is 100 times larger than all of the other data we have integrated thus far, and loading it into SQLite takes a very long time. We’ve found the most convenient way to work with it is using Apache Parquet files, and have a script that converts the EPA CEMS Hourly table from the generated datapackage into that format. To convert the example EPA CEMS data package you can run:
$ epacems_to_parquet pudl-work/datapkg/pudl-example/epacems-eia-example/datapackage.json
The resulting Apache Parquet dataset will be stored in
pudl-work/parquet/epacems and will be partitioned by year and by state, so
that you can read in only the relevant portions of the dataset. (Though in the
example, you’ll only find 2018 data for Idaho)
Now that you have a live database, we can easily work with it using a variety of tools, including Python, pandas dataframes, and Jupyter Notebooks. This command will start up a local Jupyter notebook server, and open a notebook of PUDL usage examples:
$ jupyter lab pudl-work/notebook/pudl_intro.ipynb
For more usage and installation details, see our more in-depth documentation on Read The Docs.
Contributing to PUDL¶
Find PUDL useful? Want to help make it better? There are lots of ways to contribute!
Please be sure to read our Code of Conduct
You can file a bug report, make a feature request, or ask questions in the Github issue tracker.
Feel free to fork the project and make a pull request with new code, better documentation, or example notebooks.
Make a recurring financial contribution to support our work liberating public energy data.
Hire us to do some custom analysis, and let us add the code the project.
For more information check out our Contribution Guidelines
For help with initial setup, usage questions, bug reports, suggestions to make PUDL better and anything else that could conceivably be of use or interest to the broader community of users, use the PUDL issue tracker. on Github. For private communication about the project, you can email the team: firstname.lastname@example.org
About Catalyst Cooperative¶
Catalyst Cooperative is a small group of data scientists and policy wonks. We’re organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy making. Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.
Do you work on renewable energy or climate policy? Have you found yourself scraping data from government PDFs, spreadsheets, websites, and databases, without getting something reusable? We build tools to pull this kind of information together reliably and automatically so you can focus on your real work instead — whether that’s political advocacy, energy journalism, academic research, or public policy making.