Running the ETL Pipeline
So you want to run the PUDL data processing pipeline? This is the most involved way to get access to PUDL data. It’s only recommended if you want to edit the ETL process or contribute to the code base. Check out the Data Access documentation if you just want to use already processed data.
These instructions assume you have already gone through the development setup (see: Development Setup).
There are two main scripts that are involved in the PUDL processing pipeline:
ferc1_to_sqlite
converts the FERC Form 1 DBF files into a single large SQLite database so that the data is easier to extract, and so all of the raw FERC Form 1 data is available in a modern format.pudl_etl
coordinates the “Extract, Transform, Load” process that processes 20+ years worth of data from the FERC Form 1 database, dozens of EIA spreadsheets, and the thousands of CSV files that make up the EPA CEMS hourly emissions data into a clean, well normalized SQLite database (for the FERC and EIA data), and an Apache Parquet dataset that is partitioned by state and year (for the EPA CEMS).
A settings file dictates which datasets, years, tables, or states get run through the
the processing pipeline. Two example settings files are provided in the settings
folder that is created when you run pudl_setup
: the Fast ETL and the
Full ETL.
See also
Creating a Workspace for more on how to create a PUDL data workspace.
Settings Files for info details on the contents of the settings files.
The Fast ETL
Running the Fast ETL processes one year of data for each dataset. This is what we do in our software integration tests. Depending on your computer, it should take around 15 minutes total.
$ ferc1_to_sqlite settings/etl_fast.yml
$ pudl_etl settings/etl_fast.yml
The Full ETL
The Full ETL settings includes all all available data that PUDL can process. All the years, all the states, and all the tables, including the ~1 billion record EPA CEMS dataset. Assuming you already have the data downloaded, on a computer with at least 16 GB of RAM, and a solid-state disk, he Full ETL including EPA CEMS should take around 2 hours.
$ ferc1_to_sqlite settings/etl_full.yml
$ pudl_etl settings/etl_full.yml
Processing EPA CEMS Separately
If you don’t want to process the EPA CEMS data (which takes about an hour and 20
minutes), you can delete or comment out its lines in the settings file, and
process it independently later without reprocessing the FERC and EIA data using
the epacems_to_parquet
script, which will refer to your existing PUDL
database for the information it needs, as if the FERC and EIA ETL had just been
run.
Warning
If you process the EPA CEMS data after the fact, be careful that the version of PUDL used to generate the DB is the same as the one you’re using to process the CEMS data. Otherwise the process and data may be incompatible with unpredictable results.
Additional Notes
The commands above should result in a bunch of Python logging
output
describing what the script is doing, and file outputs in the sqlite
, and
parquet
directories within your workspace. When the ETL is complete, you
should see new files at sqlite/ferc1.sqlite
and sqlite/pudl.sqlite
as
well as a new directory at parquet/epacems
containing nested directories
named by year and state.
You can use the pudl_etl
script to process more or different data by copying
and editing either of the settings files and running the script again with your
new settings file as an argument. Comments in the example settings file explain
the available parameters. However, these examples are the only configurations
that are tested automatically and known to work reliably.
If you need to re-run ferc1_to_sqlite
or pudl_etl
and want to overwrite
their previous outputs you can use the --clobber
option. All of the PUDL
scripts have help messages if you want additional information (run script_name
--help
).