PUDL implements a data processing pipeline. This pipeline takes raw data provided by public agencies in a variety of formats and integrates it together into a single (more) coherent whole. In the data-science world this is often called “ETL” which stands for “Extract, Transform, Load.”
Extract the data from its original source formats and into
pandas.DataFrameobjects for easy manipulation.
Transform the extracted data into tidy tabular data structures, applying a variety of cleaning routines, and creating connections both within and between the various datasets.
Load the data into a standardized output, in our case CSV/JSON based Tabular Data Packages, and subsequently an SQLite database or Apache Parquet files.
The PUDL python package is organized around these steps as well, with
pudl.transform subpackages that contain dataset
specific modules like
pudl.transform.eia923. The Load step is handled by the
subpackage, which contains modules that deal separately with generating CSVs
containing the output data (
pudl.load.csv), and the JSON files that
contain the corresponding metadata (
The ETL pipeline is coordinated by the top-level
pudl.etl module, which
has a command line interface accessible via the
pudl_etl script which is
installed by the PUDL Python package. The script reads a YAML file as input.
An example is provided in the
settings folder that is created when you run
pudl_setup (see: Creating a Workspace).
To run the ETL pipeline for the example, from within your PUDL workspace you would need to run four commands, which download the original data, then clone the FERC Form 1 database, convert that and other raw data into datapackages, and loads those datapackages into an SQLite database, respectively:
$ pudl_data --sources eia923 eia860 ferc1 epacems epaipm --years 2018 --states id $ ferc1_to_sqlite settings/ferc1_to_sqlite_example.yml $ pudl_etl settings/etl_example.yml $ datapkg_to_sqlite \ -o datapkg/pudl-example/pudl-merged \ datapkg/pudl-example/ferc1-example/datapackage.json \ datapkg/pudl-example/eia-example/datapackage.json \ datapkg/pudl-example/epaipm-example/datapackage.json $ epacems_to_parquet datapkg/pudl-example/epacems-eia-example/datapackage.json
These commands should result in a bunch of Python
describing what the script is doing, and outputs in the
parquet directories within your workspace. In particular,
you should see new files at
and a new directory at
datapkg/pudl-example containing several datapackage
directories, one for each of the
eia (Forms 860 and 923),
Under the hood, these scripts are extracting data from the datastore, including
spreadsheets, CSV files, and binary DBF files, generating a SQLite database
containing the raw FERC Form 1 data, and combining it all into
pudl-example, which is a bundle of
that can be used together to create a database.
Each of the data packages which are part of the bundle have metadata describing
their structure, stored in a file called
datapackage.json The data itself
is stored in a bunch of CSV files (some of which may be
data/ directories of each data package.
You can use the
pudl_etl script to process more or different data by
copying and editing the
settings/etl_example.yml file, and running the
script again with your new settings file as an argument. Comments in the
example settings file explain the available parameters.
If you want to re-run
pudl_etl and replace an existing bundle of data
packages, you can use
--clobber. If you want to generate a new data
packages with a new or modified settings file, you can change the name of the
output datapackage bundle in the configuration file.