Basic Usage

Quickstart

If you’ve already installed PUDL using conda, set up a workspace, and activated the PUDL conda environment then from within your workspace you should be able to bring up an example Jupyter notebook that works with PUDL data by running:

$ pudl_etl settings/pudl_etl_example.yml
$ jupyter-lab --notebook-dir=notebooks

Note

This example only downloads and processes a small portion of the available data as a demonstration. You can copy the example settings file and edit it to add more data. See the data catalog for a full listing of all the available data.

Running the ETL Pipeline

PUDL implements a data processing pipeline. This pipeline takes raw data provided by public agencies in a variety of formats and integrates it together into a single (more) coherent whole. In the data-science world this is often called “ETL” which stands for “Extract, Transform, Load.”

  • Extract the data from its original source formats and into pandas.DataFrame objects for easy manipulation.

  • Transform the extracted data into tidy tabular data structures, applying a variety of cleaning routines, and creating connections both within and between the various datasets.

  • Load the data into a standardized output, in our case CSV/JSON based Tabular Data Packages.

The PUDL python package is organized into these steps as well, with pudl.extract and pudl.transform subpackages that contain dataset specific modules like pudl.extract.ferc1 and pudl.transform.eia923. The Load step is currently just a single module called pudl.load.

The ETL pipeline is coordinated by the top-level pudl.etl module, which has a command line interface accessible via the pudl_etl script that is installed by the PUDL Python package. The script reads a YAML file as input. An example is provided in the settings folder that is created when you run pudl_setup (see: Creating a Workspace).

To run the ETL pipeline for the example, from within your PUDL workspace you would do:

$ pudl_etl settings/pudl_etl_example.yml

This should result in a bunch of Python logging output, describing what the script is doing, and some outputs in the sqlite and datapackage directories within your workspace. In particular, you should see new file at sqlite/ferc1.sqlite and a new directory at datapackage/pudl-example.

Under the hood, the pudl_etl script has downloaded data from the federal agencies and organized it into a datastore locally, cloned the original FERC Form 1 database into that ferc1.sqlite file, extracted a bunch of data from that database and a variety of Microsoft Excel spreadsheets and CSV files, and combined it all into the pudl-example tabular datapackage. The metadata describing the overall structure of the output is found in datapackage/pudl-example/datapackage.json and the associated data is stored in a bunch of CSV files (some of which may be gzip compressed) in the datapackage/pudl-example/data/ directory.

You can use the pudl_etl script to download and process more or different data by copying and editing the settings/pudl_etl_example.yml file, and running the script again with your new settings file as an argument. Comments in the example settings file explain the available parameters.

Todo

  • Create updated example settings file, ensure it explains all available options.

  • Integrate datastore management and ferc1 DB cloning into pudl_etl script.

It’s sometimes useful to update the datastore or clone the FERC Form 1 database independent of running the full ETL pipeline. Those (optional) processes are explained next.