Installation and Setup

System Requirements

Note

The PUDL data processing pipeline does a lot of work in-memory with pandas.DataFrame objects. Exhaustive record linkage within the 25 years of FERC Form 1 data requires up to 24 GB of memory. The full EPA CEMS Hourly dataset is nearly 100 GB on disk uncompressed.

Python 3.7+ (and conda)

PUDL requires Python 3.7 (but is not quite yet working on Python 3.8). While not strictly necessary, we highly recommend using the most recent version of Anaconda Python, or its smaller cousin miniconda if you are fond of the command line and want a lightweight install.

Both Anaconda and miniconda provide conda, a command-line tool that helps you manage your Python software environment, packages, and their dependencies. PUDL provides an environment.yml file defining a software environment that should work well for most users in conjunction with conda.

We recommend using conda because while PUDL is written entirely in Python, it makes heavy use of Python’s open data science stack including packages like numpy, scipy, pandas, and sklearn which depend on extensions written in C and C++. These extensions can be difficult to build locally when installed with pip, but conda provides pre-compiled platform specific binaries that should Just Work™.

Installing the Package

PUDL and all of its dependencies are available via conda on the community manged conda-forge channel, and we recommend installing PUDL within its own conda environment like this:

$ conda create --yes --name pudl --channel conda-forge \
    --strict-channel-priority python=3.7 catalystcoop.pudl pip

Then you activate that conda environment to access it:

$ conda activate pudl

Once you’ve activated the pudl environment, you may want to install additional software within it, for example if you want to use Jupyter notebooks to work with PUDL interactively:

$ conda install jupyter jupyterlab

You may also want to update your global conda settings:

$ conda config --add channels conda-forge
$ conda config --set channel_priority strict

PUDL is also available via the official Python Package Index (PyPI) and be installed with pip like this:

$ pip install catalystcoop.pudl

Note

pip will only install the dependencies required for PUDL to work as a development library and command line tool. If you want to check out the source code from Github for development purposes, see the Development Setup documentation.

In addition to making the pudl package available for import in Python, installing catalystcoop.pudl provides the following command line tools:

  • pudl_setup

  • pudl_data

  • ferc1_to_sqlite

  • pudl_etl

  • datapkg_to_sqlite

  • epacems_to_parquet

For information on how to use these scripts, each can be run with the --help option. ferc1_to_sqlite and pudl_etl are configured with YAML files. Examples are provided with the catalystcoop.pudl package, and deployed by running pudl_setup as described below. Additional information about the settings files can be found in our documentation on Settings Files

Creating a Workspace

PUDL needs to know where to store its big piles of inputs and outputs. It also provides some example configuration files and Jupyter notebooks. The pudl_setup script lets PUDL know where all this stuff should go. We call this a “PUDL workspace”:

$ pudl_setup <PUDL_DIR>

Here <PUDL_DIR> is the path to the directory where you want PUDL to do its business – this is where the datastore will be located, and where any outputs that are generated end up. The script will also put a configuration file in your home directory, called .pudl.yml which records the location of this workspace and uses it by default in the future. If you run pudl_setup with no arguments, it assumes you want to use the current working directory.

The workspace is laid out like this:

Directory / File

Contents

data/

Raw data, automatically organized by source, year, etc.

datapkg/

Tabular data packages generated by PUDL.

environment.yml

A file describing the PUDL conda environment.

notebook/

Interactive Jupyter notebooks that use PUDL.

parquet/

Apache Parquet files generated by PUDL.

settings/

Example configuration files for controlling PUDL scripts.

sqlite/

sqlite3 databases generated by PUDL.

The PUDL conda Environment

In addition to creating a conda environment using the command line arguments referred to above you can specify an environment in a file, usually named environment.yml. We deploy a basic version of this file into a PUDL workspace when it’s created, as listed above.

Create the Environment

Because you won’t have the environment.yml file until after you’ve installed PUDL, you will probably create your PUDL environment on the command line as described above. To do the same thing using an environment file, you’d run:

$ conda env create --name pudl --file environment.yml

You may want to periodically update PUDL and the packages it depends on by running the following commands in the directory with environment.yml in it:

$ conda update conda
$ conda env update pudl

If you get an error No such file or directory: environment.yml, it probably means you aren’t in the same directory as the environment.yml file.

Activate the Environment

conda allows you to set up different software environments for different projects. However, this means you need to tell conda which environment you want to be using at any given time. To select a particular conda environment (like the one named pudl that you just created) use conda activate followed by the name of the environment you want to use:

$ conda activate pudl

After running this command you should see an indicator (like (pudl)) in your command prompt, signaling that the environment is in use.

See also

Managing Environments, in the conda documentation.