Installation and Setup¶
The PUDL data processing pipeline does a lot of work in-memory with
pandas.DataFrame objects. Exhaustive record linkage within the
25 years of FERC Form 1 data requires up to 24 GB of memory.
The full EPA CEMS Hourly dataset is nearly 100 GB on disk
Python 3.7+ (and conda)¶
PUDL requires Python 3.7 (but is not quite yet working on Python 3.8). While not strictly necessary, we highly recommend using the most recent version of Anaconda Python, or its smaller cousin miniconda if you are fond of the command line and want a lightweight install.
Both Anaconda and miniconda provide
conda, a command-line tool that helps
you manage your Python software environment, packages, and their dependencies.
PUDL provides an
environment.yml file defining a software environment that
should work well for most users in conjunction with
We recommend using
conda because while PUDL is written entirely in Python,
it makes heavy use of Python’s open data science stack including packages like
sklearn which depend on
extensions written in C and C++. These extensions can be difficult to build
locally when installed with
conda provides pre-compiled
platform specific binaries that should Just Work™.
Installing the Package¶
PUDL and all of its dependencies are available via
conda on the community
manged conda-forge channel, and we recommend
installing PUDL within its own
conda environment like this:
$ conda create --yes --name pudl --channel conda-forge \ --strict-channel-priority python=3.7 catalystcoop.pudl pip
Then you activate that
conda environment to access it:
$ conda activate pudl
Once you’ve activated the pudl environment, you may want to install additional software within it, for example if you want to use Jupyter notebooks to work with PUDL interactively:
$ conda install jupyter jupyterlab
You may also want to update your global
$ conda config --add channels conda-forge $ conda config --set channel_priority strict
PUDL is also available via the official
Python Package Index (PyPI) and be installed with
pip like this:
$ pip install catalystcoop.pudl
pip will only install the dependencies required for PUDL to work as a
development library and command line tool. If you want to check out the
source code from Github for development purposes, see the
Development Setup documentation.
In addition to making the
pudl package available for import in Python,
catalystcoop.pudl provides the following command line tools:
For information on how to use these scripts, each can be run with the
pudl_etl are configured with
YAML files. Examples are provided with the
catalystcoop.pudl package, and
deployed by running
pudl_setup as described below. Additional information
about the settings files can be found in our documentation on
Creating a Workspace¶
PUDL needs to know where to store its big piles of inputs and outputs. It
also provides some example configuration files and
Jupyter notebooks. The
pudl_setup script lets
PUDL know where all this stuff should go. We call this a “PUDL workspace”:
$ pudl_setup <PUDL_DIR>
Here <PUDL_DIR> is the path to the directory where you want PUDL to do its
business – this is where the datastore will be located, and where any outputs
that are generated end up. The script will also put a configuration file in
your home directory, called
.pudl.yml which records the location of this
workspace and uses it by default in the future. If you run
no arguments, it assumes you want to use the current working directory.
The workspace is laid out like this:
Directory / File
Raw data, automatically organized by source, year, etc.
Tabular data packages generated by PUDL.
A file describing the PUDL conda environment.
Interactive Jupyter notebooks that use PUDL.
Apache Parquet files generated by PUDL.
Example configuration files for controlling PUDL scripts.
In addition to creating a
conda environment using the command line
arguments referred to above you can specify an environment in a file, usually
environment.yml. We deploy a basic version of this file into a
PUDL workspace when it’s created, as listed above.
Create the Environment¶
Because you won’t have the
environment.yml file until after you’ve
installed PUDL, you will probably create your PUDL environment on the command
line as described above. To do the same thing using an environment file, you’d
$ conda env create --name pudl --file environment.yml
You may want to periodically update PUDL and the packages it depends on
by running the following commands in the directory with
$ conda update conda $ conda env update pudl
If you get an error
No such file or directory: environment.yml, it
probably means you aren’t in the same directory as the
Activate the Environment¶
conda allows you to set up different software environments for different
projects. However, this means you need to tell
conda which environment you
want to be using at any given time. To select a particular
environment (like the one named
pudl that you just created) use
activate followed by the name of the environment you want to use:
$ conda activate pudl
After running this command you should see an indicator (like
your command prompt, signaling that the environment is in use.
Managing Environments, in the