Installation and Setup

System Requirements

Note

The PUDL data processing pipeline does a lot of work in-memory with pandas.DataFrame objects. The full EPA CEMS Hourly dataset is nearly 100 GB uncompressed. To handle all of the data that is available via PUDL we recommend that your system have at least:

  • 8 GB of memory

  • 100 GB of free disk space

Python 3.7+ (and conda)

PUDL requires Python 3.7 or later. In addition, while not strictly necessary, we highly recommend using the most recent version of the Anaconda Python distribution, or its smaller cousin miniconda (miniconda is nice if you are fond of the command line and want a lightweight install).

Both Anaconda and miniconda provide conda, a command-line tool that helps you manage your Python software environment, packages, and their dependencies. PUDL provides an environment.yml file defining a software environment that should work well for most users in conjunction with conda.

We recommend using conda because while PUDL is written entirely in Python, it makes heavy use of Python’s open data science stack including packages like numpy, scipy, pandas, and sklearn which depend on extensions written in C and C++. These extensions can be difficult to build locally when installed with pip, but conda provides pre-compiled platform specific binaries.

Installing the Package

PUDL is available via conda on the community manged conda-forge channel. This is the recommended way to install PUDL:

$ conda config --add channels conda-forge
$ conda config --set channel_priority strict
$ conda install PUDL_PACKAGE

PUDL is also available via the official Python Package Index (PyPI) and be installed with pip:

$ pip install PUDL_PACKAGE

Note

pip will only install the dependencies required for PUDL to work as a development library and command line tool. If you want to check out the source code from Github for development purposes, see the Development Setup documentation.

In addition to making the pudl package available for import in Python, installing PUDL_PACKAGE installs the following command line tools:

  • epacems_to_parquet

  • ferc1_to_sqlite

  • pudl_data

  • pudl_etl

  • pudl_setup

For information on how to use them, run them with the --help option. Most of them are configured using settings files. Examples are provided with the PUDL_PACKAGE, and made available by running pudl_setup as described below.

Todo

Fill out the precise details of installation after we’ve tested it with a pre-release.

Creating a Workspace

PUDL needs to know where to store its big pile of input and output data. It also provides some example configuration files and Jupyter notebooks. The pudl_setup script lets PUDL know where all this stuff should go. We call this a “PUDL workspace”:

$ pudl_setup <PUDL_DIR>

Here <PUDL_DIR> is the path to the directory where you want PUDL to do its business – this is where the datastore will be located, and any outputs that are generated will end up. The script will also put a configuration file in your home directory, called .pudl.yml that records the location of this workspace and uses it by default in the future.

The workspace is laid out like this:

Directory / File

Contents

data/

Raw data, automatically organized by source, year, etc.

datapackage/

Tabular data packages generated by PUDL.

environment.yml

A file describing the PUDL conda environment.

notebooks/

Interactive Jupyter notebooks that use PUDL.

parquet/

Apache Parquet files generated by PUDL.

settings/

Example configuration files for controlling PUDL scripts.

sqlite/

sqlite3 databases generated by PUDL.

The PUDL conda Environment

To make sure all of software PUDL depends on is available, we use the conda environment described in the environment.yml file stored in the main directory of the Github repository.

Create the Environment

To create the PUDL conda environment, make sure you are in the same directory as environment.yml and run:

$ conda env create --name=pudl --file=environment.yml

This will probably download a bunch of Python packages, and might take a while. Future updates to the conda environment will be much faster, since only a couple of packages typically get updated at a time.

If you get an error No such file or directory: environment.yml, it probably means you aren’t in the pudl repository directory.

Activate the Environment

conda allows you to set up different software environments for different projects. However, this means you need to tell conda which environment you want to be using at any given time. To select a particular conda environment (like the one named pudl that you just created) use conda activate followed by the name of the environment you want to use:

$ conda activate pudl

After running this command you should see an indicator (like (pudl)) in your command prompt, signaling that the environment is in use.

See also

Managing Environments, in the conda documentation.