Installation and Setup¶
System Requirements¶
Note
The PUDL data processing pipeline does a lot of work in-memory with
pandas.DataFrame
objects. Exhaustive record linkage within the
25 years of FERC Form 1 data requires up to 24 GB of memory.
The full EPA CEMS Hourly dataset is nearly 100 GB on disk
uncompressed.
Python 3.7+ (and conda)¶
PUDL requires Python 3.7 (but is not quite yet working on Python 3.8). While not strictly necessary, we highly recommend using the most recent version of Anaconda Python, or its smaller cousin miniconda if you are fond of the command line and want a lightweight install.
Both Anaconda and miniconda provide conda
, a command-line tool that helps
you manage your Python software environment, packages, and their dependencies.
PUDL provides an environment.yml
file defining a software environment that
should work well for most users in conjunction with conda
.
We recommend using conda
because while PUDL is written entirely in Python,
it makes heavy use of Python’s open data science stack including packages like
numpy
, scipy
, pandas
, and sklearn
which depend on
extensions written in C and C++. These extensions can be difficult to build
locally when installed with pip
, but conda
provides pre-compiled
platform specific binaries that should Just Work™.
Installing the Package¶
PUDL and all of its dependencies are available via conda
on the community
manged conda-forge channel, and we recommend
installing PUDL within its own conda
environment like this:
$ conda create --yes --name pudl --channel conda-forge \
--strict-channel-priority python=3.7 catalystcoop.pudl pip
Then you activate that conda
environment to access it:
$ conda activate pudl
Once you’ve activated the pudl environment, you may want to install additional software within it, for example if you want to use Jupyter notebooks to work with PUDL interactively:
$ conda install jupyter jupyterlab
You may also want to update your global conda
settings:
$ conda config --add channels conda-forge
$ conda config --set channel_priority strict
PUDL is also available via the official
Python Package Index (PyPI) and be installed with
pip
like this:
$ pip install catalystcoop.pudl
Note
pip
will only install the dependencies required for PUDL to work as a
development library and command line tool. If you want to check out the
source code from Github for development purposes, see the
Development Setup documentation.
In addition to making the pudl
package available for import in Python,
installing catalystcoop.pudl
provides the following command line tools:
pudl_setup
pudl_data
ferc1_to_sqlite
pudl_etl
datapkg_to_sqlite
epacems_to_parquet
For information on how to use these scripts, each can be run with the
--help
option. ferc1_to_sqlite
and pudl_etl
are configured with
YAML files. Examples are provided with the catalystcoop.pudl
package, and
deployed by running pudl_setup
as described below. Additional information
about the settings files can be found in our documentation on
Settings Files
Creating a Workspace¶
PUDL needs to know where to store its big piles of inputs and outputs. It
also provides some example configuration files and
Jupyter notebooks. The pudl_setup
script lets
PUDL know where all this stuff should go. We call this a “PUDL workspace”:
$ pudl_setup <PUDL_DIR>
Here <PUDL_DIR> is the path to the directory where you want PUDL to do its
business – this is where the datastore will be located, and where any outputs
that are generated end up. The script will also put a configuration file in
your home directory, called .pudl.yml
which records the location of this
workspace and uses it by default in the future. If you run pudl_setup
with
no arguments, it assumes you want to use the current working directory.
The workspace is laid out like this:
Directory / File |
Contents |
|
Raw data, automatically organized by source, year, etc. |
|
Tabular data packages generated by PUDL. |
|
A file describing the PUDL conda environment. |
|
Interactive Jupyter notebooks that use PUDL. |
|
Apache Parquet files generated by PUDL. |
|
Example configuration files for controlling PUDL scripts. |
|
|
The PUDL conda
Environment¶
In addition to creating a conda
environment using the command line
arguments referred to above you can specify an environment in a file, usually
named environment.yml
. We deploy a basic version of this file into a
PUDL workspace when it’s created, as listed above.
Create the Environment¶
Because you won’t have the environment.yml
file until after you’ve
installed PUDL, you will probably create your PUDL environment on the command
line as described above. To do the same thing using an environment file, you’d
run:
$ conda env create --name pudl --file environment.yml
You may want to periodically update PUDL and the packages it depends on
by running the following commands in the directory with environment.yml
in it:
$ conda update conda
$ conda env update pudl
If you get an error No such file or directory: environment.yml
, it
probably means you aren’t in the same directory as the environment.yml
file.
Activate the Environment¶
conda
allows you to set up different software environments for different
projects. However, this means you need to tell conda
which environment you
want to be using at any given time. To select a particular conda
environment (like the one named pudl
that you just created) use conda
activate
followed by the name of the environment you want to use:
$ conda activate pudl
After running this command you should see an indicator (like (pudl)
) in
your command prompt, signaling that the environment is in use.
See also
Managing Environments, in the conda
documentation.