Installation and Setup¶
System Requirements¶
Note
The PUDL data processing pipeline does a lot of work in-memory with
pandas.DataFrame
objects. The full EPA CEMS Hourly dataset is
nearly 100 GB uncompressed. To handle all of the data that is available via
PUDL we recommend that your system have at least:
8 GB of memory
100 GB of free disk space
Python 3.7+ (and conda)¶
PUDL requires Python 3.7 or later. In addition, while not strictly necessary, we highly recommend using the most recent version of the Anaconda Python distribution, or its smaller cousin miniconda (miniconda is nice if you are fond of the command line and want a lightweight install).
Both Anaconda and miniconda provide conda
, a command-line tool that helps
you manage your Python software environment, packages, and their dependencies.
PUDL provides an environment.yml
file defining a software environment that
should work well for most users in conjunction with conda
.
We recommend using conda
because while PUDL is written entirely in Python,
it makes heavy use of Python’s open data science stack including packages like
numpy
, scipy
, pandas
, and sklearn
which depend on
extensions written in C and C++. These extensions can be difficult to build
locally when installed with pip
, but conda
provides pre-compiled
platform specific binaries.
Installing the Package¶
PUDL is available via conda
on the community manged
conda-forge channel. This is the recommended way
to install PUDL:
$ conda config --add channels conda-forge
$ conda config --set channel_priority strict
$ conda install PUDL_PACKAGE
PUDL is also available via the official
Python Package Index (PyPI) and be installed with
pip
:
$ pip install PUDL_PACKAGE
Note
pip
will only install the dependencies required for PUDL to work as a
development library and command line tool. If you want to check out the
source code from Github for development purposes, see the
Development Setup documentation.
In addition to making the pudl
package available for import in Python,
installing PUDL_PACKAGE installs the following command line tools:
epacems_to_parquet
ferc1_to_sqlite
pudl_data
pudl_etl
pudl_setup
For information on how to use them, run them with the --help
option. Most
of them are configured using settings files. Examples are provided with the
PUDL_PACKAGE, and made available by running pudl_setup
as described below.
Todo
Fill out the precise details of installation after we’ve tested it with a pre-release.
Creating a Workspace¶
PUDL needs to know where to store its big pile of input and output data. It
also provides some example configuration files and
Jupyter notebooks. The pudl_setup
script lets
PUDL know where all this stuff should go. We call this a “PUDL workspace”:
$ pudl_setup <PUDL_DIR>
Here <PUDL_DIR> is the path to the directory where you want PUDL to do its
business – this is where the datastore will be located, and any outputs that
are generated will end up. The script will also put a configuration file in
your home directory, called .pudl.yml
that records the location of this
workspace and uses it by default in the future.
The workspace is laid out like this:
Directory / File |
Contents |
|
Raw data, automatically organized by source, year, etc. |
|
Tabular data packages generated by PUDL. |
|
A file describing the PUDL conda environment. |
|
Interactive Jupyter notebooks that use PUDL. |
|
Apache Parquet files generated by PUDL. |
|
Example configuration files for controlling PUDL scripts. |
|
|
The PUDL conda
Environment¶
To make sure all of software PUDL depends on is available, we use the conda
environment described in the environment.yml
file stored in the main
directory of the Github repository.
Create the Environment¶
To create the PUDL conda
environment, make sure you are in the same
directory as environment.yml
and run:
$ conda env create --name=pudl --file=environment.yml
This will probably download a bunch of Python packages, and might take a while.
Future updates to the conda
environment will be much faster, since only a
couple of packages typically get updated at a time.
If you get an error No such file or directory: environment.yml
, it
probably means you aren’t in the pudl
repository directory.
Activate the Environment¶
conda
allows you to set up different software environments for different
projects. However, this means you need to tell conda
which environment you
want to be using at any given time. To select a particular conda
environment (like the one named pudl
that you just created) use conda
activate
followed by the name of the environment you want to use:
$ conda activate pudl
After running this command you should see an indicator (like (pudl)
) in
your command prompt, signaling that the environment is in use.
See also
Managing Environments, in the conda
documentation.