Development Setup

This page will walk you through what you need to do if you want to be able to contribute code or documentation to the PUDL project.

These instructions assume that you are working on a Unix-like operating system (MacOS or Linux) and are already familiar with git, GitHub, and the Unix shell.

Warning

While it should be possible to set up the development environment on Windows, we haven’t done it. In the future we may create a Docker image that provides the development environment. E.g. for use with VS Code’s Containers extension.

Note

If you’re new to git and GitHub , you’ll want to check out:

Install conda

We use the conda package manager to specify and update our development environment, preferentially installing packages from the community maintained conda-forge distribution channel. We recommend using miniconda rather than the large pre-defined collection of scientific packages bundled together in the Anaconda Python distribution. You may also want to consider using mamba – a faster drop-in replacement for conda written in C++.

After a conda package manager, make sure it’s configured to use strict channel priority with the following commands:

$ conda update conda
$ conda config --set channel_priority strict

Fork and Clone the PUDL Repository

Unless you’re part of the Catalyst Cooperative organization already, you’ll need to fork the PUDL repository This makes a copy of it in your personal (or organizational) account on GitHub that is independent of, but linked to, the original “upstream” project.

Then, clone the repository from your fork to your local computer where you’ll be editing the code or docs. This will download the whole history of the project, including the most recent version, and put it in a local directory where you can make changes.

Create the PUDL Dev Environment

Inside the devtools directory of your newly cloned repository, you’ll see an environment.yml file that specifies the pudl-dev conda environment. You can create and activate that environment from within the main repository directory by running:

$ conda update conda
$ conda env create --name pudl-dev --file devtools/environment.yml
$ conda activate pudl-dev

This environment installs the catalystcoop.pudl package directly using the code in your cloned repository so that it can be edited during development. It also installs all of the software PUDL depends on, some packages for testing and quality control, packages for working with interactive Jupyter Notebooks, and a few Python packages that have binary dependencies which can be easier to satisfy through conda packages.

Getting and Storing an EIA API Key

PUDL accesses Energy Information Agency (EIA) datasets via an API, which requires permission from the EIA. New users must register for an API key, which is free, nearly instantaneous, and only requires you give an email address.

To make this key accessible to pudl, store it in an environment variable and reactivate the environment:

$ conda activate pudl-dev
$ conda env config vars set API_KEY_EIA='your_api_key_here'
$ conda activate pudl-dev

Updating the PUDL Dev Environment

You will need to periodically update your development (pudl-dev) conda environment to get you newer versions of existing dependencies and incorporate any changes to the environment specification that have been made by other contributors. The most reliable way to do this is to remove the existing environment and recreate it.

Note

Different development branches within the repository may specify their own slightly different versions of the pudl-dev conda environment. As a result, you may need to update your environment when switching from one branch to another.

If you want to work with the most recent version of the code on a branch named new-feature, then from within the top directory of the PUDL repository you would do:

$ git checkout new-feature
$ git pull
$ conda deactivate
$ conda update conda
$ conda env remove --name pudl-dev
$ conda env create --name pudl-dev --file devtools/environment.yml
$ conda activate pudl-dev

If you find yourself recreating the environment frequently, and are frustrated by how long it takes conda to solve the dependencies, we recommend using the mamba solver. You’ll want to install it in your base conda environment – i.e. with no conda environment activated):

$ conda deactivate
$ conda install mamba

Then the above development environment update process would become:

$ git checkout new-feature
$ git pull
$ conda deactivate
$ mamba update mamba
$ mamba env remove --name pudl-dev
$ mamba env create --name pudl-dev --file devtools/environment.yml
$ conda activate pudl-dev

If you are working with locally processed data and there have been changes to the expectations about that data in the PUDL software, you may also need to regenerate your PUDL SQLite database or other outputs. See Running the ETL Pipeline for more details.

Set Up Code Linting

We use several automated tools to apply uniform coding style and formatting across the project codebase. This is known as code linting, and it reduces merge conflicts, makes the code easier to read, and helps catch some types of bugs before they are committed. These tools are part of the pudl-dev conda environment and their configuration files are checked into the GitHub repository. If you’ve cloned the pudl repo and are working inside the pudl conda environment, they should be installed and ready to go.

Git Pre-commit Hooks

Git hooks let you automatically run scripts at various points as you manage your source code. “Pre-commit” hook scripts are run when you try to make a new commit. These scripts can review your code and identify bugs, formatting errors, bad coding habits, and other issues before the code gets checked in. This gives you the opportunity to fix those issues before publishing them.

To make sure they are run before you commit any code, you need to enable the pre-commit hooks scripts with this command:

$ pre-commit install

The scripts that run are configured in the .pre-commit-config.yaml file.

See also

Code and Docs Linters

Flake8 is a popular Python linting framework, with a large selection of plugins. We use it to check the formatting and syntax of the code and docstrings embedded within the PUDL packages. Doc8 is a lot like flake8, but for Python documentation written in the reStructuredText format and built by Sphinx. This is the de-facto standard for Python documentation. The doc8 tool checks for syntax errors and other formatting issues in the documentation source files under the docs/ directory.

Automatic Formatting

Rather than alerting you that there’s a style issue in your Python code, autopep8 tries to fix it for you automatically, applying consistent formatting rules based on PEP 8. Similarly isort automatically groups and orders Python import statements in each module to minimize diffs and merge conflicts.

Linting Within Your Editor

If you are using an editor designed for Python development many of these code linting and formatting tools can be run automatically in the background while you write code or documentation. Popular editors that work with the above tools include:

Each of these editors have their own collection of plugins and settings for working with linters and other code analysis tools.

Creating a Workspace

PUDL needs to know where to store its big piles of inputs and outputs. It also comes with some example configuration files. The pudl_setup script lets PUDL know where all this stuff should go. We call this a “PUDL workspace”:

$ pudl_setup <PUDL_DIR>

Here <PUDL_DIR> is the path to the directory where you want PUDL to do its business – this is where the datastore will be located and where any outputs that are generated end up. The script will also put a configuration file called .pudl.yml in your home directory that records the location of this workspace and uses it by default in the future. If you run pudl_setup with no arguments, it assumes you want to use the current working directory.

The workspace is laid out like this:

Directory / File

Contents

data/

Raw data, automatically organized by source, year, etc.

parquet/

Apache Parquet files generated by PUDL.

settings/

Example configuration files for controlling PUDL scripts.

sqlite/

sqlite3 databases generated by PUDL.