If you want to contribute code or documentation directly, you’ll need to create your own fork of the project on Github, and set up some version of the development environment described below, before making pull requests to submit new code, documentation, or examples of use.
If you’re new to git and Github, you may want to check out:
Install Python 3.7¶
As mentioned in the Installation and Setup documentation, PUDL currently requires
Python 3.7. We use
miniconda to manage our
software environments. While using
conda isn’t strictly required, it does
make everything easier to have everyone on the same platform.
Fork and Clone the PUDL Repository¶
On the main page of the PUDL repository you should see a Fork button in the upper right hand corner. Forking the repository makes a copy of it in your personal (or organizational) account on Github that is independent of, but linked to, the original “upstream” project.
Depending on your operating system and the git client you’re using to access
Github, the exact cloning process might be different, but if you’re using a
UNIX-like terminal, cloning the repository
from your fork will look like this (with your own Github username or
organizational name in place of
USERNAME of course):
$ git clone https://github.com/USERNAME/pudl.git
This will download the whole history of the project, including the most recent
version, and put it in a local directory called
Inside your newly cloned local repository, you should see the following:
Directory / File
Development tools not distributed with the package.
A copy of the MIT License, under which PUDL is distributed.
Template describing files included in the python package.
Jupyter Notebooks, examples and development in progress.
Configuration for development tools used with the project.
Concise, top-level project documentation.
Python build and packaging script.
Package source code, isolated to avoid unintended imports.
Modules for use with PyTest.
Configuration for the Tox build and test framework.
Create and activate the pudl-dev conda environment¶
devtools directory of your newly cloned repository, you should
environment.yml file, which specifies the
environment. You can create that environment locally from within the main
repository directory by running:
$ conda update conda
$ conda config --set channel_priority strict
$ conda env create --name pudl-dev --file devtools/environment.yml
$ conda activate pudl-dev
This environment mostly includes additional code quality assurance and testing packages, on top of the basic PUDL requirements.
Install PUDL for development¶
catalystcoop.pudl package isn’t part of the
since you’re going to be editing it. To install the local version that now
exists in your cloned repository using
pip, into your
environment from the main repository directory (containing
$ pip install --editable ./
Install PUDL QA/QC tools¶
We use automated tools to apply uniform coding style and formatting across the project codebase. This reduces merge conflicts, makes the code easier to read, and helps catch bugs before they are committed. These tools are part of the pudl conda environment, and their configuration files are checked into the Github repository, so they should be installed and ready to go if you’ve cloned the pudl repo and are working inside the pudl conda environment.
These tools can be run at three different stages in development:
inside your text editor or IDE, while you are writing code or documentation,
before you make a new commit to the repository using Git’s pre-commit hook scripts,
Real Python Code Quality Tools and Best Practices gives a good overview of available linters and static code analysis tools.
PyFlakes, which checks Python code for correctness,
pep8-naming checks that variable names comply with Python naming conventions.
flake8-builtins checks to make sure you haven’t accidentally clobbered any reserved Python names with your own variables.
Doc8 is a lot like flake8, but for Python
documentation written in the reStructuredText format and built by
Sphinx. This is the de-facto
standard for Python documentation. The
doc8 tool checks for syntax errors
and other formatting issues in the documentation source files under the
Many of the tools outlined above can be run automatically in the background while you are writing code or documentation, if you are using an editor that works well with for Python development. A couple of popular options are the free Atom editor developed by Github, and the less free Sublime Text editor. Both of them have many community maintained addons and plugins.
Catalyst primarily uses the Atom editor, with the following plugins and settings. These plugins require that the tools described above are installed on your system – which is done automatically in the pudl conda environment.
atom-beautify set to “beautify on save,” with
autopep8as the beautifier and formatter, and set to “sort imports.”
linter the base linter package used by all Atom linters.
linter-flake8 set to use
.flake8as the project config file.
python-autopep8 to actually do the work of tidying up.
Git Pre-commit Hooks¶
Git hooks let you automatically run scripts at various points as you manage your source code. “Pre-commit” hook scripts are run when you try to make a new commit. These scripts can review your code and identify bugs, formatting errors, bad coding habits, and other issues before the code gets checked in. This gives you the opportunity to fix those issues first.
Pretty much all you need to do is enable pre-commit hooks:
$ pre-commit install
The scripts that run are configured in the
In addition to
pre-commit hooks also run
bandit (a tool for identifying
common security issues in Python code) and several other checks that keep you
from accidentally committing large binary files, leaving
in your code, forgetting to resolve merge conflicts, and other gotchas that can
be hard for humans to catch but are easy for a computer.
If you want to make a pull request, it’s important that all these checks pass – otherwise the build will fail, since these same checks are tun by the tests on Travis.
The pre-commit project: A framework for managing and maintaining multi-language pre-commit hooks.
Install and Validate the Data¶
In order to work on PUDL development, you’ll probably need to have a bunch of
the data available locally. Follow the instructions in Creating a Datastore to set
up a local data management environment and download some data locally, then
run the ETL pipeline to generate some data packages and use them to populate a local SQLite database with as much
PUDL data as you can stand (for development, we typically load all of the
available data for
but only a single state’s worth of data for the much larger
Using Tox to Validate PUDL¶
If you’ve done all of the above, you should be able to use
tox to run our
test suite, and perform data validation. For example, to validate the data
stored in your PUDL SQLite database, you would simply run:
$ tox -v -e validate
This process may take 30 minutes to an hour to complete.
Running the Tests¶
We also use
tox to run PyTest against a packaged and separately installed
version of the local repository package. Take a peek inside
see what test environments are available. To run the same tests that will be
run on Travis CI when you make a pull request, you can run:
$ tox -v -e travis -- --fast
This will run the linters and pre-commit checks on all the code, make sure that
the docs can be built by Sphinx, and run the ETL process on a single year of
--fast is passed through to PyTest by
tox because it is
--. That test will also attempt to download a year of data into
a temporary directory. If you want to skip the download step and use your
already downloaded datastore, you can point the tests at it with
$ tox -v -e travis -- --fast --pudl_in=AUTO
Additional details can be found in Building and Testing PUDL.
Making a Pull Request¶
Before you make a pull request, please check that:
Your code passes all of the Travis tests by running them with
You can generate a new complete bundle of data packages, including all the available data (with the exception of
epacems– all the years of a couple of states is sufficient for testing.)
Those data packages can be used to populate an SQLite database locally, using the
epacems_to_parquetscript is able to convert the EPA CEMS Hourly Emissions table from the data package into an Apache Parquet dataset.
The data validation tests can be run against that SQLite database, using
tox -v -e validateas outlined above.
If you’ve added new data or substantial new code, please also include new tests and data validation. See the modules under
Then you can push the new code to your fork of the PUDL repository on Github, and from there, you can make a Pull Request inviting us to review your code and merge your improvements in with the main repository!