Testing PUDL#
We use pytest to specify software unit & integration tests,
and to coordinate data validation tests. There are several pytest
commands stored
as targets in the PUDL Makefile
for convenience and to ensure that we’re all running
the tests in similar ways by default.
To run the software unit and integration tests that will be run in our automated CI on GitHub, you can use the following command:
$ make pytest-unit pytest-integration
Note
If you aren’t familiar with pytest and Make already, you may want to check out:
Software Tests#
Our pytest
based software tests are all stored under the test/
directory in the main repository. They are organized into 3 broad categories,
each with its own subdirectory:
Software Unit Tests (
test/unit/
) can be run in seconds and don’t require any external data. They test the basic functionality of various functions and classes, often using minimal inline data structures that are specified in the test modules themselves.Software Integration Tests (
test/integration/
) test larger collections of functionality including the interactions between different parts of the overall software system and in some cases interactions with external systems requiring network connectivity. The main thing our integration tests do is run the full PUDL data processing pipeline for the most recent year of data. Depending on your machine, this can take from 20 minutes to an hour… or more.Data Validations (
test/validate/
) sanity check the PUDL outputs generated by the data processing pipeline. This helps us catch issues with the input data as well as more subtle bugs that don’t prevent the code from executing but do have unintended or unexpected impacts on the output data. The data validation requires a fully populated PUDL database and is quite different from the other tests.
Running tests with Make#
The Makefile
targets that pertain to software and data tests which are coordianted
by pytest
are prefixed with pytest-
In addition to running the pytest-unit
and pytest-integration
targets mentioned
above there are also:
pytest-validate
: The full data validation tests, which run on an already existing PUDL DB.pytest-integration-full
: The integration tests, but run on all years of data rather than just the most recent year. This test also assumes you already have a live PUDL DB.pytest-jupyter
: Check that select Jupyter notebooks checked into the repository can run successfully. (Currently disabled)pytest-minmax-rows
: Check that various database tables have the expected number of records in them, and report back the actual number of records found. Requires an existing PUDL DB.pytest-coverage
: Run all the software tests and generate a test coverage report.pytest-ci
: Run the unit and integration tests (those tests that get run in CI).
Running Other Commands with Make#
There are a number of non-test `make
targets. To see them all open up the
Makefile
.
ferc
: Delete all existing XBRL and DBF derived FERC databases and metadata and re-extract them from scratchpudl
: Delete your existingpudl.sqlite
and re-run the full ETL from scratch. Assumes that the FERC DBs already exist.nuke
: delete your existing FERC and PUDL databases, rebuild them from scratch, and run all of the tests and and data validations (akin to running the nightly builds) for an extensive check of everything. This will take 3 hours or more to complete, and likely fully utilize your computer’s CPU and memory.install-pudl
: Remove your existingpudl-dev
conda
environment and reinstall all dependencies as well as thecatalystcoop.pudl
package defined by the repository in--editable
mode for development.docs-build
: Remove existing PUDL documentation outputs and rebuild from scratch.dagster
: start up the Dagster UI (will remain running in your terminal until you kill it withControl-C
).jlab
: start up a JupyerLab notebook server (will remain running in your terminal until you kill it withControl-C
).ci
: Run all the checks that would be run in CI on GitHub, including the pre-commit hooks, docs build, and software unit and integration tests.
Selecting Input Data for Integration Tests#
The software integration tests need a year’s worth of input data to process. By default they will look in your local PUDL datastore to find it. If the data they need isn’t available locally, they will download it from Zenodo and put it in the local datastore.
However, if you’re editing code that affects how the datastore works, you probably don’t
want to risk contaminating your working datastore. You can use a disposable temporary
datastore instead by using our custom --tmp-data
with pytest
:
$ pytest --tmp-data test/integration
See also
Development Setup for more on how to set up a PUDL workspace and datastore.
Working with the Datastore for more on how to work with the datastore in general.
Data Validation#
Given the processed outputs of the PUDL ETL pipeline, we have a collection of tests that can be run to verify that the outputs look correct. We run all available data validations before each data release is archived on Zenodo. It is useful to run the data validation tests prior to making a pull request that makes changes to the ETL process or output functions to ensure that the outputs have not been unintentionally affected.
These data validation tests are organized into datasource specific modules
under test/validate
. Running the full data validation can take as much as
an hour, depending on your computer. These tests require a fully populated
PUDL database which contains all available FERC and EIA data, as specified by
the src/pudl/package_data/settings/etl_full.yml
input file. They are run
against the “live” SQLite database in your pudl workspace at
$PUDL_OUTPUT/pudl.sqlite
. To run the full data validation against an existing
database:
$ make pytest-validate
The data validation cases that pertain to the contents of the data tables are
currently stored as part of the pudl.validate
module.
The expected number of records in each output table is stored in the validation
test modules under test/validate
as pytest parameterizations.
Data Validation Notebooks#
We have a collection of Jupyter Notebooks that run the same functions as the
data validation. The notebooks also produce some visualizations of the data
to make it easier to understand what’s wrong when validation fails. These
notebooks are stored in test/validate/notebooks
Like the data validations, the notebooks will only run successfully when there’s a full PUDL SQLite database available in your PUDL workspace.
Running pytest Directly#
Running tests directly with pytest
gives you the ability to run only tests from a
particular test module or even a single individual test case. It’s also faster because
there’s no testing environment to set up. Instead, it just uses your Python environment
which should be the pudl-dev
conda environment discussed in Development Setup.
This is convenient if you’re debugging something specific or developing new test cases.
Running specific tests#
To run the software unit tests with pytest
directly:
$ pytest test/unit
To run only the unit tests for the Excel spreadsheet extraction module:
$ pytest test/unit/extract/excel_test.py
To run only the unit tests defined by a single test class within that module:
$ pytest test/unit/extract/excel_test.py::TestGenericExtractor
Custom PUDL pytest flags#
We have defined several custom flags to control pytest’s behavior when running the PUDL tests.
You can always check to see what custom flags exist by running pytest --help
and
looking at the custom options
section:
custom options:
--live-dbs Use existing PUDL/FERC1 DBs instead of creating temporary ones.
--tmp-data Download fresh input data for use with this test run only.
--etl-settings=ETL_SETTINGS
Path to a non-standard ETL settings file to use.
--gcs-cache-path=GCS_CACHE_PATH
If set, use this GCS path as a datastore cache layer.
The main flexibility that these custom options provide is in selecting where the raw input data comes from and what data the tests should be run against. Being able to specify the tests to run and the data to run them against independently simplifies the test suite and keeps the data and tests very clearly separated.
The --live-dbs
option lets you use your existing FERC 1 and PUDL databases instead
of building a new database at all. This can be useful if you want to test code that only
operates on an existing database, and has nothing to do with the construction of that
database. For example, the output routines:
$ pytest --live-dbs test/integration/output_test.py
We also use this option to run the data validations.
Assuming you do want to run the ETL and build new databases as part of the test you’re
running, the contents of that database are determined by an ETL settings file. By
default, the settings file that’s used is
src/pudl/package_data/settings/etl_fast.yml
But it’s also possible to use a
different input file, generating a different database, and then run some tests against
that database.
We use the src/pudl/package_data/etl_full.yml
settings file to specify an exhaustive
collection of input data.
The raw input data that all the tests use is ultimately coming from our archives on Zenodo. However, you can optionally tell the tests to look in a different places for more rapidly accessible caches of that data and to force the download of a fresh copy (especially useful when you are testing the datastore functionality specifically). By default, the tests will use the datastore that’s part of your local PUDL workspace.
For example, to run the ETL portion of the integration tests and download fresh input data to a temporary datastore that’s later deleted automatically:
$ pytest --tmp-data test/integration/etl_test.py