Converting raw FERC data to SQLite#
FERC publishes its data (e.g. FERC Form 1 – Annual Report of Major Electric Utilities) in particularly difficult to use formats. From 1994-2020 it used the proprietary FoxPro database binary format. Then in 2021 it switched to XBRL, a dialect of XML used for financial reporting.
In addition to using two different difficult to parse file formats, the data itself is unclean and poorly organized. As a result, very few people are currently able to use it. This means that, while we have not yet integrated the vast majority of the available data into PUDL, it’s useful to just provide programmatic access to the bulk raw data, independent of the cleaner subset of the data included within PUDL.
To provide that access, we’ve broken the
pudl.extract.ferc1 process down into
several distinct steps:
Clone the 1994-2020 annual database from FoxPro (DBF) into a local file-based
Select a limited subset of the tables in these databases for further processing and integration into the PUDL
The FoxPro / XBRL derived FERC Form 1 databases include 100+ tables, containing 3000+ columns.
If you need to work with this relatively unprocessed data, we highly recommend downloading it from one of our periodic data releases or our Nightly Builds.
Cloning the original FERC database is the first step in the PUDL ETL process. This can
be done using the dagster UI (see Running the ETL via the Dagster UI) or with the
script (see Running the ETL with CLI Commands).
We recommend using the Dagster UI to execute the ETL as it provides additional functionality for re-execution and viewing dependences.
ferc_to_sqlite job will create several outputs in your
directory. For example the FERC Form 1 outputs will include:
$PUDL_OUTPUT/ferc1_dbf.sqlite: Data from 1994-2020 (FoxPro/DBF)
$PUDL_OUTPUT/ferc1_xbrl.sqlite: Data from 2021 onward (XBRL)
$PUDL_OUTPUT/ferc1_xbrl_datapackage.json: Frictionless data package descriptor for the XBRL derived database.
pudl_output/ferc1_xbrl_taxonomy_metadata.json: A JSON version of the XBRL Taxonomy, containing additional metadata.
By default, the script pulls in all available years and tables of data. The output is
roughly 1GB on disk. The
ferc_to_sqlite jobs also extracts the XBRL data for FERC
Form 1, 2, 6, 60 and 714.