pudl.extract.ferc1¶
Extract FERC Form 1 data from SQLite DBs derived from original DBF or XBRL files.
The FERC Form 1 data is available in two primary formats, spanning different years. The early digital data (1994-2020) was distributed using annual Visual FoxPro databases. Starting in 2021, the agency moved to using XBRL (a dialect of XML) published via an RSS feed one filing at a time. First we convert both of those difficult to use original formats into relational databases (currently stored in SQLite). We use those databases as the starting point for our extensive cleaning and reorganization of a small portion of the available tables into a well normalized database that covers all the years of available data. The complete input databases are published separately to provide users access to all of the original tables, since we’ve only been able to clean up a small subset of them.
The conversion from both DBF and XBRL to SQLite is coordinated by the
pudl.convert.ferc_to_sqlite
script. The code for the XBRL to SQLite conversion
is used across all the modern FERC forms, and is contained in a standalone
ferc-xbrl-extractor package.
The code for converting the older FERC 1 DBF files into an SQLite DB is contained in this module.
One challenge with both of these data sources is that each year of data is treated as a standalone resource by FERC. The databases are not explicitly linked together across years. Over time the structure of the Visual FoxPro DB has changed as new tables and fields have been added. In order to be able to use the data to do analyses across many years, we need to bring all of it into a unified structure. These structural changes have only ever been additive – more recent versions of the DBF databases contain all the tables and fields that existed in earlier versions.
PUDL uses the most recently released year of DBF data (2020) as a template for the database schema, since it is capable of containing all the= fields and tables found in the other years. The structure of the database is also informed by other documentation we have been able to compile over the years from the FERC website and other sources. Copies of these resoruces are included in the FERC Form 1 data source documentation
Using this inferred structure PUDL creates an SQLite database mirroring the FERC
database using sqlalchemy
. Then we use a python package called dbfread to extract the data from the DBF tables,
and insert it virtually unchanged into the SQLite database.
Note that many quantities in the Visual FoxPro databases are tied not just to a
particular table and column, but to a row number within an individual filing, and
those row numbers have changed slowly over the years for some tables as rows have been
added or removed from the form. The f1_row_lit_tbl
table contains a record of these
changes, and can be used to align reported quantities across time.
The one significant change we make to the raw input data is to ensure that there’s a
master table of the all the respondent IDs and respondent names. All the other tables
refer to this table. Unlike the other tables the f1_respondent_id
table has no
report_year
and so it represents a merge of all the years of data. In the event that
the name associated with a given respondent ID has changed over time, we retain the most
recently reported name.
Note that there are a small number of respondent IDs that do not appear in any year
of the f1_respondent_id
table, but that do appear in the data tables. We add
these observed but not directly reported IDs to the f1_respondent_id
table and have
done our best to identify what utility they correspond to based on the assets associated
with those respondent IDs.
This SQLite compilation of the original FERC Form 1 databases accommodates all 116 tables from all the published years of DBF data (1994-2020) and takes up about 1GB of space on disk. You can interact with the most recent development version of this database online here.
Attributes¶
A mapping of PUDL DB table names to their XBRL and DBF source table names. |
|
A mapping of XBRL to (future) PUDL table names for tables not yet in PUDL. |
|
Classes¶
Wrapper for running the foxpro to sqlite conversion of FERC1 dataset. |
Functions¶
|
Create SourceAssets for raw ferc1 tables. |
|
Extract the FERC 1 XBRL Taxonomy metadata we've stored as JSON. |
|
Combine multiple raw dbf tables into one. |
|
Combine multiple raw dbf tables into one. |
|
Coordinates the extraction of all FERC Form 1 tables into PUDL. |
|
Coordinates the extraction of all FERC Form 1 tables into PUDL from XBRL data. |
Module Contents¶
- pudl.extract.ferc1.TABLE_NAME_MAP_FERC1: dict[str, dict[str, str]][source]¶
A mapping of PUDL DB table names to their XBRL and DBF source table names.
- pudl.extract.ferc1.XBRL_META_ONLY_FERC1[source]¶
A mapping of XBRL to (future) PUDL table names for tables not yet in PUDL.
We need this mapping so that we can have a meaningful reference to outside tables in some inter-table XBRL calculations, even though we aren’t yet reading those tables into PUDL. We pull information about these tables from the XBRL metadata, but the data is not extracted from the DBF and XBRL derived SQLite DBs.
- class pudl.extract.ferc1.Ferc1DbfExtractor(datastore: pudl.workspace.datastore.Datastore, settings: pudl.settings.FercToSqliteSettings, output_path: pathlib.Path)[source]¶
Bases:
pudl.extract.dbf.FercDbfExtractor
Wrapper for running the foxpro to sqlite conversion of FERC1 dataset.
- get_settings(global_settings: pudl.settings.FercToSqliteSettings) pudl.settings.GenericDatasetSettings [source]¶
Returns settings for FERC Form 1 DBF dataset.
- finalize_schema(meta: sqlalchemy.MetaData) sqlalchemy.MetaData [source]¶
Modifies schema before it’s written to sqlite database.
This marks f1_responent_id.respondent_id as a primary key and adds foreign key constraints on all tables with respondent_id column.
- postprocess()[source]¶
Applies final transformations on the data in sqlite database.
This identifies respondents that are referenced by other tables but that are not in the f1_respondent_id tables. These missing respondents are inserted back into f1_respondent_id table.
- PUDL_RIDS: dict[int, str][source]¶
Missing FERC 1 Respondent IDs for which we have identified the respondent.
- add_missing_respondents()[source]¶
Add missing respondents to f1_respondent_id table.
Some respondent_ids are referenced by other tables, but are not listed in the f1_respondent_id table. This function finds all of these missing respondents and backfills them into the f1_respondent_id table.
- get_observed_respondents() set[int] [source]¶
Compile the set of all observed respondent IDs found in the FERC 1 database.
A significant number of FERC 1 respondent IDs appear in the data tables, but not in the f1_respondent_id table. In order to construct a self-consistent database with we need to find all of those missing respondent IDs and inject them into the table when we clone the database.
- Returns:
Every respondent ID reported in any of the FERC 1 DB tables.
- aggregate_table_frames(table_name: str, dfs: list[pudl.extract.dbf.PartitionedDataFrame]) pandas.DataFrame | None [source]¶
Deduplicates records in f1_respondent_id table.
- pudl.extract.ferc1.create_raw_ferc1_assets() list[dagster.SourceAsset] [source]¶
Create SourceAssets for raw ferc1 tables.
SourceAssets allow you to access assets that are generated elsewhere. In our case, the xbrl and dbf database are created in a separate dagster Definition.
- Returns:
A list of ferc1 SourceAssets.
- pudl.extract.ferc1.raw_ferc1_xbrl__metadata_json(context) dict[str, dict[str, list[dict[str, Any]]]] [source]¶
Extract the FERC 1 XBRL Taxonomy metadata we’ve stored as JSON.
- Returns:
A dictionary keyed by PUDL table name, with an instant and a duration entry for each table, corresponding to the metadata for each of the respective instant or duration tables from XBRL if they exist. Table metadata is returned as a list of dictionaries, each of which can be interpreted as a row in a tabular structure, with each row annotating a separate XBRL concept from the FERC 1 filings. If there is no instant/duration table, an empty list is returned instead.
- pudl.extract.ferc1.extract_dbf_generic(table_names: list[str], io_manager: pudl.io_managers.FercDBFSQLiteIOManager, dataset_settings: pudl.settings.DatasetsSettings) pandas.DataFrame [source]¶
Combine multiple raw dbf tables into one.
- Parameters:
table_names – The name of the raw dbf tables you want to combine under dbf. These are the tables you want to combine.
io_manager – IO Manager that reads tables out of
ferc1_dbf.sqlite
.dataset_settings – object containing desired years to extract.
- Returns:
Concatenation of all tables in table_names as a dataframe.
- pudl.extract.ferc1.extract_xbrl_generic(table_names: list[str], io_manager: pudl.io_managers.FercXBRLSQLiteIOManager, dataset_settings: pudl.settings.DatasetsSettings, period: Literal['duration', 'instant']) pandas.DataFrame [source]¶
Combine multiple raw dbf tables into one.
- Parameters:
table_names – The name of the raw dbf tables you want to combine under xbrl. These are the tables you want to combine.
io_manager – IO Manager that reads tables out of
ferc1_xbrl.sqlite
.dataset_settings – object containing desired years to extract.
period – Either duration or instant, specific to xbrl data.
- Returns:
Concatenation of all tables in table_names as a dataframe.
- pudl.extract.ferc1.extract_dbf(dataset_settings: pudl.settings.DatasetsSettings) dict[str, pandas.DataFrame] [source]¶
Coordinates the extraction of all FERC Form 1 tables into PUDL.
This function is not used in the dagster ETL and is only intended to be used in notebooks for debugging the FERC Form 1 transforms.
- Parameters:
dataset_settings – object containing desired years to extract.
- Returns:
A dictionary of DataFrames, with the names of PUDL database tables as the keys. These are the raw unprocessed dataframes, reflecting the data as it is in the FERC Form 1 DB, for passing off to the data tidying and cleaning functions found in the
pudl.transform.ferc1
module.
- pudl.extract.ferc1.extract_xbrl(dataset_settings: pudl.settings.DatasetsSettings) dict[str, dict[Literal['duration', 'instant'], pandas.DataFrame]] [source]¶
Coordinates the extraction of all FERC Form 1 tables into PUDL from XBRL data.
This function is not used in the dagster ETL and is only intended to be used in notebooks for debugging the FERC Form 1 transforms.
- Parameters:
dataset_settings – object containing desired years to extract.
- Returns:
A dictionary where keys are the names of the PUDL database tables, values are dictionaries of DataFrames coresponding to the instant and duration tables from the XBRL derived FERC 1 database.