pudl.extract.ferc1#

Extract FERC Form 1 data from SQLite DBs derived from original DBF or XBRL files.

The FERC Form 1 data is available in two primary formats, spanning different years. The early digital data (1994-2020) was distributed using annual Visual FoxPro databases. Starting in 2021, the agency moved to using XBRL (a dialect of XML) published via an RSS feed one filing at a time. First we convert both of those difficult to use original formats into relational databases (currently stored in SQLite). We use those databases as the starting point for our extensive cleaning and reorganization of a small portion of the available tables into a well normalized database that covers all the years of available data. The complete input databases are published separately to provide users access to all of the original tables, since we’ve only been able to clean up a small subset of them.

The conversion from both DBF and XBRL to SQLite is coordinated by the pudl.convert.ferc_to_sqlite script. The code for the XBRL to SQLite conversion is used across all the modern FERC forms, and is contained in a standalone package:

https://github.com/catalyst-cooperative/ferc-xbrl-extractor

The code for converting the older FERC 1 DBF files into an SQLite DB is contained in this module.

One challenge with both of these data sources is that each year of data is treated as a standalone resource by FERC. The databases are not explicitly linked together across years. Over time the structure of the Visual FoxPro DB has changed as new tables and fields have been added. In order to be able to use the data to do analyses across many years, we need to bring all of it into a unified structure. These structural changes have only ever been additive – more recent versions of the DBF databases contain all the tables and fields that existed in earlier versions.

PUDL uses the most recently released year of DBF data (2020) as a template for the database schema, since it is capable of containing all the fields and tables found in the other years. The structure of the database is also informed by other documentation we have been able to compile over the years from the FERC website and other sources. Copies of these resoruces are included in the FERC Form 1 data source documentation

Using this inferred structure PUDL creates an SQLite database mirroring the FERC database using sqlalchemy. Then we use a python package called dbfread to extract the data from the DBF tables, and insert it virtually unchanged into the SQLite database.

Note that many quantities in the Visual FoxPro databases are tied not just to a particular table and column, but to a row number within an individual filing, and those row numbers have changed slowly over the years for some tables as rows have been added or removed from the form. The f1_row_lit_tbl table contains a record of these changes, and can be used to align reported quantities across time.

The one significant change we make to the raw input data is to ensure that there’s a master table of the all the respondent IDs and respondent names. All the other tables refer to this table. Unlike the other tables the f1_respondent_id table has no report_year and so it represents a merge of all the years of data. In the event that the name associated with a given respondent ID has changed over time, we retain the most recently reported name.

Note that there are a small number of respondent IDs that do not appear in any year of the f1_respondent_id table, but that do appear in the data tables. We add these observed but not directly reported IDs to the f1_respondent_id table and have done our best to identify what utility they correspond to based on the assets associated with those respondent IDs.

This SQLite compilation of the original FERC Form 1 databases accommodates all 116 tables from all the published years of DBF data (1994-2020) and takes up about 1GB of space on disk. You can interact with the most recent development version of this database online at:

https://data.catalyst.coop/ferc1

Module Contents#

Classes#

Ferc1DbfDatastore

A wrapper to standardize access to FERC 1 resources by year and filename.

FERC1FieldParser

A custom DBF parser to deal with bad FERC Form 1 data types.

Functions#

missing_respondents(→ list[dict[str, int | str]])

Fill in missing respondents for the f1_respondent_id table.

observed_respondents(→ set[int])

Compile the set of all observed respondent IDs found in the FERC 1 database.

add_sqlite_table(→ None)

Add a new Table to the FERC Form 1 database schema.

get_fields(→ dict[str, list[str]])

Produce the expected table names and fields from a DBC file.

get_dbc_map(→ dict[str, dict[str, str]])

Extract names of all tables and fields from a FERC Form 1 DBC file.

define_sqlite_db(sqlite_engine, sqlite_meta, dbc_map, ...)

Defines a FERC Form 1 DB structure in a given SQLAlchemy MetaData object.

get_raw_df(.working_partitions[]) → pandas.DataFrame)

Combine several years of a given FERC Form 1 DBF table into a dataframe.

dbf2sqlite(→ None)

Clone the FERC Form 1 Visual FoxPro databases into SQLite.

get_ferc1_meta(→ sqlalchemy.MetaData)

Grab the FERC Form 1 DB metadata and check that tables exist.

extract_dbf(→ dict[str, pandas.DataFrame])

Coordinates the extraction of all FERC Form 1 tables into PUDL.

extract_xbrl(→ dict[str, dict[Literal[duration, ...)

Coordinates the extraction of all FERC Form 1 tables into PUDL from XBRL data.

extract_xbrl_generic(→ pandas.DataFrame)

Extract a single FERC Form 1 XBRL table by name.

extract_dbf_generic(→ pandas.DataFrame)

Extract a single FERC Form 1 DBF table by name.

extract_xbrl_metadata(→ list[dict[Any]])

Extract the XBRL Taxonomy we've stored as JSON.

Attributes#

logger

DBF_TYPES

A mapping of DBF field types to SQLAlchemy Column types.

TABLE_NAME_MAP

A mapping of PUDL DB table names to their XBRL and DBF source table names.

PUDL_RIDS

Missing FERC 1 Respondent IDs for which we have identified the respondent.

pudl.extract.ferc1.logger[source]#
pudl.extract.ferc1.DBF_TYPES[source]#

A mapping of DBF field types to SQLAlchemy Column types.

This dictionary maps the strings which are used to denote field types in the DBF objects to the corresponding generic SQLAlchemy Column types: These definitions come from a combination of the dbfread example program dbf2sqlite and this DBF file format documentation page: http://www.dbase.com/KnowledgeBase/int/db7_file_fmt.htm

Unmapped types left as ‘XXX’ which should result in an error if encountered.

Type:

dict

pudl.extract.ferc1.TABLE_NAME_MAP: dict[str, dict[str, str]][source]#

A mapping of PUDL DB table names to their XBRL and DBF source table names.

pudl.extract.ferc1.PUDL_RIDS: dict[int, str][source]#

Missing FERC 1 Respondent IDs for which we have identified the respondent.

pudl.extract.ferc1.missing_respondents(reported: collections.abc.Iterable[int], observed: collections.abc.Iterable[int], identified: dict[int, str]) list[dict[str, int | str]][source]#

Fill in missing respondents for the f1_respondent_id table.

Parameters:
  • reported – Respondent IDs appearing in the f1_respondent_id table.

  • observed – Respondent IDs appearing anywhere in the FERC 1 DB.

  • identified – A dictionary mapping respondent_id: to respondent_name for those observed but unreported respondent IDs we’ve been able to identify based on circumstantial evidence. See pudl.extract.ferc1.PUDL_RIDS.

Returns:

A list of dictionaries representing minimal f1_respondent_id table records, of the form {“respondent_id”: ID, “respondent_name”: NAME}. These records are generated only for unreported respondents. Identified respondents get the values passed in through identified and the other observed but unidentified respondents are named “Missing Respondent ID”

pudl.extract.ferc1.observed_respondents(ferc1_engine: sqlalchemy.engine.Engine) set[int][source]#

Compile the set of all observed respondent IDs found in the FERC 1 database.

A significant number of FERC 1 respondent IDs appear in the data tables, but not in the f1_respondent_id table. In order to construct a self-consistent database with we need to find all of those missing respondent IDs and inject them into the table when we clone the database.

Parameters:

ferc1_engine – An engine for connecting to the FERC 1 database.

Returns:

Every respondent ID reported in any of the FERC 1 DB tables.

class pudl.extract.ferc1.Ferc1DbfDatastore(datastore: pudl.workspace.datastore.Datastore)[source]#

A wrapper to standardize access to FERC 1 resources by year and filename.

The internal directory structure of the published zipfiles containing FERC Form 1 data changes from year to year unpredictably, but the names of the individual database files which we parse is consistent. This wrapper encapsulates the annual directory structure variation and lets us request a particular filename by year without needing to understand the directory structure.

PACKAGE_PATH = 'pudl.package_data.ferc1'[source]#
get_dir(year: int) pathlib.Path[source]#

Get path to directory containing DBF files for an annual archive.

get_file(year: int, filename: str)[source]#

Opens given ferc1 file from the corresponding archive.

pudl.extract.ferc1.add_sqlite_table(table_name: str, sqlite_meta: sqlalchemy.schema.MetaData, dbc_map: dict[str, dict[str, str]], ferc1_dbf_ds: Ferc1DbfDatastore, refyear: int | None = None) None[source]#

Add a new Table to the FERC Form 1 database schema.

Creates a new sa.Table object named table_name and add it to the database schema contained in sqlite_meta. Use the information in the dictionary dbc_map to translate between the DBF filenames in the datastore (e.g. F1_31.DBF), and the full name of the table in the FoxPro database (e.g. f1_fuel) and also between truncated column names extracted from that DBF file, and the full column names extracted from the DBC file. Read the column datatypes out of each DBF file and use them to define the columns in the new Table object.

Parameters:
  • table_name – The name of the new table to be added to the database schema.

  • sqlite_meta – The database schema to which the newly defined sqlalchemy.Table will be added.

  • dbc_map – A dictionary of dictionaries

  • ferc1_dbf_ds – Initialized FERC1 DBF datastore.

  • refyear – Reference year to use as a template for the database schema.

pudl.extract.ferc1.get_fields(filedata) dict[str, list[str]][source]#

Produce the expected table names and fields from a DBC file.

Parameters:

filedata – Contents of the DBC file from which to extract.

Returns:

Dictionary mapping table names to the list of fields contained in that table.

pudl.extract.ferc1.get_dbc_map(ferc1_dbf_ds: Ferc1DbfDatastore, year: int) dict[str, dict[str, str]][source]#

Extract names of all tables and fields from a FERC Form 1 DBC file.

Read the DBC file associated with the FERC Form 1 database for the given year, and extract all embedded table and column names.

Parameters:
  • ferc1_dbf_ds – Initialized FERC 1 datastore.

  • year – The year of data from which the database table and column names are to be extracted. Typically this is expected to be the most recently available year of FERC Form 1 DBF data.

Returns:

A dictionary whose keys are the long table names extracted from the DBC file, and whose values are dictionaries mapping the first of which is the full name of each field in the table with the same name as the key, and the second of which is the truncated (<=10 character) long name of that field as found in the DBF file.

pudl.extract.ferc1.define_sqlite_db(sqlite_engine: sqlalchemy.engine.Engine, sqlite_meta: sqlalchemy.MetaData, dbc_map: dict[str, dict[str, str]], ferc1_dbf_ds: Ferc1DbfDatastore, ferc1_to_sqlite_settings: pudl.settings.Ferc1DbfToSqliteSettings = Ferc1DbfToSqliteSettings())[source]#

Defines a FERC Form 1 DB structure in a given SQLAlchemy MetaData object.

Given a template from an existing year of FERC data, and a list of target tables to be cloned, convert that information into table and column names, and data types, stored within a SQLAlchemy MetaData object. Use that MetaData object (which is bound to the SQLite database) to create all the tables to be populated later.

Parameters:
  • sqlite_engine – A connection engine for an existing FERC 1 DB.

  • sqlite_meta – A SQLAlchemy MetaData object which is bound to the FERC Form 1 SQLite database.

  • dbc_map – A dictionary of dictionaries, from get_dbc_map(), describing the table and column names stored within the FERC Form 1 FoxPro database files.

  • ferc1_dbf_ds – Initialized FERC 1 Datastore.

  • ferc1_to_sqlite_settings – Object containing Ferc1 to SQLite validated settings.

Returns:

the effects of the function are stored inside sqlite_meta

Return type:

None

class pudl.extract.ferc1.FERC1FieldParser(table, memofile=None)[source]#

Bases: dbfread.FieldParser

A custom DBF parser to deal with bad FERC Form 1 data types.

parseN(field, data: bytes) int | float | None[source]#

Augments the Numeric DBF parser to account for bad FERC data.

There are a small number of bad entries in the backlog of FERC Form 1 data. They take the form of leading/trailing zeroes or null characters in supposedly numeric fields, and occasionally a naked ‘.’

Accordingly, this custom parser strips leading and trailing zeros and null characters, and replaces a bare ‘.’ character with zero, allowing all these fields to be cast to numeric values.

Parameters:
  • field – The DBF field being parsed.

  • data – Binary data (bytes) read from the DBF file.

pudl.extract.ferc1.get_raw_df(ferc1_dbf_ds: Ferc1DbfDatastore, table: str, dbc_map: dict[str, dict[str, str]], years: list[int] = DataSource.from_id('ferc1').working_partitions['years']) pandas.DataFrame[source]#

Combine several years of a given FERC Form 1 DBF table into a dataframe.

Parameters:
  • ferc1_dbf_ds – Initialized FERC 1 DBF datastore

  • table – The name of the FERC Form 1 table from which data is read.

  • dbc_map – A dictionary returned by get_dbc_map(), describing the table and column names stored within the FERC Form 1 FoxPro database files.

  • years – List of years to be combined into a single DataFrame.

Returns:

A DataFrame containing multiple years of FERC Form 1 data for the requested table.

pudl.extract.ferc1.dbf2sqlite(ferc1_to_sqlite_settings: Ferc1DbfToSqliteSettings | None = None, pudl_settings: dict[str, Any] | None = None, clobber: bool = False, datastore: Datastore | None = None) None[source]#

Clone the FERC Form 1 Visual FoxPro databases into SQLite.

Parameters:
  • ferc1_to_sqlite_settings – Object containing Ferc1 to SQLite validated settings. If None (the default) then a default Ferc1DbfToSqliteSettings object will be used.

  • pudl_settings – Dictionary containing paths and database URLs used by PUDL.

  • clobber – Whether to clobber an existing FERC 1 database.

  • datastore – instance of a datastore providing access to raw resources.

pudl.extract.ferc1.get_ferc1_meta(ferc1_engine: sqlalchemy.engine.Engine) sqlalchemy.MetaData[source]#

Grab the FERC Form 1 DB metadata and check that tables exist.

Connects to the FERC Form 1 SQLite database and reads in its metadata (table schemas, types, etc.) by reflecting the database. Checks to make sure the DB is not empty, and returns the metadata object.

Parameters:

ferc1_engine – SQL Alchemy database connection engine for the PUDL FERC 1 DB.

Returns:

A SQL Alchemy metadata object, containing the definition of the DB structure.

Raises:

ValueError – If there are no tables in the SQLite Database.

pudl.extract.ferc1.extract_dbf(ferc1_settings: Ferc1Settings | None = None, pudl_settings: dict[str, Any] | None = None) dict[str, pandas.DataFrame][source]#

Coordinates the extraction of all FERC Form 1 tables into PUDL.

Parameters:
  • ferc1_settings – Object containing validated settings relevant to FERC Form 1. Contains the tables and years to be loaded into PUDL.

  • pudl_settings – A PUDL settings dictionary.

Returns:

A dictionary of DataFrames, with the names of PUDL database tables as the keys. These are the raw unprocessed dataframes, reflecting the data as it is in the FERC Form 1 DB, for passing off to the data tidying and cleaning fuctions found in the pudl.transform.ferc1 module.

Raises:

ValueError – If the FERC table requested is not integrated into PUDL

pudl.extract.ferc1.extract_xbrl(ferc1_settings: Ferc1Settings | None = None, pudl_settings: dict[str, Any] | None = None) dict[str, dict[Literal[duration, instant], pandas.DataFrame]][source]#

Coordinates the extraction of all FERC Form 1 tables into PUDL from XBRL data.

Parameters:
  • ferc1_settings – Object containing validated settings relevant to FERC Form 1. Contains the tables and years to be loaded into PUDL.

  • pudl_settings – A PUDL settings dictionary.

Returns:

A dictionary where keys are the names of the PUDL database tables, values are dictionaries of DataFrames coresponding to the instant and duration tables from the XBRL derived FERC 1 database.

Raises:

ValueError – If the FERC table requested is not yet integrated into PUDL.

pudl.extract.ferc1.extract_xbrl_generic(ferc1_engine: sqlalchemy.engine.Engine, ferc1_settings: pudl.settings.Ferc1Settings, table_name: str) pandas.DataFrame[source]#

Extract a single FERC Form 1 XBRL table by name.

Parameters:
  • ferc1_engine – An SQL Alchemy connection engine for the FERC Form 1 database.

  • ferc1_settings – Object containing validated settings relevant to FERC Form 1.

  • table_name – Name of the XBRL table to extract, as it appears in the original XBRL derived SQLite database.

pudl.extract.ferc1.extract_dbf_generic(ferc1_engine: sqlalchemy.engine.Engine, ferc1_settings: pudl.settings.Ferc1Settings, table_name: str) pandas.DataFrame[source]#

Extract a single FERC Form 1 DBF table by name.

Parameters:
  • ferc1_engine – An SQL Alchemy connection engine for the FERC Form 1 database.

  • ferc1_settings – Object containing validated settings relevant to FERC Form 1.

  • table_name – Name of desired output table to produce.

pudl.extract.ferc1.extract_xbrl_metadata(pudl_settings: dict[Any]) list[dict[Any]][source]#

Extract the XBRL Taxonomy we’ve stored as JSON.