pudl.extract.dbf#

Generalized DBF extractor for FERC data.

Module Contents#

Classes#

DbfTableSchema

Simple data-wrapper for the fox-pro table schema.

FercDbfArchive

Represents API for accessing files within a single DBF archive.

AbstractFercDbfReader

This is the interface definition for dealing with fox-pro datastores.

FercFieldParser

A custom DBF parser to deal with bad FERC data types.

PartitionedDataFrame

This class bundles pandas.DataFrame with partition information.

FercDbfReader

Wrapper to provide standardized access to FERC DBF databases.

FercDbfExtractor

Generalized class for loading data from foxpro databases into SQLAlchemy.

Functions#

add_key_constraints(→ sqlalchemy.MetaData)

Adds primary and foreign key to tables present in meta.

deduplicate_by_year(→ pandas.DataFrame | None)

Deduplicate records by year, keeping the most recent version of each record.

Attributes#

logger

DBF_TYPES

A mapping of DBF field types to SQLAlchemy Column types.

pudl.extract.dbf.logger[source]#
exception pudl.extract.dbf.DbcFileMissingError[source]#

Bases: Exception

This is raised when the DBC index file is missing.

class pudl.extract.dbf.DbfTableSchema(table_name: str)[source]#

Simple data-wrapper for the fox-pro table schema.

add_column(col_name: str, col_type: sqlalchemy.types.TypeEngine, short_name: str | None = None)[source]#

Adds a new column to this table schema.

get_columns() collections.abc.Iterator[tuple[str, sqlalchemy.types.TypeEngine]][source]#

Itereates over the (column_name, column_type) pairs.

get_column_names() set[str][source]#

Returns set of long column names.

get_column_rename_map() dict[str, str][source]#

Returns dictionary that maps from short to long column names.

create_sa_table(sa_meta: sqlalchemy.MetaData) sqlalchemy.Table[source]#

Creates SQLAlchemy table described by this instance.

Parameters:

sa_meta – new table will be written to this MetaData object.

class pudl.extract.dbf.FercDbfArchive(zipfile: FercDbfArchive.__init__.zipfile, dbc_path: pathlib.Path, table_file_map: dict[str, str], partition: dict[str, Any], field_parser: dbfread.FieldParser)[source]#

Represents API for accessing files within a single DBF archive.

Typically, archive contains data for a single year and single FERC form dataset (e.g. FERC Form 1 or FERC Form 2).

get_file(filename: str) IO[bytes][source]#

Opens the file within this archive.

get_db_schema() dict[str, list[str]][source]#

Returns dict with table names as keys, and list of column names as values.

get_table_dbf(table_name: str) dbfread.DBF[source]#

Opens the DBF for a given table.

get_table_schema(table_name: str) DbfTableSchema[source]#

Returns TableSchema for a given table and a given year.

load_table(table_name: str) pandas.DataFrame[source]#

Returns dataframe that holds data for a table contained within this archive.

Parameters:

table_name – name of the table.

class pudl.extract.dbf.AbstractFercDbfReader[source]#

Bases: Protocol

This is the interface definition for dealing with fox-pro datastores.

get_dataset() str[source]#

Returns name of the dataset that this datastore provides access to.

get_table_names() list[str][source]#

Returns list of all available table names.

get_archive(**filters) FercDbfArchive[source]#

Returns single archive matching specific filters.

get_table_schema(table_name: str, year: int) DbfTableSchema[source]#

Returns schema for a given table and a given year.

load_table_dfs(table_name: str, partitions: list[dict[str, Any]]) pandas.DataFrame | None[source]#

Returns dataframe that contains data for a given table across given years.

class pudl.extract.dbf.FercFieldParser(table, memofile=None)[source]#

Bases: dbfread.FieldParser

A custom DBF parser to deal with bad FERC data types.

parseN(field, data: bytes) int | float | None[source]#

Augments the Numeric DBF parser to account for bad FERC data.

There are a small number of bad entries in the backlog of FERC Form 1 data. They take the form of leading/trailing zeroes or null characters in supposedly numeric fields, and occasionally a naked ‘.’

Accordingly, this custom parser strips leading and trailing zeros and null characters, and replaces a bare ‘.’ character with zero, allowing all these fields to be cast to numeric values.

Parameters:
  • field – The DBF field being parsed.

  • data – Binary data (bytes) read from the DBF file.

pudl.extract.dbf.DBF_TYPES[source]#

A mapping of DBF field types to SQLAlchemy Column types.

This dictionary maps the strings which are used to denote field types in the DBF objects to the corresponding generic SQLAlchemy Column types: These definitions come from a combination of the dbfread example program dbf2sqlite and this DBF file format documentation page: http://www.dbase.com/KnowledgeBase/int/db7_file_fmt.htm : http: //www.dbase.com/KnowledgeBase/int/db7_file_fmt.htm Unmapped types left as ‘XXX’ which should result in an error if encountered.

Type:

Dict

class pudl.extract.dbf.PartitionedDataFrame(df: pandas.DataFrame, partition: dict[str, Any])[source]#

This class bundles pandas.DataFrame with partition information.

class pudl.extract.dbf.FercDbfReader(datastore: pudl.workspace.datastore.Datastore, dataset: str, field_parser: dbfread.FieldParser = FercFieldParser)[source]#

Wrapper to provide standardized access to FERC DBF databases.

get_dataset() str[source]#

Return the name of the dataset this datastore works with.

_open_csv_resource(base_filename: str) csv.DictReader[source]#

Open the given resource file as csv.DictReader.

get_archive(year: int, **filters) FercDbfArchive[source]#

Returns single dbf archive matching given filters.

get_table_names() list[str][source]#

Returns list of tables that this datastore provides access to.

static _normalize(filters: dict[str, Any]) dict[str, str][source]#

Casts partition values to lowercase strings.

valid_partition_filter(fl: dict[str, Any]) bool[source]#

Returns True if a given filter fl is considered to be valid.

This can be used to eliminate partitions that are not suitable for processing, e.g. for early years of FERC Form 2, databases marked with part=1 or part=2 are not suitable.

load_table_dfs(table_name: str, partitions: list[dict[str, Any]]) list[PartitionedDataFrame][source]#

Returns all data for a given table.

Merges data for a given table across all partitions.

Parameters:
  • table_name – name of the table to load.

  • partitions – list of partition filters to use

class pudl.extract.dbf.FercDbfExtractor(datastore: pudl.workspace.datastore.Datastore, settings: pudl.settings.FercToSqliteSettings, output_path: pathlib.Path, clobber: bool = False)[source]#

Generalized class for loading data from foxpro databases into SQLAlchemy.

When subclassing from this generic extractor, one should implement dataset specific logic in the following manner:

1. set DATABASE_NAME class attribute. This controls what filename is used for the output sqlite database. 2. Implement get_dbf_reader() method to return the right kind of dataset specific AbstractDbfReader instance.

Dataset specific logic and transformations can be injected by overriding:

1. finalize_schema() in order to modify sqlite schema. This is called just before the schema is written into the sqlite database. This is good place for adding primary and/or foreign key constraints to tables. 2. aggregate_table_frames() is responsible for concatenating individual data frames (one par input partition) into single one. This is where deduplication can take place. 3. transform_table(table_name, df) will be invoked after dataframe is loaded from the foxpro database and before it’s written to sqlite. This is good place for table-specific preprocessing and/or cleanup. 4. postprocess() is called after data is written to sqlite. This can be used for database level final cleanup and transformations (e.g. injecting missing respondent_ids).

The extraction logic is invoked by calling execute() method of this class.

DATABASE_NAME[source]#
DATASET[source]#
get_settings(global_settings: pudl.settings.FercToSqliteSettings) pudl.settings.GenericDatasetSettings[source]#

Returns dataset relevant settings from the global_settings.

get_dbf_reader(datastore: pudl.workspace.datastore.Datastore) AbstractFercDbfReader[source]#

Returns appropriate instance of AbstractFercDbfReader to access the data.

get_db_path() str[source]#

Returns the connection string for the sqlite database.

classmethod get_dagster_op() collections.abc.Callable[source]#

Returns dagstger op that runs this extractor.

execute()[source]#

Runs the extraction of the data from dbf to sqlite.

delete_schema()[source]#

Drops all tables from the existing sqlite database.

create_sqlite_tables()[source]#

Creates database schema based on the input tables.

transform_table(table_name: str, in_df: pandas.DataFrame) pandas.DataFrame[source]#

Transforms the content of a single table.

This method can be used to modify contents of the dataframe after it has been loaded from fox pro database and before it’s written to sqlite database.

Parameters:
  • table_name – name of the table that the dataframe is associated with

  • in_df – dataframe that holds all records.

static is_valid_partition(fl: dict[str, Any]) bool[source]#

Returns True if the partition filter should be considered for processing.

aggregate_table_frames(table_name: str, dfs: list[PartitionedDataFrame]) pandas.DataFrame | None[source]#

Function to aggregate partitioned data frames into a single one.

By default, this simply concatenates the frames, but custom dataset specific behaviors can be implemented.

load_table_data()[source]#

Loads all tables from fox pro database and writes them to sqlite.

finalize_schema(meta: sqlalchemy.MetaData) sqlalchemy.MetaData[source]#

This method is called just before the schema is written to sqlite.

You can use this method to apply dataset specific alterations to the schema, such as adding primary and foreign key constraints.

postprocess()[source]#

This metod is called after all the data is loaded into sqlite.

pudl.extract.dbf.add_key_constraints(meta: sqlalchemy.MetaData, pk_table: str, column: str, pk_column: str | None = None) sqlalchemy.MetaData[source]#

Adds primary and foreign key to tables present in meta.

Parameters:
  • meta – constraints will be applied to this metadata instance

  • pk_table – name of the table that contains primary-key

  • column – foreign key column name. Tables that contain this column will have foreign-key constraint added.

  • pk_column – (optional) if specified, this is the primary key column name in the table. If not specified, it is assumed that this is the same as pk_column.

pudl.extract.dbf.deduplicate_by_year(dfs: list[PartitionedDataFrame], pk_column: str) pandas.DataFrame | None[source]#

Deduplicate records by year, keeping the most recent version of each record.

It will use pk_column as the primary key column. report_yr column is expected to either be present, or it will be derived from partition[“year”].