pudl.extract.dbf
#
Generalized DBF extractor for FERC data.
Module Contents#
Classes#
Simple data-wrapper for the fox-pro table schema. |
|
This is the interface definition for dealing with fox-pro datastores. |
|
A custom DBF parser to deal with bad FERC data types. |
|
Wrapper to provide standardized access to FERC DBF databases. |
|
Generalized class for loading data from foxpro databases into SQLAlchemy. |
Attributes#
- class pudl.extract.dbf.DbfTableSchema(table_name: str)[source]#
Simple data-wrapper for the fox-pro table schema.
- add_column(col_name: str, col_type: sqlalchemy.types.TypeEngine, short_name: str | None = None)[source]#
Adds a new column to this table schema.
- get_columns() collections.abc.Iterator[tuple[str, sqlalchemy.types.TypeEngine]] [source]#
Itereates over the (column_name, column_type) pairs.
- class pudl.extract.dbf.AbstractFercDbfReader[source]#
Bases:
Protocol
This is the interface definition for dealing with fox-pro datastores.
- get_table_schema(table_name: str, year: int) DbfTableSchema [source]#
Returns schema for a given table and a given year.
- class pudl.extract.dbf.FercFieldParser(table, memofile=None)[source]#
Bases:
dbfread.FieldParser
A custom DBF parser to deal with bad FERC data types.
- parseN(field, data: bytes) int | float | None [source]#
Augments the Numeric DBF parser to account for bad FERC data.
There are a small number of bad entries in the backlog of FERC Form 1 data. They take the form of leading/trailing zeroes or null characters in supposedly numeric fields, and occasionally a naked ‘.’
Accordingly, this custom parser strips leading and trailing zeros and null characters, and replaces a bare ‘.’ character with zero, allowing all these fields to be cast to numeric values.
- Parameters:
field – The DBF field being parsed.
data – Binary data (bytes) read from the DBF file.
- pudl.extract.dbf.DBF_TYPES[source]#
A mapping of DBF field types to SQLAlchemy Column types.
This dictionary maps the strings which are used to denote field types in the DBF objects to the corresponding generic SQLAlchemy Column types: These definitions come from a combination of the dbfread example program dbf2sqlite and this DBF file format documentation page: http://www.dbase.com/KnowledgeBase/int/db7_file_fmt.htm
Unmapped types left as ‘XXX’ which should result in an error if encountered.
- Type:
- class pudl.extract.dbf.FercDbfReader(datastore: pudl.workspace.datastore.Datastore, dataset: str, field_parser: dbfread.FieldParser = FercFieldParser)[source]#
Wrapper to provide standardized access to FERC DBF databases.
- _open_csv_resource(base_filename: str) csv.DictReader [source]#
Open the given resource file as
csv.DictReader
.
- _get_dir(year: int) pathlib.Path [source]#
Returns the directory where the files for given year are stored.
- _get_file(year: int, filename: str) Any [source]#
Returns the file descriptor for a given year and base filename.
- _get_file_by_path(year: int, path: pathlib.Path) Any [source]#
Returns the file descriptor for a file identified by its full path.
- get_table_dbf(table_name: str, year: int) dbfread.DBF [source]#
Opens the DBF for a given table and year.
- get_table_names() list[str] [source]#
Returns list of tables that this datastore provides access to.
- get_db_schema(year: int) dict[str, list[str]] [source]#
Returns dict with table names as keys, and list of column names as values.
- get_table_schema(table_name: str, year: int) DbfTableSchema [source]#
Returns TableSchema for a given table and a given year.
- _load_single_year(table_name: str, year: int) pandas.DataFrame [source]#
Returns dataframe that holds data for a single year for a given table.
- Parameters:
table_name – name of the table.
year – year for which the data should be loaded.
- class pudl.extract.dbf.FercDbfExtractor(datastore: pudl.workspace.datastore.Datastore, settings: Any, output_path: pathlib.Path, clobber: bool = False)[source]#
Generalized class for loading data from foxpro databases into SQLAlchemy.
When subclassing from this generic extractor, one should implement dataset specific logic in the following manner:
1. set DATABASE_NAME class attribute. This controls what filename is used for the output sqlite database. 2. Implement get_dbf_reader() method to return the right kind of dataset specific AbstractDbfReader instance.
Dataset specific logic and transformations can be injected by overriding:
1. finalize_schema() in order to modify sqlite schema. This is called just before the schema is written into the sqlite database. This is good place for adding primary and/or foreign key constraints to tables. 2. transform_table(table_name, df) will be invoked after dataframe is loaded from the foxpro database and before it’s written to sqlite. This is good place for table-specific preprocessing and/or cleanup. 3. postprocess() is called after data is written to sqlite. This can be used for database level final cleanup and transformations (e.g. injecting missing respondent_ids).
The extraction logic is invoked by calling execute() method of this class.
- abstract get_dbf_reader(datastore: pudl.workspace.datastore.Datastore) AbstractFercDbfReader [source]#
Returns appropriate instance of AbstractFercDbfReader to access the data.
- transform_table(table_name: str, in_df: pandas.DataFrame) pandas.DataFrame [source]#
Transforms the content of a single table.
This method can be used to modify contents of the dataframe after it has been loaded from fox pro database and before it’s written to sqlite database.
- Parameters:
table_name – name of the table that the dataframe is associated with
in_df – dataframe that holds all records.