pudl.extract.ferc1 module¶
Tools for extracting data from the FERC Form 1 FoxPro database for use in PUDL.
FERC distributes the annual responses to Form 1 as binary FoxPro database files. This format is no longer widely supported, and so our first challenge in accessing the Form 1 data is to convert it into a modern format. In addition, FERC distributes one database for each year, and these databases are not explicitly linked together. Over time the structure has changed as new tables and fields have been added. In order to be able to use the data to do analyses across many years, we need to bring all of it into a unified structure. However it appears that these changes are only entirely additive – the most recent versions of the DB contain all the tables and fields that existed in earlier versions.
PUDL uses the most recently released year of data as a template, and infers the structure of the FERC Form 1 database based on the strings embedded within the binary files, pulling out the names of tables and their constituent columns. The structure of the database is also informed by information we found on the FERC website, including a mapping between the table names, DBF file names, and the pages of the Form 1 (add link to file, which should distributed with the docs) that the data was gathered from, as well as a diagram of the structure of the database as it existed in 2015 (add link/embed image).
Using this inferred structure PUDL creates an SQLite database mirroring the
FERC database using sqlalchemy
. Then we use a python package called
dbfread to extract the data from
the DBF tables, and insert it virtually unchanged into the SQLite database.
However, we do compile a master table of the all the respondent IDs and
respondent names, which all the other tables refer to. Unlike the other tables,
this table has no report_year
and so it represents a merge of all the years
of data. In the event that the name associated with a given respondent ID has
changed over time, we retain the most recently reported name.
Ths SQLite based compilation of the original FERC Form 1 databases can accommodate all 116 tables from all the published years of data (beginning in 1994). Including all the data through 2018, the database takes up more than 7GB of disk space. However, almost 90% of that “data” is embeded binary files in two tables. If those tables are excluded, the database is less than 800MB in size.
The process of cloning the FERC Form 1 database(s) is coordinated by a script
called ferc1_to_sqlite
implemented in pudl.convert.ferc1_to_sqlite
which is controlled by a YAML file. See the example file distributed with the
package.
Once the cloned SQLite database has been created, we use it as an input into the PUDL ETL pipeline, and we extract a small subset of the available tables for further processing and integration with other data sources like the EIA 860 and EIA 923.
-
class
pudl.extract.ferc1.
FERC1FieldParser
(table, memofile=None)[source]¶ Bases:
dbfread.field_parser.FieldParser
A custom DBF parser to deal with bad FERC Form 1 data types.
-
parseN
(field, data)[source]¶ Augments the Numeric DBF parser to account for bad FERC data.
There are a small number of bad entries in the backlog of FERC Form 1 data. They take the form of leading/trailing zeroes or null characters in supposedly numeric fields, and occasionally a naked ‘.’
Accordingly, this custom parser strips leading and trailing zeros and null characters, and replaces a bare ‘.’ character with zero, allowing all these fields to be cast to numeric values.
- Parameters
() (data) –
() –
() –
-
-
class
pudl.extract.ferc1.
Ferc1Datastore
(datastore: pudl.workspace.datastore.Datastore)[source]¶ Bases:
object
Simple datastore wrapper for accessing ferc1 resources.
-
PACKAGE_PATH
= 'pudl.package_data.meta.ferc1_row_maps'¶
-
get_dir
(year: int) → pathlib.Path[source]¶ Returns the path where individual ferc1 files are stored inside the yearly archive.
-
-
pudl.extract.ferc1.
PUDL_RIDS
= {514: 'AEP Texas', 519: 'Upper Michigan Energy Resources Company', 522: 'Luning Energy Holdings LLC, Invenergy Investments', 529: 'Tri-State Generation and Transmission Association', 531: 'Basin Electric Power Cooperative'}¶ Missing FERC 1 Respondent IDs for which we have identified the respondent.
-
pudl.extract.ferc1.
accumulated_depreciation
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of the fields of accumulated_depreciation_ferc1.
- Parameters
- Returns
A DataFrame containing all accumulated_depreciation_ferc1 records.
- Return type
-
pudl.extract.ferc1.
add_sqlite_table
(table_name, sqlite_meta, dbc_map, ds, refyear=2019, testing=False, bad_cols=())[source]¶ Adds a new Table to the FERC Form 1 database schema.
Creates a new sa.Table object named
table_name
and add it to the database schema contained insqlite_meta
. Use the information in the dictionarydbc_map
to translate between the DBF filenames in the datastore (e.g.F1_31.DBF
), and the full name of the table in the FoxPro database (e.g.f1_fuel
) and also between truncated column names extracted from that DBF file, and the full column names extracted from the DBC file. Read the column datatypes out of each DBF file and use them to define the columns in the new Table object.- Parameters
table_name (str) – The name of the new table to be added to the database schema.
sqlite_meta (
sqlalchemy.schema.MetaData
) – The database schema to which the newly definedsqlalchemy.Table
will be added.dbc_map (dict) – A dictionary of dictionaries
ds (
Ferc1Datastore
) – Initialized datastoretesting (bool) – Assume this is a test run, use sandboxes
bad_cols (iterable of 2-tuples) – A list or other iterable containing pairs of strings of the form (table_name, column_name), indicating columns (and their parent tables) which should not be cloned into the SQLite database for some reason.
- Returns
None
-
pudl.extract.ferc1.
check_ferc1_tables
(refyear)[source]¶ Test each FERC 1 data year for compatibility with reference year schema.
-
pudl.extract.ferc1.
dbf2sqlite
(tables, years, refyear, pudl_settings, bad_cols=(), clobber=False, datastore=None)[source]¶ Clone the FERC Form 1 Databsae to SQLite.
- Parameters
tables (iterable) – What tables should be cloned?
years (iterable) – Which years of data should be cloned?
refyear (int) – Which database year to use as a template.
pudl_settings (dict) – Dictionary containing paths and database URLs used by PUDL.
bad_cols (iterable of tuples) – A list of (table, column) pairs indicating columns that should be skipped during the cloning process. Both table and column are strings in this case, the names of their respective entities within the database metadata.
datastore (Datastore) – instance of a datastore to access the resources.
- Returns
None
-
pudl.extract.ferc1.
define_sqlite_db
(sqlite_meta, dbc_map, ds, tables={'f1_106_2009': 'F1_106_2009', 'f1_106a_2009': 'F1_106A_2009', 'f1_106b_2009': 'F1_106B_2009', 'f1_208_elc_dep': 'F1_208_ELC_DEP', 'f1_231_trn_stdycst': 'F1_231_TRN_STDYCST', 'f1_324_elc_expns': 'F1_324_ELC_EXPNS', 'f1_325_elc_cust': 'F1_325_ELC_CUST', 'f1_331_transiso': 'F1_331_TRANSISO', 'f1_338_dep_depl': 'F1_338_DEP_DEPL', 'f1_397_isorto_stl': 'F1_397_ISORTO_STL', 'f1_398_ancl_ps': 'F1_398_ANCL_PS', 'f1_399_mth_peak': 'F1_399_MTH_PEAK', 'f1_400_sys_peak': 'F1_400_SYS_PEAK', 'f1_400a_iso_peak': 'F1_400A_ISO_PEAK', 'f1_429_trans_aff': 'F1_429_TRANS_AFF', 'f1_acb_epda': 'F1_2', 'f1_accumdepr_prvsn': 'F1_3', 'f1_accumdfrrdtaxcr': 'F1_4', 'f1_adit_190_detail': 'F1_5', 'f1_adit_190_notes': 'F1_6', 'f1_adit_amrt_prop': 'F1_7', 'f1_adit_other': 'F1_8', 'f1_adit_other_prop': 'F1_9', 'f1_allowances': 'F1_10', 'f1_allowances_nox': 'F1_ALLOWANCES_NOX', 'f1_audit_log': 'F1_78', 'f1_bal_sheet_cr': 'F1_11', 'f1_capital_stock': 'F1_12', 'f1_cash_flow': 'F1_13', 'f1_cmmn_utlty_p_e': 'F1_14', 'f1_cmpinc_hedge': 'F1_CMPINC_HEDGE', 'f1_cmpinc_hedge_a': 'F1_CMPINC_HEDGE_A', 'f1_co_directors': 'F1_18', 'f1_codes_val': 'F1_76', 'f1_col_lit_tbl': 'F1_79', 'f1_comp_balance_db': 'F1_15', 'f1_construction': 'F1_16', 'f1_control_respdnt': 'F1_17', 'f1_cptl_stk_expns': 'F1_19', 'f1_csscslc_pcsircs': 'F1_20', 'f1_dacs_epda': 'F1_21', 'f1_dscnt_cptl_stk': 'F1_22', 'f1_edcfu_epda': 'F1_23', 'f1_elc_op_mnt_expn': 'F1_27', 'f1_elc_oper_rev_nb': 'F1_26', 'f1_elctrc_erg_acct': 'F1_24', 'f1_elctrc_oper_rev': 'F1_25', 'f1_electric': 'F1_28', 'f1_email': 'F1_EMAIL', 'f1_envrnmntl_expns': 'F1_29', 'f1_envrnmntl_fclty': 'F1_30', 'f1_footnote_data': 'F1_85', 'f1_footnote_tbl': 'F1_87', 'f1_fuel': 'F1_31', 'f1_general_info': 'F1_32', 'f1_gnrt_plant': 'F1_33', 'f1_hydro': 'F1_86', 'f1_ident_attsttn': 'F1_88', 'f1_important_chg': 'F1_34', 'f1_incm_stmnt_2': 'F1_35', 'f1_income_stmnt': 'F1_36', 'f1_leased': 'F1_90', 'f1_load_file_names': 'F1_80', 'f1_long_term_debt': 'F1_93', 'f1_misc_dfrrd_dr': 'F1_38', 'f1_miscgen_expnelc': 'F1_37', 'f1_mthly_peak_otpt': 'F1_39', 'f1_mtrl_spply': 'F1_40', 'f1_nbr_elc_deptemp': 'F1_41', 'f1_nonutility_prop': 'F1_42', 'f1_note_fin_stmnt': 'F1_43', 'f1_nuclear_fuel': 'F1_44', 'f1_officers_co': 'F1_45', 'f1_othr_dfrrd_cr': 'F1_46', 'f1_othr_pd_in_cptl': 'F1_47', 'f1_othr_reg_assets': 'F1_48', 'f1_othr_reg_liab': 'F1_49', 'f1_overhead': 'F1_50', 'f1_pccidica': 'F1_51', 'f1_plant': 'F1_92', 'f1_plant_in_srvce': 'F1_52', 'f1_privilege': 'F1_81', 'f1_pumped_storage': 'F1_53', 'f1_purchased_pwr': 'F1_54', 'f1_r_d_demo_actvty': 'F1_59', 'f1_reconrpt_netinc': 'F1_55', 'f1_reg_comm_expn': 'F1_56', 'f1_respdnt_control': 'F1_57', 'f1_respondent_id': 'F1_1', 'f1_retained_erng': 'F1_58', 'f1_rg_trn_srv_rev': 'F1_RG_TRN_SRV_REV', 'f1_row_lit_tbl': 'F1_84', 'f1_s0_checks': 'F1_S0_CHECKS', 'f1_s0_filing_log': 'F1_S0_FILING_LOG', 'f1_sale_for_resale': 'F1_61', 'f1_sales_by_sched': 'F1_60', 'f1_sbsdry_detail': 'F1_91', 'f1_sbsdry_totals': 'F1_62', 'f1_sched_lit_tbl': 'F1_77', 'f1_schedules_list': 'F1_63', 'f1_security': 'F1_SECURITY', 'f1_security_holder': 'F1_64', 'f1_slry_wg_dstrbtn': 'F1_65', 'f1_steam': 'F1_89', 'f1_substations': 'F1_66', 'f1_sys_error_log': 'F1_82', 'f1_taxacc_ppchrgyr': 'F1_67', 'f1_unique_num_val': 'F1_83', 'f1_unrcvrd_cost': 'F1_68', 'f1_utltyplnt_smmry': 'F1_69', 'f1_work': 'F1_70', 'f1_xmssn_adds': 'F1_71', 'f1_xmssn_elc_bothr': 'F1_72', 'f1_xmssn_elc_fothr': 'F1_73', 'f1_xmssn_line': 'F1_74', 'f1_xtraordnry_loss': 'F1_75'}, refyear=2019, bad_cols=())[source]¶ Defines a FERC Form 1 DB structure in a given SQLAlchemy MetaData object.
Given a template from an existing year of FERC data, and a list of target tables to be cloned, convert that information into table and column names, and data types, stored within a SQLAlchemy MetaData object. Use that MetaData object (which is bound to the SQLite database) to create all the tables to be populated later.
- Parameters
sqlite_meta (sa.MetaData) – A SQLAlchemy MetaData object which is bound to the FERC Form 1 SQLite database.
dbc_map (dict of dicts) – A dictionary of dictionaries, of the kind returned by get_dbc_map(), describing the table and column names stored within the FERC Form 1 FoxPro database files.
ds (
Ferc1Datastore
) – Initialized Ferc1Datastoretables (iterable of strings) – List or other iterable of FERC database table names that should be included in the database being defined. e.g. ‘f1_fuel’ and ‘f1_steam’
refyear (integer) – The year of the FERC Form 1 DB to use as a template for creating the overall multi-year database schema.
bad_cols (iterable of 2-tuples) – A list or other iterable containing pairs of strings of the form (table_name, column_name), indicating columns (and their parent tables) which should not be cloned into the SQLite database for some reason.
- Returns
the effects of the function are stored inside sqlite_meta
- Return type
-
pudl.extract.ferc1.
drop_tables
(engine)[source]¶ Drop all FERC Form 1 tables from the SQLite database.
Creates an sa.schema.MetaData object reflecting the structure of the database that the passed in
engine
refers to, and uses that schema to drop all existing tables.Todo
Treat DB connection as a context manager (with/as).
- Parameters
engine (
sqlalchemy.engine.Engine
) – A DB Engine pointing at an exising SQLite database to be deleted.- Returns
None
-
pudl.extract.ferc1.
extract
(ferc1_tables=('fuel_ferc1', 'plants_steam_ferc1', 'plants_small_ferc1', 'plants_hydro_ferc1', 'plants_pumped_storage_ferc1', 'purchased_power_ferc1', 'plant_in_service_ferc1'), ferc1_years=(1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), pudl_settings=None)[source]¶ Coordinates the extraction of all FERC Form 1 tables into PUDL.
- Parameters
ferc1_tables (iterable of strings) – List of the FERC 1 database tables to be loaded into PUDL. These are the names of the tables in the PUDL database, not the FERC Form 1 database.
ferc1_years (iterable of ints) – List of years for which FERC Form 1 data should be loaded into PUDL. Note that not all years for which FERC data is available may have been integrated into PUDL yet.
- Returns
A dictionary of pandas DataFrames, with the names of PUDL database tables as the keys. These are the raw unprocessed dataframes, reflecting the data as it is in the FERC Form 1 DB, for passing off to the data tidying and cleaning fuctions found in the
pudl.transform.ferc1
module.- Return type
- Raises
ValueError – If the year is not in the list of years for which FERC data is available
ValueError – If the year is not in the list of working FERC years
ValueError – If the FERC table requested is not integrated into PUDL
-
pudl.extract.ferc1.
fuel
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of f1_fuel table records with plant names, >0 fuel.
- Parameters
- Returns
A DataFrame containing f1_fuel records that have plant_names and non-zero fuel amounts.
- Return type
-
pudl.extract.ferc1.
get_dbc_map
(ds, year, min_length=4)[source]¶ Extract names of all tables and fields from a FERC Form 1 DBC file.
Read the DBC file associated with the FERC Form 1 database for the given
year
, and extract all printable strings longer thanmin_lengh
. Select those strings that appear to be database table names, and their associated field for use in re-naming the truncated column names extracted from the corresponding DBF files (those names are limited to having only 10 characters in their names.)- Parameters
ds (
Ferc1Datastore
) – Initialized datastoreyear – The year of data from which the database table and column names are to be extracted. Typically this is expected to be the most recently available year of FERC Form 1 data.
- Returns
a dictionary whose keys are the long table names extracted from the DBC file, and whose values are lists of pairs of values, the first of which is the full name of each field in the table with the same name as the key, and the second of which is the truncated (<=10 character) long name of that field as found in the DBF file.
- Return type
-
pudl.extract.ferc1.
get_ferc1_meta
(ferc1_engine)[source]¶ Grab the FERC Form 1 DB metadata and check that tables exist.
Connects to the FERC Form 1 SQLite database and reads in its metadata (table schemas, types, etc.) by reflecting the database. Checks to make sure the DB is not empty, and returns the metadata object.
- Parameters
ferc1_engine (sqlalchemy.engine.Engine) – SQL Alchemy database connection engine for the PUDL FERC 1 DB.
- Returns
sqlalchemy.Metadata A SQL Alchemy metadata object, containing the definition of the DB structure.
- Raises
ValueError – If there are no tables in the SQLite Database.
-
pudl.extract.ferc1.
get_fields
(filedata)[source]¶ Produce the expected table names and fields from a DBC file.
- Parameters
filedata – Contents of the DBC file from which to extract.
- Returns
[fields]
- Return type
dict of table_name
-
pudl.extract.ferc1.
get_raw_df
(ds, table, dbc_map, years=(1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019))[source]¶ Combine several years of a given FERC Form 1 DBF table into a dataframe.
- Parameters
ds (
Ferc1Datastore
) – Initialized datastoretable (string) – The name of the FERC Form 1 table from which data is read.
dbc_map (dict of dicts) – A dictionary of dictionaries, of the kind returned by get_dbc_map(), describing the table and column names stored within the FERC Form 1 FoxPro database files.
min_length (int) – The minimum number of consecutive printable
years (list) – Range of years to be combined into a single DataFrame.
- Returns
A DataFrame containing several years of FERC Form 1 data for the given table.
- Return type
-
pudl.extract.ferc1.
missing_respondents
(reported, observed, identified)[source]¶ Fill in missing respondents for the f1_respondent_id table.
- Parameters
reported (iterable) – Respondent IDs appearing in f1_respondent_id.
observed (iterable) – Respondent IDs appearing anywhere in the ferc1 DB.
identified (dict) – A {respondent_id: respondent_name} mapping for those observed but not reported respondent IDs which we have been able to identify based on circumstantial evidence. See also: pudl.extract.ferc1.PUDL_RIDS
- Returns
A list of dictionaries representing minimal f1_respondent_id table records, of the form {“respondent_id”: ID, “respondent_name”: NAME}. These records are generated only for unreported respondents. Identified respondents get the values passed in through
identified
and the other observed but unidentified respondents are named “Missing Respondent ID”- Return type
-
pudl.extract.ferc1.
observed_respondents
(ferc1_engine)[source]¶ Compile the set of all observed respondent IDs found in the FERC 1 database.
A significant number of FERC 1 respondent IDs appear in the data tables, but not in the f1_respondent_id table. In order to construct a self-consisten database with we need to find all of those missing respondent IDs and inject them into the table when we clone the database.
- Parameters
ferc1_engine (sqlalchemy.engine.Engine) – An engine for connecting to the FERC 1 database.
- Returns
Every respondent ID reported in any of the FERC 1 DB tables.
- Return type
-
pudl.extract.ferc1.
plant_in_service
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of the fields of plant_in_service_ferc1.
- Parameters
- Returns
A DataFrame containing all plant_in_service_ferc1 records.
- Return type
-
pudl.extract.ferc1.
plants_hydro
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of f1_hydro for records that have plant names.
- Parameters
- Returns
A DataFrame containing f1_hydro records that have plant names.
- Return type
-
pudl.extract.ferc1.
plants_pumped_storage
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of f1_plants_pumped_storage records with plant names.
- Parameters
- Returns
A DataFrame containing f1_plants_pumped_storage records that have plant names.
- Return type
-
pudl.extract.ferc1.
plants_small
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of f1_small for records with minimum data criteria.
- Parameters
- Returns
A DataFrame containing f1_small records that have plant names and non zero demand, generation, operations, maintenance, and fuel costs.
- Return type
-
pudl.extract.ferc1.
plants_steam
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Create a
pandas.DataFrame
containing valid raw f1_steam records.Selected records must indicate a plant capacity greater than 0, and include a non-null plant name.
- Parameters
- Returns
A DataFrame containing f1_steam records that have plant names and non-zero capacities.
- Return type
-
pudl.extract.ferc1.
purchased_power
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame the fields of purchased_power_ferc1.
- Parameters
- Returns
A DataFrame containing all purchased_power_ferc1 records.
- Return type
-
pudl.extract.ferc1.
show_dupes
(table, dbc_map, data_dir, years=(1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), pk=('respondent_id', 'report_year', 'report_prd', 'row_number', 'spplmnt_num'))[source]¶ Identify duplicate primary keys by year within a given FERC Form 1 table.
- Parameters
table (str) – Name of the original FERC Form 1 table to identify duplicate records in.
years (iterable) – a list or other iterable containing the years that should be searched for duplicate records. By default it is all available years of FERC Form 1 data.
pk (list) – A list of strings identifying the columns in the FERC Form 1 table that should be treated as a composite primary key. By default this includes: respondent_id, report_year, report_prd, row_number, and spplmnt_num.
- Returns
None