pudl.transform.ferc714¶
Transformation of the FERC Form 714 data.
FERC Form 714 has two separate raw data sources - CSV and XBRL. For both sources there is usually some specific processing that needs to happen before the two data sources get concatenated together to create the full timeseries. We are currently processing three tables from 714. Each one is processed using a similar pattern: we’ve defined a class with a run classmethod as a coordinating method, any table-specific transforms are defined as staticmethod’s within the table class and any generic 714 transforms are defined as internal module functions. The table assets are created through a small function that calls the run method. Any of the methods or functions that only apply to either of the raw data sources should include a raw datasource suffix.
Attributes¶
Fake respondent IDs for database test entities. |
|
A mapping of timezone offset codes to Timedelta offsets from UTC. |
|
Mapping between standardized time offset codes and canonical timezones. |
|
Overrides of FERC 714 respondent IDs with wrong or missing EIA Codes. |
|
Classes¶
Class for building the core_ferc714__respondent_id asset. |
|
Class for building the out_ferc714__hourly_planning_area_demand asset. |
|
Class for building the core_ferc714__yearly_planning_area_demand_forecast asset. |
|
Define some simple checks that can run on FERC 714 assets. |
Functions¶
|
A simple transform function for processing the CSV raw data. |
|
Assign the PUDL-assigned respondent_id_ferc714 based on the native respondent ID. |
|
Wrapper around filter_for_freshest_data_xbrl. |
|
Fill missing CSV or XBRL respondent id. |
|
Add a report_day column. |
|
Transform the FERC 714 respondent IDs, names, and EIA utility IDs. |
Build the out_ferc714__hourly_planning_area_demand. |
|
Build the core_ferc714__yearly_planning_area_demand_forecast. |
|
|
Turn the Ferc714CheckSpec into an actual Dagster asset check. |
Module Contents¶
- pudl.transform.ferc714.BAD_RESPONDENTS = [2, 319, 99991, 99992, 99993, 99994, 99995][source]¶
Fake respondent IDs for database test entities.
- pudl.transform.ferc714.TIMEZONE_OFFSET_CODES[source]¶
A mapping of timezone offset codes to Timedelta offsets from UTC.
Note that the FERC 714 instructions state that all hourly demand is to be reported in STANDARD time for whatever timezone is being used. Even though many respondents use daylight savings / standard time abbreviations, a large majority do appear to conform to using a single UTC offset throughout the year. There are 6 instances in which the timezone associated with reporting changed dropped.
- pudl.transform.ferc714.TIMEZONE_CODES[source]¶
Mapping between standardized time offset codes and canonical timezones.
- pudl.transform.ferc714.EIA_CODE_FIXES: dict[Literal['combined', 'csv', 'xbrl'], dict[int | str], int][source]¶
Overrides of FERC 714 respondent IDs with wrong or missing EIA Codes.
This is used in
RespondentId.spot_fix_eia_codes()
. The dictionary is organized by “source” keys (“combined”, “csv”, or “xbrl”). Each source’s value is a secondary dictionary which contains source respondent ID’s as keys and fixes for EIA codes as values.We separated these fixes by either coming directly from the CSV data, the XBRL data, or the combined data. We use the corresponding source or PUDL-derived respondent ID to identify the EIA code to overwrite. We could have combined these fixes all into one set of combined fixes identified by the PUDL-derived
respondent_id_ferc714
, but this way we can do more targeted source-based cleaning and test each source’s EIA codes before the sources are concatenated together.
- pudl.transform.ferc714._pre_process_csv(df: pandas.DataFrame, table_name: str) pandas.DataFrame [source]¶
A simple transform function for processing the CSV raw data.
Removes footnotes columns ending with _f
Drops report_prd, spplmnt_num, and row_num columns
Excludes records which pertain to bad (test) respondents.
- pudl.transform.ferc714._assign_respondent_id_ferc714(df: pandas.DataFrame, source: Literal['csv', 'xbrl']) pandas.DataFrame [source]¶
Assign the PUDL-assigned respondent_id_ferc714 based on the native respondent ID.
We need to replace the natively reported respondent ID from each of the two FERC714 sources with a PUDL-assigned respondent ID. The mapping between the native ID’s and these PUDL-assigned ID’s can be accessed in the database tables
respondents_csv_ferc714
andrespondents_xbrl_ferc714
.- Parameters:
df – the input table with the native respondent ID column.
source – the lower-case string name of the source of the FERC714 data. Either csv
xbrl. (or)
- Returns:
an augmented version of the input
df
with a new column that replaces the natively reported respondent ID with the PUDL-assigned respondent ID.
- pudl.transform.ferc714._filter_for_freshest_data_xbrl(raw_xbrl: pandas.DataFrame, table_name: str, instant_or_duration: Literal['instant', 'duration'])[source]¶
Wrapper around filter_for_freshest_data_xbrl.
Most of the specific stuff here is in just converting the core table name into the raw instant or duration XBRL table name.
- pudl.transform.ferc714._fillna_respondent_id_ferc714_source(df: pandas.DataFrame, source: Literal['csv', 'xbrl']) pandas.DataFrame [source]¶
Fill missing CSV or XBRL respondent id.
The source (CSV or XBRL) tables get assigned a PUDL-derived
respondent_id_ferc714
ID column (via_assign_respondent_id_ferc714()
). After we concatenate the source tables, we sometimes backfill and forward-fill the source IDs (respondent_id_ferc714_csv
andrespondent_id_ferc714_xbrl
). This way the older records from the CSV years will also have the XBRL ID’s and vice versa. This will enable users to find the full timeseries of a respondent that given either source ID (instead of using the source ID to find the PUDL-derived ID and then finding the records).
- pudl.transform.ferc714.assign_report_day(df: pandas.DataFrame, date_col: str) pandas.DataFrame [source]¶
Add a report_day column.
- class pudl.transform.ferc714.RespondentId[source]¶
Class for building the core_ferc714__respondent_id asset.
Most of the methods in this class as staticmethods. The purpose of using a class in this instance is mostly for organizing the table specific transforms under the same name-space.
- classmethod run(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
Build the table for the core_ferc714__respondent_id asset.
Process and combine the CSV and XBRL based data.
There are two main threads of transforms happening here:
Table compatibility: The CSV raw table is static (does not even report years) while the xbrl table is reported annually. A lot of the downstream analysis expects this table to be static. So the first step was to check whether or not the columns that we have in the CSV years had consistent data over the few XBRL years that we have. There are a small number of eia_code’s we needed to clean up, but besides that it was static. We then convert the XBRL data into a static table, then we concat-ed the tables and checked the static-ness again via
ensure_eia_code_uniqueness()
.eia_code cleaning: Clean up FERC-714 respondent names and manually assign EIA utility IDs to a few FERC Form 714 respondents that report planning area demand, but which don’t have their corresponding EIA utility IDs provided by FERC for some reason (including PacifiCorp). Done all via
spot_fix_eia_codes()
& EIA_CODE_FIXES.
- static spot_fix_eia_codes(df: pandas.DataFrame, source: Literal['csv', 'xbrl', 'combined']) pandas.DataFrame [source]¶
Spot fix the eia_codes.
Using the manually compiled fixes to the
eia_code
column stored inEIA_CODE_FIXES
, replace the reported values by respondent.
- static ensure_eia_code_uniqueness(df: pandas.DataFrame, source: Literal['csv', 'xbrl', 'combined']) pandas.DataFrame [source]¶
Ensure there is only one unique eia_code for each respondent.
- static clean_eia_codes_xbrl(xbrl: pandas.DataFrame) pandas.DataFrame [source]¶
Make eia_code’s cleaner coming from the XBRL data.
Desired outcomes here include all respondents have only one non-null eia_code and all eia_codes that are actually the respondent_id_ferc714_xbrl are nulled.
- static convert_into_static_table_xbrl(xbrl: pandas.DataFrame) pandas.DataFrame [source]¶
Convert this annually reported table into a skinnier, static table.
The CSV table is entirely static - it doesn’t have any reported changes that vary over time. The XBRL table does have start and end dates in it. In order to merge these two sources, we are checking whether or not the shared variables change over time and then converting this table into a non-time-varying table.
- static condense_into_one_source_table(df)[source]¶
Condense the CSV and XBRL records together into one record.
We have two records coming from each of the two sources in this table. This method simply drops duplicates based on the PKs of the table. We know that the names are different in the CSV vs the XBRL source. We are going to grab the XBRL names because they are more recent.
NOTE: We could have merged the data in
run()
instead of concatenating along the index. We would have had to develop different methods forensure_eia_code_uniqueness()
.
- pudl.transform.ferc714.core_ferc714__respondent_id(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
Transform the FERC 714 respondent IDs, names, and EIA utility IDs.
This is a light wrapper around
RespondentId
because you need to build an asset from a function - not a staticmethod of a class.- Parameters:
raw_csv – Raw table describing the FERC 714 Respondents from the CSV years.
raw_xbrl_duration – Raw table describing the FERC 714 Respondents from the XBRL years.
- Returns:
A clean(er) version of the FERC-714 respondents table.
- class pudl.transform.ferc714.HourlyPlanningAreaDemand[source]¶
Class for building the out_ferc714__hourly_planning_area_demand asset.
The out_ferc714__hourly_planning_area_demand table is an hourly time series of demand by Planning Area.
Most of the methods in this class as staticmethods. The purpose of using a class in this instance is mostly for organizing the table specific transforms under the same name-space.
- classmethod run(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame, raw_xbrl_instant: pandas.DataFrame) pandas.DataFrame [source]¶
Build the out_ferc714__hourly_planning_area_demand asset.
To transform this table we have to process the instant and duration xbrl tables so we can merge them together and process the XBRL data. We also have to process the CSV data so we can concatenate it with the XBLR data. Then we can process all of the data together.
For both the CSV and XBRL data, the main transforms that are happening have to do with cleaning the timestamps in the data, resulting in timestamps that are in a datetime format and are nearly continuous for every respondent.
Once the CSV and XBRL data is merged together, the transforms are mostly focused on cleaning the timezone codes reported to FERC and then using those timezone codes to convert all of timestamps into UTC datetime.
The outcome here is nearly continuous and non-duplicative time series.
- static melt_hourx_columns_csv(df)[source]¶
Melt hourX columns into hours.
There are some instances of the CSVs with a 25th hour. We drop those entirely because almost all of them are unusable (0.0 or daily totals), and they shouldn’t really exist at all based on FERC instructions.
- static parse_date_strings_csv(csv)[source]¶
Convert report_date into pandas Datetime types.
Make the report_date column from the daily string
report_date
and the integerhour
column.
- static remove_yearly_records_duration_xbrl(duration_xbrl)[source]¶
Convert a table with mostly daily records with some annuals into fully daily.
Almost all of the records have a start_date that == the end_date which I’m assuming means the record spans the duration of one day there are a small handful of records which seem to span a full year.
- static merge_instant_and_duration_tables_xbrl(instant_xbrl: pandas.DataFrame, duration_xbrl: pandas.DataFrame, table_name: str) pandas.DataFrame [source]¶
Merge XBRL instant and duration tables, reshaping instant as needed.
FERC714 XBRL instant period signifies that it is true as of the reported date, while a duration fact pertains to the specified time period. The
date
column for an instant fact corresponds to theend_date
column of a duration fact.- Parameters:
instant_xbrl – table representing XBRL instant facts.
raw_xbrl_duration – table representing XBRL duration facts.
- Returns:
A unified table combining the XBRL duration and instant facts, if both types of facts were present. If either input dataframe is empty, the other dataframe is returned unchanged, except that several unused columns are dropped. If both input dataframes are empty, an empty dataframe is returned.
- static convert_dates_to_zero_offset_hours_xbrl(xbrl: pandas.DataFrame) pandas.DataFrame [source]¶
Convert all hours to: Hour (24-hour clock) as a zero-padded decimal number.
The FERC 714 form includes columns for the hours of each day. Those columns are labeled with 1-24 to indicate the hours of the day. The XBRL filings themselves have time-like string associated with each of the facts. They include both a the year-month-day portion (formatted as %Y-%m-%d) as well as an hour-minute-second component (semi-formatted as T%H:%M:%S). Attempting to simply convert this timestamp information to a datetime using the format
"%Y-%m-%dT%H:%M:%S"
fails because about a third of the records include hour 24 - which is not an accepted hour in standard datetime formats.The respondents that report hour 24 do not report hour 00. We have done some spot checking of values reported to FERC and have determined that hour 24 seems to correspond with hour 00 (of the next day). We have not gotten complete confirmation from FERC staff that this is always the case, but it seems like a decent assumption.
So, this step converts all of the hour 24 records to be hour 00 of the next day.
- static convert_dates_to_zero_seconds_xbrl(xbrl: pandas.DataFrame) pandas.DataFrame [source]¶
Convert the last second of the day records to the first (0) second of the next day.
There are a small amount of records which report the last “hour” of the day as last second of the day, as opposed to T24 cleaned in
convert_dates_to_zero_offset_hours_xbrl()
or T00 which is standard for a datetime. This function finds these records and adds one second to them and then ensures all of the records has 0’s for seconds.
- static spot_fix_records_xbrl(xbrl: pandas.DataFrame)[source]¶
Spot fix some specific XBRL records.
- static ensure_dates_are_continuous(df: pandas.DataFrame, source: Literal['csv', 'xbrl'])[source]¶
Assert that almost all respondents have continuous timestamps.
In the xbrl data, we found 41 gaps in the timeseries! They are almost entirely on the hour in which daylight savings times goes into effect. The csv data had 10 gaps. Pretty good all in all!
- static standardize_offset_codes(df: pandas.DataFrame, offset_fixes) pandas.Series [source]¶
Convert to standardized UTC offset abbreviations.
This function ensures that all of the 3-4 letter abbreviations used to indicate a timestamp’s localized offset from UTC are standardized, so that they can be used to make the timestamps timezone aware. The standard abbreviations we’re using are:
“HST”: Hawaii Standard Time “AKST”: Alaska Standard Time “AKDT”: Alaska Daylight Time “PST”: Pacific Standard Time “PDT”: Pacific Daylight Time “MST”: Mountain Standard Time “MDT”: Mountain Daylight Time “CST”: Central Standard Time “CDT”: Central Daylight Time “EST”: Eastern Standard Time “EDT”: Eastern Daylight Time
In some cases different respondents use the same non-standard abbreviations to indicate different offsets, and so the fixes are applied on a per-respondent basis, as defined by offset_fixes.
- Parameters:
df – DataFrame containing a utc_offset_code column that needs to be standardized.
offset_fixes – A dictionary with respondent_id_ferc714 values as the keys, and a dictionary mapping non-standard UTC offset codes to the standardized UTC offset codes as the value.
- Returns:
Standardized UTC offset codes.
- static construct_utc_datetime(df: pandas.DataFrame) pandas.DataFrame [source]¶
Construct datetime_utc column.
- static spot_fix_values(df: pandas.DataFrame) pandas.DataFrame [source]¶
Spot fix values.
- pudl.transform.ferc714.out_ferc714__hourly_planning_area_demand(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame, raw_xbrl_instant: pandas.DataFrame) pandas.DataFrame [source]¶
Build the out_ferc714__hourly_planning_area_demand.
This is a light wrapper around
HourlyPlanningAreaDemand
because it seems you need to build an asset from a function - not a staticmethod of a class.
- class pudl.transform.ferc714.YearlyPlanningAreaDemandForecast[source]¶
Class for building the core_ferc714__yearly_planning_area_demand_forecast asset.
The core_ferc714__yearly_planning_area_demand_forecast table is an annual, forecasted time series of demand by Planning Area.
Most of the methods in this class as staticmethods. The purpose of using a class in this instance is mostly for organizing the table specific transforms under the same name-space.
- classmethod run(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
Build the core_ferc714__yearly_planning_area_demand_forecast asset.
To transform this table we have to process the CSV data and the XBRL duration data (this data has not instant table), merge together the XBRL and CSV data, and process the combined datasets.
The main transforms include spot-fixing forecast years with
spot_fix_forecast_years_xbrl()
and averaging out duplicate forecast values for duplicate primary key rows in the CSV table.
- static spot_fix_forecast_years_xbrl(df)[source]¶
Spot fix forecast year errors.
This function fixes the following errors:
There’s one record with an NA forecast_year value. This row also has no demand forcast values. Because forcast_year is a primary key we can’t have any NA values. Because there are no substantive forcasts in this row, we can safely remove this row.
respondent_id_ferc714 number 107 reported their forecast_year as YY instead of YYYY values.
There’s also at least one forecast year value reported as 3033 that should be 2033.
This function also checks that the values for forecast year are within an expected range.
- static average_duplicate_pks_csv(df)[source]¶
Average forecast values for duplicate primary keys.
The XBRL data had duplicate primary keys, but it was easy to parse them by keeping rows with the most recent publication_time value. The CSVs have no such distinguishing column, dispite having some duplicate primary keys.
This function takes the average of the forecast values for rows with duplicate primary keys. There are only 6 respondent/report_year/ forecast year rows where the forecast values differ. One of those is a pair where one forecast value is 0. We’ll take the non-zero value here and average out the rest.
- pudl.transform.ferc714.core_ferc714__yearly_planning_area_demand_forecast(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) pandas.DataFrame [source]¶
Build the core_ferc714__yearly_planning_area_demand_forecast.
This is a light wrapper around
YearlyPlanningAreaDemandForecast
because it seems you need to build an asset from a function - not a staticmethod of a class.
- class pudl.transform.ferc714.Ferc714CheckSpec[source]¶
Define some simple checks that can run on FERC 714 assets.
- pudl.transform.ferc714.make_check(spec: Ferc714CheckSpec) dagster.AssetChecksDefinition [source]¶
Turn the Ferc714CheckSpec into an actual Dagster asset check.