pudl.transform.ferc714 ====================== .. py:module:: pudl.transform.ferc714 .. autoapi-nested-parse:: Transformation of the FERC Form 714 data. FERC Form 714 has two separate raw data sources - CSV and XBRL. For both sources there is usually some specific processing that needs to happen before the two data sources get concatenated together to create the full timeseries. We are currently processing three tables from 714. Each one is processed using a similar pattern: we've defined a class with a run classmethod as a coordinating method, any table-specific transforms are defined as staticmethod's within the table class and any generic 714 transforms are defined as internal module functions. The table assets are created through a small function that calls the run method. Any of the methods or functions that only apply to either of the raw data sources should include a raw datasource suffix. Attributes ---------- .. autoapisummary:: pudl.transform.ferc714.logger pudl.transform.ferc714.TIMEZONE_OFFSET_CODE_FIXES pudl.transform.ferc714.TIMEZONE_OFFSET_CODE_FIXES_BY_YEAR pudl.transform.ferc714.BAD_RESPONDENTS pudl.transform.ferc714.TIMEZONE_OFFSET_CODES pudl.transform.ferc714.TIMEZONE_CODES pudl.transform.ferc714.EIA_CODE_FIXES pudl.transform.ferc714.RENAME_COLS pudl.transform.ferc714.check_specs pudl.transform.ferc714._checks Classes ------- .. autoapisummary:: pudl.transform.ferc714.RespondentId pudl.transform.ferc714.HourlyPlanningAreaDemand pudl.transform.ferc714.YearlyPlanningAreaDemandForecast pudl.transform.ferc714.Ferc714CheckSpec Functions --------- .. autoapisummary:: pudl.transform.ferc714._pre_process_csv pudl.transform.ferc714._assign_respondent_id_ferc714 pudl.transform.ferc714._filter_for_freshest_data_xbrl pudl.transform.ferc714._fillna_respondent_id_ferc714_source pudl.transform.ferc714.assign_report_day pudl.transform.ferc714.core_ferc714__respondent_id pudl.transform.ferc714.out_ferc714__hourly_planning_area_demand pudl.transform.ferc714.core_ferc714__yearly_planning_area_demand_forecast pudl.transform.ferc714.make_check Module Contents --------------- .. py:data:: logger .. py:data:: TIMEZONE_OFFSET_CODE_FIXES .. py:data:: TIMEZONE_OFFSET_CODE_FIXES_BY_YEAR .. py:data:: BAD_RESPONDENTS :value: [2, 319, 99991, 99992, 99993, 99994, 99995] Fake respondent IDs for database test entities. .. py:data:: TIMEZONE_OFFSET_CODES A mapping of timezone offset codes to Timedelta offsets from UTC. Note that the FERC 714 instructions state that all hourly demand is to be reported in STANDARD time for whatever timezone is being used. Even though many respondents use daylight savings / standard time abbreviations, a large majority do appear to conform to using a single UTC offset throughout the year. There are 6 instances in which the timezone associated with reporting changed dropped. .. py:data:: TIMEZONE_CODES Mapping between standardized time offset codes and canonical timezones. .. py:data:: EIA_CODE_FIXES :type: dict[Literal['combined', 'csv', 'xbrl'], dict[int | str], int] Overrides of FERC 714 respondent IDs with wrong or missing EIA Codes. This is used in :meth:`RespondentId.spot_fix_eia_codes`. The dictionary is organized by "source" keys ("combined", "csv", or "xbrl"). Each source's value is a secondary dictionary which contains source respondent ID's as keys and fixes for EIA codes as values. We separated these fixes by either coming directly from the CSV data, the XBRL data, or the combined data. We use the corresponding source or PUDL-derived respondent ID to identify the EIA code to overwrite. We could have combined these fixes all into one set of combined fixes identified by the PUDL-derived ``respondent_id_ferc714``, but this way we can do more targeted source-based cleaning and test each source's EIA codes before the sources are concatenated together. .. py:data:: RENAME_COLS .. py:function:: _pre_process_csv(df: pandas.DataFrame, table_name: str) -> pandas.DataFrame A simple transform function for processing the CSV raw data. * Removes footnotes columns ending with _f * Drops report_prd, spplmnt_num, and row_num columns * Excludes records which pertain to bad (test) respondents. .. py:function:: _assign_respondent_id_ferc714(df: pandas.DataFrame, source: Literal['csv', 'xbrl']) -> pandas.DataFrame Assign the PUDL-assigned respondent_id_ferc714 based on the native respondent ID. We need to replace the natively reported respondent ID from each of the two FERC714 sources with a PUDL-assigned respondent ID. The mapping between the native ID's and these PUDL-assigned ID's can be accessed in the database tables ``respondents_csv_ferc714`` and ``respondents_xbrl_ferc714``. :param df: the input table with the native respondent ID column. :param source: the lower-case string name of the source of the FERC714 data. Either csv :param or xbrl.: :returns: an augmented version of the input ``df`` with a new column that replaces the natively reported respondent ID with the PUDL-assigned respondent ID. .. py:function:: _filter_for_freshest_data_xbrl(raw_xbrl: pandas.DataFrame, table_name: str, instant_or_duration: Literal['instant', 'duration']) Wrapper around filter_for_freshest_data_xbrl. Most of the specific stuff here is in just converting the core table name into the raw instant or duration XBRL table name. .. py:function:: _fillna_respondent_id_ferc714_source(df: pandas.DataFrame, source: Literal['csv', 'xbrl']) -> pandas.DataFrame Fill missing CSV or XBRL respondent id. The source (CSV or XBRL) tables get assigned a PUDL-derived ``respondent_id_ferc714`` ID column (via :func:`_assign_respondent_id_ferc714`). After we concatenate the source tables, we sometimes backfill and forward-fill the source IDs (``respondent_id_ferc714_csv`` and ``respondent_id_ferc714_xbrl``). This way the older records from the CSV years will also have the XBRL ID's and vice versa. This will enable users to find the full timeseries of a respondent that given either source ID (instead of using the source ID to find the PUDL-derived ID and then finding the records). .. py:function:: assign_report_day(df: pandas.DataFrame, date_col: str) -> pandas.DataFrame Add a report_day column. .. py:class:: RespondentId Class for building the :ref:`core_ferc714__respondent_id` asset. Most of the methods in this class as staticmethods. The purpose of using a class in this instance is mostly for organizing the table specific transforms under the same name-space. .. py:method:: run(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) -> pandas.DataFrame :classmethod: Build the table for the :ref:`core_ferc714__respondent_id` asset. Process and combine the CSV and XBRL based data. There are two main threads of transforms happening here: * Table compatibility: The CSV raw table is static (does not even report years) while the xbrl table is reported annually. A lot of the downstream analysis expects this table to be static. So the first step was to check whether or not the columns that we have in the CSV years had consistent data over the few XBRL years that we have. There are a small number of eia_code's we needed to clean up, but besides that it was static. We then convert the XBRL data into a static table, then we concat-ed the tables and checked the static-ness again via :meth:`ensure_eia_code_uniqueness`. * eia_code cleaning: Clean up FERC-714 respondent names and manually assign EIA utility IDs to a few FERC Form 714 respondents that report planning area demand, but which don't have their corresponding EIA utility IDs provided by FERC for some reason (including PacifiCorp). Done all via :meth:`spot_fix_eia_codes` & EIA_CODE_FIXES. .. py:method:: spot_fix_eia_codes(df: pandas.DataFrame, source: Literal['csv', 'xbrl', 'combined']) -> pandas.DataFrame :staticmethod: Spot fix the eia_codes. Using the manually compiled fixes to the ``eia_code`` column stored in :py:const:`EIA_CODE_FIXES`, replace the reported values by respondent. .. py:method:: ensure_eia_code_uniqueness(df: pandas.DataFrame, source: Literal['csv', 'xbrl', 'combined']) -> pandas.DataFrame :staticmethod: Ensure there is only one unique eia_code for each respondent. .. py:method:: clean_eia_codes_xbrl(xbrl: pandas.DataFrame) -> pandas.DataFrame :staticmethod: Make eia_code's cleaner coming from the XBRL data. Desired outcomes here include all respondents have only one non-null eia_code and all eia_codes that are actually the respondent_id_ferc714_xbrl are nulled. .. py:method:: convert_into_static_table_xbrl(xbrl: pandas.DataFrame) -> pandas.DataFrame :staticmethod: Convert this annually reported table into a skinnier, static table. The CSV table is entirely static - it doesn't have any reported changes that vary over time. The XBRL table does have start and end dates in it. In order to merge these two sources, we are checking whether or not the shared variables change over time and then converting this table into a non-time-varying table. .. py:method:: condense_into_one_source_table(df) :staticmethod: Condense the CSV and XBRL records together into one record. We have two records coming from each of the two sources in this table. This method simply drops duplicates based on the PKs of the table. We know that the names are different in the CSV vs the XBRL source. We are going to grab the XBRL names because they are more recent. NOTE: We could have merged the data in :meth:`run` instead of concatenating along the index. We would have had to develop different methods for :meth:`ensure_eia_code_uniqueness`. .. py:function:: core_ferc714__respondent_id(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) -> pandas.DataFrame Transform the FERC 714 respondent IDs, names, and EIA utility IDs. This is a light wrapper around :class:`RespondentId` because you need to build an asset from a function - not a staticmethod of a class. :param raw_csv: Raw table describing the FERC 714 Respondents from the CSV years. :param raw_xbrl_duration: Raw table describing the FERC 714 Respondents from the XBRL years. :returns: A clean(er) version of the FERC-714 respondents table. .. py:class:: HourlyPlanningAreaDemand Class for building the :ref:`out_ferc714__hourly_planning_area_demand` asset. The :ref:`out_ferc714__hourly_planning_area_demand` table is an hourly time series of demand by Planning Area. Most of the methods in this class as staticmethods. The purpose of using a class in this instance is mostly for organizing the table specific transforms under the same name-space. .. py:method:: run(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame, raw_xbrl_instant: pandas.DataFrame) -> pandas.DataFrame :classmethod: Build the :ref:`out_ferc714__hourly_planning_area_demand` asset. To transform this table we have to process the instant and duration xbrl tables so we can merge them together and process the XBRL data. We also have to process the CSV data so we can concatenate it with the XBLR data. Then we can process all of the data together. For both the CSV and XBRL data, the main transforms that are happening have to do with cleaning the timestamps in the data, resulting in timestamps that are in a datetime format and are nearly continuous for every respondent. Once the CSV and XBRL data is merged together, the transforms are mostly focused on cleaning the timezone codes reported to FERC and then using those timezone codes to convert all of timestamps into UTC datetime. The outcome here is nearly continuous and non-duplicative time series. .. py:method:: melt_hourx_columns_csv(df) :staticmethod: Melt hourX columns into hours. There are some instances of the CSVs with a 25th hour. We drop those entirely because almost all of them are unusable (0.0 or daily totals), and they shouldn't really exist at all based on FERC instructions. .. py:method:: parse_date_strings_csv(csv) :staticmethod: Convert report_date into pandas Datetime types. Make the report_date column from the daily string ``report_date`` and the integer ``hour`` column. .. py:method:: remove_yearly_records_duration_xbrl(duration_xbrl) :staticmethod: Convert a table with mostly daily records with some annuals into fully daily. Almost all of the records have a start_date that == the end_date which I'm assuming means the record spans the duration of one day there are a small handful of records which seem to span a full year. .. py:method:: merge_instant_and_duration_tables_xbrl(instant_xbrl: pandas.DataFrame, duration_xbrl: pandas.DataFrame, table_name: str) -> pandas.DataFrame :staticmethod: Merge XBRL instant and duration tables, reshaping instant as needed. FERC714 XBRL instant period signifies that it is true as of the reported date, while a duration fact pertains to the specified time period. The ``date`` column for an instant fact corresponds to the ``end_date`` column of a duration fact. :param instant_xbrl: table representing XBRL instant facts. :param raw_xbrl_duration: table representing XBRL duration facts. :returns: A unified table combining the XBRL duration and instant facts, if both types of facts were present. If either input dataframe is empty, the other dataframe is returned unchanged, except that several unused columns are dropped. If both input dataframes are empty, an empty dataframe is returned. .. py:method:: convert_dates_to_zero_offset_hours_xbrl(xbrl: pandas.DataFrame) -> pandas.DataFrame :staticmethod: Convert all hours to: Hour (24-hour clock) as a zero-padded decimal number. The FERC 714 form includes columns for the hours of each day. Those columns are labeled with 1-24 to indicate the hours of the day. The XBRL filings themselves have time-like string associated with each of the facts. They include both a the year-month-day portion (formatted as %Y-%m-%d) as well as an hour-minute-second component (semi-formatted as T%H:%M:%S). Attempting to simply convert this timestamp information to a datetime using the format ``"%Y-%m-%dT%H:%M:%S"`` fails because about a third of the records include hour 24 - which is not an accepted hour in standard datetime formats. The respondents that report hour 24 do not report hour 00. We have done some spot checking of values reported to FERC and have determined that hour 24 seems to correspond with hour 00 (of the next day). We have not gotten complete confirmation from FERC staff that this is always the case, but it seems like a decent assumption. So, this step converts all of the hour 24 records to be hour 00 of the next day. .. py:method:: convert_dates_to_zero_seconds_xbrl(xbrl: pandas.DataFrame) -> pandas.DataFrame :staticmethod: Convert the last second of the day records to the first (0) second of the next day. There are a small amount of records which report the last "hour" of the day as last second of the day, as opposed to T24 cleaned in :meth:`convert_dates_to_zero_offset_hours_xbrl` or T00 which is standard for a datetime. This function finds these records and adds one second to them and then ensures all of the records has 0's for seconds. .. py:method:: spot_fix_records_xbrl(xbrl: pandas.DataFrame) :staticmethod: Spot fix some specific XBRL records. .. py:method:: ensure_dates_are_continuous(df: pandas.DataFrame, source: Literal['csv', 'xbrl']) :staticmethod: Assert that almost all respondents have continuous timestamps. In the xbrl data, we found 41 gaps in the timeseries! They are almost entirely on the hour in which daylight savings times goes into effect. The csv data had 10 gaps. Pretty good all in all! .. py:method:: standardize_offset_codes(df: pandas.DataFrame, offset_fixes) -> pandas.Series :staticmethod: Convert to standardized UTC offset abbreviations. This function ensures that all of the 3-4 letter abbreviations used to indicate a timestamp's localized offset from UTC are standardized, so that they can be used to make the timestamps timezone aware. The standard abbreviations we're using are: "HST": Hawaii Standard Time "AKST": Alaska Standard Time "AKDT": Alaska Daylight Time "PST": Pacific Standard Time "PDT": Pacific Daylight Time "MST": Mountain Standard Time "MDT": Mountain Daylight Time "CST": Central Standard Time "CDT": Central Daylight Time "EST": Eastern Standard Time "EDT": Eastern Daylight Time In some cases different respondents use the same non-standard abbreviations to indicate different offsets, and so the fixes are applied on a per-respondent basis, as defined by offset_fixes. :param df: DataFrame containing a utc_offset_code column that needs to be standardized. :param offset_fixes: A dictionary with respondent_id_ferc714 values as the keys, and a dictionary mapping non-standard UTC offset codes to the standardized UTC offset codes as the value. :returns: Standardized UTC offset codes. .. py:method:: clean_utc_code_offsets_and_set_timezone(df) :staticmethod: Clean UTC Codes and set timezone. .. py:method:: drop_missing_utc_offset(df) :staticmethod: Drop records with missing UTC offsets and zero demand. .. py:method:: construct_utc_datetime(df: pandas.DataFrame) -> pandas.DataFrame :staticmethod: Construct datetime_utc column. .. py:method:: ensure_non_duplicated_datetimes(df) :staticmethod: Report and drop duplicated UTC datetimes. .. py:method:: spot_fix_values(df: pandas.DataFrame) -> pandas.DataFrame :staticmethod: Spot fix values. .. py:function:: out_ferc714__hourly_planning_area_demand(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame, raw_xbrl_instant: pandas.DataFrame) -> pandas.DataFrame Build the :ref:`out_ferc714__hourly_planning_area_demand`. This is a light wrapper around :class:`HourlyPlanningAreaDemand` because it seems you need to build an asset from a function - not a staticmethod of a class. .. py:class:: YearlyPlanningAreaDemandForecast Class for building the :ref:`core_ferc714__yearly_planning_area_demand_forecast` asset. The :ref:`core_ferc714__yearly_planning_area_demand_forecast` table is an annual, forecasted time series of demand by Planning Area. Most of the methods in this class as staticmethods. The purpose of using a class in this instance is mostly for organizing the table specific transforms under the same name-space. .. py:method:: run(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) -> pandas.DataFrame :classmethod: Build the :ref:`core_ferc714__yearly_planning_area_demand_forecast` asset. To transform this table we have to process the CSV data and the XBRL duration data (this data has not instant table), merge together the XBRL and CSV data, and process the combined datasets. The main transforms include spot-fixing forecast years with :meth:`spot_fix_forecast_years_xbrl` and averaging out duplicate forecast values for duplicate primary key rows in the CSV table. .. py:method:: spot_fix_forecast_years_xbrl(df) :staticmethod: Spot fix forecast year errors. This function fixes the following errors: - There's one record with an NA forecast_year value. This row also has no demand forcast values. Because forcast_year is a primary key we can't have any NA values. Because there are no substantive forcasts in this row, we can safely remove this row. - respondent_id_ferc714 number 107 reported their forecast_year as YY instead of YYYY values. - There's also at least one forecast year value reported as 3033 that should be 2033. This function also checks that the values for forecast year are within an expected range. .. py:method:: average_duplicate_pks_csv(df) :staticmethod: Average forecast values for duplicate primary keys. The XBRL data had duplicate primary keys, but it was easy to parse them by keeping rows with the most recent publication_time value. The CSVs have no such distinguishing column, dispite having some duplicate primary keys. This function takes the average of the forecast values for rows with duplicate primary keys. There are only 6 respondent/report_year/ forecast year rows where the forecast values differ. One of those is a pair where one forecast value is 0. We'll take the non-zero value here and average out the rest. .. py:function:: core_ferc714__yearly_planning_area_demand_forecast(raw_csv: pandas.DataFrame, raw_xbrl_duration: pandas.DataFrame) -> pandas.DataFrame Build the :ref:`core_ferc714__yearly_planning_area_demand_forecast`. This is a light wrapper around :class:`YearlyPlanningAreaDemandForecast` because it seems you need to build an asset from a function - not a staticmethod of a class. .. py:class:: Ferc714CheckSpec Define some simple checks that can run on FERC 714 assets. .. py:attribute:: name :type: str .. py:attribute:: asset :type: str .. py:attribute:: num_rows_by_report_year :type: dict[int, int] .. py:data:: check_specs .. py:function:: make_check(spec: Ferc714CheckSpec) -> dagster.AssetChecksDefinition Turn the Ferc714CheckSpec into an actual Dagster asset check. .. py:data:: _checks