pudl.transform.eia923 ===================== .. py:module:: pudl.transform.eia923 .. autoapi-nested-parse:: Module to perform data cleaning functions on EIA923 data tables. Attributes ---------- .. autoapisummary:: pudl.transform.eia923.logger pudl.transform.eia923.COALMINE_COUNTRY_CODES Functions --------- .. autoapisummary:: pudl.transform.eia923._get_plant_nuclear_unit_id_map pudl.transform.eia923._backfill_nuclear_unit_id pudl.transform.eia923._get_plant_prime_mover_map pudl.transform.eia923._backfill_prime_mover_code pudl.transform.eia923._get_most_frequent_energy_source_map pudl.transform.eia923._clean_gen_fuel_energy_sources pudl.transform.eia923._aggregate_generation_fuel_duplicates pudl.transform.eia923._yearly_to_monthly_records pudl.transform.eia923._coalmine_cleanup pudl.transform.eia923.plants_eia923 pudl.transform.eia923.gen_fuel_nuclear pudl.transform.eia923._core_eia923__pre_generation_fuel pudl.transform.eia923._map_prime_mover_sets pudl.transform.eia923._aggregate_duplicate_boiler_fuel_keys pudl.transform.eia923._core_eia923__boiler_fuel pudl.transform.eia923.remove_duplicate_pks_boiler_fuel_eia923 pudl.transform.eia923._core_eia923__generation pudl.transform.eia923._core_eia923__coalmine pudl.transform.eia923._core_eia923__fuel_receipts_costs pudl.transform.eia923._core_eia923__cooling_system_information pudl.transform.eia923.cooling_system_information_null_check pudl.transform.eia923.cooling_system_information_withdrawal_discrepancy_check pudl.transform.eia923.cooling_system_information_continuity pudl.transform.eia923._core_eia923__fgd_operation_maintenance pudl.transform.eia923.fgd_operation_maintenance_null_check pudl.transform.eia923.fgd_cost_discrepancy_check pudl.transform.eia923.fgd_continuity_check pudl.transform.eia923._core_eia923__energy_storage Module Contents --------------- .. py:data:: logger .. py:data:: COALMINE_COUNTRY_CODES :type: dict[str, str] A mapping of EIA foreign coal mine country codes to 3-letter ISO-3166-1 codes. The EIA-923 lists the US state of origin for coal deliveries using standard 2-letter US state abbreviations. However, foreign countries are also included as "states" in this category and because some of them have 2-letter abbreviation collisions with US states, their coding is non-standard. Instead of using the provided non-standard codes, we convert to the ISO-3166-1 three letter country codes: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3 .. py:function:: _get_plant_nuclear_unit_id_map(nuc_fuel: pandas.DataFrame) -> dict[int, str] Get a plant_id -> nuclear_unit_id mapping for all plants with one nuclear unit. :param nuc_fuel: dataframe of nuclear unit fuels. :returns: one to one mapping of plant_id_eia to nuclear_unit_id. :rtype: plant_to_nuc_id .. py:function:: _backfill_nuclear_unit_id(nuc_fuel: pandas.DataFrame) -> pandas.DataFrame Backfill 2001 and 2002 nuclear_unit_id for plants with one nuclear unit. 2001 and 2002 core_eia923__monthly_generation_fuel records do not include nuclear_unit_id which is required for the primary key of nuclear_unit_fuel_eia923. We backfill this field for plants with one nuclear unit. nuclear_unit_id is filled with 'UNK' if the nuclear_unit_id can't be recovered. Params: nuc_fuel: nuclear fuels dataframe. :returns: nuclear fuels dataframe with backfilled nuclear_unit_id field. :rtype: nuc_fuel .. py:function:: _get_plant_prime_mover_map(gen_fuel: pandas.DataFrame) -> dict[int, str] Get a plant_id -> prime_mover_code mapping for all plants with one prime mover. :param gen_fuel: dataframe of generation fuels. :returns: one to one mapping of plant_id_eia to prime_mover_codes. :rtype: fuel_type_map .. py:function:: _backfill_prime_mover_code(gen_fuel: pandas.DataFrame) -> pandas.DataFrame Backfill 2001 and 2002 prime_mover_code for plants with one prime mover. 2001 and 2002 core_eia923__monthly_generation_fuel records do not include prime_mover_code which is required for the primary key. We backfill this field for plants with one prime mover. prime_mover_code is set to 'UNK' if future plants have multiple prime movers. :param gen_fuel: generation fuels dataframe. :returns: generation fuels dataframe with backfilled prime_mover_code field. :rtype: gen_fuel .. py:function:: _get_most_frequent_energy_source_map(gen_fuel: pandas.DataFrame) -> dict[str, str] Get the a mapping of the most common energy_source for each fuel_type_code_agg. :param gen_fuel: generation_fuel dataframe. :returns: mapping of fuel_type_code_agg to energy_source_code. :rtype: energy_source_map .. py:function:: _clean_gen_fuel_energy_sources(gen_fuel: pandas.DataFrame) -> pandas.DataFrame Clean the generator_fuel_eia923.energy_source_code field specifically. Transformations include: * Remap MSW to biogenic and non biogenic fuel types. * Fill missing energy_source_code using most common code for each AER fuel codes. :param gen_fuel: generation fuels dataframe. :returns: generation fuels dataframe with cleaned energy_source_code field. :rtype: gen_fuel .. py:function:: _aggregate_generation_fuel_duplicates(gen_fuel: pandas.DataFrame, nuclear: bool = False) -> pandas.DataFrame Aggregate remaining duplicate generation fuels. There are a handful of plants (< 100) whose prime_mover_code can't be imputed or duplicates exist in the raw table. We resolve these be aggregate the variable fields. :param gen_fuel: generation fuels dataframe. :param nuclear: adds nuclear_unit_id to list of natural key fields. :returns: generation fuels dataframe without duplicates in natural key fields. :rtype: gen_fuel .. py:function:: _yearly_to_monthly_records(df: pandas.DataFrame) -> pandas.DataFrame Converts an EIA 923 record of 12 months of data into 12 monthly records. Much of the data reported in EIA 923 is monthly, but all 12 months worth of data is reported in a single record, with one field for each of the 12 months. This function converts these annualized composite records into a set of 12 monthly records containing the same information, by parsing the field names for months, and adding a month field. Non - time series data is retained in the same format. :param df: A pandas DataFrame containing the annual data to be converted into monthly records. :returns: A dataframe containing the same data as was passed in via df, but with monthly records as rows instead of as columns. .. py:function:: _coalmine_cleanup(cmi_df: pandas.DataFrame) -> pandas.DataFrame Clean up the core_eia923__entity_coalmine table. This function does most of the core_eia923__entity_coalmine table transformation. It is separate from the coalmine() transform function because of the peculiar way that we are normalizing the ref:`core_eia923__monthly_fuel_receipts_costs` table. All of the coalmine information is originally coming from the EIA fuel_receipts_costs spreadsheet, but it really belongs in its own table. We strip it out of FRC, and create that separate table, but then we need to refer to that table through a foreign key. To do so, we actually merge the entire contents of the coalmine table into FRC, including the surrogate key, and then drop the data fields. For this to work, we need to have exactly the same coalmine data fields in both the new coalmine table, and the FRC table. To ensure that's true, we isolate the transformations here in this function, and apply them to the coalmine columns in both the FRC table and the coalmine table. :param cmi_df: Coal mine information table (e.g. mine name, county, state) :returns: A cleaned DataFrame containing coalmine information. .. py:function:: plants_eia923(eia923_dfs: dict[str, pandas.DataFrame], eia923_transformed_dfs: dict[str, pandas.DataFrame]) -> dict[str, pandas.DataFrame] Transforms the plants_eia923 table. Much of the static plant information is reported repeatedly, and scattered across several different pages of EIA 923. The data frame that this function uses is assembled from those many different pages, and passed in via the same dictionary of dataframes that all the other ingest functions use for uniformity. Transformations include: * Map full spelling onto code values. * Convert Y/N columns to booleans. * Remove excess white space around values. * Drop duplicate rows. :param eia923_dfs: Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA 923 form, as reported in the Excel spreadsheets they distribute. :param eia923_transformed_dfs: A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values). :returns: A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values). .. py:function:: gen_fuel_nuclear(gen_fuel_nuke: pandas.DataFrame) -> pandas.DataFrame Transforms the core_eia923__monthly_generation_fuel_nuclear table. Transformations include: * Backfill nuclear_unit_ids for 2001 and 2002. * Set all prime_mover_codes to 'ST'. * Aggregate remaining duplicate units. :param gen_fuel_nuke: dataframe of nuclear unit fuels. :returns: Transformed nuclear generation fuel table. .. py:function:: _core_eia923__pre_generation_fuel(raw_eia923__generation_fuel: pandas.DataFrame) Transforms the raw_eia923__generation_fuel table. Transformations include: * Remove fields implicated elsewhere. * Replace . values with NA. * Remove rows with utility ids 99999. * Create a fuel_type_code_pudl field that organizes fuel types into clean, distinguishable categories. * Combine year and month columns into a single date column. * Clean and impute fuel_type field. * Backfill missing prime_mover_codes * Create a separate generation_fuel_nuclear table. * Aggregate records with duplicate natural keys. :param raw_eia923__generation_fuel: The raw ``raw_eia923__generation_fuel`` dataframe. :returns: Cleaned ``eia923__generation_fuel`` dataframe ready for harvesting. _core_eia923__generation_fuel_nuclear: Cleaned ``eia923__generation_fuel_nuclear`` dataframe ready for harvesting. :rtype: _core_eia923__generation_fuel .. py:function:: _map_prime_mover_sets(prime_mover_set: numpy.ndarray) -> str Map unique prime mover combinations to a single prime mover code. In 2001-2019 data, the .value_counts() of the combinations is: (CA, CT) 750 (ST, CA) 101 (ST) 60 (CA) 17 (CS, ST, CT) 2 :param prime_mover_set: unique combinations of prime_mover_code :type prime_mover_set: np.ndarray :returns: single prime mover code :rtype: str .. py:function:: _aggregate_duplicate_boiler_fuel_keys(boiler_fuel_df: pandas.DataFrame) -> pandas.DataFrame Combine boiler_fuel rows with duplicate keys by aggregating them. core_eia923__monthly_boiler_fuel contains a few records with duplicate keys, mostly caused by CA and CT parts of combined cycle plants being mapped to the same boiler ID. This is most likely a data entry error. See GitHub issue #852 One solution (implemented here) is to simply aggregate those records together. This is cheap and easy compared to the more thorough solution of making surrogate boiler IDs. Aggregation was preferred to purity due to the low volume of affected records (4.5% of combined cycle plants). :param boiler_fuel_df: the boiler_fuel dataframe :returns: A copy of boiler_fuel dataframe with duplicates removed and aggregates appended. .. py:function:: _core_eia923__boiler_fuel(raw_eia923__boiler_fuel: pandas.DataFrame) -> pandas.DataFrame Transforms the core_eia923__monthly_boiler_fuel table. Transformations include: * Remove fields implicated elsewhere. * Drop values with plant and boiler id values of NA. * Replace . values with NA. * Create a fuel_type_code_pudl field that organizes fuel types into clean, distinguishable categories. * Combine year and month columns into a single date column. * Drop duplicate rows with NA or 0 in all value columns. Eventually we should truncate this table by the last year-month that was integrated. Right now all months get integrated for a given year, regardless of whether there's data for them. :param raw_eia923__boiler_fuel: The raw ``raw_eia923__boiler_fuel`` dataframe. :returns: Cleaned ``core_eia923__monthly_boiler_fuel`` dataframe ready for harvesting. .. py:function:: remove_duplicate_pks_boiler_fuel_eia923(bf: pandas.DataFrame) -> pandas.DataFrame Deduplicate on primary keys for :ref:`core_eia923__monthly_boiler_fuel`. There are a relatively small number of records ~5% from the boiler fuel table that have duplicate records based on what we believe is this table's primary keys. Fortunately, all of these duplicates have at least one records w/ only zeros and or nulls. So this method drops only the records which have duplicate pks and only have zeros or nulls in the non-primary key columns. Note: There are 4 boilers in 2021 that are being dropped entirely during this cleaning. They have BOTH duplicate pks and only have zeros or nulls in the non-primary key columns. We could choose to preserve all instances of the pks even after :func:`drop_invalid_rows` or only dropping one when there are two. We chose to leave this be because it was minor and these boilers show up in other years. See `comment `_ for more details. .. py:function:: _core_eia923__generation(raw_eia923__generator: pandas.DataFrame) -> pandas.DataFrame Transforms the EIA 923 generation table. Transformations include: * Drop rows with NA for generator id. * Remove fields implicated elsewhere. * Replace . values with NA. * Drop generator-date row duplicates (all have no data). :param raw_eia923__generator: The raw ``raw_eia923__generator`` dataframe. :returns: Cleaned ``_core_eia923__generation`` dataframe ready for harvesting. .. py:function:: _core_eia923__coalmine(raw_eia923__fuel_receipts_costs: pandas.DataFrame) -> pandas.DataFrame Transforms the raw_eia923__fuel_receipts_costs table. Transformations include: * Remove fields implicated elsewhere. * Drop duplicates with MSHA ID. :param raw_eia923__fuel_receipts_costs: raw precursor to the :ref:`core_eia923__monthly_fuel_receipts_costs` table. :returns: Cleaned ``_core_eia923__coalmine`` dataframe ready for harvesting. .. py:function:: _core_eia923__fuel_receipts_costs(raw_eia923__fuel_receipts_costs: pandas.DataFrame, _core_eia923__coalmine: pandas.DataFrame) -> pandas.DataFrame Transforms the eia923__fuel_receipts_costs dataframe. Transformations include: * Remove fields implicated elsewhere. * Replace . values with NA. * Standardize codes values. * Fix dates. * Replace invalid mercury content values with NA. Fuel cost is reported in cents per mmbtu. Converts cents to dollars. :param raw_eia923__fuel_receipts_costs: The raw ``raw_eia923__fuel_receipts_costs`` dataframe. :param _core_eia923__coalmine: The cleaned pre-harvest EIA 923 coal mine dataframe. :returns: Cleaned ``eia923__fuel_receipts_costs`` dataframe ready for harvesting. .. py:function:: _core_eia923__cooling_system_information(raw_eia923__cooling_system_information: pandas.DataFrame) -> pandas.DataFrame Transforms the eia923__cooling_system_information dataframe. Applies typical NA conversion and date conversion. As of 2024-02-28: 2008 and 2009 only have annual rates, but the "_rate_gallons_per_minute" values are otherwise monthly; we leave them NA for 2008-2009 to avoid confusion. As of 2024-02-28: In 2008 and 2009 the rate columns are labeled as "0.1 cubic feet per second"; In 2013 they change it to gallons/min. If taken to mean that the earlier unit is literally deci-cubic-feet per second, we find that all these rates jump 10x when we hit the 2013 data; so we interpret "0.1 cubic feet per second" to mean "cubic feet per second, with precision of 0.1 cfs." .. py:function:: cooling_system_information_null_check(csi) We do not expect any columns to be completely null. In fast-ETL context (only recent years), the annual columns may also be completely null. .. py:function:: cooling_system_information_withdrawal_discrepancy_check(csi) Withdrawal should be equal to discharge + consumption. To allow for *some* data quality errors we assert that withdrawal ~= discharge + consumption at least 90% of the time (with a 1% acceptable discrepancy). .. py:function:: cooling_system_information_continuity(csi) Check to see if columns vary as slowly as expected. .. py:function:: _core_eia923__fgd_operation_maintenance(raw_eia923__fgd_operation_maintenance: pandas.DataFrame) -> pandas.DataFrame Transforms the _core_eia923__fgd_operation_maintenance table. Transformations include: * Drop values with plant and boiler id values of NA. * Replace . values with NA. * Convert thousands of dollars to dollars. * Fix datetimes for SO2 test dates. * Ensure a unique primary key and drop some duplicated rows. :param raw_eia923__fgd_operation_maintenance: The raw ``raw_eia923__fgd_operation_maintenance`` dataframe. :returns: Cleaned ``_core_eia923__fgd_operation_maintenance`` dataframe ready for harvesting. .. py:function:: fgd_operation_maintenance_null_check(fgd) Check that columns other than expected columns aren't null. .. py:function:: fgd_cost_discrepancy_check(fgd) Opex costs should sum to opex_fgd_total_cost. To allow for *some* data quality errors we assert that costs ~= total cost at least 99% of the time (with a 1% acceptable discrepancy). .. py:function:: fgd_continuity_check(fgd) Check to see if columns vary as slowly as expected. .. py:function:: _core_eia923__energy_storage(raw_eia923__energy_storage: pandas.DataFrame) -> pandas.DataFrame Transforms the eia923_energy_storage table. Transformations include: * Replace . values with NA. * Clean up ``fuel_unit`` strings. * Make wide monthly columns into tall monthly columns. * Convert date to month. * Encode relevant columns. Other cleaning that could be done: * Come up with an encoder for ``fuel_unit`` (tricky because different between FERC and EIA).