pudl.helpers#

General utility functions that are used in a variety of contexts.

The functions in this module are used in various stages of the ETL and post-etl processes. They are usually not dataset specific, but not always. If a function is designed to be used as a general purpose tool, applicable in multiple scenarios, it should probably live here. There are lost of transform type functions in here that help with cleaning and restructing dataframes.

Module Contents#

Functions#

label_map(→ collections.defaultdict[str, str | NAType])

Build a mapping dictionary from two columns of a labeling / coding dataframe.

find_new_ferc1_strings(→ set[str])

Identify as-of-yet uncategorized freeform strings in FERC Form 1.

find_foreign_key_errors(→ list[dict[str, Any]])

Report foreign key violations from a dictionary of dataframes.

download_zip_url(url, save_path[, chunk_size])

Download and save a Zipfile locally.

add_fips_ids(df[, state_col, county_col, vintage])

Add State and County FIPS IDs to a dataframe.

clean_eia_counties(df, fixes[, state_col, county_col])

Replace non-standard county names with county nmes from US Census.

oob_to_nan(df, cols[, lb, ub])

Set non-numeric values and those outside of a given rage to NaN.

prep_dir(dir_path[, clobber])

Create (or delete and recreate) a directory.

is_doi(doi)

Determine if a string is a valid digital object identifier (DOI).

convert_col_to_datetime(df, date_col_name)

Convert a column in a dataframe to a datetime.

full_timeseries_date_merge(left, right, on[, ...])

Merge dataframes with different date frequencies and expand to a full timeseries.

_add_suffix_to_date_on(date_on)

Check date_on list is valid and add _temp_for_merge suffix.

date_merge(→ pandas.DataFrame)

Merge two dataframes that have different report date frequencies.

expand_timeseries(→ pandas.DataFrame)

Expand a dataframe to a include a full time series at a given frequency.

organize_cols(df, cols)

Organize columns into key ID & name fields & alphabetical data columns.

simplify_strings(df, columns)

Simplify the strings contained in a set of dataframe columns.

cleanstrings_series(col, str_map[, unmapped, simplify])

Clean up the strings in a single column/Series.

cleanstrings(df, columns, stringmaps[, unmapped, simplify])

Consolidate freeform strings in several dataframe columns.

fix_int_na(df, columns[, float_na, int_na, str_na])

Convert NA containing integer columns from float to string.

month_year_to_date(df)

Convert all pairs of year/month fields in a dataframe into Date fields.

remove_leading_zeros_from_numeric_strings(...)

Remove leading zeros frame column values that are numeric strings.

convert_to_date(df[, date_col, year_col, month_col, ...])

Convert specified year, month or day columns into a datetime object.

fix_eia_na(df)

Replace common ill-posed EIA NA spreadsheet values with np.nan.

simplify_columns(df)

Simplify column labels for use as snake_case database fields.

drop_tables(engine[, clobber])

Drops all tables from a SQLite database.

merge_dicts(list_of_dicts)

Merge multipe dictionaries together.

convert_cols_dtypes(→ pandas.DataFrame)

Convert a PUDL dataframe's columns to the correct data type.

generate_rolling_avg(df, group_cols, data_col, window, ...)

Generate a rolling average.

fillna_w_rolling_avg(df_og, group_cols, data_col[, window])

Filling NaNs with a rolling average.

count_records(df, cols, new_count_col_name)

Count the number of unique records in group in a dataframe.

cleanstrings_snake(df, cols)

Clean the strings in a columns in a dataframe with snake case.

zero_pad_numeric_string(→ pandas.Series)

Clean up fixed-width leading zero padded numeric (e.g. ZIP, FIPS) codes.

iterate_multivalue_dict(**kwargs)

Make dicts from dict with main dict key and one value of main dict.

get_working_eia_dates()

Get all working EIA dates as a DatetimeIndex.

dedupe_on_category(→ pandas.DataFrame)

Deduplicate a df using a sorted category to retain prefered values.

calc_capacity_factor(df, freq[, min_cap_fact, ...])

Calculate capacity factor.

weighted_average(df, data_col, weight_col, by)

Generate a weighted average.

sum_and_weighted_average_agg(→ pandas.DataFrame)

Aggregate dataframe by summing and using weighted averages.

get_eia_ferc_acct_map()

Get map of EIA technology_description/pm codes <> ferc accounts.

dedupe_n_flatten_list_of_lists(mega_list)

Flatten a list of lists and remove duplicates.

convert_df_to_excel_file(→ pandas.ExcelFile)

Converts a pandas dataframe to a pandas ExcelFile object.

Attributes#

sum_na

A sum function that returns NA if the Series includes any NA values.

logger

pudl.helpers.sum_na[source]#

A sum function that returns NA if the Series includes any NA values.

In many of our aggregations we need to override the default behavior of treating NA values as if they were zero. E.g. when calculating the heat rates of generation units, if there are some months where fuel consumption is reported as NA, but electricity generation is reported normally, then the fuel consumption for the year needs to be NA, otherwise we’ll get unrealistic heat rates.

pudl.helpers.logger[source]#
pudl.helpers.label_map(df: pandas.DataFrame, from_col: str = 'code', to_col: str = 'label', null_value: str | NAType = pd.NA) collections.defaultdict[str, str | NAType][source]#

Build a mapping dictionary from two columns of a labeling / coding dataframe.

These dataframes document the meanings of the codes that show up in much of the originally reported data. They’re defined in pudl.metadata.codes. This function is mostly used to build maps that can translate the hard to understand short codes into longer human-readable codes.

Parameters:
  • df – The coding / labeling dataframe. Must contain columns from_col and to_col.

  • from_col – Label of column containing the existing codes to be replaced.

  • to_col – Label of column containing the new codes to be swapped in.

  • null_value – Defualt (Null) value to map to when a value which doesn’t appear in from_col is encountered.

Returns:

A mapping dictionary suitable for use with pandas.Series.map().

pudl.helpers.find_new_ferc1_strings(table: str, field: str, strdict: dict[str, list[str]], ferc1_engine: sqlalchemy.engine.Engine) set[str][source]#

Identify as-of-yet uncategorized freeform strings in FERC Form 1.

Parameters:
  • table – Name of the FERC Form 1 DB to search.

  • field – Name of the column in that table to search.

  • strdict – A string cleaning dictionary. See e.g. pudl.transform.ferc1.FUEL_UNIT_STRINGS

  • ferc1_engine – SQL Alchemy DB connection engine for the FERC Form 1 DB.

Returns:

Any string found in the searched table + field that was not part of any of categories enumerated in strdict.

pudl.helpers.find_foreign_key_errors(dfs: dict[str, pandas.DataFrame]) list[dict[str, Any]][source]#

Report foreign key violations from a dictionary of dataframes.

The database schema to check against is generated based on the names of the dataframes (keys of the dictionary) and the PUDL metadata structures.

Parameters:

dfs – Keys are table names, and values are dataframes ready for loading into the SQLite database.

Returns:

A list of dictionaries, each one pertains to a single database table in which a foreign key constraint violation was found, and it includes the table name, foreign key definition, and the elements of the dataframe that violated the foreign key constraint.

pudl.helpers.download_zip_url(url, save_path, chunk_size=128)[source]#

Download and save a Zipfile locally.

Useful for acquiring and storing non-PUDL data locally.

Parameters:
  • url (str) – The URL from which to download the Zipfile

  • save_path (pathlib.Path) – The location to save the file.

  • chunk_size (int) – Data chunk in bytes to use while downloading.

Returns:

None

pudl.helpers.add_fips_ids(df, state_col='state', county_col='county', vintage=2015)[source]#

Add State and County FIPS IDs to a dataframe.

To just add State FIPS IDs, make county_col = None.

pudl.helpers.clean_eia_counties(df, fixes, state_col='state', county_col='county')[source]#

Replace non-standard county names with county nmes from US Census.

pudl.helpers.oob_to_nan(df, cols, lb=None, ub=None)[source]#

Set non-numeric values and those outside of a given rage to NaN.

Parameters:
  • df (pandas.DataFrame) – The dataframe containing values to be altered.

  • cols (iterable) – Labels of the columns whose values are to be changed.

  • lb – (number): Lower bound, below which values are set to NaN. If None, don’t use a lower bound.

  • ub – (number): Upper bound, below which values are set to NaN. If None, don’t use an upper bound.

Returns:

The altered DataFrame.

Return type:

pandas.DataFrame

pudl.helpers.prep_dir(dir_path, clobber=False)[source]#

Create (or delete and recreate) a directory.

Parameters:
  • dir_path (path-like) – path to the directory that you are trying to clean and prepare.

  • clobber (bool) – If True and dir_path exists, it will be removed and replaced with a new, empty directory.

Raises:

FileExistsError – if a file or directory already exists at dir_path.

Returns:

Path to the created directory.

Return type:

pathlib.Path

pudl.helpers.is_doi(doi)[source]#

Determine if a string is a valid digital object identifier (DOI).

Function simply checks whether the offered string matches a regular expresssion – it doesn’t check whether the DOI is actually registered with the relevant authority.

Parameters:

doi (str) – String to validate.

Returns:

True if doi matches the regex for valid DOIs, False otherwise.

Return type:

bool

pudl.helpers.convert_col_to_datetime(df, date_col_name)[source]#

Convert a column in a dataframe to a datetime.

If the column isn’t a datetime, it needs to be converted to a string type first so that integer years are formatted correctly.

Parameters:
  • df (pandas.DataFrame) – Dataframe with column to convert.

  • date_col_name (string) – name of the column to convert.

Returns:

Dataframe with the converted datetime column.

pudl.helpers.full_timeseries_date_merge(left: pandas.DataFrame, right: pandas.DataFrame, on: list[str], left_date_col: str = 'report_date', right_date_col: str = 'report_date', new_date_col: str = 'report_date', date_on: list[str] = ['year'], how: Literal[inner, outer, full_timeseries_date_merge.left, full_timeseries_date_merge.right, cross] = 'inner', report_at_start: bool = True, freq: str = 'MS', **kwargs)[source]#

Merge dataframes with different date frequencies and expand to a full timeseries.

Arguments: see arguments for date_merge and expand_timeseries

pudl.helpers._add_suffix_to_date_on(date_on)[source]#

Check date_on list is valid and add _temp_for_merge suffix.

pudl.helpers.date_merge(left: pandas.DataFrame, right: pandas.DataFrame, on: list[str], left_date_col: str = 'report_date', right_date_col: str = 'report_date', new_date_col: str = 'report_date', date_on: list[str] = None, how: Literal[inner, outer, date_merge.left, date_merge.right, cross] = 'inner', report_at_start: bool = True, **kwargs) pandas.DataFrame[source]#

Merge two dataframes that have different report date frequencies.

We often need to bring together data that is reported at different temporal granularities e.g. monthly basis versus annual basis. This function acts as a wrapper on a pandas merge to allow merging at different temporal granularities. The date columns of both dataframes are separated into year, quarter, month, and day columns. Then, the dataframes are merged according to how on the columns specified by the on and date_on argument, which list the new temporal columns to merge on as well any additional shared columns. Finally, the datetime column is reconstructed in the output dataframe and named according to the new_date_col parameter.

Parameters:
  • left – The left dataframe in the merge. Typically monthly in our use cases if doing a left merge E.g. generation_eia923. Must contain columns specified by left_date_col and on argument.

  • right – The right dataframe in the merge. Typically annual in our uses cases if doing a left merge E.g. generators_eia860. Must contain columns specified by right_date_col and on argument.

  • on – The columns to merge on that are shared between both dataframes. Typically ID columns like plant_id_eia, generator_id or boiler_id.

  • left_date_col – Column in left containing datetime like data. Default is report_date. Must be a Datetime or convertible to a Datetime using pandas.to_datetime()

  • right_date_col – Column in right containing datetime like data. Default is report_date. Must be a Datetime or convertible to a Datetime using pandas.to_datetime().

  • new_date_col – Name of the reconstructed datetime column in the output dataframe.

  • date_on – The temporal columns to merge on. Values in this list of columns must be [year, quarter, month, day]. E.g. if a monthly reported dataframe is being merged onto a daily reported dataframe, then the merge would be performed on ["year", "month"]. If one of these temporal columns already exists in the dataframe it will not be clobbered by the merge, as the suffix “_temp_for_merge” is added when expanding the datetime column into year, quarter, month, and day. By default, date_on will just include year.

  • how – How the dataframes should be merged. See pandas.DataFrame.merge().

  • report_at_start – Whether the data in the dataframe whose report date is not being kept in the merged output (in most cases the less frequently reported dataframe) is reported at the start or end of the time period e.g. January 1st for annual data.

  • kwargs – Additional arguments to pass to pandas.DataFrame.merge().

Returns:

Merged contents of left and right input dataframes.

Raises:
  • ValueError – if left_date_col or right_date_col columns are missing from their respective input dataframes.

  • ValueError – if any of the labels referenced in on are missing from either the left or right dataframes.

pudl.helpers.expand_timeseries(df: pandas.DataFrame, key_cols: list[str], date_col: str = 'report_date', freq: str = 'MS', fill_through_freq: Literal[year, month, day] = 'year') pandas.DataFrame[source]#

Expand a dataframe to a include a full time series at a given frequency.

This function adds a full timeseries to the given dataframe for each group of columns specified by key_cols. The data in the timeseries will be filled with the next previous chronological observation for a group of primary key columns specified by key_cols.

Parameters:
  • df – The dataframe to expand. Must have date_col in columns.

  • key_cols – Column names of the non-date primary key columns in the dataframe. The resulting dataframe will have a full timeseries expanded for each unique group of these ID columns that are present in the dataframe.

  • date_col – Name of the datetime column being expanded into a full timeseries.

  • freq – The frequency of the time series to expand the data to. See here for a list of frequency aliases.

  • fill_through_freq – The frequency in which to fill in the data through. For example, if equal to “year” the data will be filled in through the end of the last reported year for each grouping of key_cols. Valid frequencies are only year, month, or day.

pudl.helpers.organize_cols(df, cols)[source]#

Organize columns into key ID & name fields & alphabetical data columns.

For readability, it’s nice to group a few key columns at the beginning of the dataframe (e.g. report_year or report_date, plant_id…) and then put all the rest of the data columns in alphabetical order.

Parameters:
  • df – The DataFrame to be re-organized.

  • cols – The columns to put first, in their desired output ordering.

Returns:

A dataframe with the same columns as the input DataFrame df, but with cols first, in the same order as they were passed in, and the remaining columns sorted alphabetically.

Return type:

pandas.DataFrame

pudl.helpers.simplify_strings(df, columns)[source]#

Simplify the strings contained in a set of dataframe columns.

Performs several operations to simplify strings for comparison and parsing purposes. These include removing Unicode control characters, stripping leading and trailing whitespace, using lowercase characters, and compacting all internal whitespace to a single space.

Leaves null values unaltered. Casts other values with astype(str).

Parameters:
  • df (pandas.DataFrame) – DataFrame whose columns are being cleaned up.

  • columns (iterable) – The labels of the string columns to be simplified.

Returns:

The whole DataFrame that was passed in, with the string columns cleaned up.

Return type:

pandas.DataFrame

pudl.helpers.cleanstrings_series(col, str_map, unmapped=None, simplify=True)[source]#

Clean up the strings in a single column/Series.

Parameters:
  • col (pandas.Series) – A pandas Series, typically a single column of a dataframe, containing the freeform strings that are to be cleaned.

  • str_map (dict) – A dictionary of lists of strings, in which the keys are the simplified canonical strings, witch which each string found in the corresponding list will be replaced.

  • unmapped (str) – A value with which to replace any string found in col that is not found in one of the lists of strings in map. Typically the null string ‘’. If None, these strings will not be replaced.

  • simplify (bool) – If True, strip and compact whitespace, and lowercase all strings in both the list of values to be replaced, and the values found in col. This can reduce the number of strings that need to be kept track of.

Returns:

The cleaned up Series / column, suitable for replacing the original messy column in a pandas.DataFrame.

Return type:

pandas.Series

pudl.helpers.cleanstrings(df, columns, stringmaps, unmapped=None, simplify=True)[source]#

Consolidate freeform strings in several dataframe columns.

This function will consolidate freeform strings found in columns into simplified categories, as defined by stringmaps. This is useful when a field contains many different strings that are really meant to represent a finite number of categories, e.g. a type of fuel. It can also be used to create simplified categories that apply to similar attributes that are reported in various data sources from different agencies that use their own taxonomies.

The function takes and returns a pandas.DataFrame, making it suitable for use with the pandas.DataFrame.pipe() method in a chain.

Parameters:
  • df (pandas.DataFrame) – the DataFrame containing the string columns to be cleaned up.

  • columns (list) – a list of string column labels found in the column index of df. These are the columns that will be cleaned.

  • stringmaps (list) – a list of dictionaries. The keys of these dictionaries are strings, and the values are lists of strings. Each dictionary in the list corresponds to a column in columns. The keys of the dictionaries are the values with which every string in the list of values will be replaced.

  • unmapped (str, None) – the value with which strings not found in the stringmap dictionary will be replaced. Typically the null string ‘’. If None, then strings found in the columns but not in the stringmap will be left unchanged.

  • simplify (bool) – If true, strip whitespace, remove duplicate whitespace, and force lower-case on both the string map and the values found in the columns to be cleaned. This can reduce the overall number of string values that need to be tracked.

Returns:

The function returns a new DataFrame containing the cleaned strings.

Return type:

pandas.DataFrame

pudl.helpers.fix_int_na(df, columns, float_na=np.nan, int_na=-1, str_na='')[source]#

Convert NA containing integer columns from float to string.

Numpy doesn’t have a real NA value for integers. When pandas stores integer data which has NA values, it thus upcasts integers to floating point values, using np.nan values for NA. However, in order to dump some of our dataframes to CSV files for use in data packages, we need to write out integer formatted numbers, with empty strings as the NA value. This function replaces np.nan values with a sentinel value, converts the column to integers, and then to strings, finally replacing the sentinel value with the desired NA string.

This is an interim solution – now that pandas extension arrays have been implemented, we need to go back through and convert all of these integer columns that contain NA values to Nullable Integer types like Int64.

Parameters:
  • df (pandas.DataFrame) – The dataframe to be fixed. This argument allows method chaining with the pipe() method.

  • columns (iterable of strings) – A list of DataFrame column labels indicating which columns need to be reformatted for output.

  • float_na (float) – The floating point value to be interpreted as NA and replaced in col.

  • int_na (int) – Sentinel value to substitute for float_na prior to conversion of the column to integers.

  • str_na (str) – sa.String value to substitute for int_na after the column has been converted to strings.

Returns:

a new DataFrame, with the selected columns converted to strings that look like integers, compatible with the postgresql COPY FROM command.

Return type:

df (pandas.DataFrame)

pudl.helpers.month_year_to_date(df)[source]#

Convert all pairs of year/month fields in a dataframe into Date fields.

This function finds all column names within a dataframe that match the regular expression ‘_month$’ and ‘_year$’, and looks for pairs that have identical prefixes before the underscore. These fields are assumed to describe a date, accurate to the month. The two fields are used to construct a new _date column (having the same prefix) and the month/year columns are then dropped.

Todo

This function needs to be combined with convert_to_date, and improved: * find and use a _day$ column as well * allow specification of default month & day values, if none are found. * allow specification of lists of year, month, and day columns to be combined, rather than automataically finding all the matching ones. * Do the Right Thing when invalid or NA values are encountered.

Parameters:

df (pandas.DataFrame) – The DataFrame in which to convert year/months fields to Date fields.

Returns:

A DataFrame in which the year/month fields have been converted into Date fields.

Return type:

pandas.DataFrame

pudl.helpers.remove_leading_zeros_from_numeric_strings(df: pandas.DataFrame, col_name: str) pandas.DataFrame[source]#

Remove leading zeros frame column values that are numeric strings.

Sometimes an ID column (like generator_id or unit_id) will be reported with leading zeros and sometimes it won’t. For example, in the Excel spreadsheets published by EIA, the same generator may show up with the ID “0001” and “1” in different years This function strips the leading zeros from those numeric strings so the data can be mapped accross years and datasets more reliably.

Alphanumeric generator IDs with leadings zeroes are not affected, as we found no instances in which an alphanumeric ID appeared both with and without leading zeroes. The ID “0A1” will stay “0A1”.

Parameters:
  • df – A DataFrame containing the column you’d like to remove numeric leading zeros from.

  • col_name – The name of the column you’d like to remove numeric leading zeros from.

Returns:

A DataFrame without leading zeros for numeric string values in the desired column.

pudl.helpers.convert_to_date(df, date_col='report_date', year_col='report_year', month_col='report_month', day_col='report_day', month_value=1, day_value=1)[source]#

Convert specified year, month or day columns into a datetime object.

If the input date_col already exists in the input dataframe, then no conversion is applied, and the original dataframe is returned unchanged. Otherwise the constructed date is placed in that column, and the columns which were used to create the date are dropped.

Parameters:
  • df (pandas.DataFrame) – dataframe to convert

  • date_col (str) – the name of the column you want in the output.

  • year_col (str) – the name of the year column in the original table.

  • month_col (str) – the name of the month column in the original table.

  • day_col – the name of the day column in the original table.

  • month_value (int) – generated month if no month exists.

  • day_value (int) – generated day if no month exists.

Returns:

A DataFrame in which the year, month, day columns values have been converted into datetime objects.

Return type:

pandas.DataFrame

Todo

Update docstring.

pudl.helpers.fix_eia_na(df)[source]#

Replace common ill-posed EIA NA spreadsheet values with np.nan.

Currently replaces empty string, single decimal points with no numbers, and any single whitespace character with np.nan.

Parameters:

df (pandas.DataFrame) – The DataFrame to clean.

Returns:

The cleaned DataFrame.

Return type:

pandas.DataFrame

pudl.helpers.simplify_columns(df)[source]#

Simplify column labels for use as snake_case database fields.

All columns will be re-labeled by: * Replacing all non-alphanumeric characters with spaces. * Forcing all letters to be lower case. * Compacting internal whitespace to a single “ “. * Stripping leading and trailing whitespace. * Replacing all remaining whitespace with underscores.

Parameters:

df (pandas.DataFrame) – The DataFrame to clean.

Returns:

The cleaned DataFrame.

Return type:

pandas.DataFrame

Todo

Update docstring.

pudl.helpers.drop_tables(engine: sqlalchemy.engine.Engine, clobber: bool = False)[source]#

Drops all tables from a SQLite database.

Creates an sa.schema.MetaData object reflecting the structure of the database that the passed in engine refers to, and uses that schema to drop all existing tables.

Todo

Treat DB connection as a context manager (with/as).

Parameters:
  • engine – An SQL Alchemy SQLite database Engine pointing at an exising SQLite database to be deleted.

  • clobber – Whether or not to allow a non-empty DB to be removed.

Raises:

AssertionError – if clobber is False and there are any tables in the database.

Returns:

None

pudl.helpers.merge_dicts(list_of_dicts)[source]#

Merge multipe dictionaries together.

Given any number of dicts, shallow copy and merge into a new dict, precedence goes to key value pairs in latter dicts.

Parameters:

dict_args (list) – a list of dictionaries.

Returns:

dict

pudl.helpers.convert_cols_dtypes(df: pandas.DataFrame, data_source: str | None = None, name: str | None = None) pandas.DataFrame[source]#

Convert a PUDL dataframe’s columns to the correct data type.

Boolean type conversions created a special problem, because null values in boolean columns get converted to True (which is bonkers!)… we generally want to preserve the null values and definitely don’t want them to be True, so we are keeping those columns as objects and preforming a simple mask for the boolean columns.

The other exception in here is with the utility_id_eia column. It is often an object column of strings. All of the strings are numbers, so it should be possible to convert to pandas.Int32Dtype() directly, but it is requiring us to convert to int first. There will probably be other columns that have this problem… and hopefully pandas just enables this direct conversion.

Parameters:
  • df – dataframe with columns that appear in the PUDL tables.

  • data_source – the name of the datasource (eia, ferc1, etc.)

  • name – name of the table (for logging only!)

Returns:

Input dataframe, but with column types as specified by pudl.metadata.fields.FIELD_METADATA

pudl.helpers.generate_rolling_avg(df, group_cols, data_col, window, **kwargs)[source]#

Generate a rolling average.

For a given dataframe with a report_date column, generate a monthly rolling average and use this rolling average to impute missing values.

Parameters:
  • df (pandas.DataFrame) – Original dataframe. Must have group_cols column, a data_col column and a report_date column.

  • group_cols (iterable) – a list of columns to groupby.

  • data_col (str) – the name of the data column.

  • window (int) – window from pandas.Series.rolling().

  • kwargs – Additional arguments to pass to pandas.Series.rolling().

Returns:

pandas.DataFrame

pudl.helpers.fillna_w_rolling_avg(df_og, group_cols, data_col, window=12, **kwargs)[source]#

Filling NaNs with a rolling average.

Imputes null values from a dataframe on a rolling monthly average. To note, this was designed to work with the PudlTabl object’s tables.

Parameters:
  • df_og (pandas.DataFrame) – Original dataframe. Must have group_cols column, a data_col column and a ‘report_date’ column.

  • group_cols (iterable) – a list of columns to groupby.

  • data_col (str) – the name of the data column.

  • window (int) – window from pandas.Series.rolling

  • kwargs – Additional arguments to pass to pandas.Series.rolling.

Returns:

dataframe with nulls filled in.

Return type:

pandas.DataFrame

pudl.helpers.count_records(df, cols, new_count_col_name)[source]#

Count the number of unique records in group in a dataframe.

Parameters:
  • df (panda.DataFrame) – dataframe you would like to groupby and count.

  • cols (iterable) – list of columns to group and count by.

  • new_count_col_name (string) – the name that will be assigned to the column that will contain the count.

Returns:

dataframe containing only cols and new_count_col_name.

Return type:

pandas.DataFrame

pudl.helpers.cleanstrings_snake(df, cols)[source]#

Clean the strings in a columns in a dataframe with snake case.

Parameters:
  • df (panda.DataFrame) – original dataframe.

  • cols (list) – list of columns in df to apply snake case to.

pudl.helpers.zero_pad_numeric_string(col: pandas.Series, n_digits: int) pandas.Series[source]#

Clean up fixed-width leading zero padded numeric (e.g. ZIP, FIPS) codes.

Often values like ZIP and FIPS codes are stored as integers, or get converted to floating point numbers because there are NA values in the column. Sometimes other non-digit strings are included like Canadian postal codes mixed in with ZIP codes, or IMP (imported) instead of a FIPS county code. This function attempts to manage these irregularities and produce either fixed-width leading zero padded strings of digits having a specified length (n_digits) or NA.

  • Convert the Series to a nullable string.

  • Remove any decimal point and all digits following it.

  • Remove any non-digit characters.

  • Replace any empty strings with NA.

  • Replace any strings longer than n_digits with NA.

  • Pad remaining digit-only strings to n_digits length.

  • Replace (invalid) all-zero codes with NA.

Parameters:
  • col – The Series to clean. May be numeric, string, object, etc.

  • n_digits – the desired length of the output strings.

Returns:

A Series of nullable strings, containing only all-numeric strings having length n_digits, padded with leading zeroes if necessary.

pudl.helpers.iterate_multivalue_dict(**kwargs)[source]#

Make dicts from dict with main dict key and one value of main dict.

pudl.helpers.get_working_eia_dates()[source]#

Get all working EIA dates as a DatetimeIndex.

pudl.helpers.dedupe_on_category(dedup_df: pandas.DataFrame, base_cols: list[str], category_name: str, sorter: list[str]) pandas.DataFrame[source]#

Deduplicate a df using a sorted category to retain prefered values.

Use a sorted category column to retain your prefered values when a dataframe is deduplicated.

Parameters:
  • dedup_df – the dataframe with the records to deduplicate.

  • base_cols – list of columns which must not be duplicated.

  • category_name – name of the categorical column to order values for deduplication.

  • sorter – sorted list of categorical values found in the category_name column.

Returns:

The deduplicated dataframe.

pudl.helpers.calc_capacity_factor(df, freq, min_cap_fact=None, max_cap_fact=None)[source]#

Calculate capacity factor.

Capacity factor is calcuated from the capcity, the net generation over a time period and the hours in that same time period. The dates from that dataframe are pulled out to determine the hours in each period based on the frequency. The number of hours is used in calculating the capacity factor. Then records with capacity factors outside the range specified by min_cap_fact and max_cap_fact are dropped.

Parameters:
  • df (pandas.DataFrame) – table with components of capacity factor ( report_date, net_generation_mwh and capacity_mw)

  • min_cap_fact (float) – Lower bound, below which values are set to NaN. If None, don’t use a lower bound. Default is None.

  • max_cap_fact (float) – Upper bound, below which values are set to NaN. If None, don’t use an upper bound. Default is None.

  • freq (str) – String describing time frequency at which to aggregate the reported data, such as ‘MS’ (month start) or ‘AS’ (annual start).

Returns:

modified version of input df with one additional column (capacity_factor).

Return type:

pandas.DataFrame

pudl.helpers.weighted_average(df, data_col, weight_col, by)[source]#

Generate a weighted average.

Parameters:
  • df (pandas.DataFrame) – A DataFrame containing, at minimum, the columns specified in the other parameters data_col and weight_col.

  • data_col (string) – column name of data column to average

  • weight_col (string) – column name to weight on

  • by (list) – A list of the columns to group by when calcuating the weighted average value.

Returns:

a table with by columns as the index and the weighted data_col.

Return type:

pandas.DataFrame

pudl.helpers.sum_and_weighted_average_agg(df_in: pandas.DataFrame, by: list, sum_cols: list, wtavg_dict: dict[str, str]) pandas.DataFrame[source]#

Aggregate dataframe by summing and using weighted averages.

Many times we want to aggreate a data table using the same groupby columns but with different aggregation methods. This function combines two of our most common aggregation methods (summing and applying a weighted average) into one function. Because pandas does not have a built-in weighted average method for groupby we use :func:weighted_average.

Parameters:
  • df_in (pandas.DataFrame) – input table to aggregate. Must have columns in id_cols, sum_cols and keys from wtavg_dict.

  • by (list) – columns to group/aggregate based on. These columns will be passed as an argument into grouby as by arg.

  • sum_cols (list) – columns to sum.

  • wtavg_dict (dictionary) – dictionary of columns to average (keys) and columns to weight by (values).

Returns:

table with join of columns from by, sum_cols and keys of wtavg_dict. Primary key of table will be by.

pudl.helpers.get_eia_ferc_acct_map()[source]#

Get map of EIA technology_description/pm codes <> ferc accounts.

Returns:

table which maps the combination of EIA’s technology

description and prime mover code to FERC Uniform System of Accounts (USOA) accouting names. Read more about USOA here The output table has the following columns: [‘technology_description’, ‘prime_mover_code’, ‘ferc_acct_name’]

Return type:

pandas.DataFrame

pudl.helpers.dedupe_n_flatten_list_of_lists(mega_list)[source]#

Flatten a list of lists and remove duplicates.

pudl.helpers.convert_df_to_excel_file(df: pandas.DataFrame, **kwargs) pandas.ExcelFile[source]#

Converts a pandas dataframe to a pandas ExcelFile object.

You can pass parameters for pandas.to_excel() function.