pudl.helpers

General utility functions that are used in a variety of contexts.

The functions in this module are used in various stages of the ETL and post-etl processes. They are usually not dataset specific, but not always. If a function is designed to be used as a general purpose tool, applicable in multiple scenarios, it should probably live here. There are lost of transform type functions in here that help with cleaning and restructing dataframes.

Module Contents

Functions

label_map(df: pandas.DataFrame, from_col: str = 'code', to_col: str = 'label', null_value=pd.NA) → DefaultDict[str, Union[str, Literal[pandas.NA]]]

Build a mapping dictionary from two columns of a labeling / coding dataframe.

find_new_ferc1_strings(table: str, field: str, strdict: Dict[str, List[str]], ferc1_engine: sqlalchemy.engine.Engine) → Set[str]

Identify as-of-yet uncategorized freeform strings in FERC Form 1.

find_foreign_key_errors(dfs: Dict[str, pandas.DataFrame]) → List[Dict[str, Any]]

Report foreign key violations from a dictionary of dataframes.

download_zip_url(url, save_path, chunk_size=128)

Download and save a Zipfile locally.

add_fips_ids(df, state_col='state', county_col='county', vintage=2015)

Add State and County FIPS IDs to a dataframe.

clean_eia_counties(df, fixes, state_col='state', county_col='county')

Replace non-standard county names with county nmes from US Census.

oob_to_nan(df, cols, lb=None, ub=None)

Set non-numeric values and those outside of a given rage to NaN.

prep_dir(dir_path, clobber=False)

Create (or delete and recreate) a directory.

is_doi(doi)

Determine if a string is a valid digital object identifier (DOI).

clean_merge_asof(left, right, left_on='report_date', right_on='report_date', by={})

Merge two dataframes having different time report_date frequencies.

get_pudl_dtype(col, data_source)

Look up a column's canonical data type based on its PUDL data source.

get_pudl_dtypes(col_source_dict)

Look up canonical PUDL data types for columns based on data sources.

organize_cols(df, cols)

Organize columns into key ID & name fields & alphabetical data columns.

simplify_strings(df, columns)

Simplify the strings contained in a set of dataframe columns.

cleanstrings_series(col, str_map, unmapped=None, simplify=True)

Clean up the strings in a single column/Series.

cleanstrings(df, columns, stringmaps, unmapped=None, simplify=True)

Consolidate freeform strings in several dataframe columns.

fix_int_na(df, columns, float_na=np.nan, int_na=-1, str_na='')

Convert NA containing integer columns from float to string.

month_year_to_date(df)

Convert all pairs of year/month fields in a dataframe into Date fields.

fix_leading_zero_gen_ids(df)

Remove leading zeros from EIA generator IDs which are numeric strings.

convert_to_date(df, date_col='report_date', year_col='report_year', month_col='report_month', day_col='report_day', month_value=1, day_value=1)

Convert specified year, month or day columns into a datetime object.

fix_eia_na(df)

Replace common ill-posed EIA NA spreadsheet values with np.nan.

simplify_columns(df)

Simplify column labels for use as snake_case database fields.

drop_tables(engine, clobber=False)

Drops all tables from a SQLite database.

merge_dicts(list_of_dicts)

Merge multipe dictionaries together.

convert_cols_dtypes(df, data_source, name=None)

Convert the data types for a dataframe.

convert_dfs_dict_dtypes(dfs_dict, data_source)

Convert the data types of a dictionary of dataframes.

generate_rolling_avg(df, group_cols, data_col, window, **kwargs)

Generate a rolling average.

fillna_w_rolling_avg(df_og, group_cols, data_col, window=12, **kwargs)

Filling NaNs with a rolling average.

count_records(df, cols, new_count_col_name)

Count the number of unique records in group in a dataframe.

cleanstrings_snake(df, cols)

Clean the strings in a columns in a dataframe with snake case.

zero_pad_zips(zip_series, n_digits)

Retain prefix zeros in zipcodes.

iterate_multivalue_dict(**kwargs)

Make dicts from dict with main dict key and one value of main dict.

get_working_eia_dates()

Get all working EIA dates as a DatetimeIndex.

convert_df_to_excel_file(df: pandas.DataFrame, **kwargs) → pandas.ExcelFile

Converts a pandas dataframe to a pandas ExcelFile object.

Attributes

logger

sum_na

A sum function that returns NA if the Series includes any NA values.

pudl.helpers.logger[source]
pudl.helpers.sum_na[source]

A sum function that returns NA if the Series includes any NA values.

In many of our aggregations we need to override the default behavior of treating NA values as if they were zero. E.g. when calculating the heat rates of generation units, if there are some months where fuel consumption is reported as NA, but electricity generation is reported normally, then the fuel consumption for the year needs to be NA, otherwise we’ll get unrealistic heat rates.

pudl.helpers.label_map(df: pandas.DataFrame, from_col: str = 'code', to_col: str = 'label', null_value=pd.NA) DefaultDict[str, Union[str, Literal[pandas.NA]]][source]

Build a mapping dictionary from two columns of a labeling / coding dataframe.

These dataframes document the meanings of the codes that show up in much of the originally reported data. They’re defined in pudl.metadata.codes. This function is mostly used to build maps that can translate the hard to understand short codes into longer human-readable codes.

Parameters
  • df – The coding / labeling dataframe. Must contain columns from_col and to_col.

  • from_col – Label of column containing the existing codes to be replaced.

  • to_col – Label of column containing the new codes to be swapped in.

  • null_value – Defualt (Null) value to map to when a value which doesn’t appear in from_col is encountered.

Returns

A mapping dictionary suitable for use with pandas.Series.map().

pudl.helpers.find_new_ferc1_strings(table: str, field: str, strdict: Dict[str, List[str]], ferc1_engine: sqlalchemy.engine.Engine) Set[str][source]

Identify as-of-yet uncategorized freeform strings in FERC Form 1.

Parameters
  • table – Name of the FERC Form 1 DB to search.

  • field – Name of the column in that table to search.

  • strdict – A string cleaning dictionary. See e.g. pudl.transform.ferc1.FUEL_UNIT_STRINGS

  • ferc1_engine – SQL Alchemy DB connection engine for the FERC Form 1 DB.

Returns

Any string found in the searched table + field that was not part of any of categories enumerated in strdict.

pudl.helpers.find_foreign_key_errors(dfs: Dict[str, pandas.DataFrame]) List[Dict[str, Any]][source]

Report foreign key violations from a dictionary of dataframes.

The database schema to check against is generated based on the names of the dataframes (keys of the dictionary) and the PUDL metadata structures.

Parameters

dfs – Keys are table names, and values are dataframes ready for loading into the SQLite database.

Returns

A list of dictionaries, each one pertains to a single database table in which a foreign key constraint violation was found, and it includes the table name, foreign key definition, and the elements of the dataframe that violated the foreign key constraint.

pudl.helpers.download_zip_url(url, save_path, chunk_size=128)[source]

Download and save a Zipfile locally.

Useful for acquiring and storing non-PUDL data locally.

Parameters
  • url (str) – The URL from which to download the Zipfile

  • save_path (pathlib.Path) – The location to save the file.

  • chunk_size (int) – Data chunk in bytes to use while downloading.

Returns

None

pudl.helpers.add_fips_ids(df, state_col='state', county_col='county', vintage=2015)[source]

Add State and County FIPS IDs to a dataframe.

pudl.helpers.clean_eia_counties(df, fixes, state_col='state', county_col='county')[source]

Replace non-standard county names with county nmes from US Census.

pudl.helpers.oob_to_nan(df, cols, lb=None, ub=None)[source]

Set non-numeric values and those outside of a given rage to NaN.

Parameters
  • df (pandas.DataFrame) – The dataframe containing values to be altered.

  • cols (iterable) – Labels of the columns whose values are to be changed.

  • lb – (number): Lower bound, below which values are set to NaN. If None, don’t use a lower bound.

  • ub – (number): Upper bound, below which values are set to NaN. If None, don’t use an upper bound.

Returns

The altered DataFrame.

Return type

pandas.DataFrame

pudl.helpers.prep_dir(dir_path, clobber=False)[source]

Create (or delete and recreate) a directory.

Parameters
  • dir_path (path-like) – path to the directory that you are trying to clean and prepare.

  • clobber (bool) – If True and dir_path exists, it will be removed and replaced with a new, empty directory.

Raises

FileExistsError – if a file or directory already exists at dir_path.

Returns

Path to the created directory.

Return type

pathlib.Path

pudl.helpers.is_doi(doi)[source]

Determine if a string is a valid digital object identifier (DOI).

Function simply checks whether the offered string matches a regular expresssion – it doesn’t check whether the DOI is actually registered with the relevant authority.

Parameters

doi (str) – String to validate.

Returns

True if doi matches the regex for valid DOIs, False otherwise.

Return type

bool

pudl.helpers.clean_merge_asof(left, right, left_on='report_date', right_on='report_date', by={})[source]

Merge two dataframes having different time report_date frequencies.

We often need to bring together data which is reported on a monthly basis, and entity attributes that are reported on an annual basis. The pandas.merge_asof() is designed to do this, but requires that dataframes are sorted by the merge keys (left_on, right_on, and by.keys() here). We also need to make sure that all merge keys have identical data types in the two dataframes (e.g. plant_id_eia needs to be a nullable integer in both dataframes, not a python int in one, and a nullable pandas.Int64Dtype() in the other). Note that pandas.merge_asof() performs a left merge, so the higher frequency dataframe must be the left dataframe.

We also force both left_on and right_on to be a Datetime using pandas.to_datetime() to allow merging dataframes having integer years with those having datetime columns.

Because pandas.merge_asof() searches backwards for the first matching date, this function only works if the less granular dataframe uses the convention of reporting the first date in the time period for which it reports. E.g. annual dataframes need to have January 1st as the date. This is what happens by defualt if only a year or year-month are provided to pandas.to_datetime() as strings.

Parameters
  • left (pandas.DataFrame) – The higher frequency “data” dataframe. Typically monthly in our use cases. E.g. generation_eia923. Must contain report_date and any columns specified in the by argument.

  • right (pandas.DataFrame) – The lower frequency “attribute” dataframe. Typically annual in our uses cases. E.g. generators_eia860. Must contain report_date and any columns specified in the by argument.

  • left_on (str) – Column in left to merge on using merge_asof. Default is report_date. Must be convertible to a Datetime using pandas.to_datetime()

  • right_on (str) – Column in right to merge on using merge_asof. Default is report_date. Must be convertible to a Datetime using pandas.to_datetime()

  • by (dict) – A dictionary enumerating any columns to merge on other than report_date. Typically ID columns like plant_id_eia, generator_id or boiler_id. The keys of the dictionary are the names of the columns, and the values are their data source, as defined in pudl.constants (e.g. ferc1 or eia). The data source is used to look up the column’s canonical data type.

Returns

Merged contents of left and right input dataframes. Will be sorted by left_on and any columns specified in by. See documentation for pandas.merge_asof() to understand how this kind of merge works.

Return type

pandas.DataFrame

Raises
  • ValueError – if left_on or right_on columns are missing from their respective input dataframes.

  • ValueError – if any of the labels referenced in by are missing from either the left or right dataframes.

pudl.helpers.get_pudl_dtype(col, data_source)[source]

Look up a column’s canonical data type based on its PUDL data source.

pudl.helpers.get_pudl_dtypes(col_source_dict)[source]

Look up canonical PUDL data types for columns based on data sources.

pudl.helpers.organize_cols(df, cols)[source]

Organize columns into key ID & name fields & alphabetical data columns.

For readability, it’s nice to group a few key columns at the beginning of the dataframe (e.g. report_year or report_date, plant_id…) and then put all the rest of the data columns in alphabetical order.

Parameters
  • df – The DataFrame to be re-organized.

  • cols – The columns to put first, in their desired output ordering.

Returns

A dataframe with the same columns as the input DataFrame df, but with cols first, in the same order as they were passed in, and the remaining columns sorted alphabetically.

Return type

pandas.DataFrame

pudl.helpers.simplify_strings(df, columns)[source]

Simplify the strings contained in a set of dataframe columns.

Performs several operations to simplify strings for comparison and parsing purposes. These include removing Unicode control characters, stripping leading and trailing whitespace, using lowercase characters, and compacting all internal whitespace to a single space.

Leaves null values unaltered. Casts other values with astype(str).

Parameters
  • df (pandas.DataFrame) – DataFrame whose columns are being cleaned up.

  • columns (iterable) – The labels of the string columns to be simplified.

Returns

The whole DataFrame that was passed in, with the string columns cleaned up.

Return type

pandas.DataFrame

pudl.helpers.cleanstrings_series(col, str_map, unmapped=None, simplify=True)[source]

Clean up the strings in a single column/Series.

Parameters
  • col (pandas.Series) – A pandas Series, typically a single column of a dataframe, containing the freeform strings that are to be cleaned.

  • str_map (dict) – A dictionary of lists of strings, in which the keys are the simplified canonical strings, witch which each string found in the corresponding list will be replaced.

  • unmapped (str) – A value with which to replace any string found in col that is not found in one of the lists of strings in map. Typically the null string ‘’. If None, these strings will not be replaced.

  • simplify (bool) – If True, strip and compact whitespace, and lowercase all strings in both the list of values to be replaced, and the values found in col. This can reduce the number of strings that need to be kept track of.

Returns

The cleaned up Series / column, suitable for replacing the original messy column in a pandas.DataFrame.

Return type

pandas.Series

pudl.helpers.cleanstrings(df, columns, stringmaps, unmapped=None, simplify=True)[source]

Consolidate freeform strings in several dataframe columns.

This function will consolidate freeform strings found in columns into simplified categories, as defined by stringmaps. This is useful when a field contains many different strings that are really meant to represent a finite number of categories, e.g. a type of fuel. It can also be used to create simplified categories that apply to similar attributes that are reported in various data sources from different agencies that use their own taxonomies.

The function takes and returns a pandas.DataFrame, making it suitable for use with the pandas.DataFrame.pipe() method in a chain.

Parameters
  • df (pandas.DataFrame) – the DataFrame containing the string columns to be cleaned up.

  • columns (list) – a list of string column labels found in the column index of df. These are the columns that will be cleaned.

  • stringmaps (list) – a list of dictionaries. The keys of these dictionaries are strings, and the values are lists of strings. Each dictionary in the list corresponds to a column in columns. The keys of the dictionaries are the values with which every string in the list of values will be replaced.

  • unmapped (str, None) – the value with which strings not found in the stringmap dictionary will be replaced. Typically the null string ‘’. If None, then strings found in the columns but not in the stringmap will be left unchanged.

  • simplify (bool) – If true, strip whitespace, remove duplicate whitespace, and force lower-case on both the string map and the values found in the columns to be cleaned. This can reduce the overall number of string values that need to be tracked.

Returns

The function returns a new DataFrame containing the cleaned strings.

Return type

pandas.DataFrame

pudl.helpers.fix_int_na(df, columns, float_na=np.nan, int_na=- 1, str_na='')[source]

Convert NA containing integer columns from float to string.

Numpy doesn’t have a real NA value for integers. When pandas stores integer data which has NA values, it thus upcasts integers to floating point values, using np.nan values for NA. However, in order to dump some of our dataframes to CSV files for use in data packages, we need to write out integer formatted numbers, with empty strings as the NA value. This function replaces np.nan values with a sentinel value, converts the column to integers, and then to strings, finally replacing the sentinel value with the desired NA string.

This is an interim solution – now that pandas extension arrays have been implemented, we need to go back through and convert all of these integer columns that contain NA values to Nullable Integer types like Int64.

Parameters
  • df (pandas.DataFrame) – The dataframe to be fixed. This argument allows method chaining with the pipe() method.

  • columns (iterable of strings) – A list of DataFrame column labels indicating which columns need to be reformatted for output.

  • float_na (float) – The floating point value to be interpreted as NA and replaced in col.

  • int_na (int) – Sentinel value to substitute for float_na prior to conversion of the column to integers.

  • str_na (str) – sa.String value to substitute for int_na after the column has been converted to strings.

Returns

a new DataFrame, with the selected columns converted to strings that look like integers, compatible with the postgresql COPY FROM command.

Return type

df (pandas.DataFrame)

pudl.helpers.month_year_to_date(df)[source]

Convert all pairs of year/month fields in a dataframe into Date fields.

This function finds all column names within a dataframe that match the regular expression ‘_month$’ and ‘_year$’, and looks for pairs that have identical prefixes before the underscore. These fields are assumed to describe a date, accurate to the month. The two fields are used to construct a new _date column (having the same prefix) and the month/year columns are then dropped.

Todo

This function needs to be combined with convert_to_date, and improved: * find and use a _day$ column as well * allow specification of default month & day values, if none are found. * allow specification of lists of year, month, and day columns to be combined, rather than automataically finding all the matching ones. * Do the Right Thing when invalid or NA values are encountered.

Parameters

df (pandas.DataFrame) – The DataFrame in which to convert year/months fields to Date fields.

Returns

A DataFrame in which the year/month fields have been converted into Date fields.

Return type

pandas.DataFrame

pudl.helpers.fix_leading_zero_gen_ids(df)[source]

Remove leading zeros from EIA generator IDs which are numeric strings.

If the DataFrame contains a column named generator_id then that column will be cast to a string, and any all numeric value with leading zeroes will have the leading zeroes removed. This is necessary because in some but not all years of data, some of the generator IDs are treated as integers in the Excel spreadsheets published by EIA, so the same generator may show up with the ID “0001” and “1” in different years.

Alphanumeric generator IDs with leadings zeroes are not affected, as we found no instances in which an alphanumeric generator ID appeared both with and without leading zeroes.

Parameters

df (pandas.DataFrame) – DataFrame, presumably containing a column named generator_id (otherwise no action will be taken.)

Returns

pandas.DataFrame

pudl.helpers.convert_to_date(df, date_col='report_date', year_col='report_year', month_col='report_month', day_col='report_day', month_value=1, day_value=1)[source]

Convert specified year, month or day columns into a datetime object.

If the input date_col already exists in the input dataframe, then no conversion is applied, and the original dataframe is returned unchanged. Otherwise the constructed date is placed in that column, and the columns which were used to create the date are dropped.

Parameters
  • df (pandas.DataFrame) – dataframe to convert

  • date_col (str) – the name of the column you want in the output.

  • year_col (str) – the name of the year column in the original table.

  • month_col (str) – the name of the month column in the original table.

  • day_col – the name of the day column in the original table.

  • month_value (int) – generated month if no month exists.

  • day_value (int) – generated day if no month exists.

Returns

A DataFrame in which the year, month, day columns values have been converted into datetime objects.

Return type

pandas.DataFrame

Todo

Update docstring.

pudl.helpers.fix_eia_na(df)[source]

Replace common ill-posed EIA NA spreadsheet values with np.nan.

Currently replaces empty string, single decimal points with no numbers, and any single whitespace character with np.nan.

Parameters

df (pandas.DataFrame) – The DataFrame to clean.

Returns

The cleaned DataFrame.

Return type

pandas.DataFrame

pudl.helpers.simplify_columns(df)[source]

Simplify column labels for use as snake_case database fields.

All columns will be re-labeled by: * Replacing all non-alphanumeric characters with spaces. * Forcing all letters to be lower case. * Compacting internal whitespace to a single ” “. * Stripping leading and trailing whitespace. * Replacing all remaining whitespace with underscores.

Parameters

df (pandas.DataFrame) – The DataFrame to clean.

Returns

The cleaned DataFrame.

Return type

pandas.DataFrame

Todo

Update docstring.

pudl.helpers.drop_tables(engine, clobber=False)[source]

Drops all tables from a SQLite database.

Creates an sa.schema.MetaData object reflecting the structure of the database that the passed in engine refers to, and uses that schema to drop all existing tables.

Todo

Treat DB connection as a context manager (with/as).

Parameters

engine (sa.engine.Engine) – An SQL Alchemy SQLite database Engine pointing at an exising SQLite database to be deleted.

Returns

None

pudl.helpers.merge_dicts(list_of_dicts)[source]

Merge multipe dictionaries together.

Given any number of dicts, shallow copy and merge into a new dict, precedence goes to key value pairs in latter dicts.

Parameters

dict_args (list) – a list of dictionaries.

Returns

dict

pudl.helpers.convert_cols_dtypes(df, data_source, name=None)[source]

Convert the data types for a dataframe.

This function will convert a PUDL dataframe’s columns to the correct data type. It uses a dictionary in constants.py called COLUMN_DTYPES to assign the right type. Within a given data source (e.g. eia923, ferc1) each column name is assumed to always have the same data type whenever it is found.

Boolean type conversions created a special problem, because null values in boolean columns get converted to True (which is bonkers!)… we generally want to preserve the null values and definitely don’t want them to be True, so we are keeping those columns as objects and preforming a simple mask for the boolean columns.

The other exception in here is with the utility_id_eia column. It is often an object column of strings. All of the strings are numbers, so it should be possible to convert to pandas.Int32Dtype() directly, but it is requiring us to convert to int first. There will probably be other columns that have this problem… and hopefully pandas just enables this direct conversion.

Parameters
  • df (pandas.DataFrame) – dataframe with columns that appear in the PUDL tables.

  • data_source (str) – the name of the datasource (eia, ferc1, etc.)

  • name (str) – name of the table (for logging only!)

Returns

a dataframe with columns as specified by the pudl.constants COLUMN_DTYPES dictionary.

Return type

pandas.DataFrame

pudl.helpers.convert_dfs_dict_dtypes(dfs_dict, data_source)[source]

Convert the data types of a dictionary of dataframes.

This is a wrapper for pudl.helpers.convert_cols_dtypes() which loops over an entire dictionary of dataframes, assuming they are all from the specified data source, and appropriately assigning data types to each column based on the data source specific type map stored in pudl.constants

pudl.helpers.generate_rolling_avg(df, group_cols, data_col, window, **kwargs)[source]

Generate a rolling average.

For a given dataframe with a report_date column, generate a monthly rolling average and use this rolling average to impute missing values.

Parameters
  • df (pandas.DataFrame) – Original dataframe. Must have group_cols column, a data_col column and a report_date column.

  • group_cols (iterable) – a list of columns to groupby.

  • data_col (str) – the name of the data column.

  • window (int) – window from pandas.Series.rolling().

  • kwargs – Additional arguments to pass to pandas.Series.rolling().

Returns

pandas.DataFrame

pudl.helpers.fillna_w_rolling_avg(df_og, group_cols, data_col, window=12, **kwargs)[source]

Filling NaNs with a rolling average.

Imputes null values from a dataframe on a rolling monthly average. To note, this was designed to work with the PudlTabl object’s tables.

Parameters
  • df_og (pandas.DataFrame) – Original dataframe. Must have group_cols column, a data_col column and a ‘report_date’ column.

  • group_cols (iterable) – a list of columns to groupby.

  • data_col (str) – the name of the data column.

  • window (int) – window from pandas.Series.rolling

  • kwargs – Additional arguments to pass to pandas.Series.rolling.

Returns

dataframe with nulls filled in.

Return type

pandas.DataFrame

pudl.helpers.count_records(df, cols, new_count_col_name)[source]

Count the number of unique records in group in a dataframe.

Parameters
  • df (panda.DataFrame) – dataframe you would like to groupby and count.

  • cols (iterable) – list of columns to group and count by.

  • new_count_col_name (string) – the name that will be assigned to the column that will contain the count.

Returns

dataframe containing only cols and new_count_col_name.

Return type

pandas.DataFrame

pudl.helpers.cleanstrings_snake(df, cols)[source]

Clean the strings in a columns in a dataframe with snake case.

Parameters
  • df (panda.DataFrame) – original dataframe.

  • cols (list) – list of columns in df to apply snake case to.

pudl.helpers.zero_pad_zips(zip_series, n_digits)[source]

Retain prefix zeros in zipcodes.

Parameters
  • zip_series (pd.Series) – series containing the zipcode values.

  • n_digits (int) – zipcode length (likely 4 or 5 digits).

Returns

a series containing zipcodes with their prefix zeros intact and invalid zipcodes rendered as na.

Return type

pandas.Series

pudl.helpers.iterate_multivalue_dict(**kwargs)[source]

Make dicts from dict with main dict key and one value of main dict.

pudl.helpers.get_working_eia_dates()[source]

Get all working EIA dates as a DatetimeIndex.

pudl.helpers.convert_df_to_excel_file(df: pandas.DataFrame, **kwargs) pandas.ExcelFile[source]

Converts a pandas dataframe to a pandas ExcelFile object.

You can pass parameters for pandas.to_excel() function.