pudl.helpers module¶
General utility functions that are used in a variety of contexts.
The functions in this module are used in various stages of the ETL and post-etl processes. They are usually not dataset specific, but not always. If a function is designed to be used as a general purpose tool, applicable in multiple scenarios, it should probably live here. There are lost of transform type functions in here that help with cleaning and restructing dataframes.
-
pudl.helpers.
cleanstrings
(df, columns, stringmaps, unmapped=None, simplify=True)[source]¶ Consolidate freeform strings in several dataframe columns.
This function will consolidate freeform strings found in columns into simplified categories, as defined by stringmaps. This is useful when a field contains many different strings that are really meant to represent a finite number of categories, e.g. a type of fuel. It can also be used to create simplified categories that apply to similar attributes that are reported in various data sources from different agencies that use their own taxonomies.
The function takes and returns a pandas.DataFrame, making it suitable for use with the
pandas.DataFrame.pipe()
method in a chain.- Parameters
df (pandas.DataFrame) – the DataFrame containing the string columns to be cleaned up.
columns (list) – a list of string column labels found in the column index of df. These are the columns that will be cleaned.
stringmaps (list) – a list of dictionaries. The keys of these dictionaries are strings, and the values are lists of strings. Each dictionary in the list corresponds to a column in columns. The keys of the dictionaries are the values with which every string in the list of values will be replaced.
unmapped (str, None) – the value with which strings not found in the stringmap dictionary will be replaced. Typically the null string ‘’. If None, then strings found in the columns but not in the stringmap will be left unchanged.
simplify (bool) – If true, strip whitespace, remove duplicate whitespace, and force lower-case on both the string map and the values found in the columns to be cleaned. This can reduce the overall number of string values that need to be tracked.
- Returns
The function returns a new pandas series/column that can be used to set the values of the original data.
- Return type
-
pudl.helpers.
cleanstrings_series
(col, str_map, unmapped=None, simplify=True)[source]¶ Clean up the strings in a single column/Series.
- Parameters
col (pandas.Series) – A pandas Series, typically a single column of a dataframe, containing the freeform strings that are to be cleaned.
map (dict) – A dictionary of lists of strings, in which the keys are the simplified canonical strings, witch which each string found in the corresponding list will be replaced.
unmapped (str) – A value with which to replace any string found in col that is not found in one of the lists of strings in map. Typically the null string ‘’. If None, these strings will not be replaced.
simplify (bool) – If True, strip and compact whitespace, and lowercase all strings in both the list of values to be replaced, and the values found in col. This can reduce the number of strings that need to be kept track of.
- Returns
The cleaned up Series / column, suitable for replacing the original messy column in a
pandas.DataFrame
.- Return type
-
pudl.helpers.
cleanstrings_snake
(df, cols)[source]¶ Clean the strings in a columns in a dataframe with snake case.
- Parameters
df (panda.DataFrame) – original dataframe.
cols (list) – list of columns in df to apply snake case to.
-
pudl.helpers.
convert_cols_dtypes
(df, data_source, name=None)[source]¶ Convert the data types for a dataframe.
This function will convert a PUDL dataframe’s columns to the correct data type. It uses a dictionary in constants.py called column_dtypes to assign the right type. Within a given data source (e.g. eia923, ferc1) each column name is assumed to always have the same data type whenever it is found.
Boolean type conversions created a special problem, because null values in boolean columns get converted to True (which is bonkers!)… we generally want to preserve the null values and definitely don’t want them to be True, so we are keeping those columns as objects and preforming a simple mask for the boolean columns.
The other exception in here is with the utility_id_eia column. It is often an object column of strings. All of the strings are numbers, so it should be possible to convert to
pandas.Int32Dtype()
directly, but it is requiring us to convert to int first. There will probably be other columns that have this problem… and hopefully pandas just enables this direct conversion.- Parameters
df (pandas.DataFrame) – dataframe with columns that appear in the PUDL tables.
data_source (str) – the name of the datasource (eia, ferc1, etc.)
name (str) – name of the table (for logging only!)
- Returns
a dataframe with columns as specified by the
pudl.constants
column_dtypes
dictionary.- Return type
-
pudl.helpers.
convert_dfs_dict_dtypes
(dfs_dict, data_source)[source]¶ Convert the data types of a dictionary of dataframes.
This is a wrapper for
pudl.helpers.convert_cols_dtypes()
which loops over an entire dictionary of dataframes, assuming they are all from the specified data source, and appropriately assigning data types to each column based on the data source specific type map stored in pudl.constants
-
pudl.helpers.
convert_to_date
(df, date_col='report_date', year_col='report_year', month_col='report_month', day_col='report_day', month_value=1, day_value=1)[source]¶ Convert specified year, month or day columns into a datetime object.
If the input
date_col
already exists in the input dataframe, then no conversion is applied, and the original dataframe is returned unchanged. Otherwise the constructed date is placed in that column, and the columns which were used to create the date are dropped.- Parameters
df (pandas.DataFrame) – dataframe to convert
date_col (str) – the name of the column you want in the output.
year_col (str) – the name of the year column in the original table.
month_col (str) – the name of the month column in the original table.
day_col – the name of the day column in the original table.
month_value (int) – generated month if no month exists.
day_value (int) – generated day if no month exists.
- Returns
A DataFrame in which the year, month, day columns values have been converted into datetime objects.
- Return type
Todo
Update docstring.
-
pudl.helpers.
count_records
(df, cols, new_count_col_name)[source]¶ Count the number of unique records in group in a dataframe.
- Parameters
df (panda.DataFrame) – dataframe you would like to groupby and count.
cols (iterable) – list of columns to group and count by.
new_count_col_name (string) – the name that will be assigned to the column that will contain the count.
- Retruns:
pandas.DataFrame: dataframe with only the cols definted and the new_count_col_name.
-
pudl.helpers.
drop_tables
(engine, clobber=False)[source]¶ Drops all tables from a SQLite database.
Creates an sa.schema.MetaData object reflecting the structure of the database that the passed in
engine
refers to, and uses that schema to drop all existing tables.Todo
Treat DB connection as a context manager (with/as).
- Parameters
engine (sa.engine.Engine) – An SQL Alchemy SQLite database Engine pointing at an exising SQLite database to be deleted.
- Returns
None
-
pudl.helpers.
extend_annual
(df, date_col='report_date', start_date=None, end_date=None)[source]¶ Extend time range in a DataFrame by duplicating first and last years.
Takes the earliest year’s worth of annual data and uses it to create earlier years by duplicating it, and changing the year. Similarly, extends a dataset into the future by duplicating the last year’s records.
This is primarily used to extend the EIA860 data about utilities, plants, and generators, so that we can analyze a larger set of EIA923 data. EIA923 data has been integrated a bit further back, and the EIA860 data has a year long lag in being released.
- Parameters
df (pandas.DataFrame) – The dataframe to extend.
date_col (str) – The column in the dataframe which contains date information.
start_date (date) – The earliest date to which data should be extended.
end_date (date) – The latest date to which data should be extended.
- Raises
ValueError – if the data column is found not to be consistent with annual reporting.
- Returns
A dataframe resembling the input dataframe, but with the first and/or last years of data copied to provide an approximation of earlier/later data that is not available.
- Return type
-
pudl.helpers.
fillna_w_rolling_avg
(df_og, group_cols, data_col, window=12, **kwargs)[source]¶ Filling NaNs with a rolling average.
Imputes null values from a dataframe on a rolling monthly average. To note, this was designed to work with the PudlTabl object’s tables.
- Parameters
df_og (pandas.DataFrame) – Original dataframe. Must have group_cols column, a data_col column and a ‘report_date’ column.
group_cols (iterable) – a list of columns to groupby.
data_col (str) – the name of the data column.
window (int) – window from pandas.Series.rolling
**kwargs – Additional arguments to pass to
pandas.Series.rolling
.
- Returns
dataframe with nulls filled in.
- Return type
-
pudl.helpers.
find_timezone
(*, lng=None, lat=None, state=None, strict=True)[source]¶ Find the timezone associated with the a specified input location.
Note that this function requires named arguments. The names are lng, lat, and state. lng and lat must be provided, but they may be NA. state isn’t required, and isn’t used unless lng/lat are NA or timezonefinder can’t find a corresponding timezone.
Timezones based on states are imprecise, so it’s far better to use lng/lat if possible. If strict is True, state will not be used. More on state-to-timezone conversion here: https://en.wikipedia.org/wiki/List_of_time_offsets_by_U.S._state_and_territory
- Parameters
- Returns
The timezone (as an IANA string) for that location.
- Return type
Todo
Update docstring.
-
pudl.helpers.
fix_eia_na
(df)[source]¶ Replace common ill-posed EIA NA spreadsheet values with np.nan.
- Parameters
df (pandas.DataFrame) – The DataFrame to clean.
- Returns
The cleaned DataFrame.
- Return type
Todo
Update docstring.
-
pudl.helpers.
fix_int_na
(df, columns, float_na=nan, int_na=-1, str_na='')[source]¶ Convert NA containing integer columns from float to string.
Numpy doesn’t have a real NA value for integers. When pandas stores integer data which has NA values, it thus upcasts integers to floating point values, using np.nan values for NA. However, in order to dump some of our dataframes to CSV files for use in data packages, we need to write out integer formatted numbers, with empty strings as the NA value. This function replaces np.nan values with a sentinel value, converts the column to integers, and then to strings, finally replacing the sentinel value with the desired NA string.
This is an interim solution – now that pandas extension arrays have been implemented, we need to go back through and convert all of these integer columns that contain NA values to Nullable Integer types like Int64.
- Parameters
df (pandas.DataFrame) – The dataframe to be fixed. This argument allows method chaining with the pipe() method.
columns (iterable of strings) – A list of DataFrame column labels indicating which columns need to be reformatted for output.
float_na (float) – The floating point value to be interpreted as NA and replaced in col.
int_na (int) – Sentinel value to substitute for float_na prior to conversion of the column to integers.
str_na (str) – sa.String value to substitute for int_na after the column has been converted to strings.
- Returns
a new DataFrame, with the selected columns converted to strings that look like integers, compatible with the postgresql COPY FROM command.
- Return type
df (pandas.DataFrame)
-
pudl.helpers.
generate_rolling_avg
(df, group_cols, data_col, window, **kwargs)[source]¶ Generate a rolling average.
For a given dataframe with a report_date column, generate a monthly rolling average and use this rolling average to impute missing values.
- Parameters
df (pandas.DataFrame) – Original dataframe. Must have group_cols column, a data_col column and a ‘report_date’ column.
group_cols (iterable) – a list of columns to groupby.
data_col (str) – the name of the data column.
window (int) – window from pandas.Series.rolling
**kwargs – Additional arguments to pass to
pandas.Series.rolling
.
- Returns
pandas.DataFrame
-
pudl.helpers.
is_annual
(df_year, year_col='report_date')[source]¶ Determine whether a DataFrame contains consistent annual time-series data.
Some processes will only work with consistent yearly reporting. This means if you have two non-contiguous years of data or the datetime reporting is inconsistent, the process will break. This function attempts to infer the temporal frequency of the dataframe, or if that is impossible, to at least see whether the data would be consistent with annual reporting – e.g. if there is only a single year of data, it should all have the same date, and that date should correspond to January 1st of a given year.
This function is known to be flaky and needs to be re-written to deal with the edge cases better.
- Parameters
df_year (
pandas.DataFrame
) – A pandas DataFrame that might contain time-series data at annual resolution.year_col (str) – The column of the DataFrame in which the year is reported.
- Returns
True if df_year is found to be consistent with continuous annual time resolution, False otherwise.
- Return type
-
pudl.helpers.
is_doi
(doi)[source]¶ Determine if a string is a valid digital object identifier (DOI).
Function simply checks whether the offered string matches a regular expresssion – it doesn’t check whether the DOI is actually registered with the relevant authority.
-
pudl.helpers.
merge_dicts
(list_of_dicts)[source]¶ Merge multipe dictionaries together.
Given any number of dicts, shallow copy and merge into a new dict, precedence goes to key value pairs in latter dicts.
- Parameters
dict_args (list) – a list of dictionaries.
- Returns
dict
-
pudl.helpers.
merge_on_date_year
(df_date, df_year, on=(), how='inner', date_col='report_date', year_col='report_date')[source]¶ Merge two dataframes based on a shared year.
Some of our data is annual, and has an integer year column (e.g. FERC 1). Some of our data is annual, and uses a Date column (e.g. EIA 860), and some of our data has other temporal resolutions, and uses date columns (e.g. EIA 923 fuel receipts are monthly, EPA CEMS data is hourly). This function takes two data frames and merges them based on the year that the data pertains to. It requires one of the dataframes to have annual resolution, and allows the annual time to be described as either an integer year or a Date. The non-annual dataframe must have a Date column.
By default, it is assumed that both the date and annual columns to be merged on are called ‘report_date’ since that’s the common case when bringing together EIA860 and EIA923 data.
- Parameters
df_date – the dataframe with a more granular date column, the label of which is specified by date_col (report_date by default)
df_year – the dataframe with a column containing annual dates, the label of which is specified by year_col (report_date by default)
on – The list of columns to merge on, other than the year and date columns.
date_col – name of the date column to use to find the year to merge on. Must be a Date.
year_col – name of the year column to merge on. Must be a Date column with annual resolution.
- Returns
a dataframe with a date column, but no year columns, and only one copy of any shared columns that were not part of the list of columns to be merged on. The values from df1 are the ones which are retained for any shared, non-merging columns
- Return type
- Raises
ValueError – if the date or year columns are not found, or if the year column is found to be inconsistent with annual reporting.
-
pudl.helpers.
month_year_to_date
(df)[source]¶ Convert all pairs of year/month fields in a dataframe into Date fields.
This function finds all column names within a dataframe that match the regular expression ‘_month$’ and ‘_year$’, and looks for pairs that have identical prefixes before the underscore. These fields are assumed to describe a date, accurate to the month. The two fields are used to construct a new _date column (having the same prefix) and the month/year columns are then dropped.
Todo
This function needs to be combined with convert_to_date, and improved: * find and use a _day$ column as well * allow specification of default month & day values, if none are found. * allow specification of lists of year, month, and day columns to be combined, rather than automataically finding all the matching ones. * Do the Right Thing when invalid or NA values are encountered.
- Parameters
df (pandas.DataFrame) – The DataFrame in which to convert year/months fields to Date fields.
- Returns
A DataFrame in which the year/month fields have been converted into Date fields.
- Return type
-
pudl.helpers.
oob_to_nan
(df, cols, lb=None, ub=None)[source]¶ Set non-numeric values and those outside of a given rage to NaN.
- Parameters
df (pandas.DataFrame) – The dataframe containing values to be altered.
cols (iterable) – Labels of the columns whose values are to be changed.
lb – (number): Lower bound, below which values are set to NaN. If None, don’t use a lower bound.
ub – (number): Upper bound, below which values are set to NaN. If None, don’t use an upper bound.
- Returns
The altered DataFrame.
- Return type
-
pudl.helpers.
organize_cols
(df, cols)[source]¶ Organize columns into key ID & name fields & alphabetical data columns.
For readability, it’s nice to group a few key columns at the beginning of the dataframe (e.g. report_year or report_date, plant_id…) and then put all the rest of the data columns in alphabetical order.
- Parameters
df – The DataFrame to be re-organized.
cols – The columns to put first, in their desired output ordering.
- Returns
A dataframe with the same columns as the input DataFrame df, but with cols first, in the same order as they were passed in, and the remaining columns sorted alphabetically.
- Return type
-
pudl.helpers.
prep_dir
(dir_path, clobber=False)[source]¶ Create (or delete and recreate) a directory.
- Parameters
dir_path (path-like) – path to the directory that you are trying to clean and prepare.
clobber (bool) – If True and dir_path exists, it will be removed and replaced with a new, empty directory.
- Raises
FileExistsError – if a file or directory already exists at dir_path.
- Returns
Path to the created directory.
- Return type
-
pudl.helpers.
simplify_columns
(df)[source]¶ Simplify column labels for use as database fields.
- This transformation includes:
Replacing all non-alphanumeric characters with spaces.
Forcing all letters to be lower case.
Compacting internal whitespace.
Stripping leading and trailing whitespace.
- Parameters
df (pandas.DataFrame) – The DataFrame to clean.
- Returns
The cleaned DataFrame.
- Return type
Todo
Update docstring.
-
pudl.helpers.
strip_lower
(df, columns)[source]¶ Strip and compact whitespace, lowercase listed DataFrame columns.
First converts all listed columns (if present in df) to string type, then applies the str.strip() and str.lower() methods to them, and replaces all internal whitespace to a single space. The columns are assigned in place.
- Parameters
df (pandas.DataFrame) – DataFrame whose columns are being cleaned up.
columns (iterable) – The labels of the columns to be stripped and converted to lowercase.
- Returns
The whole DataFrame that was passed in, with the columns cleaned up in place, allowing method chaining.
- Return type
-
pudl.helpers.
sum_na
(self, axis=None, *, skipna=False, level=None, numeric_only=None, min_count=0, **kwargs)¶ Return the sum of the values for the requested axis.
This is equivalent to the method
numpy.sum
.- Parameters
axis ({index (0)}) – Axis for the function to be applied on.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min_count (int, default 0) –
The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA.New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns
- Return type
scalar or Series (if level specified)
See also
Series.sum()
Return the sum.
Series.min()
Return the minimum.
Series.max()
Return the maximum.
Series.idxmin()
Return the index of the minimum.
Series.idxmax()
Return the index of the maximum.
DataFrame.sum()
Return the sum over the requested axis.
DataFrame.min()
Return the minimum over the requested axis.
DataFrame.max()
Return the maximum over the requested axis.
DataFrame.idxmin()
Return the index of the minimum over the requested axis.
DataFrame.idxmax()
Return the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.sum() 14
Sum using level names, as well as indices.
>>> s.sum(level='blooded') blooded warm 6 cold 8 Name: legs, dtype: int64
>>> s.sum(level=0) blooded warm 6 cold 8 Name: legs, dtype: int64
By default, the sum of an empty or all-NA Series is
0
.>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the
min_count
parameter. For example, if you’d like the sum of an empty series to be NaN, passmin_count=1
.>>> pd.Series([]).sum(min_count=1) nan
Thanks to the
skipna
parameter,min_count
handles all-NA and empty series identically.>>> pd.Series([np.nan]).sum() 0.0
>>> pd.Series([np.nan]).sum(min_count=1) nan
-
pudl.helpers.
verify_input_files
(ferc1_years, eia923_years, eia860_years, epacems_years, epacems_states, pudl_settings)[source]¶ Verify that all required data files exist prior to the ETL process.
- Parameters
ferc1_years (iterable) – Years of FERC1 data we’re going to import.
eia923_years (iterable) – Years of EIA923 data we’re going to import.
eia860_years (iterable) – Years of EIA860 data we’re going to import.
epacems_years (iterable) – Years of CEMS data we’re going to import.
epacems_states (iterable) – States of CEMS data we’re going to import.
data_dir (path-like) – Path to the top level of the PUDL datastore.
- Raises
FileNotFoundError – If any of the requested data is missing.