`pudl.transform.classes`¶

Classes for defining & coordinating the transformation of tabular data sources.

We define our data transformations in four separate components:

The data being transformed (pd.DataFrame or pd.Series).

The functions & methods doing the transformations.

Non-data parameters that control the behavior of the transform functions & methods.

Classes that organize the functions & parameters that transform a given input table.

Separating out the transformation functions and the parameters that control them allows us to re-use the same transforms in many different contexts without duplicating the code.

Transform functions take data (either a Series or DataFrame) and a TransformParams object as inputs, and return transformed data of the same type that they consumed (Series or DataFrame). They operate on the data, and their particular behavior is controled by the TransformParams. Like the TableTransformer classes discussed below, they are organized into 3 separate levels of abstraction:

general-purpose: always available from the abstract base class.

dataset-specific: used repeatedly by a dataset, from an intermediate abstract class.

table-specific: used only once for a particular table, defined in a concrete class.

These functions are not generally meant to be used independent of a TableTransfomer class. They are wrapped by methods within the class definitions which handle logging and intermediate dataframe caching.

Transform functions that operate on individual columns should implement the ColumnTransformFunc Protocol.

Transform functions that need to operate on whole tables should implement the TableTransformFunc Protocol.

To iteratively apply a ColumnTransformFunc to several columns in a table, use multicol_transform_factory() to construct a MultiColumnTransformFunc

Using a hierarchy of TableTransformer classes to organize the functions and parameters allows us to apply a particular set of transformations uniformly across every table that’s part of a family of similar data. It also allows us to keep transform functions that only apply to a particular collection of tables or an individual table separated from other data that it should not be used with.

Currently there are 3 levels of abstraction in the TableTransformer classes:

The AbstractTableTransformer abstract base class that defines methods useful across a wide range of data sources.

A dataset-specific abstract class that can define transforms which are consistently useful across many tables in the dataset (e.g. the pudl.transform.ferc1.Ferc1AbstractTableTransformer class).

Table-specific concrete classes that inherit from both of the higher levels, and contain any bespoke transformations or parameters that only pertain to that table. (e.g. the pudl.transform.ferc1.SteamPlantsFerc1TableTransformer class).

The TransformParams classes are immutable pydantic models that store and the parameters which are passed to the transform functions / methods described above. These models are defined alongside the functions they’re used with. General purpose transforms have their parameter models defined in this module. Dataset-specific transforms should have their parameters defined in the module that defines the associated transform function. The MultiColumnTransformParams models are dictionaries keyed by column name, that must map to per-column parameters which are all of the same type.

Specific TransformParams classes are instantiated using dictionaries of values defined in the per-dataset modules under pudl.transform.params e.g. pudl.transform.params.ferc1.

Module Contents¶

Classes¶

`TransformParams`	An immutable base model for transformation parameters.
`MultiColumnTransformParams`	A dictionary of `TransformParams` to apply to several columns in a table.
`ColumnTransformFunc`	Callback protocol defining a per-column transformation function.
`TableTransformFunc`	Callback protocol defining a per-table transformation function.
`MultiColumnTransformFunc`	Callback protocol defining a per-table transformation function.
`RenameColumns`	A dictionary for mapping old column names to new column names in a dataframe.
`StringNormalization`	Options to control string normalization.
`EnforceSnakeCase`	Boolean parameter for `enforce_snake_case()`.
`StripNonNumericValues`	Boolean parameter for `strip_non_numeric_values()`.
`StringCategories`	Mappings to categorize the values in freeform string columns.
`UnitConversion`	A column-wise unit conversion which can also rename the column.
`ValidRange`	Column level specification of min and/or max values.
`UnitCorrections`	Fix outlying values resulting from apparent unit errors.
`InvalidRows`	Pameters that identify invalid rows to drop.
`ReplaceWithNa`	Pameters that replace certain values with NA.
`SpotFixes`	Parameters that replace certain values with a manually corrected value.
`TableTransformParams`	A collection of all the generic transformation parameters for a table.
`AbstractTableTransformer`	An abstract base table transformer class.

Functions¶

`multicol_transform_factory`(→ MultiColumnTransformFunc)	Construct `MultiColumnTransformFunc` from a `ColumnTransformFunc`.
`normalize_strings`(→ pandas.Series)	Derive a canonical, simplified version of the strings in the column.
`enforce_snake_case`(→ pandas.Series)	Enforce snake_case for a column.
`strip_non_numeric_values`(→ pandas.Series)	Strip a column of any non numeric values.
`categorize_strings`(→ pandas.Series)	Impose a controlled vocabulary on a freeform string column.
`convert_units`(→ pandas.Series)	Convert column units and rename the column to reflect the change.
`nullify_outliers`(→ pandas.Series)	Set any values outside the valid range to NA.
`correct_units`(→ pandas.DataFrame)	Correct outlying values based on inferred discrepancies in reported units.
`drop_invalid_rows`(→ pandas.DataFrame)	Drop rows with only invalid values in all specificed columns.
`replace_with_na`(→ pandas.Series)	Replace specified values with NA.
`spot_fix_values`(→ pandas.DataFrame)	Manually fix one-off singular missing values and typos across a DataFrame.
`cache_df`(→ collections.abc.Callable[Ellipsis, ...)	A decorator for caching dataframes within an `AbstractTableTransformer`.

Attributes¶

`logger`
`normalize_strings_multicol`	A multi-column version of the `normalize_strings()` function.
`enforce_snake_case_multicol`
`strip_non_numeric_values_multicol`
`categorize_strings_multicol`	A multi-column version of the `categorize_strings()` function.
`convert_units_multicol`	A multi-column version of the `convert_units()` function.
`nullify_outliers_multicol`	A multi-column version of the `nullify_outliers()` function.
`replace_with_na_multicol`	A multi-column version of the `nullify_outliers()` function.

pudl.transform.classes.logger[source]¶

class pudl.transform.classes.TransformParams(/, **data: Any)[source]¶

Bases: pydantic.BaseModel

An immutable base model for transformation parameters.

TransformParams instances created without any arguments should have no effect when applied by their associated function.

model_config[source]¶

class pudl.transform.classes.MultiColumnTransformParams(/, **data: Any)[source]¶

Bases: TransformParams

A dictionary of TransformParams to apply to several columns in a table.

These parameter dictionaries are dynamically generated for each multi-column transformation specified within a TableTransformParams object, and passed in to the MultiColumnTransformFunc callables which are constructed by multicol_transform_factory()

The keys are column names, values must all be the same type of TransformParams object. For examples, see e.g. the categorize_strings or convert_units elements within pudl.transform.ferc1.TRANSFORM_PARAMS.

The dictionary structure is not explicitly stated in this class, because it’s messy to use Pydantic for validation when the data to be validated isn’t contained within a Pydantic model. When Pydantic v2 is available, it will be easy, and we’ll do it: https://pydantic-docs.helpmanual.io/blog/pydantic-v2/#validation-without-a-model

single_param_type(info: pydantic.ValidationInfo)[source]¶: Check that all TransformParams in the dictionary are of the same type.

class pudl.transform.classes.ColumnTransformFunc[source]¶

Bases: Protocol

Callback protocol defining a per-column transformation function.

__call__(col: pandas.Series, params: TransformParams) → pandas.Series[source]¶: Create a callable.

class pudl.transform.classes.TableTransformFunc[source]¶

Bases: Protocol

Callback protocol defining a per-table transformation function.

__call__(df: pandas.DataFrame, params: TransformParams) → pandas.DataFrame[source]¶: Create a callable.

class pudl.transform.classes.MultiColumnTransformFunc[source]¶

Bases: Protocol

Callback protocol defining a per-table transformation function.

__call__(df: pandas.DataFrame, params: MultiColumnTransformParams) → pandas.DataFrame[source]¶: Create a callable.

pudl.transform.classes.multicol_transform_factory(col_func: ColumnTransformFunc, drop=True) → MultiColumnTransformFunc[source]¶

Construct MultiColumnTransformFunc from a ColumnTransformFunc.

This factory function saves us from having to iterate over dataframes in many separate places, applying the same transform functions with different parameters to multiple columns. Instead, we define a function that transforms a column given some parameters, and then easily apply that function to many columns using a dictionary of parameters (a MultiColumnTransformParams). Uniform logging output is also integrated into the constructed function.

Parameters:: col_func – A single column transform function.
Returns:: A multi-column transform function.

Examples

>>> class AddInt(TransformParams):
...     val: int
...
>>> def add_int(col: pd.Series, params: AddInt):
...     return col + params.val
...
>>> add_int_multicol = multicol_transform_factory(add_int)
...
>>> df = pd.DataFrame(
...     {
...         "col1": [1, 2, 3],
...         "col2": [10, 20, 30],
...     }
... )
...
>>> actual = add_int_multicol(
...     df,
...     params={
...         "col1": AddInt(val=1),
...         "col2": AddInt(val=2),
...     }
... )
...
>>> expected = pd.DataFrame(
...     {
...         "col1": [2, 3, 4],
...         "col2": [12, 22, 32],
...     }
... )
...
>>> pd.testing.assert_frame_equal(actual, expected)

class pudl.transform.classes.RenameColumns(/, **data: Any)[source]¶

Bases: TransformParams

A dictionary for mapping old column names to new column names in a dataframe.

This parameter model has no associated transform function since it is used with the pd.DataFrame.rename() method. Because it renames all of the columns in a dataframe at once, it’s a table transformation (though it could also have been implemented as a column transform).

columns: dict[str, str][source]¶

class pudl.transform.classes.StringNormalization(/, **data: Any)[source]¶

Bases: TransformParams

Options to control string normalization.

Most of what takes place in the string normalization is standardized and controlled by the normalize_strings() function since we need the normalizations of different columns to be comparable, but there are a couple of column-specific parameterizations that are useful, and they are encapsulated by this class.

remove_chars: str[source]¶: A string of individual ASCII characters removed at the end of normalization.

nullable: bool = False[source]¶: Whether the normalized string should be cast to pd.StringDtype.

pudl.transform.classes.normalize_strings(col: pandas.Series, params: StringNormalization) → pandas.Series[source]¶

Derive a canonical, simplified version of the strings in the column.

Transformations include:

Convert to pd.StringDtype.
Decompose composite unicode characters.
Translate to ASCII character equivalents if they exist.
Translate to lower case.
Strip leading and trailing whitespace.
Consolidate multiple internal whitespace characters into a single space.

Parameters:

col – series of strings to normalize.
params – settings enumerating any particular characters to remove, and whether the resulting series should be a nullable string.

pudl.transform.classes.normalize_strings_multicol[source]¶: A multi-column version of the normalize_strings() function.

class pudl.transform.classes.EnforceSnakeCase(/, **data: Any)[source]¶

Bases: TransformParams

Boolean parameter for enforce_snake_case().

enforce_snake_case: bool[source]¶

pudl.transform.classes.enforce_snake_case(col: pandas.Series, params: EnforceSnakeCase | None = None) → pandas.Series[source]¶

Enforce snake_case for a column.

Removes leading whitespaces, lower-cases, replaces spaces with underscore and removes remaining non alpha numeric snake case values.

Parameters:

col – a column of strings.
params – an EnforceSnakeCase parameter object. Default is None which will instantiate an instance of EnforceSnakeCase where enforce_snake_case is True, which will enforce snake case on the col. If enforce_snake_case is False, the column will be returned unaltered.

pudl.transform.classes.enforce_snake_case_multicol[source]¶

class pudl.transform.classes.StripNonNumericValues(/, **data: Any)[source]¶

Bases: TransformParams

Boolean parameter for strip_non_numeric_values().

Stores a named boolean variable that is employed in strip_non_numeric_values() to determine whether of not the transform treatment should be applied. Pydantic 2.0 will allow validation of these simple variables without needing to define a model.

strip_non_numeric_values: bool[source]¶

pudl.transform.classes.strip_non_numeric_values(col: pandas.Series, params: StripNonNumericValues | None = None) → pandas.Series[source]¶

Strip a column of any non numeric values.

Using the following options in pd.Series.extract() :

an optional + or - followed by at least one digit followed by an optional decimal place followed by any number of digits (including zero)
OR an optional + or - followed by a period followed by at least one digit

Unless the found mathc is followed by a letter (this is done using a negative lookback).

Note: This will not work with exponential values. If there are two possible matches of numeric values within a value, only the first match will be returned (ex: "FERC1 Licenses 1234 & 5678" will return "1234").

pudl.transform.classes.strip_non_numeric_values_multicol[source]¶

class pudl.transform.classes.StringCategories(/, **data: Any)[source]¶

Bases: TransformParams

Mappings to categorize the values in freeform string columns.

property mapping: dict[str, str][source]¶: A 1-to-1 mapping appropriate for use with pd.Series.map().

categories: dict[str, set[str]][source]¶: Mapping from a categorical string to the set of the values it should replace.

na_category: str = 'na_category'[source]¶

All strings mapped to this category will be set to NA at the end.

The NA category is a special case because testing whether a value is NA is complex, given the many different values which can be used to represent NA. See categorize_strings() to see how it is used.

classmethod categories_are_disjoint(v)[source]¶: Ensure that each string to be categorized only appears in one category.

classmethod categories_are_idempotent(v)[source]¶

Ensure that every category contains the string it will map to.

This ensures that if the categorization is applied more than once, it doesn’t change the output.

pudl.transform.classes.categorize_strings(col: pandas.Series, params: StringCategories) → pandas.Series[source]¶

Impose a controlled vocabulary on a freeform string column.

Note that any value present in the data that is not mapped to one of the output categories will be set to NA.

pudl.transform.classes.categorize_strings_multicol[source]¶: A multi-column version of the categorize_strings() function.

class pudl.transform.classes.UnitConversion(/, **data: Any)[source]¶

Bases: TransformParams

A column-wise unit conversion which can also rename the column.

Allows simple linear conversions of the form y(x) = a*x + b. Note that the default values result in no alteration of the column.

Parameters:

multiplier – A multiplicative coefficient; “a” in the equation above. Set to 1.0 by default.
adder – An additive constant; “b” in the equation above. Set to 0.0 by default.
from_unit – A string that will be replaced in the input series name. If None or the empty string, the series is not renamed.
to_unit – The string from_unit is replaced with. If None or the empty string, the series is not renamed. Note that either both or neither of from_unit and to_unit can be left unset, but not just one of them.

property pattern: str[source]¶: Regular expression based on from_unit for use with re.sub().

property repl: str[source]¶: Regex backreference to parentheticals, for use with re.sub().

multiplier: float = 1.0[source]¶

adder: float = 0.0[source]¶

from_unit: str = ''[source]¶

to_unit: str = ''[source]¶

both_or_neither_units_are_none()[source]¶: Ensure that either both or neither of the units strings are None.

inverse() → UnitConversion[source]¶

Construct a UnitConversion that is the inverse of self.

Allows a unit conversion to be undone. This is currently used in the context of validating the combination of UnitConversions that are used in the UnitCorrections parameter model.

pudl.transform.classes.convert_units(col: pandas.Series, params: UnitConversion) → pandas.Series[source]¶: Convert column units and rename the column to reflect the change.

pudl.transform.classes.convert_units_multicol[source]¶: A multi-column version of the convert_units() function.

class pudl.transform.classes.ValidRange(/, **data: Any)[source]¶

Bases: TransformParams

Column level specification of min and/or max values.

lower_bound: float[source]¶

upper_bound: float[source]¶

classmethod upper_bound_gte_lower_bound(upper_bound: float, info: pydantic.ValidationInfo)[source]¶: Require upper bound to be greater than or equal to lower bound.

pudl.transform.classes.nullify_outliers(col: pandas.Series, params: ValidRange) → pandas.Series[source]¶

Set any values outside the valid range to NA.

The column is coerced to be numeric.

pudl.transform.classes.nullify_outliers_multicol[source]¶: A multi-column version of the nullify_outliers() function.

class pudl.transform.classes.UnitCorrections(/, **data: Any)[source]¶

Bases: TransformParams

Fix outlying values resulting from apparent unit errors.

Note that since the unit correction depends on other columns in the dataframe to select a relevant subset of records, it is a table transform not a column transform, and so needs to know what column it applies to internally.

data_col: str[source]¶: The label of the column to be modified.

cat_col: str[source]¶: Label of a categorical column which will be used to select records to correct.

cat_val: str[source]¶: Categorical value to use to select records for correction.

valid_range: ValidRange[source]¶: The range of values expected to be found in data_col.

unit_conversions: list[UnitConversion][source]¶: A list of unit conversions to use to identify errors and correct them.

classmethod no_column_rename(params: list[UnitConversion]) → list[UnitConversion][source]¶

Ensure that the unit conversions used in corrections don’t rename the column.

This constraint is imposed so that the same unit conversion definitions can be re-used both for unit corrections and normal columnwise unit conversions.

distinct_domains()[source]¶

Verify that all unit conversions map distinct domains to the valid range.

If the domains being mapped to the valid range overlap, then it is ambiguous which unit conversion should be applied to the original value.

For all unit conversions calculate the range of original values that result from the inverse of the specified unit conversion applied to the valid ranges of values.
For all pairs of unit conversions verify that their original data ranges do not overlap with each other. We must also ensure that the original and converted ranges of each individual correction do not overlap. For example, if the valid range is from 1 to 10, and the unit conversion multiplies by 3, we’d be unable to distinguish a valid value of 6 from a value that should be corrected to be 2.

pudl.transform.classes.correct_units(df: pandas.DataFrame, params: UnitCorrections) → pandas.DataFrame[source]¶

Correct outlying values based on inferred discrepancies in reported units.

In many cases we know that a particular column in the database should have a value within a particular range (e.g. the heat content of a ton of coal is a well defined physical quantity – it can be 15 mmBTU/ton or 22 mmBTU/ton, but it can’t be 1 mmBTU/ton or 100 mmBTU/ton).

Sometimes these fields are reported in the wrong units (e.g. kWh of electricity generated rather than MWh) resulting in several recognizable populations of reported values showing up at different ranges of value within the data. In cases where the unit conversion and range of valid values are such that these populations do not overlap, it’s possible to convert them to the canonical units fairly unambiguously.

This issue is especially common in the context of fuel attributes, because fuels are reported in terms of many different units. Because fuels with different units are often reported in the same column, and different fuels have different valid ranges of values, it’s also necessary to be able to select only a subset of the data that pertains to a particular fuel. This means filtering based on another column, so the function needs to have access to the whole dataframe.

Data values which are not found in one of the expected ranges are set to NA.

class pudl.transform.classes.InvalidRows(/, **data: Any)[source]¶

Bases: TransformParams

Pameters that identify invalid rows to drop.

invalid_values: Annotated[set[Any], Field(min_length=1)] | None[source]¶: A list of values that should be considered invalid in the selected columns.

required_valid_cols: list[str] | None[source]¶: List of columns passed into pd.filter() as the items argument.

allowed_invalid_cols: list[str] | None[source]¶

List of columns not to search for valid values to preserve.

Used to construct an items argument for pd.filter(). This option is useful when a table is wide, and specifying all required_valid_cols would be tedious.

like: str | None[source]¶: A string to use as the like argument to pd.filter()

regex: str | None[source]¶: A regular expression to use as the regex argument to pd.filter().

one_filter_argument()[source]¶: Validate that only one argument is specified for pd.filter().

pudl.transform.classes.drop_invalid_rows(df: pandas.DataFrame, params: InvalidRows) → pandas.DataFrame[source]¶

Drop rows with only invalid values in all specificed columns.

This method finds all rows in a dataframe that contain ONLY invalid data in ALL of the columns that we are checking, and drops those rows, logging the % of all rows that were dropped.

class pudl.transform.classes.ReplaceWithNa(/, **data: Any)[source]¶

Bases: TransformParams

Pameters that replace certain values with NA.

The categorize strings function replaces bad values, but it requires all the values in the column to fall under a certain category. This function allows you to replace certain specific values with NA without having to categorize the rest of the column.

replace_with_na: list[str][source]¶: A list of values that should be replaced with NA.

pudl.transform.classes.replace_with_na(col: pandas.Series, params: ReplaceWithNa) → pandas.Series[source]¶: Replace specified values with NA.

pudl.transform.classes.replace_with_na_multicol[source]¶: A multi-column version of the nullify_outliers() function.

class pudl.transform.classes.SpotFixes(/, **data: Any)[source]¶

Bases: TransformParams

Parameters that replace certain values with a manually corrected value.

idx_cols: list[str][source]¶: The column(s) used to identify a record.

fix_cols: list[str][source]¶: The column(s) to be fixed.

expect_unique: bool[source]¶: Set to True if each fix should correspond to only one row.

spot_fixes: list[tuple[str | int | float | bool, Ellipsis]][source]¶: A tuple containing the values of the idx_cols and fix_cols for each fix.

pudl.transform.classes.spot_fix_values(df: pandas.DataFrame, params: SpotFixes) → pandas.DataFrame[source]¶

Manually fix one-off singular missing values and typos across a DataFrame.

Use this function to correct typos, missing values that are easily manually identified through manual investigation of records, consistent issues for a small number of records (e.g. incorrectly entered capacity data for 2-3 plants).

From an instance of SpotFixes, this function takes a list of sets of manual fixes and applies them to the specified records in a given dataframe. Each set of fixes contains a list of identifying columns, a list of columns to be fixed, and the values to be updated. A ValueError will be returned if spot-fixed datatypes do not match those of the inputted dataframe. For each set of fixes, the expect_unique parameter allows users to specify whether each fix should be applied only to one row.

Returns:: The same input DataFrame but with some spot fixes corrected.

class pudl.transform.classes.TableTransformParams(/, **data: Any)[source]¶

Bases: TransformParams

A collection of all the generic transformation parameters for a table.

This class is used to instantiate and contain all of the individual TransformParams objects that are associated with transforming a given table. It can be instantiated using one of the table-level dictionaries of parameters defined in the dataset-specific modules in pudl.transform.params

Data source-specific TableTransformParams classes should be defined in the data source-specific transform modules and inherit from this class. See e.g. pudl.transform.ferc1.Ferc1TableTransformParams

convert_units: dict[str, UnitConversion][source]¶

categorize_strings: dict[str, StringCategories][source]¶

nullify_outliers: dict[str, ValidRange][source]¶

normalize_strings: dict[str, StringNormalization][source]¶

strip_non_numeric_values: dict[str, StripNonNumericValues][source]¶

replace_with_na: dict[str, ReplaceWithNa][source]¶

correct_units: list[UnitCorrections] = [][source]¶

rename_columns: RenameColumns[source]¶

drop_invalid_rows: list[InvalidRows] = [][source]¶

spot_fix_values: list[SpotFixes] = [][source]¶

classmethod from_dict(params: dict[str, Any]) → TableTransformParams[source]¶

Construct TableTransformParams from a dictionary of keyword arguments.

Typically these will be the table-level dictionaries defined in the dataset- specific modules in the pudl.transform.params subpackage. See also the TableTransformParams.from_id() method.

classmethod from_id(table_id: enum.Enum) → TableTransformParams[source]¶

A factory method that looks up transform parameters based on table_id.

This is a shortcut, which allows us to constitute the parameter models based on the table they are associated with without having to pass in a potentially large nested data structure, which gets messy in Dagster.

pudl.transform.classes.cache_df(key: str = 'main') → collections.abc.Callable[Ellipsis, pandas.DataFrame][source]¶

A decorator for caching dataframes within an AbstractTableTransformer.

It’s often useful during development or debugging to be able to track the evolution of data as it passes through several transformation steps. Especially when some of the steps are time consuming, it’s nice to still get a copy of the last known state of the data when a transform raises an exception and fails.

This decorator lets you easily save a copy of the dataframe being returned by a class method for later reference, before moving on to the next step. Each unique key used within a given AbstractTableTransformer instance results in a new dataframe being cached. Re-using the same key will overwrite previously cached dataframes that were stored with that key.

Saving many intermediate steps can provide lots of detailed information, but will use more memory. Updating the same cached dataframe as it successfully passes through each step lets you access the last known state it had before an error occurred.

This decorator requires that the decorated function return a single pd.DataFrame, but it can take any type of inputs.

There’s a lot of nested functions in here. For a more thorough explanation, see: https://realpython.com/primer-on-python-decorators/#fancy-decorators

Parameters:: key – The key that will be used to store and look up the cached dataframe in the internal self._cached_dfs dictionary.
Returns:: The decorated class method.

class pudl.transform.classes.AbstractTableTransformer(params: TableTransformParams | None = None, cache_dfs: bool = False, clear_cached_dfs: bool = True, **kwargs)[source]¶

Bases: abc.ABC

An abstract base table transformer class.

This class provides methods for applying the general purpose transform funcitons to dataframes. These methods should each log that they are running, and the table_id of the table they’re beiing applied to. By default they should obtain their parameters from the params which are stored in the class, but should allow other parameters to be passed in.

The class also provides a template for coordinating the high level flow of data through the transformations. The main coordinating function that’s used to run the full transformation is AbstractTableTransformer.transform(), and the transform is broken down into 3 distinct steps: start, main, and end. Those individual steps need to be defined by child classes. Usually the start and end methods will handle transformations that need to be applied uniformily across all the tables in a given dataset, with the main step containing transformations that are specific to a particular table.

In development it’s often useful to be able to review the state of the data at various stages as it progresses through the transformation. The cache_df() decorator defined above can be applied to individual transform methods or the start, main, and end methods defined in the child classes, to allow intermediate dataframes to be reviewed after the fact. Whether to cache dataframes and whether to delete them upon successful completion of the transform is controlled by flags set when the TableTransformer class is created.

Table-specific transform parameters need to be associated with the class. They can either be passed in explicitly when the class is instantiated, or looked up based on the table_id associated with the class. See TableTransformParams.from_id()

The call signature of the AbstractTableTransformer.transform_start() method accepts any type of inputs by default, and returns a single pd.DataFrame. Later transform steps are assumed to take a single dataframe as input, and return a single dataframe. Since Python is lazy about enforcing types and interfaces you can get away with other kinds of arguments when they’re sometimes necessary, but this isn’t a good arrangement and we should figure out how to do it right. See the pudl.transform.ferc1.SteamPlantsTableTransformer class for an example.

table_id: enum.Enum[source]¶

Name of the PUDL database table that this table transformer produces.

Must be defined in the database schema / metadata. This ID is used to instantiate the appropriate TableTransformParams object.

cache_dfs: bool = False[source]¶

Whether to cache copies of intermediate dataframes until transformation is done.

When True, the TableTransformer will save dataframes internally at each step of the transform, so that they can be inspected easily if the transformation fails.

clear_cached_dfs: bool = True[source]¶: Determines whether cached dataframes are deleted at the end of the transform.

_cached_dfs: dict[str, pandas.DataFrame][source]¶

Cached intermediate dataframes for use in development and debugging.

The dictionary keys are the strings passed to the cache_df() method decorator.

parameter_model[source]¶

The pydantic model that is used to contain & instantiate parameters.

In child classes this should be replaced with the data source-specific TableTransformParams class, if it has been defined.

params: AbstractTableTransformer.parameter_model[source]¶

The parameters that will be used to control the transformation functions.

This attribute is of type parameter_model which is defined above. This type varies across datasets and is used to construct and validate the parameters based, so it needs to be set separately in child classes. See pudl.transform.ferc1.Ferc1AbstractTableTransformer for an example.

abstract transform_start(*args, **kwargs) → pandas.DataFrame[source]¶

Transformations applied to many tables within a dataset at the beginning.

This method should be implemented by the dataset-level abstract table transformer class. It does not specify its inputs because different data sources need different inputs. E.g. the FERC 1 transform needs 2 XBRL derived dataframes, and one DBF derived dataframe, while (most) EIA tables just receive and return a single dataframe.

This step is often used to organize initial transformations that are applied uniformly across all the tables in a dataset.

At the end of this step, all the inputs should have been consolidated into a single dataframe to return.

abstract transform_main(df: pandas.DataFrame, **kwargs) → pandas.DataFrame[source]¶

The method used to do most of the table-specific transformations.

Typically the transformations grouped together into this method will be unique to the table that is being transformed. Generally this method will take and return a single dataframe, and that pattern is implemented in the AbstractTableTransformer.transform() method. In cases where transforms take or return more than one dataframe, you will need to define a new transform method within the child class. See SteamPlantsTableTransformer as an example.

abstract transform_end(df: pandas.DataFrame) → pandas.DataFrame[source]¶

Transformations applied to many tables within a dataset at the end.

This method should be implemented by the dataset-level abstract table transformer class. It should do any standard cleanup that’s required after the table-specific transformations have been applied. E.g. enforcing the table’s database schema and dropping invalid records based on parameterized criteria.

transform(*args, **kwargs) → pandas.DataFrame[source]¶: Apply all specified transformations to the appropriate input dataframes.

rename_columns(df: pandas.DataFrame, params: RenameColumns | None = None, **kwargs) → pandas.DataFrame[source]¶

Rename the whole collection of dataframe columns using input params.

Log if there’s any mismatch between the columns in the dataframe, and the columns that have been defined in the mapping for renaming.

normalize_strings(df: pandas.DataFrame, params: dict[str, bool] | None = None) → pandas.DataFrame[source]¶: Method wrapper for string normalization.

strip_non_numeric_values(df: pandas.DataFrame, params: dict[str, bool] | None = None) → pandas.DataFrame[source]¶: Method wrapper for stripping non-numeric values.

categorize_strings(df: pandas.DataFrame, params: dict[str, StringCategories] | None = None) → pandas.DataFrame[source]¶: Method wrapper for string categorization.

nullify_outliers(df: pandas.DataFrame, params: dict[str, ValidRange] | None = None) → pandas.DataFrame[source]¶: Method wrapper for nullifying outlying values.

convert_units(df: pandas.DataFrame, params: dict[str, UnitConversion] | None = None) → pandas.DataFrame[source]¶: Method wrapper for columnwise unit conversions.

correct_units(df: pandas.DataFrame, params: UnitCorrections | None = None) → pandas.DataFrame[source]¶

Apply all specified unit corrections to the table in order.

Note: this is a table transform, not a multi-column transform.

drop_invalid_rows(df: pandas.DataFrame, params: list[InvalidRows] | None = None) → pandas.DataFrame[source]¶: Drop rows with only invalid values in all specificed columns.

replace_with_na(df: pandas.DataFrame, params: dict[str, ReplaceWithNa] | None = None) → pandas.DataFrame[source]¶: Replace specified values with NA.

spot_fix_values(df: pandas.DataFrame, params: list[SpotFixes] | None = None) → pandas.DataFrame[source]¶: Replace specified values with specified values.

enforce_schema(df: pandas.DataFrame) → pandas.DataFrame[source]¶: Drop columns not in the DB schema and enforce specified types.

pudl.transform.classes¶

Module Contents¶

Classes¶

Functions¶

Attributes¶

`pudl.transform.classes`¶