pudl.analysis.record_linkage.embed_dataframe

Tools for embedding a DataFrame to create feature matrix for models.

Module Contents

Classes

FeatureMatrix

Class to wrap a feature matrix returned from dataframe embedding.

TransformStep

TransformStep's can be combined to vectorize one or more columns.

ColumnVectorizer

Define a set of transformations to apply to one or more columns.

TextVectorizer

Implement TransformStep for sklearn.feature_extraction.text.TfidfVectorizer.

CategoricalVectorizer

Implement TransformStep for sklearn.preprocessing.OneHotEncoder.

NumericalVectorizer

Implement ColumnTransformation for MinMaxScaler.

NumericalNormalizer

Implement ColumnTransformation for Normalizer.

ColumnCleaner

Implement ColumnTransformation for cleaning functions.

NameCleaner

Implement ColumnTransformation for CompanyNameCleaner.

FuelTypeFiller

Fill missing fuel types from another column.

StringSimilarityScorer

Vectorize two string columns with Jaro Winkler similarity.

NumericSimilarityScorer

Vectorize two numeric columns with a similarity score.

Functions

log_dataframe_embedder_config(embedder_name, ...)

Log embedder config to mlflow experiment.

dataframe_embedder_factory(name_prefix, vectorizers)

Return a configured op graph to embed an input dataframe.

dataframe_cleaner_factory(name_prefix, vectorizers)

Return a configured op graph to clean an input dataframe.

_apply_cleaning_func(df[, function_key])

_extract_keyword_from_column(→ pandas.Series)

Extract keywords contained in a Pandas series with a regular expression.

_fill_fuel_type_from_name(→ pandas.DataFrame)

Impute missing fuel type data from a name column.

_apply_string_similarity_func(df, function_key, col1, col2)

_apply_numeric_similarity_func(df, function_key, col1, ...)

Attributes

pudl.analysis.record_linkage.embed_dataframe.logger[source]
class pudl.analysis.record_linkage.embed_dataframe.FeatureMatrix[source]

Class to wrap a feature matrix returned from dataframe embedding.

Depending on the transformations applied, a feature matrix may be sparse or dense matrix. Using this wrapper enables Dagsters type checking while allowing both dense and sparse matrices underneath.

matrix: numpy.ndarray | scipy.sparse.csr_matrix[source]
index: pandas.Index[source]
class pudl.analysis.record_linkage.embed_dataframe.TransformStep(/, **data: Any)[source]

Bases: pydantic.BaseModel, abc.ABC

TransformStep’s can be combined to vectorize one or more columns.

This class defines a very simple interface for TransformStep’s, which essentially says that a TransformStep should take configuration and implement the method as_transformer.

name: str[source]
abstract as_transformer() sklearn.base.BaseEstimator[source]

This method should use configuration to produce a sklearn.base.BaseEstimator.

class pudl.analysis.record_linkage.embed_dataframe.ColumnVectorizer(/, **data: Any)[source]

Bases: pydantic.BaseModel

Define a set of transformations to apply to one or more columns.

transform_steps: list[TransformStep][source]
weight: float = 1.0[source]
columns: list[str][source]
as_pipeline()[source]

Return sklearn.pipeline.Pipeline with configuration.

as_config_dict()[source]

Return config dict formatted for logging to mlflow.

pudl.analysis.record_linkage.embed_dataframe.log_dataframe_embedder_config(embedder_name: str, vectorizers: dict[str, ColumnVectorizer], experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker)[source]

Log embedder config to mlflow experiment.

pudl.analysis.record_linkage.embed_dataframe.dataframe_embedder_factory(name_prefix: str, vectorizers: dict[str, ColumnVectorizer])[source]

Return a configured op graph to embed an input dataframe.

pudl.analysis.record_linkage.embed_dataframe.dataframe_cleaner_factory(name_prefix: str, vectorizers: dict[str, ColumnVectorizer])[source]

Return a configured op graph to clean an input dataframe.

class pudl.analysis.record_linkage.embed_dataframe.TextVectorizer(/, **data: Any)[source]

Bases: TransformStep

Implement TransformStep for sklearn.feature_extraction.text.TfidfVectorizer.

name: str = 'tfidf_vectorizer'[source]
options: dict[source]
as_transformer()[source]

Return configured TfidfVectorizer.

class pudl.analysis.record_linkage.embed_dataframe.CategoricalVectorizer(/, **data: Any)[source]

Bases: TransformStep

Implement TransformStep for sklearn.preprocessing.OneHotEncoder.

name: str = 'one_hot_encoder_vectorizer'[source]
options: dict[source]
as_transformer()[source]

Return configured OneHotEncoder.

class pudl.analysis.record_linkage.embed_dataframe.NumericalVectorizer(/, **data: Any)[source]

Bases: TransformStep

Implement ColumnTransformation for MinMaxScaler.

name: str = 'numerical_vectorizer'[source]
options: dict[source]
as_transformer()[source]

Return configured MinMaxScalerConfig.

class pudl.analysis.record_linkage.embed_dataframe.NumericalNormalizer(/, **data: Any)[source]

Bases: TransformStep

Implement ColumnTransformation for Normalizer.

name: str = 'numerical_normalizer'[source]
options: dict[source]
as_transformer()[source]

Return configured NormalizerConfig.

pudl.analysis.record_linkage.embed_dataframe._apply_cleaning_func(df, function_key: str = None)[source]
class pudl.analysis.record_linkage.embed_dataframe.ColumnCleaner(/, **data: Any)[source]

Bases: TransformStep

Implement ColumnTransformation for cleaning functions.

name: str = 'column_cleaner'[source]
cleaning_function: str[source]
as_transformer()[source]

Return configured NormalizerConfig.

class pudl.analysis.record_linkage.embed_dataframe.NameCleaner(/, **data: Any)[source]

Bases: TransformStep

Implement ColumnTransformation for CompanyNameCleaner.

name: str = 'name_cleaner'[source]
company_cleaner: pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner[source]
return_as_dframe: bool = False[source]
as_transformer()[source]

Return configured CompanyNameCleaner.

class pudl.analysis.record_linkage.embed_dataframe.FuelTypeFiller(/, **data: Any)[source]

Bases: TransformStep

Fill missing fuel types from another column.

name: str = 'fuel_type_filler'[source]
fuel_type_col: str = 'fuel_type_code_pudl'[source]
name_col: str = 'plant_name'[source]
as_transformer()[source]

Return configured FuelTypeFiller.

pudl.analysis.record_linkage.embed_dataframe._extract_keyword_from_column(ser: pandas.Series, keyword_list: list[str]) pandas.Series[source]

Extract keywords contained in a Pandas series with a regular expression.

pudl.analysis.record_linkage.embed_dataframe._fill_fuel_type_from_name(df: pandas.DataFrame, fuel_type_col: str, name_col: str) pandas.DataFrame[source]

Impute missing fuel type data from a name column.

If a missing fuel type code is contained in the plant name, fill in the fuel type code PUDL for that record. E.g. “Washington Hydro”

pudl.analysis.record_linkage.embed_dataframe._apply_string_similarity_func(df, function_key: str, col1: str, col2: str)[source]
class pudl.analysis.record_linkage.embed_dataframe.StringSimilarityScorer(/, **data: Any)[source]

Bases: TransformStep

Vectorize two string columns with Jaro Winkler similarity.

name: str = 'string_sim'[source]
metric: str[source]
col1: str[source]
col2: str[source]
as_transformer()[source]

Return configured Jaro Winkler similarity function.

pudl.analysis.record_linkage.embed_dataframe._apply_numeric_similarity_func(df, function_key: str, col1: str, col2: str, scale: float, offset: float, origin: float, missing_value: float, label: str)[source]
class pudl.analysis.record_linkage.embed_dataframe.NumericSimilarityScorer(/, **data: Any)[source]

Bases: TransformStep

Vectorize two numeric columns with a similarity score.

If two values are the same the similarity is 1 and in case of complete disagreement it is 0. The implementation is adapted from the recordlinkage Python package Numeric comparison library and is similar with numeric comparing in ElasticSearch, a full-text search tool.

Parameters:
  • name – The name of the transformation step. Default is numeric_sim.

  • col1 – The name of the first column to compare. Must be a numeric column.

  • col2 – The name of the second column to compare. Must be a numeric column.

  • output_name – The name of the output Series of compared values.

  • method – The metric used. Options are “exponential”, “linear”, “exact”.

  • scale – The rate of decay, how quickly the score should drop the further from the origin that a value lies. Default is 1.0.

  • offset – Setting a nonzero offset expands the central point to cover a range of values instead of just the single point specified by the origin. Default is 0.

  • origin – The central point, or the best possible value for the difference between records. Differences that fall at the origin will get a similarity score of 1.0. The default is 0.

  • missing_value – The value if one or both records have a missing value on the compared field. Default 0.

name: str = 'numeric_sim'[source]
col1: str[source]
col2: str[source]
output_name: str[source]
method: str = 'linear'[source]
scale: float = 1.0[source]
offset: float = 0.0[source]
origin: float = 0.0[source]
missing_value: float = 0.0[source]
as_transformer()[source]

Return configured exponential similarity function.