`pudl.analysis.record_linkage.embed_dataframe`¶

Tools for embedding a DataFrame to create feature matrix for models.

Module Contents¶

Classes¶

`FeatureMatrix`	Class to wrap a feature matrix returned from dataframe embedding.
`TransformStep`	TransformStep's can be combined to vectorize one or more columns.
`ColumnVectorizer`	Define a set of transformations to apply to one or more columns.
`TextVectorizer`	Implement TransformStep for `sklearn.feature_extraction.text.TfidfVectorizer`.
`CategoricalVectorizer`	Implement TransformStep for `sklearn.preprocessing.OneHotEncoder`.
`NumericalVectorizer`	Implement ColumnTransformation for MinMaxScaler.
`NumericalNormalizer`	Implement ColumnTransformation for Normalizer.
`ColumnCleaner`	Implement ColumnTransformation for cleaning functions.
`NameCleaner`	Implement ColumnTransformation for CompanyNameCleaner.
`FuelTypeFiller`	Fill missing fuel types from another column.
`StringSimilarityScorer`	Vectorize two string columns with Jaro Winkler similarity.
`NumericSimilarityScorer`	Vectorize two numeric columns with a similarity score.

Functions¶

`log_dataframe_embedder_config`(embedder_name, ...)	Log embedder config to mlflow experiment.
`dataframe_embedder_factory`(name_prefix, vectorizers)	Return a configured op graph to embed an input dataframe.
`dataframe_cleaner_factory`(name_prefix, vectorizers)	Return a configured op graph to clean an input dataframe.
`_apply_cleaning_func`(df[, function_key])
`_extract_keyword_from_column`(→ pandas.Series)	Extract keywords contained in a Pandas series with a regular expression.
`_fill_fuel_type_from_name`(→ pandas.DataFrame)	Impute missing fuel type data from a name column.
`_apply_string_similarity_func`(df, function_key, col1, col2)
`_apply_numeric_similarity_func`(df, function_key, col1, ...)

Attributes¶

logger

pudl.analysis.record_linkage.embed_dataframe.logger[source]¶

class pudl.analysis.record_linkage.embed_dataframe.FeatureMatrix[source]¶

Class to wrap a feature matrix returned from dataframe embedding.

Depending on the transformations applied, a feature matrix may be sparse or dense matrix. Using this wrapper enables Dagsters type checking while allowing both dense and sparse matrices underneath.

matrix: numpy.ndarray | scipy.sparse.csr_matrix[source]¶

index: pandas.Index[source]¶

class pudl.analysis.record_linkage.embed_dataframe.TransformStep(/, **data: Any)[source]¶

Bases: pydantic.BaseModel, abc.ABC

TransformStep’s can be combined to vectorize one or more columns.

This class defines a very simple interface for TransformStep’s, which essentially says that a TransformStep should take configuration and implement the method as_transformer.

name: str[source]¶

abstract as_transformer() → sklearn.base.BaseEstimator[source]¶: This method should use configuration to produce a sklearn.base.BaseEstimator.

class pudl.analysis.record_linkage.embed_dataframe.ColumnVectorizer(/, **data: Any)[source]¶

Bases: pydantic.BaseModel

Define a set of transformations to apply to one or more columns.

transform_steps: list[TransformStep][source]¶

weight: float = 1.0[source]¶

columns: list[str][source]¶

as_pipeline()[source]¶: Return sklearn.pipeline.Pipeline with configuration.

as_config_dict()[source]¶: Return config dict formatted for logging to mlflow.

pudl.analysis.record_linkage.embed_dataframe.log_dataframe_embedder_config(embedder_name: str, vectorizers: dict[str, ColumnVectorizer], experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker)[source]¶: Log embedder config to mlflow experiment.

pudl.analysis.record_linkage.embed_dataframe.dataframe_embedder_factory(name_prefix: str, vectorizers: dict[str, ColumnVectorizer])[source]¶: Return a configured op graph to embed an input dataframe.

pudl.analysis.record_linkage.embed_dataframe.dataframe_cleaner_factory(name_prefix: str, vectorizers: dict[str, ColumnVectorizer])[source]¶: Return a configured op graph to clean an input dataframe.

class pudl.analysis.record_linkage.embed_dataframe.TextVectorizer(/, **data: Any)[source]¶

Bases: TransformStep

Implement TransformStep for sklearn.feature_extraction.text.TfidfVectorizer.

name: str = 'tfidf_vectorizer'[source]¶

options: dict[source]¶

as_transformer()[source]¶: Return configured TfidfVectorizer.

class pudl.analysis.record_linkage.embed_dataframe.CategoricalVectorizer(/, **data: Any)[source]¶

Bases: TransformStep

Implement TransformStep for sklearn.preprocessing.OneHotEncoder.

name: str = 'one_hot_encoder_vectorizer'[source]¶

options: dict[source]¶

as_transformer()[source]¶: Return configured OneHotEncoder.

class pudl.analysis.record_linkage.embed_dataframe.NumericalVectorizer(/, **data: Any)[source]¶

Bases: TransformStep

Implement ColumnTransformation for MinMaxScaler.

name: str = 'numerical_vectorizer'[source]¶

options: dict[source]¶

as_transformer()[source]¶: Return configured MinMaxScalerConfig.

class pudl.analysis.record_linkage.embed_dataframe.NumericalNormalizer(/, **data: Any)[source]¶

Bases: TransformStep

Implement ColumnTransformation for Normalizer.

name: str = 'numerical_normalizer'[source]¶

options: dict[source]¶

as_transformer()[source]¶: Return configured NormalizerConfig.

pudl.analysis.record_linkage.embed_dataframe._apply_cleaning_func(df, function_key: str = None)[source]¶

class pudl.analysis.record_linkage.embed_dataframe.ColumnCleaner(/, **data: Any)[source]¶

Bases: TransformStep

Implement ColumnTransformation for cleaning functions.

name: str = 'column_cleaner'[source]¶

cleaning_function: str[source]¶

as_transformer()[source]¶: Return configured NormalizerConfig.

class pudl.analysis.record_linkage.embed_dataframe.NameCleaner(/, **data: Any)[source]¶

Bases: TransformStep

Implement ColumnTransformation for CompanyNameCleaner.

name: str = 'name_cleaner'[source]¶

company_cleaner: pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner[source]¶

return_as_dframe: bool = False[source]¶

as_transformer()[source]¶: Return configured CompanyNameCleaner.

class pudl.analysis.record_linkage.embed_dataframe.FuelTypeFiller(/, **data: Any)[source]¶

Bases: TransformStep

Fill missing fuel types from another column.

name: str = 'fuel_type_filler'[source]¶

fuel_type_col: str = 'fuel_type_code_pudl'[source]¶

name_col: str = 'plant_name'[source]¶

as_transformer()[source]¶: Return configured FuelTypeFiller.

pudl.analysis.record_linkage.embed_dataframe._extract_keyword_from_column(ser: pandas.Series, keyword_list: list[str]) → pandas.Series[source]¶: Extract keywords contained in a Pandas series with a regular expression.

pudl.analysis.record_linkage.embed_dataframe._fill_fuel_type_from_name(df: pandas.DataFrame, fuel_type_col: str, name_col: str) → pandas.DataFrame[source]¶

Impute missing fuel type data from a name column.

If a missing fuel type code is contained in the plant name, fill in the fuel type code PUDL for that record. E.g. “Washington Hydro”

pudl.analysis.record_linkage.embed_dataframe._apply_string_similarity_func(df, function_key: str, col1: str, col2: str)[source]¶

class pudl.analysis.record_linkage.embed_dataframe.StringSimilarityScorer(/, **data: Any)[source]¶

Bases: TransformStep

Vectorize two string columns with Jaro Winkler similarity.

name: str = 'string_sim'[source]¶

metric: str[source]¶

col1: str[source]¶

col2: str[source]¶

as_transformer()[source]¶: Return configured Jaro Winkler similarity function.

pudl.analysis.record_linkage.embed_dataframe._apply_numeric_similarity_func(df, function_key: str, col1: str, col2: str, scale: float, offset: float, origin: float, missing_value: float, label: str)[source]¶

class pudl.analysis.record_linkage.embed_dataframe.NumericSimilarityScorer(/, **data: Any)[source]¶

Bases: TransformStep

Vectorize two numeric columns with a similarity score.

If two values are the same the similarity is 1 and in case of complete disagreement it is 0. The implementation is adapted from the recordlinkage Python package Numeric comparison library and is similar with numeric comparing in ElasticSearch, a full-text search tool.

Parameters:

name – The name of the transformation step. Default is numeric_sim.
col1 – The name of the first column to compare. Must be a numeric column.
col2 – The name of the second column to compare. Must be a numeric column.
output_name – The name of the output Series of compared values.
method – The metric used. Options are “exponential”, “linear”, “exact”.
scale – The rate of decay, how quickly the score should drop the further from the origin that a value lies. Default is 1.0.
offset – Setting a nonzero offset expands the central point to cover a range of values instead of just the single point specified by the origin. Default is 0.
origin – The central point, or the best possible value for the difference between records. Differences that fall at the origin will get a similarity score of 1.0. The default is 0.
missing_value – The value if one or both records have a missing value on the compared field. Default 0.