pudl.analysis.record_linkage.embed_dataframe¶
Tools for embedding a DataFrame to create feature matrix for models.
Attributes¶
Classes¶
Class to wrap a feature matrix returned from dataframe embedding. |
|
TransformStep's can be combined to vectorize one or more columns. |
|
Define a set of transformations to apply to one or more columns. |
|
Implement TransformStep for |
|
Implement TransformStep for |
|
Implement ColumnTransformation for MinMaxScaler. |
|
Implement ColumnTransformation for Normalizer. |
|
Implement ColumnTransformation for cleaning functions. |
|
Implement ColumnTransformation for CompanyNameCleaner. |
|
Fill missing fuel types from another column. |
|
Vectorize two string columns with Jaro Winkler similarity. |
|
Vectorize two numeric columns with a similarity score. |
Functions¶
|
Log embedder config to mlflow experiment. |
|
Return a configured op graph to embed an input dataframe. |
|
Return a configured op graph to clean an input dataframe. |
|
|
|
Extract keywords contained in a Pandas series with a regular expression. |
|
Impute missing fuel type data from a name column. |
|
|
|
Module Contents¶
- class pudl.analysis.record_linkage.embed_dataframe.FeatureMatrix[source]¶
Class to wrap a feature matrix returned from dataframe embedding.
Depending on the transformations applied, a feature matrix may be sparse or dense matrix. Using this wrapper enables Dagsters type checking while allowing both dense and sparse matrices underneath.
- index: pandas.Index[source]¶
- class pudl.analysis.record_linkage.embed_dataframe.TransformStep(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel
,abc.ABC
TransformStep’s can be combined to vectorize one or more columns.
This class defines a very simple interface for TransformStep’s, which essentially says that a TransformStep should take configuration and implement the method as_transformer.
- abstract as_transformer() sklearn.base.BaseEstimator [source]¶
This method should use configuration to produce a
sklearn.base.BaseEstimator
.
- class pudl.analysis.record_linkage.embed_dataframe.ColumnVectorizer(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel
Define a set of transformations to apply to one or more columns.
- transform_steps: list[TransformStep][source]¶
- as_pipeline()[source]¶
Return
sklearn.pipeline.Pipeline
with configuration.
- pudl.analysis.record_linkage.embed_dataframe.log_dataframe_embedder_config(embedder_name: str, vectorizers: dict[str, ColumnVectorizer], experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker)[source]¶
Log embedder config to mlflow experiment.
- pudl.analysis.record_linkage.embed_dataframe.dataframe_embedder_factory(name_prefix: str, vectorizers: dict[str, ColumnVectorizer])[source]¶
Return a configured op graph to embed an input dataframe.
- pudl.analysis.record_linkage.embed_dataframe.dataframe_cleaner_factory(name_prefix: str, vectorizers: dict[str, ColumnVectorizer])[source]¶
Return a configured op graph to clean an input dataframe.
- class pudl.analysis.record_linkage.embed_dataframe.TextVectorizer(/, **data: Any)[source]¶
Bases:
TransformStep
Implement TransformStep for
sklearn.feature_extraction.text.TfidfVectorizer
.
- class pudl.analysis.record_linkage.embed_dataframe.CategoricalVectorizer(/, **data: Any)[source]¶
Bases:
TransformStep
Implement TransformStep for
sklearn.preprocessing.OneHotEncoder
.
- class pudl.analysis.record_linkage.embed_dataframe.NumericalVectorizer(/, **data: Any)[source]¶
Bases:
TransformStep
Implement ColumnTransformation for MinMaxScaler.
- class pudl.analysis.record_linkage.embed_dataframe.NumericalNormalizer(/, **data: Any)[source]¶
Bases:
TransformStep
Implement ColumnTransformation for Normalizer.
- pudl.analysis.record_linkage.embed_dataframe._apply_cleaning_func(df, function_key: str = None)[source]¶
- class pudl.analysis.record_linkage.embed_dataframe.ColumnCleaner(/, **data: Any)[source]¶
Bases:
TransformStep
Implement ColumnTransformation for cleaning functions.
- class pudl.analysis.record_linkage.embed_dataframe.NameCleaner(/, **data: Any)[source]¶
Bases:
TransformStep
Implement ColumnTransformation for CompanyNameCleaner.
- class pudl.analysis.record_linkage.embed_dataframe.FuelTypeFiller(/, **data: Any)[source]¶
Bases:
TransformStep
Fill missing fuel types from another column.
- pudl.analysis.record_linkage.embed_dataframe._extract_keyword_from_column(ser: pandas.Series, keyword_list: list[str]) pandas.Series [source]¶
Extract keywords contained in a Pandas series with a regular expression.
- pudl.analysis.record_linkage.embed_dataframe._fill_fuel_type_from_name(df: pandas.DataFrame, fuel_type_col: str, name_col: str) pandas.DataFrame [source]¶
Impute missing fuel type data from a name column.
If a missing fuel type code is contained in the plant name, fill in the fuel type code PUDL for that record. E.g. “Washington Hydro”
- pudl.analysis.record_linkage.embed_dataframe._apply_string_similarity_func(df, function_key: str, col1: str, col2: str)[source]¶
- class pudl.analysis.record_linkage.embed_dataframe.StringSimilarityScorer(/, **data: Any)[source]¶
Bases:
TransformStep
Vectorize two string columns with Jaro Winkler similarity.
- pudl.analysis.record_linkage.embed_dataframe._apply_numeric_similarity_func(df, function_key: str, col1: str, col2: str, scale: float, offset: float, origin: float, missing_value: float, label: str)[source]¶
- class pudl.analysis.record_linkage.embed_dataframe.NumericSimilarityScorer(/, **data: Any)[source]¶
Bases:
TransformStep
Vectorize two numeric columns with a similarity score.
If two values are the same the similarity is 1 and in case of complete disagreement it is 0. The implementation is adapted from the recordlinkage Python package Numeric comparison library and is similar with numeric comparing in ElasticSearch, a full-text search tool.
- Parameters:
name – The name of the transformation step. Default is numeric_sim.
col1 – The name of the first column to compare. Must be a numeric column.
col2 – The name of the second column to compare. Must be a numeric column.
output_name – The name of the output Series of compared values.
method – The metric used. Options are “exponential”, “linear”, “exact”.
scale – The rate of decay, how quickly the score should drop the further from the origin that a value lies. Default is 1.0.
offset – Setting a nonzero offset expands the central point to cover a range of values instead of just the single point specified by the origin. Default is 0.
origin – The central point, or the best possible value for the difference between records. Differences that fall at the origin will get a similarity score of 1.0. The default is 0.
missing_value – The value if one or both records have a missing value on the compared field. Default 0.