pudl.analysis.record_linkage.link_cross_year¶
Define a record linkage model interface and implement common functionality.
Attributes¶
Classes¶
Compute distance between records and add penalty to records from same year. |
|
Class to wrap a distance matrix saved in a np.memmap. |
|
Configuration for DBSCAN step. |
|
Configuration for AgglomerativeClustering used to split overmerged clusters. |
|
Configuration for |
Functions¶
|
Return a distance matrix with only distances within a cluster. |
|
Compute average distance between two clusters of records given indices of each cluster. |
|
Compute a distance matrix and penalize records from the same year. |
|
Generate initial IDs using DBSCAN algorithm. |
|
Split clusters with multiple records from same report_year. |
|
DBSCAN assigns 'noisy' records a label of '-1', which will be labeled by this step. |
|
Apply model and return column of estimated record labels. |
Module Contents¶
- class pudl.analysis.record_linkage.link_cross_year.PenalizeReportYearDistanceConfig(**config_dict)[source]¶
Bases:
dagster.Config
Compute distance between records and add penalty to records from same year.
The metric can be any string accepted by
scipy.spatial.distance.pdist()
, e.g.cosine
oreuclidean
.
- class pudl.analysis.record_linkage.link_cross_year.DistanceMatrix(feature_matrix: numpy.ndarray, original_df: pandas.DataFrame, config: PenalizeReportYearDistanceConfig)[source]¶
Class to wrap a distance matrix saved in a np.memmap.
- pudl.analysis.record_linkage.link_cross_year.get_cluster_distance_matrix(distance_matrix: numpy.ndarray, cluster_inds: numpy.ndarray) numpy.ndarray [source]¶
Return a distance matrix with only distances within a cluster.
- pudl.analysis.record_linkage.link_cross_year.get_average_distance_matrix(distance_matrix: numpy.ndarray, cluster_groups: list[list[int]]) numpy.ndarray [source]¶
Compute average distance between two clusters of records given indices of each cluster.
- pudl.analysis.record_linkage.link_cross_year.compute_distance_with_year_penalty(config: PenalizeReportYearDistanceConfig, feature_matrix: pudl.analysis.record_linkage.embed_dataframe.FeatureMatrix, original_df: pandas.DataFrame) DistanceMatrix [source]¶
Compute a distance matrix and penalize records from the same year.
- class pudl.analysis.record_linkage.link_cross_year.DBSCANConfig(**config_dict)[source]¶
Bases:
dagster.Config
Configuration for DBSCAN step.
- pudl.analysis.record_linkage.link_cross_year.cluster_records_dbscan(config: DBSCANConfig, distance_matrix: DistanceMatrix, original_df: pandas.DataFrame, experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker) pandas.DataFrame [source]¶
Generate initial IDs using DBSCAN algorithm.
- class pudl.analysis.record_linkage.link_cross_year.SplitClustersConfig(**config_dict)[source]¶
Bases:
dagster.Config
Configuration for AgglomerativeClustering used to split overmerged clusters.
- pudl.analysis.record_linkage.link_cross_year.split_clusters(config: SplitClustersConfig, distance_matrix: DistanceMatrix, id_year_df: pandas.DataFrame, experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker) pandas.DataFrame [source]¶
Split clusters with multiple records from same report_year.
DBSCAN will sometimes match records from the same report year, which breaks the assumption that there should only be one record for each entity from a single report year. To fix this, agglomerative clustering will be applied to each such cluster. Agglomerative clustering could replace DBSCAN in the initial linkage step to avoid these matches in the first place, however, it is very inneficient on a large number of records, so applying to smaller sets of overmerged records is much faster and uses much less memory.
- class pudl.analysis.record_linkage.link_cross_year.MatchOrphanedRecordsConfig(**config_dict)[source]¶
Bases:
dagster.Config
Configuration for
match_orphaned_records()
op.
- pudl.analysis.record_linkage.link_cross_year.match_orphaned_records(config: MatchOrphanedRecordsConfig, distance_matrix: DistanceMatrix, id_year_df: pandas.DataFrame, experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker) pandas.DataFrame [source]¶
DBSCAN assigns ‘noisy’ records a label of ‘-1’, which will be labeled by this step.
To label orphaned records, points are seperated into clusters where each orphaned record is a cluster of a single point. Then, a distance matrix is computed with the average distance between each cluster, and is used in a round of agglomerative clustering. This will match orphaned records to existing clusters, or assign them unique ID’s if they don’t appear close enough to any existing clusters.
- pudl.analysis.record_linkage.link_cross_year.link_ids_cross_year(df: pandas.DataFrame, feature_matrix: pudl.analysis.record_linkage.embed_dataframe.FeatureMatrix, experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker)[source]¶
Apply model and return column of estimated record labels.