Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
PUDL 0.1.dev50+g9c75267.d20250516 documentation
Logo
PUDL 0.1.dev50+g9c75267.d20250516 documentation
  • About PUDL
  • Data Access
  • PUDL Database Viewer
  • Data Sources
    • EIA Form 860 – Annual Electric Generator Report
    • EIA Form 861 – Annual Electric Power Industry Report
    • EIA Form 923 – Power Plant Operations Report
    • EIA Form 930 – Hourly and Daily Balancing Authority Operations Report
    • EPA Hourly Continuous Emission Monitoring System (CEMS)
    • FERC Form 1 – Annual Report of Major Electric Utilities
    • FERC Form 714 – Annual Electric Balancing Authority Area and Planning Area Report
    • Pipelines and Hazardous Materials Safety Administration (PHMSA) Annual Natural Gas Report
    • GridPath Resource Adequacy Toolkit Data
    • Vibrant Clean Energy Resource Adequacy Renewable Energy (RARE) Power Dataset
    • Other Data in PUDL
    • High Priority Target Datasets
  • Data Dictionaries
    • PUDL Data Dictionary
    • Raw FERC Form 1 Data Dictionary
    • PUDL Code Metadata
  • Contributing
  • Development
    • Development Setup
    • Running the ETL Pipeline
    • Developing with Dagster
    • Project Management
    • Testing PUDL
      • How to debug a quantile check
    • Building the Documentation
    • Working with the Datastore
    • Converting raw FERC data to SQLite
    • Existing Data Updates
    • Run a Versioned Release
    • PUDL ID Mapping
    • Naming Conventions
    • Data and ETL Design Guidelines
    • Nightly Data Builds
    • Infrastructure as Code
  • Licensing
  • Code of Conduct
  • Release Notes
  • API Reference
    • pudl
      • pudl.__main__
      • pudl.analysis
        • pudl.analysis.allocate_gen_fuel
        • pudl.analysis.epacamd_eia
        • pudl.analysis.fuel_by_plant
        • pudl.analysis.mcoe
        • pudl.analysis.ml_tools
          • pudl.analysis.ml_tools.experiment_tracking
          • pudl.analysis.ml_tools.models
        • pudl.analysis.plant_parts_eia
        • pudl.analysis.record_linkage
          • pudl.analysis.record_linkage.classify_plants_ferc1
          • pudl.analysis.record_linkage.eia_ferc1_inputs
          • pudl.analysis.record_linkage.eia_ferc1_model_config
          • pudl.analysis.record_linkage.eia_ferc1_record_linkage
          • pudl.analysis.record_linkage.eia_ferc1_train
          • pudl.analysis.record_linkage.embed_dataframe
          • pudl.analysis.record_linkage.link_cross_year
          • pudl.analysis.record_linkage.name_cleaner
        • pudl.analysis.service_territory
        • pudl.analysis.spatial
        • pudl.analysis.state_demand
        • pudl.analysis.timeseries_cleaning
        • pudl.analysis.timeseries_evaluation
      • pudl.convert
        • pudl.convert.censusdp1tract_to_sqlite
        • pudl.convert.metadata_to_rst
      • pudl.etl
        • pudl.etl.check_foreign_keys
        • pudl.etl.cli
        • pudl.etl.eia_bulk_elec_assets
        • pudl.etl.epacems_assets
        • pudl.etl.glue_assets
        • pudl.etl.static_assets
      • pudl.extract
        • pudl.extract.censuspep
        • pudl.extract.csv
        • pudl.extract.dbf
        • pudl.extract.eia176
        • pudl.extract.eia191
        • pudl.extract.eia757a
        • pudl.extract.eia860
        • pudl.extract.eia860m
        • pudl.extract.eia861
        • pudl.extract.eia923
        • pudl.extract.eia930
        • pudl.extract.eiaaeo
        • pudl.extract.eiaapi
        • pudl.extract.epacems
        • pudl.extract.excel
        • pudl.extract.extractor
        • pudl.extract.ferc
        • pudl.extract.ferc1
        • pudl.extract.ferc2
        • pudl.extract.ferc6
        • pudl.extract.ferc60
        • pudl.extract.ferc714
        • pudl.extract.gridpathratoolkit
        • pudl.extract.nrelatb
        • pudl.extract.parquet
        • pudl.extract.phmsagas
        • pudl.extract.sec10k
        • pudl.extract.vcerare
        • pudl.extract.xbrl
      • pudl.ferc_to_sqlite
        • pudl.ferc_to_sqlite.cli
      • pudl.glue
        • pudl.glue.ferc1_eia
        • pudl.glue.ferc714
      • pudl.helpers
      • pudl.io_managers
      • pudl.logging_helpers
      • pudl.metadata
        • pudl.metadata.classes
        • pudl.metadata.codes
        • pudl.metadata.constants
        • pudl.metadata.dfs
        • pudl.metadata.enums
        • pudl.metadata.fields
        • pudl.metadata.helpers
        • pudl.metadata.labels
        • pudl.metadata.resources
          • pudl.metadata.resources.allocate_gen_fuel
          • pudl.metadata.resources.eia
          • pudl.metadata.resources.eia860
          • pudl.metadata.resources.eia860m
          • pudl.metadata.resources.eia861
          • pudl.metadata.resources.eia923
          • pudl.metadata.resources.eia930
          • pudl.metadata.resources.eiaaeo
          • pudl.metadata.resources.eiaapi
          • pudl.metadata.resources.epacems
          • pudl.metadata.resources.ferc1
          • pudl.metadata.resources.ferc1_eia_record_linkage
          • pudl.metadata.resources.ferc714
          • pudl.metadata.resources.glue
          • pudl.metadata.resources.gridpathratoolkit
          • pudl.metadata.resources.mcoe
          • pudl.metadata.resources.nrelatb
          • pudl.metadata.resources.pudl
          • pudl.metadata.resources.sec10k
          • pudl.metadata.resources.vcerare
        • pudl.metadata.sources
      • pudl.output
        • pudl.output.censusdp1tract
        • pudl.output.eia
        • pudl.output.eia860
        • pudl.output.eia923
        • pudl.output.eia930
        • pudl.output.eiaapi
        • pudl.output.epacems
        • pudl.output.ferc1
        • pudl.output.ferc714
        • pudl.output.pudltabl
        • pudl.output.sec10k
        • pudl.output.sql
          • pudl.output.sql.helpers
      • pudl.package_data
        • pudl.package_data.censuspep
          • pudl.package_data.censuspep.column_maps
        • pudl.package_data.eia176
          • pudl.package_data.eia176.column_maps
        • pudl.package_data.eia191
          • pudl.package_data.eia191.column_maps
        • pudl.package_data.eia757a
          • pudl.package_data.eia757a.column_maps
        • pudl.package_data.eia860
          • pudl.package_data.eia860.column_maps
        • pudl.package_data.eia860m
          • pudl.package_data.eia860m.column_maps
        • pudl.package_data.eia861
          • pudl.package_data.eia861.column_maps
        • pudl.package_data.eia923
          • pudl.package_data.eia923.column_maps
        • pudl.package_data.eia930
          • pudl.package_data.eia930.column_maps
        • pudl.package_data.epacems
        • pudl.package_data.ferc1
          • pudl.package_data.ferc1.row_maps
        • pudl.package_data.glue
        • pudl.package_data.phmsagas
          • pudl.package_data.phmsagas.column_maps
        • pudl.package_data.settings
        • pudl.package_data.test
          • pudl.package_data.test.column_maps
        • pudl.package_data.vcerare
      • pudl.resources
      • pudl.settings
      • pudl.transform
        • pudl.transform.censuspep
        • pudl.transform.classes
        • pudl.transform.eia
        • pudl.transform.eia176
        • pudl.transform.eia860
        • pudl.transform.eia860m
        • pudl.transform.eia861
        • pudl.transform.eia923
        • pudl.transform.eia930
        • pudl.transform.eiaaeo
        • pudl.transform.eiaapi
        • pudl.transform.epacems
        • pudl.transform.ferc
        • pudl.transform.ferc1
        • pudl.transform.ferc714
        • pudl.transform.gridpathratoolkit
        • pudl.transform.nrelatb
        • pudl.transform.params
          • pudl.transform.params.ferc1
        • pudl.transform.sec10k
        • pudl.transform.vcerare
      • pudl.validate
      • pudl.workspace
        • pudl.workspace.datastore
        • pudl.workspace.resource_cache
        • pudl.workspace.setup
    • dbt_helper
  • Bibliography
Back to top
View this page

pudl.analysis.record_linkage.link_cross_year¶

Define a record linkage model interface and implement common functionality.

Attributes¶

logger

Classes¶

PenalizeReportYearDistanceConfig

Compute distance between records and add penalty to records from same year.

DistanceMatrix

Class to wrap a distance matrix saved in a np.memmap.

DBSCANConfig

Configuration for DBSCAN step.

SplitClustersConfig

Configuration for AgglomerativeClustering used to split overmerged clusters.

MatchOrphanedRecordsConfig

Configuration for match_orphaned_records() op.

Functions¶

get_cluster_distance_matrix(→ numpy.ndarray)

Return a distance matrix with only distances within a cluster.

get_average_distance_matrix(→ numpy.ndarray)

Compute average distance between two clusters of records given indices of each cluster.

compute_distance_with_year_penalty(→ DistanceMatrix)

Compute a distance matrix and penalize records from the same year.

cluster_records_dbscan(→ pandas.DataFrame)

Generate initial IDs using DBSCAN algorithm.

split_clusters(→ pandas.DataFrame)

Split clusters with multiple records from same report_year.

match_orphaned_records(→ pandas.DataFrame)

DBSCAN assigns 'noisy' records a label of '-1', which will be labeled by this step.

link_ids_cross_year(df, feature_matrix, experiment_tracker)

Apply model and return column of estimated record labels.

Module Contents¶

pudl.analysis.record_linkage.link_cross_year.logger[source]¶
class pudl.analysis.record_linkage.link_cross_year.PenalizeReportYearDistanceConfig(**config_dict)[source]¶

Bases: dagster.Config

Compute distance between records and add penalty to records from same year.

The metric can be any string accepted by scipy.spatial.distance.pdist(), e.g. cosine or euclidean.

distance_penalty: float = 10000.0[source]¶
metric: str = 'euclidean'[source]¶
class pudl.analysis.record_linkage.link_cross_year.DistanceMatrix(feature_matrix: numpy.ndarray, original_df: pandas.DataFrame, config: PenalizeReportYearDistanceConfig)[source]¶

Class to wrap a distance matrix saved in a np.memmap.

file_buffer[source]¶
distance_matrix[source]¶
pudl.analysis.record_linkage.link_cross_year.get_cluster_distance_matrix(distance_matrix: numpy.ndarray, cluster_inds: numpy.ndarray) → numpy.ndarray[source]¶

Return a distance matrix with only distances within a cluster.

pudl.analysis.record_linkage.link_cross_year.get_average_distance_matrix(distance_matrix: numpy.ndarray, cluster_groups: list[list[int]]) → numpy.ndarray[source]¶

Compute average distance between two clusters of records given indices of each cluster.

pudl.analysis.record_linkage.link_cross_year.compute_distance_with_year_penalty(config: PenalizeReportYearDistanceConfig, feature_matrix: pudl.analysis.record_linkage.embed_dataframe.FeatureMatrix, original_df: pandas.DataFrame) → DistanceMatrix[source]¶

Compute a distance matrix and penalize records from the same year.

class pudl.analysis.record_linkage.link_cross_year.DBSCANConfig(**config_dict)[source]¶

Bases: dagster.Config

Configuration for DBSCAN step.

eps: float = 0.5[source]¶
min_samples: int = 1[source]¶
pudl.analysis.record_linkage.link_cross_year.cluster_records_dbscan(config: DBSCANConfig, distance_matrix: DistanceMatrix, original_df: pandas.DataFrame, experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker) → pandas.DataFrame[source]¶

Generate initial IDs using DBSCAN algorithm.

class pudl.analysis.record_linkage.link_cross_year.SplitClustersConfig(**config_dict)[source]¶

Bases: dagster.Config

Configuration for AgglomerativeClustering used to split overmerged clusters.

distance_threshold: float = 0.5[source]¶
pudl.analysis.record_linkage.link_cross_year.split_clusters(config: SplitClustersConfig, distance_matrix: DistanceMatrix, id_year_df: pandas.DataFrame, experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker) → pandas.DataFrame[source]¶

Split clusters with multiple records from same report_year.

DBSCAN will sometimes match records from the same report year, which breaks the assumption that there should only be one record for each entity from a single report year. To fix this, agglomerative clustering will be applied to each such cluster. Agglomerative clustering could replace DBSCAN in the initial linkage step to avoid these matches in the first place, however, it is very inneficient on a large number of records, so applying to smaller sets of overmerged records is much faster and uses much less memory.

class pudl.analysis.record_linkage.link_cross_year.MatchOrphanedRecordsConfig(**config_dict)[source]¶

Bases: dagster.Config

Configuration for match_orphaned_records() op.

distance_threshold: float = 0.5[source]¶
pudl.analysis.record_linkage.link_cross_year.match_orphaned_records(config: MatchOrphanedRecordsConfig, distance_matrix: DistanceMatrix, id_year_df: pandas.DataFrame, experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker) → pandas.DataFrame[source]¶

DBSCAN assigns ‘noisy’ records a label of ‘-1’, which will be labeled by this step.

To label orphaned records, points are seperated into clusters where each orphaned record is a cluster of a single point. Then, a distance matrix is computed with the average distance between each cluster, and is used in a round of agglomerative clustering. This will match orphaned records to existing clusters, or assign them unique ID’s if they don’t appear close enough to any existing clusters.

pudl.analysis.record_linkage.link_cross_year.link_ids_cross_year(df: pandas.DataFrame, feature_matrix: pudl.analysis.record_linkage.embed_dataframe.FeatureMatrix, experiment_tracker: pudl.analysis.ml_tools.experiment_tracking.ExperimentTracker)[source]¶

Apply model and return column of estimated record labels.

Next
pudl.analysis.record_linkage.name_cleaner
Previous
pudl.analysis.record_linkage.embed_dataframe
Copyright © 2016-2025, Catalyst Cooperative, CC-BY-4.0
Made with Sphinx and @pradyunsg's Furo
On this page
  • pudl.analysis.record_linkage.link_cross_year
    • Attributes
    • Classes
    • Functions
    • Module Contents
      • logger
      • PenalizeReportYearDistanceConfig
        • PenalizeReportYearDistanceConfig.distance_penalty
        • PenalizeReportYearDistanceConfig.metric
      • DistanceMatrix
        • DistanceMatrix.file_buffer
        • DistanceMatrix.distance_matrix
      • get_cluster_distance_matrix()
      • get_average_distance_matrix()
      • compute_distance_with_year_penalty()
      • DBSCANConfig
        • DBSCANConfig.eps
        • DBSCANConfig.min_samples
      • cluster_records_dbscan()
      • SplitClustersConfig
        • SplitClustersConfig.distance_threshold
      • split_clusters()
      • MatchOrphanedRecordsConfig
        • MatchOrphanedRecordsConfig.distance_threshold
      • match_orphaned_records()
      • link_ids_cross_year()