pudl.analysis.record_linkage.name_cleaner

This module contains the implementation of CompanyNameCleaner class from OS-Climate’s financial-entity-cleaner package.

Attributes

Classes

LegalTermLocation

The location of the legal terms within the name string.

Lettercase

Allowed cases for output strings.

HandleLegalTerms

Whether to leave, remove, or normalize legal terms.

CompanyNameCleaner

Class to normalize/clean up text based company names.

Functions

_get_legal_terms_dict(→ dict[str, list])

Module Contents

pudl.analysis.record_linkage.name_cleaner.logger[source]
pudl.analysis.record_linkage.name_cleaner.CLEANING_RULES_DICT[source]
pudl.analysis.record_linkage.name_cleaner.DEFAULT_CLEANING_RULES_LIST = ['remove_word_the_from_the_end', 'remove_word_the_from_the_beginning',...[source]
pudl.analysis.record_linkage.name_cleaner.NAME_JSON_ENTRY_LEGAL_TERMS = 'legal_forms'[source]
class pudl.analysis.record_linkage.name_cleaner.LegalTermLocation(*args, **kwds)[source]

Bases: enum.Enum

The location of the legal terms within the name string.

AT_THE_END = 1[source]
ANYWHERE = 2[source]
class pudl.analysis.record_linkage.name_cleaner.Lettercase(*args, **kwds)[source]

Bases: enum.Enum

Allowed cases for output strings.

LOWER = 1[source]
TITLE = 2[source]
UPPER = 3[source]
class pudl.analysis.record_linkage.name_cleaner.HandleLegalTerms(*args, **kwds)[source]

Bases: enum.Enum

Whether to leave, remove, or normalize legal terms.

NORMALIZE = 3[source]
LEAVE_AS_IS = 1[source]
REMOVE = 2[source]
class pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner(/, **data: Any)[source]

Bases: pydantic.BaseModel

Class to normalize/clean up text based company names.

cleaning_rules_list: list[str] = ['remove_word_the_from_the_end', 'remove_word_the_from_the_beginning',...[source]

A list of cleaning rules that the CompanyNameCleaner should apply.

Will be validated to ensure rules comply to allowed cleaning functions.

A flag to indicate how to habndle legal terms.

Options are to remove, normalize, or keep them as is.

place_word_the_at_beginning: bool = False[source]

A flag to indicate whether to move ‘the’ to the start of a string.

If True, then if the word ‘the’ appears at the end of a string, remove it and place ‘the’ at the beginning of the string.

remove_unicode: bool = False[source]

Define if unicode characters should be removed from text’s name.

This cleaning rule is treated separated from the regex rules because it depends on the language of the text’s name. For instance, Russian or Japanese text’s may contain unicode characters, while Portuguese and French companies may not.

output_lettercase: Lettercase[source]

Define the letter case of the cleaning output.

legal_term_location: LegalTermLocation[source]

Indicates where in the string legal terms are found.

remove_accents: bool = False[source]

Flag to indicate whether to remove accents from strings.

If True, replace letters with accents with non-accented ones.

legal_terms_dict: dict[str, list] = None[source]
_validate_cleaning_rules() Self[source]
_apply_regex_rules(col: pandas.Series, dict_regex_rules: dict[str, list[str]]) pandas.Series[source]

Applies several cleaning rules based on a custom dictionary.

The dictionary must contain cleaning rules written in regex format.

Parameters:
  • col (pd.Series) – The column that needs to be cleaned.

  • dict_regex_rules (dict) – a dictionary of cleaning rules writen in regex with the format [rule name] : [‘replacement’, ‘regex rule’]

Returns:

the modified/cleaned column.

Return type:

(pd.Series)

_remove_unicode_chars(col: pandas.Series) pandas.Series[source]

Removes unicode characters that are unreadable in ASCII format.

Parameters:

col (pd.Series) – series containing unicode characters.

Returns:

the corresponding input series without unicode characters.

Return type:

(pd.Series)

_move_the_to_beginning(col: pandas.Series) pandas.Series[source]
_apply_cleaning_rules(col: pandas.Series) pandas.Series[source]

Apply the cleaning rules from the dictionary of regex rules.

Apply the normalization of legal terms according to dictionary of regex rules.

Remove legal terms from a string.

get_clean_data(col: pandas.Series) pandas.Series[source]

Clean names and normalize legal terms.

Parameters:

col (pd.Series) – the column that is to be cleaned

Returns:

the clean version of the column

Return type:

clean_col (pd.Series)

apply_name_cleaning(df: pandas.DataFrame, return_as_dframe: bool = False) pandas.DataFrame[source]

Clean up text names in a dataframe.

Parameters:
  • df (dataframe) – the input dataframe that contains the text’s name to be cleaned

  • return_as_dframe (bool) – whether to return the cleaned data as a dataframe or series. Useful to return as a dataframe if used in a cleaning pipeline with no vectorization step after name cleaning. If multiple columns are passed in for cleaning then output will be a dataframe regardless of this parameter.

Returns:

the clean version of the input dataframe

Return type:

df (dataframe)