pudl.analysis.record_linkage.name_cleaner

This module contains the implementation of CompanyNameCleaner class from OS-Climate’s financial-entity-cleaner package.

Attributes

Classes

LegalTermLocation

The location of the legal terms within the name string.

CompanyNameCleaner

Class to normalize/clean up text based company names.

Module Contents

pudl.analysis.record_linkage.name_cleaner.logger[source]
pudl.analysis.record_linkage.name_cleaner.CLEANING_RULES_DICT[source]
class pudl.analysis.record_linkage.name_cleaner.LegalTermLocation(*args, **kwds)[source]

Bases: enum.Enum

The location of the legal terms within the name string.

AT_THE_END = 1[source]
ANYWHERE = 2[source]
class pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner(/, **data: Any)[source]

Bases: pydantic.BaseModel

Class to normalize/clean up text based company names.

__NAME_JSON_ENTRY_LEGAL_TERMS = 'legal_forms'[source]
cleaning_rules_list: list[str] = ['remove_word_the_from_the_end', 'remove_word_the_from_the_beginning',...[source]
remove_unicode: bool = False[source]
output_lettercase: Literal['lower', 'title'] = 'lower'[source]
legal_term_location: LegalTermLocation[source]
remove_accents: bool = False[source]
_apply_regex_rules(str_value: str, dict_regex_rules: dict[str, list[str]]) str[source]

Applies several cleaning rules based on a custom dictionary.

The dictionary must contain cleaning rules written in regex format.

Parameters:
  • str_value (str) – any value as string to be cleaned up.

  • dict_regex_rules (dict) – a dictionary of cleaning rules writen in regex with the format [rule name] : [‘replacement’, ‘regex rule’]

Returns:

the modified/cleaned value.

Return type:

(str)

_remove_unicode_chars(value: str) str[source]

Removes unicode character that is unreadable when converted to ASCII format.

Parameters:

value (str) – any string containing unicode characters.

Returns:

the corresponding input string without unicode characters.

Return type:

(str)

_apply_cleaning_rules(company_name: str) str[source]

Apply the cleaning rules from the dictionary of regex rules.

Apply the normalizattion of legal terms according to dictionary of regex rules.

get_clean_data(company_name: str) str[source]

Clean a name and normalize legal terms.

If company_name is null or not a string value, pd.NA will be returned.

Parameters:

company_name (str) – the original text

Returns:

the clean version of the text

Return type:

clean_company_name (str)

apply_name_cleaning(df: pandas.DataFrame, return_as_dframe: bool = False) pandas.DataFrame[source]

Clean up text names in a dataframe.

Parameters:
  • df (dataframe) – the input dataframe that contains the text’s name to be cleaned

  • return_as_dframe (bool) – whether to return the cleaned data as a dataframe or series. Useful to return as a dataframe if used in a cleaning pipeline with no vectorization step after name cleaning. If multiple columns are passed in for cleaning then output will be a dataframe regardless of this parameter.

Returns:

the clean version of the input dataframe

Return type:

df (dataframe)