pudl.analysis.record_linkage.name_cleaner#

This module contains the implementation of CompanyNameCleaner class from OS-Climate’s financial-entity-cleaner package.

Module Contents#

Classes#

LegalTermLocation

The location of the legal terms within the name string.

CompanyNameCleaner

Class to normalize/clean up text based company names.

Attributes#

pudl.analysis.record_linkage.name_cleaner.logger[source]#
pudl.analysis.record_linkage.name_cleaner.CLEANING_RULES_DICT[source]#
class pudl.analysis.record_linkage.name_cleaner.LegalTermLocation(*args, **kwds)[source]#

Bases: enum.Enum

The location of the legal terms within the name string.

AT_THE_END = 1[source]#
ANYWHERE = 2[source]#
class pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Class to normalize/clean up text based company names.

__NAME_JSON_ENTRY_LEGAL_TERMS = 'legal_forms'[source]#
cleaning_rules_list: list[str] = ['replace_amperstand_between_space_by_AND', 'replace_hyphen_between_spaces_by_single_space',...[source]#
remove_unicode: bool = False[source]#
output_lettercase: Literal[lower, title] = 'lower'[source]#
legal_term_location: LegalTermLocation[source]#
remove_accents: bool = False[source]#
_apply_regex_rules(str_value: str, dict_regex_rules: dict[str, list[str]]) str[source]#

Applies several cleaning rules based on a custom dictionary.

The dictionary must contain cleaning rules written in regex format.

Parameters:
  • str_value (str) – any value as string to be cleaned up.

  • dict_regex_rules (dict) – a dictionary of cleaning rules writen in regex with the format [rule name] : [‘replacement’, ‘regex rule’]

Returns:

the modified/cleaned value.

Return type:

(str)

_remove_unicode_chars(value: str) str[source]#

Removes unicode character that is unreadable when converted to ASCII format.

Parameters:

value (str) – any string containing unicode characters.

Returns:

the corresponding input string without unicode characters.

Return type:

(str)

_apply_cleaning_rules(company_name: str) str[source]#

Apply the cleaning rules from the dictionary of regex rules.

Apply the normalizattion of legal terms according to dictionary of regex rules.

get_clean_data(company_name: str) str[source]#

Clean a name and normalize legal terms.

If company_name is null or not a string value, pd.NA will be returned.

Parameters:

company_name (str) – the original text

Returns:

the clean version of the text

Return type:

clean_company_name (str)

apply_name_cleaning(df: pandas.DataFrame) pandas.Series[source]#

Clean up text names in a dataframe.

Parameters:
  • df (dataframe) – the input dataframe that contains the text’s name to be cleaned

  • in_company_name_attribute (str) – the attribute in the dataframe that contains the names

  • out_company_name_attribute (str) – the attribute to be created for the clean version of the text’s name

Returns:

the clean version of the input dataframe

Return type:

df (dataframe)