`pudl.analysis.record_linkage.name_cleaner`#

This module contains the implementation of CompanyNameCleaner class from OS-Climate’s financial-entity-cleaner package.

Module Contents#

Classes#

`LegalTermLocation`	The location of the legal terms within the name string.
`CompanyNameCleaner`	Class to normalize/clean up text based company names.

Attributes#

`logger`
`CLEANING_RULES_DICT`

pudl.analysis.record_linkage.name_cleaner.logger[source]#

pudl.analysis.record_linkage.name_cleaner.CLEANING_RULES_DICT[source]#

class pudl.analysis.record_linkage.name_cleaner.LegalTermLocation(*args, **kwds)[source]#

Bases: enum.Enum

The location of the legal terms within the name string.

AT_THE_END = 1[source]#

ANYWHERE = 2[source]#

class pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Class to normalize/clean up text based company names.

__NAME_LEGAL_TERMS_DICT_FILE = 'us_legal_forms.json'[source]#

__NAME_JSON_ENTRY_LEGAL_TERMS = 'legal_forms'[source]#

cleaning_rules_list: list[str] = ['replace_amperstand_between_space_by_AND', 'replace_hyphen_between_spaces_by_single_space',...[source]#

normalize_legal_terms: bool = True[source]#

remove_unicode: bool = False[source]#

output_lettercase: Literal[lower, title] = 'lower'[source]#

legal_term_location: LegalTermLocation[source]#

remove_accents: bool = False[source]#

_apply_regex_rules(str_value: str, dict_regex_rules: dict[str, list[str]]) → str[source]#

Applies several cleaning rules based on a custom dictionary.

The dictionary must contain cleaning rules written in regex format.

Parameters:

str_value (str) – any value as string to be cleaned up.
dict_regex_rules (dict) – a dictionary of cleaning rules writen in regex with the format [rule name] : [‘replacement’, ‘regex rule’]

Returns:

the modified/cleaned value.

Return type:

(str)

_remove_unicode_chars(value: str) → str[source]#

Removes unicode character that is unreadable when converted to ASCII format.

Parameters:: value (str) – any string containing unicode characters.
Returns:: the corresponding input string without unicode characters.
Return type:: (str)

_apply_cleaning_rules(company_name: str) → str[source]#: Apply the cleaning rules from the dictionary of regex rules.

_apply_normalization_of_legal_terms(company_name: str) → str[source]#: Apply the normalizattion of legal terms according to dictionary of regex rules.

get_clean_data(company_name: str) → str[source]#

Clean a name and normalize legal terms.

If company_name is null or not a string value, pd.NA will be returned.

Parameters:: company_name (str) – the original text
Returns:: the clean version of the text
Return type:: clean_company_name (str)

apply_name_cleaning(df: pandas.DataFrame) → pandas.Series[source]#

Clean up text names in a dataframe.