`pudl.analysis.record_linkage.name_cleaner`¶

This module contains the implementation of CompanyNameCleaner class from OS-Climate’s financial-entity-cleaner package.

Module Contents¶

Classes¶

`LegalTermLocation`	The location of the legal terms within the name string.
`CompanyNameCleaner`	Class to normalize/clean up text based company names.

Attributes¶

`logger`
`CLEANING_RULES_DICT`

pudl.analysis.record_linkage.name_cleaner.logger[source]¶

pudl.analysis.record_linkage.name_cleaner.CLEANING_RULES_DICT[source]¶

class pudl.analysis.record_linkage.name_cleaner.LegalTermLocation(*args, **kwds)[source]¶

Bases: enum.Enum

The location of the legal terms within the name string.

AT_THE_END = 1[source]¶

ANYWHERE = 2[source]¶

class pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner(/, **data: Any)[source]¶

Bases: pydantic.BaseModel

Class to normalize/clean up text based company names.

__NAME_LEGAL_TERMS_DICT_FILE = 'us_legal_forms.json'[source]¶

__NAME_JSON_ENTRY_LEGAL_TERMS = 'legal_forms'[source]¶

cleaning_rules_list: list[str] = ['remove_word_the_from_the_end', 'remove_word_the_from_the_beginning',...[source]¶

normalize_legal_terms: bool = True[source]¶

remove_unicode: bool = False[source]¶

output_lettercase: Literal[lower, title] = 'lower'[source]¶

legal_term_location: LegalTermLocation[source]¶

remove_accents: bool = False[source]¶

_apply_regex_rules(str_value: str, dict_regex_rules: dict[str, list[str]]) → str[source]¶

Applies several cleaning rules based on a custom dictionary.

The dictionary must contain cleaning rules written in regex format.

Parameters:

str_value (str) – any value as string to be cleaned up.
dict_regex_rules (dict) – a dictionary of cleaning rules writen in regex with the format [rule name] : [‘replacement’, ‘regex rule’]

Returns:

the modified/cleaned value.

Return type:

(str)

_remove_unicode_chars(value: str) → str[source]¶

Removes unicode character that is unreadable when converted to ASCII format.

Parameters:: value (str) – any string containing unicode characters.
Returns:: the corresponding input string without unicode characters.
Return type:: (str)

_apply_cleaning_rules(company_name: str) → str[source]¶: Apply the cleaning rules from the dictionary of regex rules.

_apply_normalization_of_legal_terms(company_name: str) → str[source]¶: Apply the normalizattion of legal terms according to dictionary of regex rules.

get_clean_data(company_name: str) → str[source]¶

Clean a name and normalize legal terms.

If company_name is null or not a string value, pd.NA will be returned.

Parameters:: company_name (str) – the original text
Returns:: the clean version of the text
Return type:: clean_company_name (str)

apply_name_cleaning(df: pandas.DataFrame, return_as_dframe: bool = False) → pandas.DataFrame[source]¶

Clean up text names in a dataframe.