pudl.analysis.record_linkage.name_cleaner
#
This module contains the implementation of CompanyNameCleaner class from OS-Climate’s financial-entity-cleaner package.
Module Contents#
Classes#
The location of the legal terms within the name string. |
|
Class to normalize/clean up text based company names. |
Attributes#
- class pudl.analysis.record_linkage.name_cleaner.LegalTermLocation(*args, **kwds)[source]#
Bases:
enum.Enum
The location of the legal terms within the name string.
- class pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner(/, **data: Any)[source]#
Bases:
pydantic.BaseModel
Class to normalize/clean up text based company names.
- cleaning_rules_list: list[str] = ['replace_amperstand_between_space_by_AND', 'replace_hyphen_between_spaces_by_single_space',...[source]#
- legal_term_location: LegalTermLocation[source]#
- _apply_regex_rules(str_value: str, dict_regex_rules: dict[str, list[str]]) str [source]#
Applies several cleaning rules based on a custom dictionary.
The dictionary must contain cleaning rules written in regex format.
- _remove_unicode_chars(value: str) str [source]#
Removes unicode character that is unreadable when converted to ASCII format.
- _apply_cleaning_rules(company_name: str) str [source]#
Apply the cleaning rules from the dictionary of regex rules.
- _apply_normalization_of_legal_terms(company_name: str) str [source]#
Apply the normalizattion of legal terms according to dictionary of regex rules.
- get_clean_data(company_name: str) str [source]#
Clean a name and normalize legal terms.
If
company_name
is null or not a string value, pd.NA will be returned.
- apply_name_cleaning(df: pandas.DataFrame) pandas.Series [source]#
Clean up text names in a dataframe.
- Parameters:
- Returns:
the clean version of the input dataframe
- Return type:
df (dataframe)