pudl.analysis.record_linkage.name_cleaner¶
This module contains the implementation of CompanyNameCleaner class from OS-Climate’s financial-entity-cleaner package.
Attributes¶
Classes¶
The location of the legal terms within the name string. |
|
Class to normalize/clean up text based company names. |
Module Contents¶
- class pudl.analysis.record_linkage.name_cleaner.LegalTermLocation(*args, **kwds)[source]¶
Bases:
enum.Enum
The location of the legal terms within the name string.
- class pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel
Class to normalize/clean up text based company names.
- cleaning_rules_list: list[str] = ['remove_word_the_from_the_end', 'remove_word_the_from_the_beginning',...[source]¶
- legal_term_location: LegalTermLocation[source]¶
- _apply_regex_rules(str_value: str, dict_regex_rules: dict[str, list[str]]) str [source]¶
Applies several cleaning rules based on a custom dictionary.
The dictionary must contain cleaning rules written in regex format.
- _remove_unicode_chars(value: str) str [source]¶
Removes unicode character that is unreadable when converted to ASCII format.
- _apply_cleaning_rules(company_name: str) str [source]¶
Apply the cleaning rules from the dictionary of regex rules.
- _apply_normalization_of_legal_terms(company_name: str) str [source]¶
Apply the normalizattion of legal terms according to dictionary of regex rules.
- get_clean_data(company_name: str) str [source]¶
Clean a name and normalize legal terms.
If
company_name
is null or not a string value, pd.NA will be returned.
- apply_name_cleaning(df: pandas.DataFrame, return_as_dframe: bool = False) pandas.DataFrame [source]¶
Clean up text names in a dataframe.
- Parameters:
df (dataframe) – the input dataframe that contains the text’s name to be cleaned
return_as_dframe (bool) – whether to return the cleaned data as a dataframe or series. Useful to return as a dataframe if used in a cleaning pipeline with no vectorization step after name cleaning. If multiple columns are passed in for cleaning then output will be a dataframe regardless of this parameter.
- Returns:
the clean version of the input dataframe
- Return type:
df (dataframe)