pudl.analysis.record_linkage.name_cleaner¶
This module contains the implementation of CompanyNameCleaner class from OS-Climate’s financial-entity-cleaner package.
Attributes¶
Classes¶
The location of the legal terms within the name string. |
|
Allowed cases for output strings. |
|
Whether to leave, remove, or normalize legal terms. |
|
Class to normalize/clean up text based company names. |
Functions¶
|
Module Contents¶
- pudl.analysis.record_linkage.name_cleaner.DEFAULT_CLEANING_RULES_LIST = ['remove_word_the_from_the_end', 'remove_word_the_from_the_beginning',...[source]¶
- pudl.analysis.record_linkage.name_cleaner.NAME_LEGAL_TERMS_DICT_FILE = 'us_legal_forms.json'[source]¶
- class pudl.analysis.record_linkage.name_cleaner.LegalTermLocation(*args, **kwds)[source]¶
Bases:
enum.Enum
The location of the legal terms within the name string.
- class pudl.analysis.record_linkage.name_cleaner.Lettercase(*args, **kwds)[source]¶
Bases:
enum.Enum
Allowed cases for output strings.
- class pudl.analysis.record_linkage.name_cleaner.HandleLegalTerms(*args, **kwds)[source]¶
Bases:
enum.Enum
Whether to leave, remove, or normalize legal terms.
- class pudl.analysis.record_linkage.name_cleaner.CompanyNameCleaner(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel
Class to normalize/clean up text based company names.
- cleaning_rules_list: list[str] = ['remove_word_the_from_the_end', 'remove_word_the_from_the_beginning',...[source]¶
A list of cleaning rules that the CompanyNameCleaner should apply.
Will be validated to ensure rules comply to allowed cleaning functions.
- handle_legal_terms: HandleLegalTerms[source]¶
A flag to indicate how to habndle legal terms.
Options are to remove, normalize, or keep them as is.
- place_word_the_at_beginning: bool = False[source]¶
A flag to indicate whether to move ‘the’ to the start of a string.
If True, then if the word ‘the’ appears at the end of a string, remove it and place ‘the’ at the beginning of the string.
- remove_unicode: bool = False[source]¶
Define if unicode characters should be removed from text’s name.
This cleaning rule is treated separated from the regex rules because it depends on the language of the text’s name. For instance, Russian or Japanese text’s may contain unicode characters, while Portuguese and French companies may not.
- output_lettercase: Lettercase[source]¶
Define the letter case of the cleaning output.
- legal_term_location: LegalTermLocation[source]¶
Indicates where in the string legal terms are found.
- remove_accents: bool = False[source]¶
Flag to indicate whether to remove accents from strings.
If True, replace letters with accents with non-accented ones.
- _apply_regex_rules(col: pandas.Series, dict_regex_rules: dict[str, list[str]]) pandas.Series [source]¶
Applies several cleaning rules based on a custom dictionary.
The dictionary must contain cleaning rules written in regex format.
- Parameters:
col (pd.Series) – The column that needs to be cleaned.
dict_regex_rules (dict) – a dictionary of cleaning rules writen in regex with the format [rule name] : [‘replacement’, ‘regex rule’]
- Returns:
the modified/cleaned column.
- Return type:
(pd.Series)
- _remove_unicode_chars(col: pandas.Series) pandas.Series [source]¶
Removes unicode characters that are unreadable in ASCII format.
- Parameters:
col (pd.Series) – series containing unicode characters.
- Returns:
the corresponding input series without unicode characters.
- Return type:
(pd.Series)
- _move_the_to_beginning(col: pandas.Series) pandas.Series [source]¶
- _apply_cleaning_rules(col: pandas.Series) pandas.Series [source]¶
Apply the cleaning rules from the dictionary of regex rules.
- _apply_normalization_of_legal_terms(col: pandas.Series) pandas.Series [source]¶
Apply the normalization of legal terms according to dictionary of regex rules.
- _apply_removal_of_legal_terms(col: pandas.Series) pandas.Series [source]¶
Remove legal terms from a string.
- get_clean_data(col: pandas.Series) pandas.Series [source]¶
Clean names and normalize legal terms.
- Parameters:
col (pd.Series) – the column that is to be cleaned
- Returns:
the clean version of the column
- Return type:
clean_col (pd.Series)
- apply_name_cleaning(df: pandas.DataFrame, return_as_dframe: bool = False) pandas.DataFrame [source]¶
Clean up text names in a dataframe.
- Parameters:
df (dataframe) – the input dataframe that contains the text’s name to be cleaned
return_as_dframe (bool) – whether to return the cleaned data as a dataframe or series. Useful to return as a dataframe if used in a cleaning pipeline with no vectorization step after name cleaning. If multiple columns are passed in for cleaning then output will be a dataframe regardless of this parameter.
- Returns:
the clean version of the input dataframe
- Return type:
df (dataframe)