pudl.metadata.helpers
Functions for manipulating metadata constants.
Module Contents
Functions
|
Format multiple errors into a single error. |
|
Parse field names. |
|
Parse foreign key rule from resource descriptor. |
|
Build foreign key tree. |
|
Traverse foreign key tree. |
|
Build foreign keys for each resource. |
|
Split the time period from a column name. |
|
Add smaller periods to a list of column names. |
|
Return most frequent value (or error if none exists). |
|
Return most frequent value if more frequent than minimum (or error if none exists). |
|
Return single unique value (or error if none exists). |
|
Return dictionary of values, listed by index. |
|
Wrap aggregate function in a try-except for error handling. |
|
Aggregate dataframe and capture errors (using apply). |
|
Aggregate dataframe and capture errors (using aggregate). |
- pudl.metadata.helpers.format_errors(*errors: str, title: str = None, pydantic: bool = False) str [source]
Format multiple errors into a single error.
- Parameters
errors – Error messages.
title – Title for error messages.
Examples
>>> e = format_errors('worse', title='bad') >>> print(e) bad * worse >>> e = format_errors('worse', title='bad', pydantic=True) >>> print(e) bad * worse >>> e = format_errors('bad', 'worse') >>> print(e) * bad * worse >>> e = format_errors('bad', 'worse', pydantic=True) >>> print(e) * bad * worse
- pudl.metadata.helpers._parse_field_names(fields: List[Union[str, dict]]) List[str] [source]
Parse field names.
- Parameters
fields – Either field names or field descriptors with a name key.
- Returns
Field names.
- pudl.metadata.helpers._parse_foreign_key_rule(rule: dict, name: str, key: List[str]) List[dict] [source]
Parse foreign key rule from resource descriptor.
- Parameters
meta – Resource descriptor.
name – Resource name.
key – Resource primary key.
- Returns
fields (List[str]): Local fields.
reference[‘resource’] (str): Reference resource name.
reference[‘fields’] (List[str]): Reference primary key fields.
exclude (List[str]): Names of resources to exclude, including name.
- Return type
Parsed foreign key rules
- pudl.metadata.helpers._build_foreign_key_tree(resources: Dict[str, dict]) Dict[str, Dict[Tuple[str, Ellipsis], dict]] [source]
Build foreign key tree.
- Parameters
resources – Resource descriptors by name.
- Returns
Foreign key tree where the first key is a resource name (str), the second key is resource field names (Tuple[str, …]), and the value describes the reference resource (dict):
reference[‘resource’] (str): Reference name.
reference[‘fields’] (List[str]): Reference field names.
- pudl.metadata.helpers._traverse_foreign_key_tree(tree: Dict[str, Dict[Tuple[str, Ellipsis], dict]], name: str, fields: Tuple[str, Ellipsis]) List[Dict[str, Any]] [source]
Traverse foreign key tree.
- Parameters
tree – Foreign key tree (see
_build_foreign_key_tree()
).name – Local resource name.
fields – Local resource fields.
- Returns
fields (List[str]): Local fields.
reference[‘resource’] (str): Reference resource name.
reference[‘fields’] (List[str]): Reference primary key fields.
- Return type
Sequence of foreign keys starting from name and fields
- pudl.metadata.helpers.build_foreign_keys(resources: Dict[str, dict], prune: bool = True) Dict[str, List[dict]] [source]
Build foreign keys for each resource.
A resource’s foreign_key_rules (if present) determines which other resources will be assigned a foreign key (foreign_keys) to the reference’s primary key:
fields (List[List[str]]): Sets of field names for which to create a foreign key. These are assumed to match the order of the reference’s primary key fields.
exclude (Optional[List[str]]): Names of resources to exclude.
- Parameters
resources – Resource descriptors by name.
prune – Whether to prune redundant foreign keys.
- Returns
Foreign keys for each resource (if any), by resource name.
fields (List[str]): Field names.
reference[‘resource’] (str): Reference resource name.
reference[‘fields’] (List[str]): Reference resource field names.
Examples
>>> resources = { ... 'x': { ... 'schema': { ... 'fields': ['z'], ... 'primary_key': ['z'], ... 'foreign_key_rules': {'fields': [['z']]} ... } ... }, ... 'y': { ... 'schema': { ... 'fields': ['z', 'yy'], ... 'primary_key': ['z', 'yy'], ... 'foreign_key_rules': {'fields': [['z', 'zz']]} ... } ... }, ... 'z': {'schema': {'fields': ['z', 'zz']}} ... } >>> keys = build_foreign_keys(resources) >>> keys['z'] [{'fields': ['z', 'zz'], 'reference': {'resource': 'y', 'fields': ['z', 'yy']}}] >>> keys['y'] [{'fields': ['z'], 'reference': {'resource': 'x', 'fields': ['z']}}] >>> keys = build_foreign_keys(resources, prune=False) >>> keys['z'][0] {'fields': ['z'], 'reference': {'resource': 'x', 'fields': ['z']}}
- pudl.metadata.helpers.split_period(name: str) Tuple[str, Optional[str]] [source]
Split the time period from a column name.
- Parameters
name – Column name.
- Returns
Base name and time period, if any.
Examples
>>> split_period('report_date') ('report', 'date') >>> split_period('report_day') ('report_day', None) >>> split_period('date') ('date', None)
- pudl.metadata.helpers.expand_periodic_column_names(names: Iterable[str]) List[str] [source]
Add smaller periods to a list of column names.
- Parameters
names – Column names.
- Returns
Column names with additional names for smaller periods.
Examples
>>> expand_periodic_column_names(['id', 'report_year']) ['id', 'report_year', 'report_quarter', 'report_month', 'report_date']
- exception pudl.metadata.helpers.AggregationError[source]
Bases:
ValueError
Error raised by aggregation functions.
- pudl.metadata.helpers.most_frequent(x: pandas.Series) Any [source]
Return most frequent value (or error if none exists).
- pudl.metadata.helpers.most_and_more_frequent(x: pandas.Series, min_frequency: float = None) Any [source]
Return most frequent value if more frequent than minimum (or error if none exists).
The minimum frequency ignores null values, so for example, 1 in [1, 1, 1, nan] has a frequency of 1.
- pudl.metadata.helpers.unique(x: pandas.Series) Any [source]
Return single unique value (or error if none exists).
- pudl.metadata.helpers.as_dict(x: pandas.Series) Dict[Any, list] [source]
Return dictionary of values, listed by index.
- pudl.metadata.helpers.try_aggfunc(func: Callable, raised: bool = True, error: Union[str, Callable] = None) Callable [source]
Wrap aggregate function in a try-except for error handling.
- Parameters
func – Aggregate function.
raised – Whether
AggregationError
errors are raised or returned.error –
Error value, whose type and format depends on raise. Below, x is the original input and e is the original error.
raised=True: A string with substitions (e.g. ‘Error at {x.name}: {e}’) that replaces the arguments of the original error. By default, the original error is raised unchanged.
raised=False: A function with signature f(x, e) returning a value that replaces the arguments of the original error. By default, the original error is returned unchanged.
- Returns
Aggregate function with custom error handling.
Examples
>>> x = pd.Series([0, 0, 1, 1], index=['a', 'a', 'a', 'b']) >>> most_frequent(x) Traceback (most recent call last): AggregationError: No value is most frequent. >>> try_aggfunc(most_frequent, raised=False)(x) AggregationError('No value is most frequent.') >>> try_aggfunc(most_frequent, error='Bad dtype {x.dtype}')(x) Traceback (most recent call last): AggregationError: Bad dtype int64 >>> error = lambda x, e: as_dict(x) >>> try_aggfunc(most_frequent, raised=False, error=error)(x) AggregationError({'a': [0, 0, 1], 'b': [1]})
- pudl.metadata.helpers.groupby_apply(df: pandas.DataFrame, by: Iterable, aggfuncs: Dict[Any, Callable], raised: bool = True, error: Callable = None) Tuple[pandas.DataFrame, Dict[Any, pandas.Series]] [source]
Aggregate dataframe and capture errors (using apply).
- Parameters
df – Dataframe to aggregate.
by – Columns names to use to group rows (see
pandas.DataFrame.groupby()
).aggfuncs – Aggregation functions for columns not in by.
raised – Whether
AggregationError
errors are raised or replaced withnp.nan
and returned in an error report.error – A function with signature f(x, e) -> Tuple[Any, Any], where x is the original input and e is the original error, used when raised=False. The first and second value of the returned tuple are used as the index and values, respectively, of the
pandas.Series
returned for each column. By default, the first value is x.name (the values of columns by for that row group), and the second is the original error.
- Returns
Aggregated dataframe with by columns set as the index and an error report with (if raised=False) a
pandas.Series
for each column where errors occured.
Examples
>>> df = pd.DataFrame({'x': [0, 0, 1, 1], 'y': pd.Series([2, 2, 2, 3], dtype='Int64')}) >>> df.index = [0, 0, 0, 1] >>> base = dict(df=df, by='x', aggfuncs={'y': unique}) >>> groupby_apply(**base) Traceback (most recent call last): AggregationError: Could not aggregate y at x = 1: Not unique. >>> _, report = groupby_apply(**base, raised=False) >>> report['y'] x 1 Not unique. dtype: object >>> error = lambda x, e: (x.name, as_dict(x)) >>> _, report = groupby_apply(**base, raised=False, error=error) >>> report['y'] x 1 {0: [2], 1: [3]} dtype: object
- pudl.metadata.helpers.groupby_aggregate(df: pandas.DataFrame, by: Iterable, aggfuncs: Dict[Any, Callable], raised: bool = True, error: Callable = None) Tuple[pandas.DataFrame, Dict[Any, pandas.Series]] [source]
Aggregate dataframe and capture errors (using aggregate).
Although faster than
groupby_apply()
, it has some limitations:Raised errors cannot access the group index.
Aggregation functions must return a scalar (must ‘reduce’). This is not a limitation with
pandas.Series.apply()
.
- Parameters
df – Dataframe to aggregate.
by – Columns names to use to group rows (see
pandas.DataFrame.groupby()
).aggfuncs – Aggregation functions for columns not in by.
raised – Whether
AggregationError
errors are raised or replaced withnp.nan
and returned in an error report.error – A function with signature f(x, e) -> Any, where x is the original input and e is the original error, used when raised=False. By default, the original error is returned.
- Returns
Aggregated dataframe with by columns set as the index and an error report with (if raised=False) a
pandas.Series
of errors (or the value returned by error) for each column where errors occured.
Examples
>>> df = pd.DataFrame({ ... 'x': [0, 0, 1, 1], ... 'y': pd.Series([2, 2, 2, 3], dtype='Int64') ... }) >>> df.index = [0, 0, 0, 1] >>> base = dict(df=df, by='x', aggfuncs={'y': unique}) >>> groupby_aggregate(**base) Traceback (most recent call last): AggregationError: Could not aggregate y: Not unique. >>> result, report = groupby_aggregate(**base, raised=False) >>> result y x 0 2 1 <NA> >>> report['y'] x 1 Not unique. Name: y, dtype: object >>> error = lambda x, e: as_dict(x) >>> result, report = groupby_aggregate(**base, raised=False, error=error) >>> report['y'] x 1 {0: [2], 1: [3]} Name: y, dtype: object