pudl.metadata.classes¶
Metadata data classes.
Attributes¶
Classes¶
A base model that configures some options for PUDL metadata classes. |
|
Field constraints (resource.schema.fields[...].constraints). |
|
Field harvest parameters (resource.schema.fields[...].harvest). |
|
A class that allows us to standardize reported categorical codes. |
|
Field (resource.schema.fields[...]). |
|
Foreign key reference (resource.schema.foreign_keys[...].reference). |
|
Foreign key (resource.schema.foreign_keys[...]). |
|
Table schema (resource.schema). |
|
Data license (package|resource.licenses[...]). |
|
Data contributor (package.contributors[...]). |
|
A data source that has been integrated into PUDL. |
|
Resource harvest parameters (resource.harvest). |
|
The form we expect the RESOURCE_METADATA elements to take. |
|
Tabular data resource (package.resources[...]). |
|
Tabular data package. |
|
A list of Encoders for standardizing and documenting categorical codes. |
|
A collection of Data Sources and Resources for metadata export. |
Functions¶
|
Return a list of all unique values, in order of first appearance. |
|
Format value for use in raw SQL(ite). |
|
|
|
Check that input list has unique values. |
|
Construct reusable Pydantic validator. |
Module Contents¶
- pudl.metadata.classes._unique(*args: collections.abc.Iterable) list [source]¶
Return a list of all unique values, in order of first appearance.
- Parameters:
args – Iterables of values.
Examples
>>> _unique([0, 2], (2, 1)) [0, 2, 1] >>> _unique([{'x': 0, 'y': 1}, {'y': 1, 'x': 0}], [{'z': 2}]) [{'x': 0, 'y': 1}, {'z': 2}]
- pudl.metadata.classes._format_for_sql(x: Any, identifier: bool = False) str [source]¶
Format value for use in raw SQL(ite).
- Parameters:
x – Value to format.
identifier – Whether x represents an identifier (e.g. table, column) name.
Examples
>>> _format_for_sql('table_name', identifier=True) '"table_name"' >>> _format_for_sql('any string') "'any string'" >>> _format_for_sql("Single's quote") "'Single''s quote'" >>> _format_for_sql(None) 'null' >>> _format_for_sql(1) '1' >>> _format_for_sql(True) 'True' >>> _format_for_sql(False) 'False' >>> _format_for_sql(re.compile("^[^']*$")) "'^[^'']*$'" >>> _format_for_sql(datetime.date(2020, 1, 2)) "'2020-01-02'" >>> _format_for_sql(datetime.datetime(2020, 1, 2, 3, 4, 5, 6)) "'2020-01-02 03:04:05'"
- pudl.metadata.classes.SnakeCase[source]¶
Snake-case variable name
str
(e.g. ‘pudl’, ‘entity_eia860’).
- pudl.metadata.classes.StrictList[source]¶
Non-empty
list
.Allows
list
,tuple
,set
,frozenset
,collections.deque
, or generators and casts to alist
.
- pudl.metadata.classes._check_unique(value: list = None) list | None [source]¶
Check that input list has unique values.
- pudl.metadata.classes._validator(*names, fn: collections.abc.Callable) collections.abc.Callable [source]¶
Construct reusable Pydantic validator.
- Parameters:
names – Names of attributes to validate.
fn – Validation function (see
pydantic.field_validator()
).
Examples
>>> class Class(BaseModel): ... x: list = None ... _check_unique = _validator("x", fn=_check_unique) >>> Class(x=[0, 0]) Traceback (most recent call last): ValidationError: ...
- class pudl.metadata.classes.PudlMeta(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel
A base model that configures some options for PUDL metadata classes.
- class pudl.metadata.classes.FieldConstraints(/, **data: Any)[source]¶
Bases:
PudlMeta
Field constraints (resource.schema.fields[…].constraints).
See https://specs.frictionlessdata.io/table-schema/#constraints.
- minimum: pydantic.StrictInt | pydantic.StrictFloat | datetime.date | datetime.datetime | None = None[source]¶
- maximum: pydantic.StrictInt | pydantic.StrictFloat | datetime.date | datetime.datetime | None = None[source]¶
- pattern: re.Pattern | None = None[source]¶
- enum: StrictList[String | pydantic.StrictInt | pydantic.StrictFloat | pydantic.StrictBool | datetime.date | datetime.datetime] | None = None[source]¶
- classmethod _check_max_length(value, info: pydantic.ValidationInfo)[source]¶
- classmethod _check_max(value, info: pydantic.ValidationInfo)[source]¶
- class pudl.metadata.classes.FieldHarvest(/, **data: Any)[source]¶
Bases:
PudlMeta
Field harvest parameters (resource.schema.fields[…].harvest).
- aggregate: collections.abc.Callable[[pandas.Series], pandas.Series][source]¶
Computes a single value from all field values in a group.
- class pudl.metadata.classes.Encoder(/, **data: Any)[source]¶
Bases:
PudlMeta
A class that allows us to standardize reported categorical codes.
Often the original data we are integrating uses short codes to indicate a categorical value, like
ST
in place of “steam turbine” orLIG
in place of “lignite coal”. Many of these coded fields contain non-standard codes due to data-entry errors. The codes have also evolved over the years.In order to allow easy comparison of records across all years and tables, we define a standard set of codes, a mapping from non-standard codes to standard codes (where possible), and a set of known but unfixable codes which will be ignored and replaced with NA values. These definitions can be found in
pudl.metadata.codes
and we refer to these as coding tables.In our metadata structures, each coding table is defined just like any other DB table, with the addition of an associated
Encoder
object defining the standard, fixable, and ignored codes.In addition, a
Package
class that has been instantiated using thePackage.from_resource_ids()
method will associate an Encoder object with any column that has a foreign key constraint referring to a coding table (This column-level encoder is same as the encoder associated with the referenced table). This Encoder can be used to standardize the codes found within the column.Field
andResource
objects haveencode()
methods that will use the column-level encoders to recode the original values, either for a single column or for all coded columns within a Resource, given either a correspondingpandas.Series
orpandas.DataFrame
containing actual values.If any unrecognized values are encountered, an exception will be raised, alerting us that a new code has been identified, and needs to be classified as fixable or to be ignored.
- df: pandas.DataFrame[source]¶
A table associating short codes with long descriptions and other information.
Each coding table contains at least a
code
column containing the standard codes and adescription
column with a human readable explanation of what the code stands for. Additional metadata pertaining to the codes and their categories may also appear in this dataframe, which will be loaded into the PUDL DB as a static table. Thecode
column is a natural primary key and must contain no duplicate values.
- ignored_codes: list[pydantic.StrictInt | str] = [][source]¶
A list of non-standard codes which appear in the data, and will be set to NA.
These codes may be the result of data entry errors, and we are unable to map them to the appropriate canonical code. They are discarded from the raw input data.
- code_fixes: dict[pydantic.StrictInt | String, pydantic.StrictInt | String][source]¶
A dictionary mapping non-standard codes to canonical, standardized codes.
The intended meanings of some non-standard codes are clear, and therefore they can be mapped to the standardized, canonical codes with confidence. Sometimes these are the result of data entry errors or changes in the stanard codes over time.
- model_config[source]¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod _df_is_encoding_table(df: pandas.DataFrame)[source]¶
Verify that the coding table provides both codes and descriptions.
- classmethod _good_and_ignored_codes_are_disjoint(ignored_codes, info: pydantic.ValidationInfo)[source]¶
Check that there’s no overlap between good and ignored codes.
- classmethod _good_and_fixable_codes_are_disjoint(code_fixes, info: pydantic.ValidationInfo)[source]¶
Check that there’s no overlap between the good and fixable codes.
- classmethod _fixable_and_ignored_codes_are_disjoint(code_fixes, info: pydantic.ValidationInfo)[source]¶
Check that there’s no overlap between the ignored and fixable codes.
- classmethod _check_fixed_codes_are_good_codes(code_fixes, info: pydantic.ValidationInfo)[source]¶
Check that every every fixed code is also one of the good codes.
- property code_map: dict[str, str | pandas._libs.missing.NAType][source]¶
A mapping of all known codes to their standardized values, or NA.
- encode(col: pandas.Series, dtype: type | None = None) pandas.Series [source]¶
Apply the stored code mapping to an input Series.
- static dict_from_id(x: str) dict [source]¶
Look up the encoder by coding table name in the metadata.
- classmethod from_id(x: str) Encoder [source]¶
Construct an Encoder based on Resource.name of a coding table.
- classmethod from_code_id(x: str) Encoder [source]¶
Construct an Encoder by looking up name of coding table in codes metadata.
- to_rst(top_dir: pydantic.DirectoryPath, csv_subdir: pydantic.DirectoryPath, is_header: pydantic.StrictBool) String [source]¶
Ouput dataframe to a csv for use in jinja template.
Then output to an RST file.
- generate_encodable_data(size: int = 10) pandas.Series [source]¶
Produce a series of data which can be encoded by this encoder.
Selects values randomly from valid, ignored, and fixable codes.
- class pudl.metadata.classes.Field(/, **data: Any)[source]¶
Bases:
PudlMeta
Field (resource.schema.fields[…]).
See https://specs.frictionlessdata.io/table-schema/#field-descriptors.
Examples
>>> field = Field(name='x', type='string', description='X', constraints={'enum': ['x', 'y']}) >>> field.to_pandas_dtype() CategoricalDtype(categories=['x', 'y'], ordered=False, categories_dtype=object) >>> field.to_sql() Column('x', Enum('x', 'y'), CheckConstraint(...), table=None, comment='X') >>> field = Field.from_id('utility_id_eia') >>> field.name 'utility_id_eia'
- constraints: FieldConstraints[source]¶
- harvest: FieldHarvest[source]¶
- classmethod _check_constraints(value, info: pydantic.ValidationInfo)[source]¶
- classmethod _check_encoder(value, info: pydantic.ValidationInfo)[source]¶
- to_pandas_dtype(compact: bool = False) str | pandas.CategoricalDtype [source]¶
Return Pandas data type.
- Parameters:
compact – Whether to return a low-memory data type (32-bit integer or float).
- to_pyarrow_dtype() pyarrow.lib.DataType [source]¶
Return PyArrow data type.
- to_pyarrow() pyarrow.Field [source]¶
Return a PyArrow Field appropriate to the field.
- to_sql(dialect: Literal['sqlite'] = 'sqlite', check_types: bool = True, check_values: bool = True) sqlalchemy.Column [source]¶
Return equivalent SQL column.
- encode(col: pandas.Series, dtype: type | None = None) pandas.Series [source]¶
Recode the Field if it has an associated encoder.
- class pudl.metadata.classes.ForeignKeyReference(/, **data: Any)[source]¶
Bases:
PudlMeta
Foreign key reference (resource.schema.foreign_keys[…].reference).
See https://specs.frictionlessdata.io/table-schema/#foreign-keys.
- class pudl.metadata.classes.ForeignKey(/, **data: Any)[source]¶
Bases:
PudlMeta
Foreign key (resource.schema.foreign_keys[…]).
See https://specs.frictionlessdata.io/table-schema/#foreign-keys.
- reference: ForeignKeyReference[source]¶
- classmethod _check_fields_equal_length(value, info: pydantic.ValidationInfo)[source]¶
- class pudl.metadata.classes.Schema(/, **data: Any)[source]¶
Bases:
PudlMeta
Table schema (resource.schema).
See https://specs.frictionlessdata.io/table-schema.
- foreign_keys: list[ForeignKey] = [][source]¶
- classmethod _check_primary_key_in_fields(pk, info: pydantic.ValidationInfo)[source]¶
Verify that all primary key elements also appear in the schema fields.
- class pudl.metadata.classes.License(/, **data: Any)[source]¶
Bases:
PudlMeta
Data license (package|resource.licenses[…]).
See https://specs.frictionlessdata.io/data-package/#licenses.
- class pudl.metadata.classes.Contributor(/, **data: Any)[source]¶
Bases:
PudlMeta
Data contributor (package.contributors[…]).
See https://specs.frictionlessdata.io/data-package/#contributors.
- role: Literal['author', 'contributor', 'maintainer', 'publisher', 'wrangler'] = 'contributor'[source]¶
- zenodo_role: Literal['contact person', 'data collector', 'data curator', 'data manager', 'distributor', 'editor', 'hosting institution', 'other', 'producer', 'project leader', 'project member', 'registration agency', 'registration authority', 'related person', 'researcher', 'rights holder', 'sponsor', 'supervisor', 'work package leader'] = 'project member'[source]¶
- classmethod from_id(x: str) Contributor [source]¶
Construct from PUDL identifier.
- class pudl.metadata.classes.DataSource(/, **data: Any)[source]¶
Bases:
PudlMeta
A data source that has been integrated into PUDL.
This metadata is used for:
Generating PUDL documentation.
Annotating long-term archives of the raw input data on Zenodo.
Defining what data partitions can be processed using PUDL.
It can also be used to populate the “source” fields of frictionless data packages and data resources (package|resource.sources[…]).
See https://specs.frictionlessdata.io/data-package/#sources.
- contributors: list[Contributor] = [][source]¶
- get_resource_ids() list[str] [source]¶
Compile list of resource IDs associated with this data source.
- get_temporal_coverage(partitions: dict = None) str [source]¶
Return a string describing the time span covered by the data source.
- to_rst(docs_dir: pydantic.DirectoryPath, source_resources: list, extra_resources: list, output_path: str = None) None [source]¶
Output a representation of the data source in RST for documentation.
- classmethod from_field_namespace(x: str) list[DataSource] [source]¶
Return list of DataSource objects by field namespace.
- classmethod from_id(x: str) DataSource [source]¶
Construct Source by source name in the metadata.
- class pudl.metadata.classes.ResourceHarvest(/, **data: Any)[source]¶
Bases:
PudlMeta
Resource harvest parameters (resource.harvest).
- class pudl.metadata.classes.PudlResourceDescriptor(/, **data: Any)[source]¶
Bases:
PudlMeta
The form we expect the RESOURCE_METADATA elements to take.
This differs from
Resource
andSchema
, etc., in that we represent many complex types (Field
,DataSource
, etc.) with string IDs that we then turn into instances of those types with lookups. We also useforeign_key_rules
to generate the actualforeign_key
relationships that are represented in aSchema
.This is all very useful in that we can describe the resources more concisely!
TODO: In the future, we could convert from a
PudlResourceDescriptor
to various standard formats, such as a Frictionless resource or apandera
schema. This would require some of the logic currently inResource
to move into this class.- class PudlSchemaDescriptor(/, **data: Any)[source]¶
Bases:
PudlMeta
Container to hold the schema shape.
- class PudlForeignKeyRules(/, **data: Any)[source]¶
Bases:
PudlMeta
Container to describe what foreign key rules look like.
- foreign_key_rules: PudlResourceDescriptor.PudlSchemaDescriptor.PudlForeignKeyRules[source]¶
- class PudlCodeMetadata(/, **data: Any)[source]¶
Bases:
PudlMeta
Describes a bunch of codes.
- class CodeDataFrame[source]¶
Bases:
pandera.DataFrameModel
The DF we use to represent code/label/description associations.
- code: pandera.typing.Series[Any][source]¶
- label: pandera.typing.Series[str] | None[source]¶
- description: pandera.typing.Series[str][source]¶
- operational_status: pandera.typing.Series[str] | None[source]¶
- encoder: PudlResourceDescriptor.PudlCodeMetadata | None = None[source]¶
- class pudl.metadata.classes.Resource(/, **data: Any)[source]¶
Bases:
PudlMeta
Tabular data resource (package.resources[…]).
See https://specs.frictionlessdata.io/tabular-data-resource.
Examples
A simple example illustrates the conversion to SQLAlchemy objects.
>>> fields = [{'name': 'x', 'type': 'year', 'description': 'X'}, {'name': 'y', 'type': 'string', 'description': 'Y'}] >>> fkeys = [{'fields': ['x', 'y'], 'reference': {'resource': 'b', 'fields': ['x', 'y']}}] >>> schema = {'fields': fields, 'primary_key': ['x'], 'foreign_keys': fkeys} >>> resource = Resource(name='a', schema=schema, description='A') >>> table = resource.to_sql() >>> table.columns.x Column('x', Integer(), ForeignKey('b.x'), CheckConstraint(...), table=<a>, primary_key=True, nullable=False, comment='X') >>> table.columns.y Column('y', Text(), ForeignKey('b.y'), CheckConstraint(...), table=<a>, comment='Y')
To illustrate harvesting operations, say we have a resource with two fields - a primary key (id) and a data field - which we want to harvest from two different dataframes.
>>> from pudl.metadata.helpers import unique, as_dict >>> fields = [ ... {'name': 'id', 'type': 'integer', 'description': 'ID'}, ... {'name': 'x', 'type': 'integer', 'harvest': {'aggregate': unique, 'tolerance': 0.25}, 'description': 'X'} ... ] >>> resource = Resource(**{ ... 'name': 'a', ... 'harvest': {'harvest': True}, ... 'schema': {'fields': fields, 'primary_key': ['id']}, ... 'description': 'A', ... }) >>> dfs = { ... 'a': pd.DataFrame({'id': [1, 1, 2, 2], 'x': [1, 1, 2, 2]}), ... 'b': pd.DataFrame({'id': [2, 3, 3], 'x': [3, 4, 4]}) ... }
Skip aggregation to access all the rows concatenated from the input dataframes. The names of the input dataframes are used as the index.
>>> df, _ = resource.harvest_dfs(dfs, aggregate=False) >>> df id x df a 1 1 a 1 1 a 2 2 a 2 2 b 2 3 b 3 4 b 3 4
Field names and data types are enforced.
>>> resource.to_pandas_dtypes() == df.dtypes.apply(str).to_dict() True
Alternatively, aggregate by primary key (the default when
harvest
. harvest=True) and report aggregation errors.>>> df, report = resource.harvest_dfs(dfs) >>> df x id 1 1 2 <NA> 3 4 >>> report['stats'] {'all': 2, 'invalid': 1, 'tolerance': 0.0, 'actual': 0.5} >>> report['fields']['x']['stats'] {'all': 3, 'invalid': 1, 'tolerance': 0.25, 'actual': 0.33...} >>> report['fields']['x']['errors'] id 2 Not unique. Name: x, dtype: object
Customize the error values in the error report.
>>> error = lambda x, e: as_dict(x) >>> df, report = resource.harvest_dfs( ... dfs, aggregate_kwargs={'raised': False, 'error': error} ... ) >>> report['fields']['x']['errors'] id 2 {'a': [2, 2], 'b': [3]} Name: x, dtype: object
Limit harvesting to the input dataframe of the same name by setting
harvest
. harvest=False.>>> resource.harvest.harvest = False >>> df, _ = resource.harvest_dfs(dfs, aggregate_kwargs={'raised': False}) >>> df id x df a 1 1 a 1 1 a 2 2 a 2 2
Harvesting can also handle conversion to longer time periods. Period harvesting requires primary key fields with a datetime data type, except for year fields which can be integer.
>>> fields = [{'name': 'report_year', 'type': 'year', 'description': 'Report year'}] >>> resource = Resource(**{ ... 'name': 'table', 'harvest': {'harvest': True}, ... 'schema': {'fields': fields, 'primary_key': ['report_year']}, ... 'description': 'Table', ... }) >>> df = pd.DataFrame({'report_date': ['2000-02-02', '2000-03-03']}) >>> resource.format_df(df) report_year 0 2000-01-01 1 2000-01-01 >>> df = pd.DataFrame({'report_year': [2000, 2000]}) >>> resource.format_df(df) report_year 0 2000-01-01 1 2000-01-01
- harvest: ResourceHarvest[source]¶
- contributors: list[Contributor] = [][source]¶
- sources: list[DataSource] = [][source]¶
- field_namespace: Literal['eia', 'eiaaeo', 'eia_bulk_elec', 'epacems', 'ferc1', 'ferc714', 'glue', 'gridpathratoolkit', 'ppe', 'pudl', 'nrelatb', 'vcerare'] | None = None[source]¶
- etl_group: Literal['eia860', 'eia861', 'eia861_disabled', 'eia923', 'eia930', 'eiaaeo', 'entity_eia', 'epacems', 'ferc1', 'ferc1_disabled', 'ferc714', 'glue', 'gridpathratoolkit', 'outputs', 'static_ferc1', 'static_eia', 'static_eia_disabled', 'eia_bulk_elec', 'state_demand', 'static_pudl', 'service_territories', 'nrelatb', 'vcerare'] | None = None[source]¶
- property sphinx_ref_name[source]¶
Get legal Sphinx ref name.
Sphinx throws an error when creating a cross ref target for a resource that has a preceding underscore. It is also possible for resources to have identical names when the preceeding underscore is removed. This function adds a preceeding ‘i’ to cross ref targets for resources with preceeding underscores. The ‘i’ will not be rendered in the docs, only in the .rst files the hyperlinks.
- classmethod _check_harvest_primary_key(value, info: pydantic.ValidationInfo)[source]¶
- static dict_from_id(resource_id: str) dict [source]¶
Construct dictionary from PUDL identifier (resource.name).
- static dict_from_resource_descriptor(resource_id: str, descriptor: PudlResourceDescriptor) dict [source]¶
Get a Resource-shaped dict from a PudlResourceDescriptor.
schema.fields
Field names are expanded (
Field.from_id()
).Field attributes are replaced with any specific to the resource.group and field.name.
sources: Source ids are expanded (
Source.from_id()
).licenses: License ids are expanded (
License.from_id()
).contributors: Contributor ids are fetched by source ids, then expanded (
Contributor.from_id()
).keywords: Keywords are fetched by source ids.
schema.foreign_keys: Foreign keys are fetched by resource name.
- get_field(name: str) Field [source]¶
Return field with the given name if it’s part of the Resources.
- to_sql(metadata: sqlalchemy.MetaData = None, check_types: bool = True, check_values: bool = True) sqlalchemy.Table [source]¶
Return equivalent SQL Table.
- to_pyarrow() pyarrow.Schema [source]¶
Construct a PyArrow schema for the resource.
- to_pandas_dtypes(**kwargs: Any) dict[str, str | pandas.CategoricalDtype] [source]¶
Return Pandas data type of each field by field name.
- Parameters:
kwargs – Arguments to
Field.to_pandas_dtype()
.
- match_primary_key(names: collections.abc.Iterable[str]) dict[str, str] | None [source]¶
Match primary key fields to input field names.
An exact match is required unless
harvest
.`harvest=True`, in which case periodic names may also match a basename with a smaller period.- Parameters:
names – Field names.
- Raises:
ValueError – Field names are not unique.
ValueError – Multiple field names match primary key field.
- Returns:
The name matching each primary key field (if any) as a
dict
, or None if not all primary key fields have a match.
Examples
>>> fields = [{'name': 'x_year', 'type': 'year', 'description': 'Year'}] >>> schema = {'fields': fields, 'primary_key': ['x_year']} >>> resource = Resource(name='r', schema=schema, description='R')
By default, when
harvest
.`harvest=False`, exact matches are required.>>> resource.harvest.harvest False >>> resource.match_primary_key(['x_month']) is None True >>> resource.match_primary_key(['x_year', 'x_month']) {'x_year': 'x_year'}
When
harvest
.`harvest=True`, in the absence of an exact match, periodic names may also match a basename with a smaller period.>>> resource.harvest.harvest = True >>> resource.match_primary_key(['x_year', 'x_month']) {'x_year': 'x_year'} >>> resource.match_primary_key(['x_month']) {'x_month': 'x_year'} >>> resource.match_primary_key(['x_month', 'x_date']) Traceback (most recent call last): ValueError: ... {'x_month', 'x_date'} match primary key field 'x_year'
- format_df(df: pandas.DataFrame | None = None, **kwargs: Any) pandas.DataFrame [source]¶
Format a dataframe according to the resources’s table schema.
DataFrame columns not in the schema are dropped.
Any columns missing from the DataFrame are added with the right dtype, but will be empty.
All columns are cast to their specified pandas dtypes.
Primary key columns must be present and non-null.
Periodic primary key fields are snapped to the start of the desired period.
If the primary key fields could not be matched to columns in df (
match_primary_key()
) or if df=None, an empty dataframe is returned.
- Parameters:
df – Dataframe to format.
kwargs – Arguments to
Field.to_pandas_dtypes()
.
- Returns:
Dataframe with column names and data types matching the resource fields.
- enforce_schema(df: pandas.DataFrame) pandas.DataFrame [source]¶
Drop columns not in the DB schema and enforce specified types.
- aggregate_df(df: pandas.DataFrame, raised: bool = False, error: collections.abc.Callable = None) tuple[pandas.DataFrame, dict] [source]¶
Aggregate dataframe by primary key.
The dataframe is grouped by primary key fields and aggregated with the aggregate function of each field (
schema_
. fields[*].harvest.aggregate).The report is formatted as follows:
valid (bool): Whether resouce is valid.
stats (dict): Error statistics for resource fields.
fields (dict):
<field_name> (str)
valid (bool): Whether field is valid.
stats (dict): Error statistics for field groups.
errors (
pandas.Series
): Error values indexed by primary key.
…
Each stats (dict) contains the following:
all (int): Number of entities (field or field group).
invalid (int): Invalid number of entities.
tolerance (float): Fraction of invalid entities below which parent entity is considered valid.
actual (float): Actual fraction of invalid entities.
- Parameters:
df – Dataframe to aggregate. It is assumed to have column names and data types matching the resource fields.
raised – Whether aggregation errors are raised or replaced with
np.nan
and returned in an error report.error – A function with signature f(x, e) -> Any, where x are the original field values as a
pandas.Series
and e is the original error. If provided, the returned value is reported instead of e.
- Raises:
ValueError – A primary key is required for aggregating.
- Returns:
The aggregated dataframe indexed by primary key fields, and an aggregation report (descripted above) that includes all aggregation errors and whether the result meets the resource’s and fields’ tolerance.
- _build_aggregation_report(df: pandas.DataFrame, errors: dict) dict [source]¶
Build report from aggregation errors.
- Parameters:
df – Harvested dataframe (see
harvest_dfs()
).errors – Aggregation errors (see
groupby_aggregate()
).
- Returns:
Aggregation report, as described in
aggregate_df()
.
- harvest_dfs(dfs: dict[str, pandas.DataFrame], aggregate: bool = None, aggregate_kwargs: dict[str, Any] = {}, format_kwargs: dict[str, Any] = {}) tuple[pandas.DataFrame, dict] [source]¶
Harvest from named dataframes.
For standard resources (
harvest
. harvest=False), the columns matching all primary key fields and any data fields are extracted from the input dataframe of the same name.For harvested resources (
harvest
. harvest=True), the columns matching all primary key fields and any data fields are extracted from each compatible input dataframe, and concatenated into a single dataframe. Periodic key fields (e.g. ‘report_month’) are matched to any column of the same name with an equal or smaller period (e.g. ‘report_day’) and snapped to the start of the desired period.If aggregate=False, rows are indexed by the name of the input dataframe. If aggregate=True, rows are indexed by primary key fields.
- Parameters:
dfs – Dataframes to harvest.
aggregate – Whether to aggregate the harvested rows by their primary key. By default, this is True if self.harvest.harvest=True and False otherwise.
aggregate_kwargs – Optional arguments to
aggregate_df()
.format_kwargs – Optional arguments to
format_df()
.
- Returns:
A dataframe harvested from the dataframes, with column names and data types matching the resource fields, alongside an aggregation report.
- encode(df: pandas.DataFrame) pandas.DataFrame [source]¶
Standardize coded columns using the foreign column they refer to.
- class pudl.metadata.classes.Package(/, **data: Any)[source]¶
Bases:
PudlMeta
Tabular data package.
See https://specs.frictionlessdata.io/data-package.
Examples
Foreign keys between resources are checked for completeness and consistency.
>>> fields = [{'name': 'x', 'type': 'year', 'description': 'X'}, {'name': 'y', 'type': 'string', 'description': 'Y'}] >>> fkey = {'fields': ['x', 'y'], 'reference': {'resource': 'b', 'fields': ['x', 'y']}} >>> schema = {'fields': fields, 'primary_key': ['x'], 'foreign_keys': [fkey]} >>> a = Resource(name='a', schema=schema, description='A') >>> b = Resource(name='b', schema=Schema(fields=fields, primary_key=['x']), description='B') >>> Package(name='ab', resources=[a, b]) Traceback (most recent call last): ValidationError: ... >>> b.schema.primary_key = ['x', 'y'] >>> package = Package(name='ab', resources=[a, b])
SQL Alchemy can sort tables, based on foreign keys, in the order in which they need to be loaded into a database.
>>> metadata = package.to_sql() >>> [table.name for table in metadata.sorted_tables] ['b', 'a']
- created: datetime.datetime[source]¶
- contributors: list[Contributor] = [][source]¶
- sources: list[DataSource] = [][source]¶
- model_config[source]¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- _populate_from_resources()[source]¶
Populate Package attributes from similar deduplicated Resource attributes.
Resources and Packages share some descriptive attributes. When building a Package out of a collection of Resources, we want the Package to reflect the union of all the analogous values found in the Resources, but we don’t want any duplicates. We may also get values directly from the Package inputs.
- classmethod from_resource_ids(resource_ids: tuple[str] = tuple(sorted(RESOURCE_METADATA)), resolve_foreign_keys: bool = False, excluded_etl_groups: tuple[str] = ()) Package [source]¶
Construct a collection of Resources from PUDL identifiers (resource.name).
Identify any fields that have foreign key relationships referencing the coding tables defined in
pudl.metadata.codes
and if so, associate the coding table’s encoder with those columns for later use cleaning them up.The result is cached, since we so often need to generate the metdata for the full collection of PUDL tables.
- Parameters:
resource_ids – Resource PUDL identifiers (resource.name). Needs to be a Tuple so that the set of identifiers is hashable, allowing return value caching through lru_cache.
resolve_foreign_keys – Whether to add resources as needed based on foreign keys.
excluded_etl_groups – Collection of ETL groups used to filter resources out of Package.
- static get_etl_group_tables(etl_group: str) tuple[str] [source]¶
Get a sorted tuple of table names for an etl_group.
- Parameters:
etl_group – the etl_group key.
- Returns:
A sorted tuple of table names for the etl_group.
- get_resource(name: str) Resource [source]¶
Return the resource with the given name if it is in the Package.
- to_sql(check_types: bool = True, check_values: bool = True) sqlalchemy.MetaData [source]¶
Return equivalent SQL MetaData.
- get_sorted_resources() StrictList[Resource] [source]¶
Get a list of sorted Resources.
Currently Resources are listed in reverse alphabetical order based on their name which results in the following order to promote output tables to users and push intermediate tables to the bottom of the docs: output, core, intermediate. In the future we might want to have more fine grain control over how Resources are sorted.
- Returns:
A sorted list of resources.
- property encoders: dict[SnakeCase, Encoder][source]¶
Compile a mapping of field names to their encoders, if they exist.
This dictionary will be used many times, so it makes sense to build it once when the Package is instantiated so it can be reused.
- encode(df: pandas.DataFrame) pandas.DataFrame [source]¶
Clean up all coded columns in a dataframe based on PUDL coding tables.
- Returns:
A modified copy of the input dataframe.
- pudl.metadata.classes.PUDL_PACKAGE[source]¶
Define a gobal PUDL package object for use across the entire codebase.
This needs to happen after the definition of the Package class above, and it is used in some of the class definitions below, but having it defined in the middle of this module is kind of obscure, so it is imported in the __init__.py for this subpackage and then imported in other modules from that more prominent location.
- class pudl.metadata.classes.CodeMetadata(/, **data: Any)[source]¶
Bases:
PudlMeta
A list of Encoders for standardizing and documenting categorical codes.
Used to export static coding metadata to PUDL documentation automatically
- classmethod from_code_ids(code_ids: collections.abc.Iterable[str]) CodeMetadata [source]¶
Construct a list of encoders from code dictionaries.
- Parameters:
code_ids – A list of Code PUDL identifiers, keys to entries in the CODE_METADATA dictionary.
- class pudl.metadata.classes.DatasetteMetadata(/, **data: Any)[source]¶
Bases:
PudlMeta
A collection of Data Sources and Resources for metadata export.
Used to create metadata YAML file to accompany Datasette.
- data_sources: list[DataSource][source]¶
- classmethod from_data_source_ids(output_path: pathlib.Path, data_source_ids: list[str] = ['pudl', 'eia860', 'eia860m', 'eia861', 'eia923', 'ferc1', 'ferc2', 'ferc6', 'ferc60', 'ferc714'], xbrl_ids: list[str] = ['ferc1_xbrl', 'ferc2_xbrl', 'ferc6_xbrl', 'ferc60_xbrl', 'ferc714_xbrl']) DatasetteMetadata [source]¶
Construct a dictionary of DataSources from data source names.
Create dictionary of first and last year or year-month for each source.
- Parameters:
output_path – PUDL_OUTPUT path.
data_source_ids – ids of data sources currently included in Datasette
xbrl_ids – ids of data converted XBRL data to be included in Datasette