pudl.metadata.classes

Metadata data classes.

Module Contents

Classes

Base

Custom Pydantic base class.

BaseType

Base class for custom pydantic types.

Date

Any datetime.date.

Datetime

Any datetime.datetime.

Pattern

Regular expression pattern.

FieldConstraints

Field constraints (resource.schema.fields[...].constraints).

FieldHarvest

Field harvest parameters (resource.schema.fields[...].harvest).

Encoder

A class that allows us to standardize reported categorical codes.

Field

Field (resource.schema.fields[...]).

ForeignKeyReference

Foreign key reference (resource.schema.foreign_keys[...].reference).

ForeignKey

Foreign key (resource.schema.foreign_keys[...]).

Schema

Table schema (resource.schema).

License

Data license (package|resource.licenses[...]).

Source

Data source (package|resource.sources[...]).

Contributor

Data contributor (package.contributors[...]).

ResourceHarvest

Resource harvest parameters (resource.harvest).

Resource

Tabular data resource (package.resources[...]).

Package

Tabular data package.

Functions

_unique(*args: Iterable) → list

Return a list of all unique values, in order of first appearance.

_format_for_sql(x: Any, identifier: bool = False) → str

Format value for use in raw SQL(ite).

StrictList(item_type: Type = Any) → pydantic.ConstrainedList

Non-empty list.

_check_unique(value: list = None) → Optional[list]

Check that input list has unique values.

_validator(*names, fn: Callable) → Callable

Construct reusable Pydantic validator.

Attributes

logger

JINJA_ENVIRONMENT

String

Non-empty str with no trailing or leading whitespace.

SnakeCase

Snake-case variable name str (e.g. 'pudl', 'entity_eia860').

Bool

Any bool (True or False).

Float

Any float.

Int

Any int.

PositiveInt

Positive int.

PositiveFloat

Positive float.

Email

String representing an email.

HttpUrl

Http(s) URL.

pudl.metadata.classes.logger[source]
pudl.metadata.classes._unique(*args: Iterable) list[source]

Return a list of all unique values, in order of first appearance.

Parameters

args – Iterables of values.

Examples

>>> _unique([0, 2], (2, 1))
[0, 2, 1]
>>> _unique([{'x': 0, 'y': 1}, {'y': 1, 'x': 0}], [{'z': 2}])
[{'x': 0, 'y': 1}, {'z': 2}]
pudl.metadata.classes._format_for_sql(x: Any, identifier: bool = False) str[source]

Format value for use in raw SQL(ite).

Parameters
  • x – Value to format.

  • identifier – Whether x represents an identifier (e.g. table, column) name.

Examples

>>> _format_for_sql('table_name', identifier=True)
'"table_name"'
>>> _format_for_sql('any string')
"'any string'"
>>> _format_for_sql("Single's quote")
"'Single''s quote'"
>>> _format_for_sql(None)
'null'
>>> _format_for_sql(1)
'1'
>>> _format_for_sql(True)
'True'
>>> _format_for_sql(False)
'False'
>>> _format_for_sql(re.compile("^[^']*$"))
"'^[^'']*$'"
>>> _format_for_sql(datetime.date(2020, 1, 2))
"'2020-01-02'"
>>> _format_for_sql(datetime.datetime(2020, 1, 2, 3, 4, 5, 6))
"'2020-01-02 03:04:05'"
pudl.metadata.classes.JINJA_ENVIRONMENT :jinja2.Environment[source]
class pudl.metadata.classes.Base[source]

Bases: pydantic.BaseModel

Custom Pydantic base class.

It overrides fields() and schema() to allow properties with those names. To use them in a class, use an underscore prefix and an alias.

Examples

>>> class Class(Base):
...     fields_: List[str] = pydantic.Field(alias="fields")
>>> m = Class(fields=['x'])
>>> m
Class(fields=['x'])
>>> m.fields
['x']
>>> m.fields = ['y']
>>> m.dict()
{'fields': ['y']}
class Config[source]

Custom Pydantic configuration.

validate_all :bool = True[source]
validate_assignment :bool = True[source]
extra :str = forbid[source]
arbitrary_types_allowed = True[source]
dict(self, *args, by_alias=True, **kwargs) dict[source]

Return as a dictionary.

json(self, *args, by_alias=True, **kwargs) str[source]

Return as JSON.

__getattribute__(self, name: str) Any[source]

Get attribute.

__setattr__(self, name, value) None[source]

Set attribute.

__repr_args__(self) List[Tuple[str, Any]][source]

Returns the attributes to show in __str__, __repr__, and __pretty__.

pudl.metadata.classes.String[source]

Non-empty str with no trailing or leading whitespace.

pudl.metadata.classes.SnakeCase[source]

Snake-case variable name str (e.g. ‘pudl’, ‘entity_eia860’).

pudl.metadata.classes.Bool[source]

Any bool (True or False).

pudl.metadata.classes.Float[source]

Any float.

pudl.metadata.classes.Int[source]

Any int.

pudl.metadata.classes.PositiveInt[source]

Positive int.

pudl.metadata.classes.PositiveFloat[source]

Positive float.

pudl.metadata.classes.Email[source]

String representing an email.

pudl.metadata.classes.HttpUrl[source]

Http(s) URL.

class pudl.metadata.classes.BaseType[source]

Base class for custom pydantic types.

classmethod __get_validators__(cls) Callable[source]

Yield validator methods.

class pudl.metadata.classes.Date[source]

Bases: BaseType

Any datetime.date.

classmethod validate(cls, value: Any) datetime.date[source]

Validate as date.

class pudl.metadata.classes.Datetime[source]

Bases: BaseType

Any datetime.datetime.

classmethod validate(cls, value: Any) datetime.datetime[source]

Validate as datetime.

class pudl.metadata.classes.Pattern[source]

Bases: BaseType

Regular expression pattern.

classmethod validate(cls, value: Any) re.Pattern[source]

Validate as pattern.

pudl.metadata.classes.StrictList(item_type: Type = Any) pydantic.ConstrainedList[source]

Non-empty list.

Allows list, tuple, set, frozenset, collections.deque, or generators and casts to a list.

pudl.metadata.classes._check_unique(value: list = None) Optional[list][source]

Check that input list has unique values.

pudl.metadata.classes._validator(*names, fn: Callable) Callable[source]

Construct reusable Pydantic validator.

Parameters
  • names – Names of attributes to validate.

  • fn – Validation function (see pydantic.validator()).

Examples

>>> class Class(Base):
...     x: list = None
...     _check_unique = _validator("x", fn=_check_unique)
>>> Class(y=[0, 0])
Traceback (most recent call last):
ValidationError: ...
class pudl.metadata.classes.FieldConstraints[source]

Bases: Base

Field constraints (resource.schema.fields[…].constraints).

See https://specs.frictionlessdata.io/table-schema/#constraints.

required :Bool = False[source]
unique :Bool = False[source]
min_length :PositiveInt[source]
max_length :PositiveInt[source]
minimum :Union[Int, Float, Date, Datetime][source]
maximum :Union[Int, Float, Date, Datetime][source]
pattern :Pattern[source]
enum :StrictList(Union[pydantic.StrictStr, Int, Float, Bool, Date, Datetime])[source]
_check_unique[source]
_check_max_length(cls, value, values)[source]
_check_max(cls, value, values)[source]
class pudl.metadata.classes.FieldHarvest[source]

Bases: Base

Field harvest parameters (resource.schema.fields[…].harvest).

aggregate :Callable[[pandas.Series], pandas.Series][source]

Computes a single value from all field values in a group.

tolerance :PositiveFloat = 0.0[source]

Fraction of invalid groups above which result is considered invalid.

class pudl.metadata.classes.Encoder[source]

Bases: Base

A class that allows us to standardize reported categorical codes.

Often the original data we are integrating uses short codes to indicate a categorical value, like ST in place of “steam turbine” or LIG in place of “lignite coal”. Many of these coded fields contain non-standard codes due to data-entry errors. The codes have also evolved over the years.

In order to allow easy comparison of records across all years and tables, we define a standard set of codes, a mapping from non-standard codes to standard codes (where possible), and a set of known but unfixable codes which will be ignored and replaced with NA values. These definitions can be found in pudl.metadata.codes and we refer to these as coding tables.

In our metadata structures, each coding table is defined just like any other DB table, with the addition of an associated Encoder object defining the standard, fixable, and ignored codes.

In addition, a Package class that has been instantiated using the Package.from_resource_ids() method will associate an Encoder object with any column that has a foreign key constraint referring to a coding table (This column-level encoder is same as the encoder associated with the referenced table). This Encoder can be used to standardize the codes found within the column.

Field and Resource objects have encode() methods that will use the column-level encoders to recode the original values, either for a single column or for all coded columns within a Resource, given either a corresponding pandas.Series or pandas.DataFrame containing actual values.

If any unrecognized values are encountered, an exception will be raised, alerting us that a new code has been identified, and needs to be classified as fixable or to be ignored.

df :pandas.DataFrame[source]

A table associating short codes with long descriptions and other information.

Each coding table contains at least a code column containing the standard codes and a definition column with a human readable explanation of what the code stands for. Additional metadata pertaining to the codes and their categories may also appear in this dataframe, which will be loaded into the PUDL DB as a static table. The code column is a natural primary key and must contain no duplicate values.

ignored_codes :List[Union[Int, String]] = [][source]

A list of non-standard codes which appear in the data, and will be set to NA.

These codes may be the result of data entry errors, and we are unable to map them to the appropriate canonical code. They are discarded from the raw input data.

code_fixes :Dict[Union[Int, String], Union[Int, String]][source]

A dictionary mapping non-standard codes to canonical, standardized codes.

The intended meanings of some non-standard codes are clear, and therefore they can be mapped to the standardized, canonical codes with confidence. Sometimes these are the result of data entry errors or changes in the stanard codes over time.

_df_is_encoding_table(cls, df)[source]

Verify that the coding table provides both codes and descriptions.

_good_and_ignored_codes_are_disjoint(cls, ignored_codes, values)[source]

Check that there’s no overlap between good and ignored codes.

_good_and_fixable_codes_are_disjoint(cls, code_fixes, values)[source]

Check that there’s no overlap between the good and fixable codes.

_fixable_and_ignored_codes_are_disjoint(cls, code_fixes, values)[source]

Check that there’s no overlap between the ignored and fixable codes.

_check_fixed_codes_are_good_codes(cls, code_fixes, values)[source]

Check that every every fixed code is also one of the good codes.

property code_map(self) Dict[str, Union[str, type(pd.NA)]][source]

A mapping of all known codes to their standardized values, or NA.

encode(self, col: pandas.Series, dtype: Union[type, None] = None) pandas.Series[source]

Apply the stored code mapping to an input Series.

static dict_from_id(x: str) dict[source]

Look up the encoder by coding table name in the metadata.

classmethod from_id(cls, x: str) Encoder[source]

Construct an Encoder based on Resource.name of a coding table.

class pudl.metadata.classes.Field[source]

Bases: Base

Field (resource.schema.fields[…]).

See https://specs.frictionlessdata.io/table-schema/#field-descriptors.

Examples

>>> field = Field(name='x', type='string', constraints={'enum': ['x', 'y']})
>>> field.to_pandas_dtype()
CategoricalDtype(categories=['x', 'y'], ordered=False)
>>> field.to_sql()
Column('x', Enum('x', 'y'), CheckConstraint(...), table=None)
>>> field = Field.from_id('utility_id_eia')
>>> field.name
'utility_id_eia'
name :SnakeCase[source]
type :Literal[string, number, integer, boolean, date, datetime, year][source]
format :Literal[default] = default[source]
description :String[source]
constraints :FieldConstraints[source]
harvest :FieldHarvest[source]
encoder :Encoder[source]
_check_constraints(cls, value, values)[source]
_check_encoder(cls, value, values)[source]
static dict_from_id(x: str) dict[source]

Construct dictionary from PUDL identifier (Field.name).

classmethod from_id(cls, x: str) Field[source]

Construct from PUDL identifier (Field.name).

to_pandas_dtype(self, compact: bool = False) Union[str, pandas.CategoricalDtype][source]

Return Pandas data type.

Parameters

compact – Whether to return a low-memory data type (32-bit integer or float).

to_sql_dtype(self) sqlalchemy.sql.visitors.VisitableType[source]

Return SQLAlchemy data type.

to_sql(self, dialect: Literal[sqlite] = 'sqlite', check_types: bool = True, check_values: bool = True) sqlalchemy.Column[source]

Return equivalent SQL column.

encode(self, col: pandas.Series, dtype: Union[type, None] = None) pandas.Series[source]

Recode the Field if it has an associated encoder.

class pudl.metadata.classes.ForeignKeyReference[source]

Bases: Base

Foreign key reference (resource.schema.foreign_keys[…].reference).

See https://specs.frictionlessdata.io/table-schema/#foreign-keys.

resource :SnakeCase[source]
fields_ :StrictList(SnakeCase)[source]
_check_unique[source]
class pudl.metadata.classes.ForeignKey[source]

Bases: Base

Foreign key (resource.schema.foreign_keys[…]).

See https://specs.frictionlessdata.io/table-schema/#foreign-keys.

fields_ :StrictList(SnakeCase)[source]
reference :ForeignKeyReference[source]
_check_unique[source]
_check_fields_equal_length(cls, value, values)[source]
is_simple(self) bool[source]

Indicate whether the FK relationship contains a single column.

to_sql(self) sqlalchemy.ForeignKeyConstraint[source]

Return equivalent SQL Foreign Key.

class pudl.metadata.classes.Schema[source]

Bases: Base

Table schema (resource.schema).

See https://specs.frictionlessdata.io/table-schema.

fields_ :StrictList(Field)[source]
missing_values :List[pydantic.StrictStr] = [''][source]
primary_key :StrictList(SnakeCase)[source]
foreign_keys :List[ForeignKey] = [][source]
_check_unique[source]
_check_field_names_unique(cls, value)[source]
_check_primary_key_in_fields(cls, value, values)[source]
_check_foreign_key_in_fields(cls, value, values)[source]
class pudl.metadata.classes.License[source]

Bases: Base

Data license (package|resource.licenses[…]).

See https://specs.frictionlessdata.io/data-package/#licenses.

name :String[source]
title :String[source]
path :HttpUrl[source]
static dict_from_id(x: str) dict[source]

Construct dictionary from PUDL identifier.

classmethod from_id(cls, x: str) License[source]

Construct from PUDL identifier.

class pudl.metadata.classes.Source[source]

Bases: Base

Data source (package|resource.sources[…]).

See https://specs.frictionlessdata.io/data-package/#sources.

title :String[source]
path :HttpUrl[source]
email :Email[source]
static dict_from_id(x: str) dict[source]

Construct dictionary from PUDL identifier.

classmethod from_id(cls, x: str) Source[source]

Construct from PUDL identifier.

class pudl.metadata.classes.Contributor[source]

Bases: Base

Data contributor (package.contributors[…]).

See https://specs.frictionlessdata.io/data-package/#contributors.

title :String[source]
path :HttpUrl[source]
email :Email[source]
role :Literal[author, contributor, maintainer, publisher, wrangler] = contributor[source]
organization :String[source]
static dict_from_id(x: str) dict[source]

Construct dictionary from PUDL identifier.

classmethod from_id(cls, x: str) Contributor[source]

Construct from PUDL identifier.

class pudl.metadata.classes.ResourceHarvest[source]

Bases: Base

Resource harvest parameters (resource.harvest).

harvest :Bool = False[source]

Whether to harvest from dataframes based on field names.

If False, the dataframe with the same name is used and the process is limited to dropping unwanted fields.

tolerance :PositiveFloat = 0.0[source]

Fraction of invalid fields above which result is considerd invalid.

class pudl.metadata.classes.Resource[source]

Bases: Base

Tabular data resource (package.resources[…]).

See https://specs.frictionlessdata.io/tabular-data-resource.

Examples

A simple example illustrates the conversion to SQLAlchemy objects.

>>> fields = [{'name': 'x', 'type': 'year'}, {'name': 'y', 'type': 'string'}]
>>> fkeys = [{'fields': ['x', 'y'], 'reference': {'resource': 'b', 'fields': ['x', 'y']}}]
>>> schema = {'fields': fields, 'primary_key': ['x'], 'foreign_keys': fkeys}
>>> resource = Resource(name='a', schema=schema)
>>> table = resource.to_sql()
>>> table.columns.x
Column('x', Integer(), ForeignKey('b.x'), CheckConstraint(...), table=<a>, primary_key=True, nullable=False)
>>> table.columns.y
Column('y', Text(), ForeignKey('b.y'), CheckConstraint(...), table=<a>)

To illustrate harvesting operations, say we have a resource with two fields - a primary key (id) and a data field - which we want to harvest from two different dataframes.

>>> from pudl.metadata.helpers import unique, as_dict
>>> fields = [
...     {'name': 'id', 'type': 'integer'},
...     {'name': 'x', 'type': 'integer', 'harvest': {'aggregate': unique, 'tolerance': 0.25}}
... ]
>>> resource = Resource(**{
...     'name': 'a',
...     'harvest': {'harvest': True},
...     'schema': {'fields': fields, 'primary_key': ['id']}
... })
>>> dfs = {
...     'a': pd.DataFrame({'id': [1, 1, 2, 2], 'x': [1, 1, 2, 2]}),
...     'b': pd.DataFrame({'id': [2, 3, 3], 'x': [3, 4, 4]})
... }

Skip aggregation to access all the rows concatenated from the input dataframes. The names of the input dataframes are used as the index.

>>> df, _ = resource.harvest_dfs(dfs, aggregate=False)
>>> df
    id  x
df
a    1  1
a    1  1
a    2  2
a    2  2
b    2  3
b    3  4
b    3  4

Field names and data types are enforced.

>>> resource.to_pandas_dtypes() == df.dtypes.apply(str).to_dict()
True

Alternatively, aggregate by primary key (the default when harvest. harvest=True) and report aggregation errors.

>>> df, report = resource.harvest_dfs(dfs)
>>> df
       x
id
1      1
2   <NA>
3      4
>>> report['stats']
{'all': 2, 'invalid': 1, 'tolerance': 0.0, 'actual': 0.5}
>>> report['fields']['x']['stats']
{'all': 3, 'invalid': 1, 'tolerance': 0.25, 'actual': 0.33...}
>>> report['fields']['x']['errors']
id
2    Not unique.
Name: x, dtype: object

Customize the error values in the error report.

>>> error = lambda x, e: as_dict(x)
>>> df, report = resource.harvest_dfs(
...    dfs, aggregate_kwargs={'raised': False, 'error': error}
... )
>>> report['fields']['x']['errors']
id
2    {'a': [2, 2], 'b': [3]}
Name: x, dtype: object

Limit harvesting to the input dataframe of the same name by setting harvest. harvest=False.

>>> resource.harvest.harvest = False
>>> df, _ = resource.harvest_dfs(dfs, aggregate_kwargs={'raised': False})
>>> df
    id  x
df
a    1  1
a    1  1
a    2  2
a    2  2

Harvesting can also handle conversion to longer time periods. Period harvesting requires primary key fields with a datetime data type, except for year fields which can be integer.

>>> fields = [{'name': 'report_year', 'type': 'year'}]
>>> resource = Resource(**{
...     'name': 'table', 'harvest': {'harvest': True},
...     'schema': {'fields': fields, 'primary_key': ['report_year']}
... })
>>> df = pd.DataFrame({'report_date': ['2000-02-02', '2000-03-03']})
>>> resource.format_df(df)
  report_year
0  2000-01-01
1  2000-01-01
>>> df = pd.DataFrame({'report_year': [2000, 2000]})
>>> resource.format_df(df)
  report_year
0  2000-01-01
1  2000-01-01
name :SnakeCase[source]
title :String[source]
description :String[source]
harvest :ResourceHarvest[source]
group :Literal[eia, epacems, ferc1, ferc714, glue, pudl][source]
schema_ :Schema[source]
contributors :List[Contributor] = [][source]
licenses :List[License] = [][source]
sources :List[Source] = [][source]
keywords :List[String] = [][source]
encoder :Encoder[source]
_check_unique[source]
_check_harvest_primary_key(cls, value, values)[source]
static dict_from_id(x: str) dict[source]

Construct dictionary from PUDL identifier (resource.name).

  • schema.fields

    • Field names are expanded (Field.from_id()).

    • Field attributes are replaced with any specific to the resource.group and field.name.

  • sources: Source ids are expanded (Source.from_id()).

  • licenses: License ids are expanded (License.from_id()).

  • contributors: Contributor ids are fetched by source ids, then expanded (Contributor.from_id()).

  • keywords: Keywords are fetched by source ids.

  • schema.foreign_keys: Foreign keys are fetched by resource name.

classmethod from_id(cls, x: str) Resource[source]

Construct from PUDL identifier (resource.name).

get_field(self, name: str) Field[source]

Return field with the given name if it’s part of the Resources.

to_sql(self, metadata: sqlalchemy.MetaData = None, check_types: bool = True, check_values: bool = True) sqlalchemy.Table[source]

Return equivalent SQL Table.

to_pandas_dtypes(self, **kwargs: Any) Dict[str, Union[str, pandas.CategoricalDtype]][source]

Return Pandas data type of each field by field name.

Parameters

kwargs – Arguments to Field.to_pandas_dtype().

match_primary_key(self, names: Iterable[str]) Optional[Dict[str, str]][source]

Match primary key fields to input field names.

An exact match is required unless harvest .`harvest=True`, in which case periodic names may also match a basename with a smaller period.

Parameters

names – Field names.

Raises
  • ValueError – Field names are not unique.

  • ValueError – Multiple field names match primary key field.

Returns

The name matching each primary key field (if any) as a dict, or None if not all primary key fields have a match.

Examples

>>> fields = [{'name': 'x_year', 'type': 'year'}]
>>> schema = {'fields': fields, 'primary_key': ['x_year']}
>>> resource = Resource(name='r', schema=schema)

By default, when harvest .`harvest=False`, exact matches are required.

>>> resource.harvest.harvest
False
>>> resource.match_primary_key(['x_month']) is None
True
>>> resource.match_primary_key(['x_year', 'x_month'])
{'x_year': 'x_year'}

When harvest .`harvest=True`, in the absence of an exact match, periodic names may also match a basename with a smaller period.

>>> resource.harvest.harvest = True
>>> resource.match_primary_key(['x_year', 'x_month'])
{'x_year': 'x_year'}
>>> resource.match_primary_key(['x_month'])
{'x_month': 'x_year'}
>>> resource.match_primary_key(['x_month', 'x_date'])
Traceback (most recent call last):
ValueError: ... {'x_month', 'x_date'} match primary key field 'x_year'
format_df(self, df: pandas.DataFrame = None, **kwargs: Any) pandas.DataFrame[source]

Format a dataframe.

Parameters
  • df – Dataframe to format.

  • kwargs – Arguments to Field.to_pandas_dtypes().

Returns

Dataframe with column names and data types matching the resource fields. Periodic primary key fields are snapped to the start of the desired period. If the primary key fields could not be matched to columns in df (match_primary_key()) or if df=None, an empty dataframe is returned.

aggregate_df(self, df: pandas.DataFrame, raised: bool = False, error: Callable = None) Tuple[pandas.DataFrame, dict][source]

Aggregate dataframe by primary key.

The dataframe is grouped by primary key fields and aggregated with the aggregate function of each field (schema_. fields[*].harvest.aggregate).

The report is formatted as follows:

  • valid (bool): Whether resouce is valid.

  • stats (dict): Error statistics for resource fields.

  • fields (dict):

    • <field_name> (str)

      • valid (bool): Whether field is valid.

      • stats (dict): Error statistics for field groups.

      • errors (pandas.Series): Error values indexed by primary key.

Each stats (dict) contains the following:

  • all (int): Number of entities (field or field group).

  • invalid (int): Invalid number of entities.

  • tolerance (float): Fraction of invalid entities below which parent entity is considered valid.

  • actual (float): Actual fraction of invalid entities.

Parameters
  • df – Dataframe to aggregate. It is assumed to have column names and data types matching the resource fields.

  • raised – Whether aggregation errors are raised or replaced with np.nan and returned in an error report.

  • error – A function with signature f(x, e) -> Any, where x are the original field values as a pandas.Series and e is the original error. If provided, the returned value is reported instead of e.

Raises

ValueError – A primary key is required for aggregating.

Returns

The aggregated dataframe indexed by primary key fields, and an aggregation report (descripted above) that includes all aggregation errors and whether the result meets the resource’s and fields’ tolerance.

_build_aggregation_report(self, df: pandas.DataFrame, errors: dict) dict[source]

Build report from aggregation errors.

Parameters
  • df – Harvested dataframe (see harvest_dfs()).

  • errors – Aggregation errors (see groupby_aggregate()).

Returns

Aggregation report, as described in aggregate_df().

harvest_dfs(self, dfs: Dict[str, pandas.DataFrame], aggregate: bool = None, aggregate_kwargs: Dict[str, Any] = {}, format_kwargs: Dict[str, Any] = {}) Tuple[pandas.DataFrame, dict][source]

Harvest from named dataframes.

For standard resources (harvest. harvest=False), the columns matching all primary key fields and any data fields are extracted from the input dataframe of the same name.

For harvested resources (harvest. harvest=True), the columns matching all primary key fields and any data fields are extracted from each compatible input dataframe, and concatenated into a single dataframe. Periodic key fields (e.g. ‘report_month’) are matched to any column of the same name with an equal or smaller period (e.g. ‘report_day’) and snapped to the start of the desired period.

If aggregate=False, rows are indexed by the name of the input dataframe. If aggregate=True, rows are indexed by primary key fields.

Parameters
  • dfs – Dataframes to harvest.

  • aggregate – Whether to aggregate the harvested rows by their primary key. By default, this is True if self.harvest.harvest=True and False otherwise.

  • aggregate_kwargs – Optional arguments to aggregate_df().

  • format_kwargs – Optional arguments to format_df().

Returns

A dataframe harvested from the dataframes, with column names and data types matching the resource fields, alongside an aggregation report.

to_rst(self, path: str) None[source]

Output to an RST file.

encode(self, df: pandas.DataFrame) pandas.DataFrame[source]

Standardize coded columns using the foreign column they refer to.

class pudl.metadata.classes.Package[source]

Bases: Base

Tabular data package.

See https://specs.frictionlessdata.io/data-package.

Examples

Foreign keys between resources are checked for completeness and consistency.

>>> fields = [{'name': 'x', 'type': 'year'}, {'name': 'y', 'type': 'string'}]
>>> fkey = {'fields': ['x', 'y'], 'reference': {'resource': 'b', 'fields': ['x', 'y']}}
>>> schema = {'fields': fields, 'primary_key': ['x'], 'foreign_keys': [fkey]}
>>> a = Resource(name='a', schema=schema)
>>> b = Resource(name='b', schema=Schema(fields=fields, primary_key=['x']))
>>> Package(name='ab', resources=[a, b])
Traceback (most recent call last):
ValidationError: ...
>>> b.schema.primary_key = ['x', 'y']
>>> package = Package(name='ab', resources=[a, b])

SQL Alchemy can sort tables, based on foreign keys, in the order in which they need to be loaded into a database.

>>> metadata = package.to_sql()
>>> [table.name for table in metadata.sorted_tables]
['b', 'a']
name :String[source]
title :String[source]
description :String[source]
keywords :List[String] = [][source]
homepage :HttpUrl = https://catalyst.coop/pudl[source]
created :Datetime[source]
contributors :List[Contributor] = [][source]
sources :List[Source] = [][source]
licenses :List[License] = [][source]
resources :StrictList(Resource)[source]
_check_foreign_keys(cls, value)[source]
_populate_from_resources(cls, values)[source]
classmethod from_resource_ids(cls, resource_ids: Iterable[str], resolve_foreign_keys: bool = False) Package[source]

Construct a collection of Resources from PUDL identifiers (resource.name).

Identify any fields that have foreign key relationships referencing the coding tables defined in pudl.metadata.codes and if so, associate the coding table’s encoder with those columns for later use cleaning them up.

Parameters
  • resource_ids – Resource PUDL identifiers (resource.name).

  • resolve_foreign_keys – Whether to add resources as needed based on foreign keys.

get_resource(self, name: str) Resource[source]

Return the resource with the given name if it is in the Package.

to_rst(self, path: str) None[source]

Output to an RST file.

to_sql(self, check_types: bool = True, check_values: bool = True) sqlalchemy.MetaData[source]

Return equivalent SQL MetaData.