pudl.load.metadata module¶

Routines for generating PUDL tabular data package and resource metadata.

This module enables the generation and use of the metadata for tabular data packages. It also saves and validates the datapackage once the metadata is compiled. In general the routines in this module can only be used after the referenced CSV’s have been generated by the top level PUDL ETL module, and written out to the datapackage data directory by the pudl.load.csv module.

The metadata comes from three basic sources: the datapkg_settings that are read in from the YAML file specifying the datapackage or bundle of datapackages to be generated, the CSV files themselves (their names, sizes, and hash values) and the stored metadata template which ultimately determines the structure of the relational database that these output tabular data packages represent, and encodes field specific table schemas. See the “megadata” which is stored in src/pudl/package_data/meta/datapkg/datapackage.json.

For unpartitioned tables which are contained in a single tabular data resource this is a relatively straightforward process. However, larger tables that have been partitioned into smaller tabular data resources that are part of a resource group (e.g. EPA CEMS) have additional complexities. We have tried to say “resource” when referring to an individual output CSV that has its own metadata entry, and “table” when referring to whole tables which typically contain only a single resource, but may be composed of hundreds or even thousands of individual resources.

See https://frictionlessdata.io for more details on the tabular data package standards.

In addition, we have included PUDL specific metadata fields that document the ETL parameters which were used to process the data, temporal and spatial coverage for each resource, Zenodo DOIs if appropriate, UUIDs to identify the individual data packages as well as co-generated bundles of data packages that can be used together to instantiate a single database, etc.

pudl.load.metadata.compile_keywords(data_sources)[source]¶

Compile the set of all keywords associated with given data sources.

The list of keywords we associate with each data source is stored in the pudl.constants.keywords_by_data_source dictionary.

Parameters: data_sources (iterable) – List of data source codes (eia923, ferc1, etc.) from which to gather keywords.
Returns: the set of all unique keywords associated with any of the input data sources.
Return type: list

pudl.load.metadata.compile_partitions(datapkg_settings)[source]¶

Given a datapackage settings dictionary, extract dataset partitions.

Iterates through all the datasets enumerated in the datapackage settings, and compiles a dictionary indicating which datasets should be partitioned and on what basis when they are output as tabular data resources. Currently this only applies to the epacems dataset. Datapackage settings must be validated because currently we inject EPA CEMS partitioning variables (epacems_years, epacems_states) during the validation process.

Parameters: datapkg_settings (dict) – a dictionary containing validated datapackage settings, mostly read in from a PUDL ETL settings file.
Returns: Uses table name (e.g. hourly_emissions_epacems) as keys, and lists of partition variables (e.g. [“epacems_years”, “epacems_states”]) as the values. If no datasets within the datapackage are being partitioned, this is an empty dictionary.
Return type: dict

pudl.load.metadata.data_sources_from_tables(table_names)[source]¶

Look up data sources used by the given list of PUDL database tables.

Parameters: tables_names (iterable) – a list of names of ‘seed’ tables, whose dependencies we are seeking to find.
Returns: The set of data sources for the list of PUDL table names.
Return type: set

pudl.load.metadata.generate_metadata(datapkg_settings, datapkg_resources, datapkg_dir, datapkg_bundle_uuid=None, datapkg_bundle_doi=None)[source]¶

Generate metadata for package tables and validate package.

The metadata for this package is compiled from the pkg_settings and from the “megadata”, which is a json file containing the schema for all of the possible pudl tables. Given a set of tables, this function compiles metadata and validates the metadata and the package. This function assumes datapackage CSVs have already been generated.

See Frictionless Data for the tabular data package specification: http://frictionlessdata.io/specs/tabular-data-package/

Parameters

datapkg_settings (dict) – a dictionary containing package settings containing top level elements of the data package JSON descriptor specific to the data package including: * name: short, unique package name e.g. pudl-eia923, ferc1-test * title: One line human readable description. * description: A paragraph long description. * version: the version of the data package being published. * keywords: For search purposes.
datapkg_resources (list) – The names of tabular data resources that are included in this data package.
datapkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapkg_dir directory, with the name of the package as the name of the subdirectory.
datapkg_bundle_uuid – A type 4 UUID identifying the ETL run which which generated the data package – this indicates that the data packages are compatible with each other
datapkg_bundle_doi – A digital object identifier (DOI) that will be used to archive the bundle of mutually compatible data packages. Needs to be provided by an archiving service like Zenodo. This field may also be added after the data package has been generated.

Returns

a Python dictionary representing a valid tabular data package descriptor.

Return type

dict

pudl.load.metadata.get_autoincrement_columns(unpartitioned_tables)[source]¶: Grab the autoincrement columns for pkg tables.

pudl.load.metadata.get_datapkg_fks(datapkg_json)[source]¶

Get a dictionary of foreign key relationships from datapackage metadata.

Parameters

datapkg_json (path-like) – Path to the datapackage.json containing the schema from which the foreign key relationships will be read.

Returns

table names (keys) with lists of table names (values) which the: key table has forgien key relationships with.

Return type

dict

pudl.load.metadata.get_dependent_tables(table_name, fk_relash)[source]¶

For a given table, get the list of all the other tables it depends on.

Parameters

table_name (str) – The table whose dependencies we are looking for.
fk_relash (dict) – table names (keys) with lists of table names (values) which the key table has forgien key relationships with.

Returns

the set of all the tables the specified table depends upon.

Return type

set

pudl.load.metadata.get_dependent_tables_from_list(table_names)[source]¶

Given a list of tables, find all the other tables they depend on.

Iterate over a list of input tables, adding them and all of their dependent tables to a set, and return that set. Useful for determining which tables need to be exported together to yield a self-contained subset of the PUDL database.

Parameters: table_names (iterable) – a list of names of ‘seed’ tables, whose dependencies we are seeking to find.
Returns: All tables with which any of the input tables have ForeignKey relations.
Return type: set

pudl.load.metadata.get_tabular_data_resource(resource_name, datapkg_dir, datapkg_settings, partitions=False)[source]¶

Create a Tabular Data Resource descriptor for a PUDL table.

Based on the information in the database, and some additional metadata this function will generate a valid Tabular Data Resource descriptor, according to the Frictionless Data specification, which can be found here: https://frictionlessdata.io/specs/tabular-data-resource/

Parameters

resource_name (string) – name of the tabular data resource for which you want to generate a Tabular Data Resource descriptor. This is the resource name, rather than the database table name, because we partition large tables into resource groups consisting of many files.
datapkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapkg_dir directory, with the name of the package as the name of the subdirectory.
datapkg_settings (dict) – Python dictionary represeting the ETL parameters read in from the settings file, pertaining to the tabular datapackage this resource is part of.
partitions (dict) – A dictionary with PUDL database table names as the keys (e.g. hourly_emissions_epacems), and lists of partition variables (e.g. [“epacems_years”, “epacems_states”]) as the keys.

Returns

A Python dictionary representing a tabular data resource descriptor that complies with the Frictionless Data specification.

Return type

dict

pudl.load.metadata.get_unpartitioned_tables(resources, datapkg_settings)[source]¶

Generate a list of database table names from a list of data resources.

In the case of EPA CEMS and potentially other large datasets, we are partitioning a single table into many tabular data resources that are part of a resource group. However in some contexts we want to refer to the list of corresponding databse tables, rather than the list of resources.

The partition key in the datapackage settings is the name of the table without the partition elements, and so in the case of partitioned tables we use that key as the name of the table. Otherwise we just use the name of the resource.

Parameters

resources (iterable) – A list of tabular data resource names. They must be expected to appear in the datapackage specified by datapkg_settings.
datapkg_settings (dict) – a dictionary containing validated datapackage settings, mostly read in from a PUDL ETL settings file.

Returns

The names of the database tables corresponding to the tabular: datapackage resource names that were passed in.

Return type

list

pudl.load.metadata.hash_csv(csv_path)[source]¶

Calculates a SHA-256 hash of the CSV file for data integrity checking.

Parameters: csv_path (path-like) – Path the CSV file to hash.
Returns: the hexdigest of the hash, with a ‘sha256:’ prefix.
Return type: str

pudl.load.metadata.pull_resource_from_megadata(resource_name)[source]¶

Read metadata for a given data resource from the stored PUDL megadata.

Parameters: resource_name (str) – the name of the tabular data resource whose JSON descriptor we are reading.
Returns: A Python dictionary containing the resource descriptor portion of a data package descriptor, not expected to be valid or complete.
Return type: dict
Raises: ValueError – If table_name is not found exactly one time in the PUDL metadata library.

pudl.load.metadata.spatial_coverage(resource_name)[source]¶

Extract spatial coverage (country and state) for a given source.

Parameters: resource_name (str) – The name of the (potentially partitioned) resource for which we are enumerating the spatial coverage. Currently this is the only place we are able to access the partitioned spatial coverage after the ETL process has completed.
Returns: A dictionary containing country and potentially state level spatial coverage elements. Country keys are “country” for the full name of country, “iso_3166-1_alpha-2” for the 2-letter ISO code, and “iso_3166-1_alpha-3” for the 3-letter ISO code. State level elements are “state” (a two letter ISO code for sub-national jurisdiction) and “iso_3166-2” for the combined country-state code conforming to that standard.
Return type: dict

pudl.load.metadata.temporal_coverage(resource_name, datapkg_settings)[source]¶

Extract start and end dates from ETL parameters for a given source.

Parameters

resource_name (str) – The name of the (potentially partitioned) resource for which we are enumerating the spatial coverage. Currently this is the only place we are able to access the partitioned spatial coverage after the ETL process has completed.
datapkg_settings (dict) – Python dictionary represeting the ETL parameters read in from the settings file, pertaining to the tabular datapackage this resource is part of.

Returns

A dictionary of two items, keys “start_date” and “end_date” with values in ISO 8601 YYYY-MM-DD format, indicating the extent of the time series data contained within the resource. If the resource does not contain time series data, the dates are null.

Return type

dict

pudl.load.metadata.validate_save_datapkg(datapkg_descriptor, datapkg_dir, row_limit=1000, table_limit=10)[source]¶

Validate datapackage descriptor, save it, and validate some sample data.

Parameters

datapkg_descriptor (dict) – A Python dictionary representation of a (hopefully valid) tabular datapackage descriptor.
datapkg_dir (path-like) – Directory into which the datapackage.json file containing the tabular datapackage descriptor should be written.
row_limit (int) – Number of rows to validate in each table. Passed in to goodtables.validate()
table_limit (int) – Number of different tables to validate within the datapackage. Passed in in to goodtables.validate(). Note that for larger numbers of tables this has caused memory issues, not sure why.

Returns

A dictionary containing the goodtables datapackage validation report. Note that this will only be returned if there are no errors, otherwise it is output as an error message.

Return type

dict

Raises

ValueError – if the datapackage descriptor passed in is invalid, or if any of the tables has a data validation error.