pudl.load.metadata module

Make me metadata!!!.

Lists of dictionaries of dictionaries of lists, forever. This module enables the generation and use of the metadata for tabular data packages. This module also saves and validates the datapackage once the metadata is compiled. The intented use of the module is to use it after generating the CSV’s via etl.py.

On a basic level, based on the settings in the pkg_settings, tables and sources associated with a data package, we are compiling information about the data package. For the table metadata, we are pulling from the megadata (pudl/package_data/meta/datapackage/datapackage.json). Most of the other elements of the metadata is regenerated.

For most tables, this is a relatively straightforward process, but we are attempting to enable partioning of tables (storing parts of a table in individual CSVs). These partitioned tables are parts of a “group” which can be read by frictionlessdata tools as one table. At each step the process, this module needs to know whether to deal with the full partitioned table names or the cononical table name.

pudl.load.metadata.compile_partitions(pkg_settings)[source]

Pull out the partitions from data package settings.

Parameters

pkg_settings (dict) – a dictionary containing package settings containing top level elements of the data package JSON descriptor specific to the data package

Returns

Return type

dict

pudl.load.metadata.data_sources_from_tables_pkg(table_names, testing=False)[source]

Look up data sources based on a list of PUDL DB tables.

Parameters
  • tables_names (iterable) – a list of names of ‘seed’ tables, whose dependencies we are seeking to find.

  • testing (bool) – Connected to the test database (True) or live PUDl database (False)?

Returns

The set of data sources for the list of PUDL table names.

Return type

set

pudl.load.metadata.generate_metadata(pkg_settings, tables, pkg_dir, uuid_pkgs='733a1605-eaaf-46c4-b461-471e2aec3a84')[source]

Generate metadata for package tables and validate package.

The metadata for this package is compiled from the pkg_settings and from the “megadata”, which is a json file containing the schema for all of the possible pudl tables. Given a set of tables, this function compiles metadata and validates the metadata and the package. This function assumes datapackage CSVs have already been generated.

See Frictionless Data for the tabular data package specification: http://frictionlessdata.io/specs/tabular-data-package/

Parameters
  • pkg_settings (dict) – a dictionary containing package settings containing top level elements of the data package JSON descriptor specific to the data package including: * name: short package name e.g. pudl-eia923, ferc1-test, cems_pkg * title: One line human readable description. * description: A paragraph long description. * keywords: For search purposes.

  • tables (list) – a list of tables that are included in this data package.

  • pkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapackage_dir directory, with the name of the package as the name of the subdirectory.

  • uuid_pkgs

Todo

Return to (uuid_pkgs)

Returns

a datapackage. See frictionlessdata specs. dict: a valition dictionary containing validity of package and any errors that were generated during packaing.

Return type

datapackage.package.Package

pudl.load.metadata.get_autoincrement_columns(unpartitioned_tables)[source]

Grab the autoincrement columns for pkg tables.

pudl.load.metadata.get_dependent_tables_from_list_pkg(table_names, testing=False)[source]

Given a list of tables, find all the other tables they depend on.

Iterate over a list of input tables, adding them and all of their dependent tables to a set, and return that set. Useful for determining which tables need to be exported together to yield a self-contained subset of the PUDL database.

Parameters
  • table_names (iterable) – a list of names of ‘seed’ tables, whose dependencies we are seeking to find.

  • testing (bool) – Connected to the test database (True) or live PUDl database (False)?

Returns

The set of all the tables which any of the input tables depends on, via ForeignKey constraints.

Return type

all_the_tables (set)

pudl.load.metadata.get_dependent_tables_pkg(table_name, fk_relash)[source]

For a given table, get the list of all the other tables it depends on.

Parameters
  • table_name (str) – The table whose dependencies we are looking for.

  • () (fk_relash) –

Todo

Incomplete docstring.

Returns

the set of all the tables the specified table depends upon.

Return type

set

pudl.load.metadata.get_foreign_key_relash_from_pkg(pkg_json)[source]

Generate a dictionary of foreign key relationships from pkging metadata.

This function helps us pull all of the foreign key relationships of all of the tables in the metadata.

Parameters

datapackage_json_path (path-like) – Path to the datapackage.json containing the schema from which the foreign key relationships will be read

Returns

list of foreign key tables

Return type

dict

pudl.load.metadata.get_repartitioned_tables(tables, partitions, pkg_settings)[source]

Get the re-partitioned tables.

Parameters
  • tables (list) – a list of tables that are included in this data package.

  • partitions (dict) –

  • pkg_settings (dict) – a dictionary containing package settings containing top level elements of the data package JSON descriptor specific to the data package.

Returns

list of tables including full groups of

Return type

list

pudl.load.metadata.get_source_metadata(data_sources, pkg_settings)[source]

Grab sources for metadata.

pudl.load.metadata.get_tabular_data_resource(table_name, pkg_dir, partitions=False)[source]

Create a Tabular Data Resource descriptor for a PUDL table.

Based on the information in the database, and some additional metadata this function will generate a valid Tabular Data Resource descriptor, according to the Frictionless Data specification, which can be found here: https://frictionlessdata.io/specs/tabular-data-resource/

Parameters
  • table_name (string) – table name for which you want to generate a Tabular Data Resource descriptor

  • pkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapackage_dir directory, with the name of the package as the name of the subdirectory.

Returns

A JSON object containing key information about the selected table

Return type

Tabular Data Resource descriptor

pudl.load.metadata.get_unpartioned_tables(tables, pkg_settings)[source]

Get the tables w/out the partitions.

Because the partitioning key will always be the name of the table without whatever element the table is being partitioned by, we can assume the names of all of the un-partitioned tables to get a list of tables that is easier to work with.

Parameters
  • tables (iterable) – list of tables that are included in this datapackage.

  • pkg_settings (dictionary) –

Returns

tables_unpartioned is a set of un-partitioned tables

Return type

iterable

pudl.load.metadata.hash_csv(csv_path)[source]

Calculates a SHA-256 hash of the CSV file for data integrity checking.

Parameters

csv_path (path-like) – Path the CSV file to hash.

Returns

the hexdigest of the hash, with a ‘sha256:’ prefix.

Return type

str

pudl.load.metadata.package_files_from_table(table, pkg_settings)[source]

Determine which files should exist in a package cooresponding to a table.

We want to convert the datapackage tables and any information about package partitioning into a list of expected files. For each table that is partitioned, we want to add the partitions to the end of the table name.

pudl.load.metadata.prep_pkg_bundle_directory(pudl_settings, pkg_bundle_name, clobber=False)[source]

Create (or delete and create) data package directory.

Parameters
  • pudl_settings (dict) – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • debug (bool) – If True, return a dictionary with package names (keys) and a list with the data package metadata and report (values).

  • pkg_bundle_name (string) – name of directory you want the bundle of data packages to live. If this is set to None, the name will be defaulted to be the pudl packge version.

Returns

path-like

pudl.load.metadata.pull_resource_from_megadata(table_name)[source]

Read a single data resource from the PUDL metadata library.

Parameters

table_name (str) – the name of the table / data resource whose JSON descriptor we are reading.

Returns

a Tabular Data Resource Descriptor, as a JSON object.

Return type

json

Raises

ValueError – If table_name is not found exactly one time in the PUDL metadata library.

pudl.load.metadata.test_file_consistency(tables, pkg_settings, pkg_dir)[source]

Test the consistency of tables for packaging.

The purpose of this function is to test that we have the correct list of tables. There are three different ways we could determine which tables are being dumped into packages: a list of the tables being generated through the ETL functions, the list of dependent tables and the list of CSVs in package directory.

Currently, this function is supposed to be fed the ETL function tables which are tested against the CSVs present in the package directory.

Parameters
  • pkg_name (string) – the name of the data package.

  • tables (list) – a list of table names to be tested.

  • pkg_dir (path-like) – the directory in which to check the consistency of table files

Raises

AssertionError – If the tables in the CSVs and the ETL tables are not exactly the same list of tables.

Todo

Determine what to do with the dependent tables check.

pudl.load.metadata.validate_save_pkg(pkg_descriptor, pkg_dir)[source]

Validate a data package descriptor and save it to a json file.

Parameters
  • pkg_descriptor (dict) –

  • pkg_dir (path-like) –

Returns

report