pudl.load.metadata module¶
Make me metadata!!!.
Lists of dictionaries of dictionaries of lists, forever. This module enables the generation and use of the metadata for tabular data packages. This module also saves and validates the datapackage once the metadata is compiled. The intented use of the module is to use it after generating the CSV’s via etl.py.
On a basic level, based on the settings in the pkg_settings, tables and sources associated with a data package, we are compiling information about the data package. For the table metadata, we are pulling from the megadata (pudl/package_data/meta/datapackage/datapackage.json). Most of the other elements of the metadata is regenerated.
For most tables, this is a relatively straightforward process, but we are attempting to enable partioning of tables (storing parts of a table in individual CSVs). These partitioned tables are parts of a “group” which can be read by frictionlessdata tools as one table. At each step the process, this module needs to know whether to deal with the full partitioned table names or the cononical table name.
-
pudl.load.metadata.
compile_partitions
(pkg_settings)[source]¶ Pull out the partitions from data package settings.
-
pudl.load.metadata.
data_sources_from_tables_pkg
(table_names, testing=False)[source]¶ Look up data sources based on a list of PUDL DB tables.
-
pudl.load.metadata.
generate_metadata
(pkg_settings, tables, pkg_dir, uuid_pkgs='733a1605-eaaf-46c4-b461-471e2aec3a84')[source]¶ Generate metadata for package tables and validate package.
The metadata for this package is compiled from the pkg_settings and from the “megadata”, which is a json file containing the schema for all of the possible pudl tables. Given a set of tables, this function compiles metadata and validates the metadata and the package. This function assumes datapackage CSVs have already been generated.
See Frictionless Data for the tabular data package specification: http://frictionlessdata.io/specs/tabular-data-package/
- Parameters
pkg_settings (dict) – a dictionary containing package settings containing top level elements of the data package JSON descriptor specific to the data package including: * name: short package name e.g. pudl-eia923, ferc1-test, cems_pkg * title: One line human readable description. * description: A paragraph long description. * keywords: For search purposes.
tables (list) – a list of tables that are included in this data package.
pkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapackage_dir directory, with the name of the package as the name of the subdirectory.
uuid_pkgs –
Todo
Return to (uuid_pkgs)
- Returns
a datapackage. See frictionlessdata specs. dict: a valition dictionary containing validity of package and any errors that were generated during packaing.
- Return type
datapackage.package.Package
-
pudl.load.metadata.
get_autoincrement_columns
(unpartitioned_tables)[source]¶ Grab the autoincrement columns for pkg tables.
-
pudl.load.metadata.
get_dependent_tables_from_list_pkg
(table_names, testing=False)[source]¶ Given a list of tables, find all the other tables they depend on.
Iterate over a list of input tables, adding them and all of their dependent tables to a set, and return that set. Useful for determining which tables need to be exported together to yield a self-contained subset of the PUDL database.
- Parameters
table_names (iterable) – a list of names of ‘seed’ tables, whose dependencies we are seeking to find.
testing (bool) – Connected to the test database (True) or live PUDl database (False)?
- Returns
The set of all the tables which any of the input tables depends on, via ForeignKey constraints.
- Return type
all_the_tables (set)
-
pudl.load.metadata.
get_dependent_tables_pkg
(table_name, fk_relash)[source]¶ For a given table, get the list of all the other tables it depends on.
- Parameters
table_name (str) – The table whose dependencies we are looking for.
() (fk_relash) –
Todo
Incomplete docstring.
- Returns
the set of all the tables the specified table depends upon.
- Return type
-
pudl.load.metadata.
get_foreign_key_relash_from_pkg
(pkg_json)[source]¶ Generate a dictionary of foreign key relationships from pkging metadata.
This function helps us pull all of the foreign key relationships of all of the tables in the metadata.
- Parameters
datapackage_json_path (path-like) – Path to the datapackage.json containing the schema from which the foreign key relationships will be read
- Returns
list of foreign key tables
- Return type
-
pudl.load.metadata.
get_repartitioned_tables
(tables, partitions, pkg_settings)[source]¶ Get the re-partitioned tables.
- Parameters
- Returns
list of tables including full groups of
- Return type
-
pudl.load.metadata.
get_source_metadata
(data_sources, pkg_settings)[source]¶ Grab sources for metadata.
-
pudl.load.metadata.
get_tabular_data_resource
(table_name, pkg_dir, partitions=False)[source]¶ Create a Tabular Data Resource descriptor for a PUDL table.
Based on the information in the database, and some additional metadata this function will generate a valid Tabular Data Resource descriptor, according to the Frictionless Data specification, which can be found here: https://frictionlessdata.io/specs/tabular-data-resource/
- Parameters
table_name (string) – table name for which you want to generate a Tabular Data Resource descriptor
pkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapackage_dir directory, with the name of the package as the name of the subdirectory.
- Returns
A JSON object containing key information about the selected table
- Return type
Tabular Data Resource descriptor
-
pudl.load.metadata.
get_unpartioned_tables
(tables, pkg_settings)[source]¶ Get the tables w/out the partitions.
Because the partitioning key will always be the name of the table without whatever element the table is being partitioned by, we can assume the names of all of the un-partitioned tables to get a list of tables that is easier to work with.
- Parameters
tables (iterable) – list of tables that are included in this datapackage.
pkg_settings (dictionary) –
- Returns
tables_unpartioned is a set of un-partitioned tables
- Return type
iterable
-
pudl.load.metadata.
hash_csv
(csv_path)[source]¶ Calculates a SHA-256 hash of the CSV file for data integrity checking.
- Parameters
csv_path (path-like) – Path the CSV file to hash.
- Returns
the hexdigest of the hash, with a ‘sha256:’ prefix.
- Return type
-
pudl.load.metadata.
package_files_from_table
(table, pkg_settings)[source]¶ Determine which files should exist in a package cooresponding to a table.
We want to convert the datapackage tables and any information about package partitioning into a list of expected files. For each table that is partitioned, we want to add the partitions to the end of the table name.
-
pudl.load.metadata.
prep_pkg_bundle_directory
(pudl_settings, pkg_bundle_name, clobber=False)[source]¶ Create (or delete and create) data package directory.
- Parameters
pudl_settings (dict) – a dictionary filled with settings that mostly describe paths to various resources and outputs.
debug (bool) – If True, return a dictionary with package names (keys) and a list with the data package metadata and report (values).
pkg_bundle_name (string) – name of directory you want the bundle of data packages to live. If this is set to None, the name will be defaulted to be the pudl packge version.
- Returns
path-like
-
pudl.load.metadata.
pull_resource_from_megadata
(table_name)[source]¶ Read a single data resource from the PUDL metadata library.
- Parameters
table_name (str) – the name of the table / data resource whose JSON descriptor we are reading.
- Returns
a Tabular Data Resource Descriptor, as a JSON object.
- Return type
json
- Raises
ValueError – If table_name is not found exactly one time in the PUDL metadata library.
-
pudl.load.metadata.
test_file_consistency
(tables, pkg_settings, pkg_dir)[source]¶ Test the consistency of tables for packaging.
The purpose of this function is to test that we have the correct list of tables. There are three different ways we could determine which tables are being dumped into packages: a list of the tables being generated through the ETL functions, the list of dependent tables and the list of CSVs in package directory.
Currently, this function is supposed to be fed the ETL function tables which are tested against the CSVs present in the package directory.
- Parameters
pkg_name (string) – the name of the data package.
tables (list) – a list of table names to be tested.
pkg_dir (path-like) – the directory in which to check the consistency of table files
- Raises
AssertionError – If the tables in the CSVs and the ETL tables are not exactly the same list of tables.
Todo
Determine what to do with the dependent tables check.