pudl.convert.flatten_datapkgs module

This module takes a bundle of datapackages and flattens them.

Because we have enabled the generation of multiple data packages as a part of a data package “bundle”, we need to squish the multiple data packages together in order to put all of the pudl data into one data package. This is especailly useful for converting the data package to a SQLite database or any other format.

The module does two main things:
  • squish the csv’s together

  • squish the metadata (datapackage.json) files together

The CSV squishing is pretty simple and is all being done in flatten_data_packages_csvs. We are assuming and enforcing that if two data packages include the same dataset, that dataset has the same ETL parameters (years, tables, states, etc.). The metadata is slightly more complicated to compile because each element of the metadata is structured differently. Most of that work is being done in flatten_data_package_metadata.

pudl.convert.flatten_datapkgs.check_for_matching_parameters(pkg_bundle_dir, pkg_name)[source]

Check to see if the ETL parameters for datasets are the same across dp’s.

Parameters
  • pkg_bundle_dir (path-like) – the subdirectory where the bundle of data packages live

  • pkg_name (str) – the name you choose for the flattened data package.

pudl.convert.flatten_datapkgs.compile_data_packages_metadata(pkg_bundle_dir, pkg_name='pudl-all')[source]

Grab the metadata from each of your dp’s.

Parameters
  • pkg_bundle_dir (path-like) – the subdirectory where the bundle of data packages live

  • pkg_name (str) – the name you choose for the flattened data package.

Returns

pkg_descriptor_elements

Return type

dict

pudl.convert.flatten_datapkgs.flatten_data_package_metadata(pkg_bundle_dir, pkg_name='pudl-all')[source]

Convert a bundle of PULD data package metadata into one file.

Parameters
  • pkg_bundle_dir (path-like) – the subdirectory where the bundle of data packages live

  • pkg_name (str) – the name you choose for the flattened data package.

Returns

pkg_descriptor

Return type

dict

pudl.convert.flatten_datapkgs.flatten_data_packages_csvs(pkg_bundle_dir, pkg_name='pudl-all')[source]

Copy the CSVs into a new data package directory.

Parameters
  • pkg_bundle_dir (path-like) – the subdirectory where the bundle of data packages live

  • pkg_name (str) – the name you choose for the flattened data package.

pudl.convert.flatten_datapkgs.flatten_pudl_datapackages(pudl_settings, pkg_bundle_name, pkg_name='pudl-all')[source]

Combines a collection of PUDL data packages into one.

Parameters
  • pkg_bundle_name (str) – the name of the subdirectory where the bundle of data packages live. Normally, this name will have been generated in generate_data_packages.

  • pudl_settings (dict) – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • pkg_name (str) – the name you choose for the flattened data package.

Returns

a dictionary of the data package validation report.

Return type

dict

pudl.convert.flatten_datapkgs.get_all_sources(pkg_descriptor_elements)[source]

Grab list of all of the datasets in a data package bundle.

pudl.convert.flatten_datapkgs.get_same_source_meta(pkg_descriptor_elements, title)[source]

Grab the the source metadata of the same dataset from all datapackages.