pudl.convert.merge_datapkgs module

Functions for merging compatible PUDL datapackges together.

pudl.convert.merge_datapkgs.check_etl_params(dps)[source]

Verify that datapackages to be merged have compatible ETL params.

Given that all of the input data packages come from the same ETL run, which means they will have used the same input data, the only way they should potentially differ is in the ETL parameters which were used to generate them. This function pulls the data source specific ETL params which we store in each datapackage descriptor and checks that within a given data source (e.g. eia923, ferc1) all of the ETL parameters are identical (e.g. the years, states, and tables loaded).

Parameters

dps (iterable) – A list of datapackage.Package objects, representing the datapackages to be merged.

Returns

None

Raises

ValueError – If the PUDL ETL parameters associated with any given data source are not identical across all instances of that data source within the datapackages to be merged. Also if the ETL UUIDs for all of the datapackages to be merged are not identical.

pudl.convert.merge_datapkgs.check_identical_vals(dps, required_vals, optional_vals=())[source]

Verify that datapackages to be merged have required identical values.

This only works for elements with simple (hashable) datatypes, which can be added to a set.

Parameters
  • dps (iterable) – a list of tabular datapackage objects, output by PUDL.

  • required_vals (iterable) – A list of strings indicating which top level metadata elements should be compared between the datapackages. All must be present in every datapackage.

  • optional_vals (iterable) – A list of strings indicating top level metadata elements to be compared between the datapackages. They do not need to appear in all datapackages, but if they do appear, they must be identical.

Returns

None

Raises
  • ValueError – if any of the required or optional metadata elements have different values in the different data packages.

  • KeyError – if a required metadata element is not found in any of the datapackages.

pudl.convert.merge_datapkgs.merge_data(dps, out_path)[source]

Copy the CSV files into the merged datapackage’s data directory.

Iterates through all of the resources in the input datapackages and copies the files they refer to into the data directory associated with the merged datapackage (a directory named “data” inside the out_path directory).

Function assumes that a fresh (empty) data directory has been created. If a file with the same name already exists, it is not overwritten, in order to prevent unnecessary copying of resources which appear in multiple input packages.

Parameters
  • dps (iterable) – A list of datapackage.Package objects, representing the datapackages to be merged.

  • out_path (path like) – Base directory for the newly created datapackage. The final path element will also be used as the name of the merged data package.

Returns

None

pudl.convert.merge_datapkgs.merge_datapkgs(dps, out_path, clobber=False)[source]

Merge several compatible datapackages into one larger datapackage.

Parameters
  • dps (iterable) – A collection of tabular data package objects that were output by PUDL, to be merged into a single deduplicated datapackage for loading into a database or other storage medium.

  • out_path (path-like) – Base directory for the newly created datapackage. The final path element will also be used as the name of the merged data package.

  • clobber (bool) – If the location of the output datapackage already exists, should it be overwritten? If True, yes. If False, no.

Returns

A report containing information about the validity of the merged datapackage.

Return type

dict

Raises
pudl.convert.merge_datapkgs.merge_meta(dps, datapkg_name)[source]

Merge the JSON descriptors of datapackages into one big descriptor.

This function builds up a new tabular datapackage JSON descriptor as a python dictionary, containing the merged metadata from all of the input datapackages.

The process is complex for two reasons. First, there are several different datatypes in the descriptor that need to be merged, and the processes for each of them are different. Second, what constitutes a “merge” may vary depending on the semantic content of the metadata. E.g. the created timestamp is a simple string, but we need to choose one of the several values (the earliest one) for inclusion in the merged datapackage, while many other simple string fields are required to be identical across all of the input data packages (e.g. datapkg-bundle-uuid):

Parameters
  • dps (iterable) – A collection of datapackage objects, whose metadata will be merged to create a single datapackage descriptor representing the union of all the data in the input datapackages.

  • datapkg_name (str) – The name associated with the newly merged datapackage. This should be the same as the name of the directory in which the datapackage is found.

Returns

a Python dictionary representing a tabular datapackage JSON descriptor, encoded as a python dictionary, containing the merged metadata of the input datapackages.

Return type

dict