pudl.workspace.datastore#

Datastore manages file retrieval for PUDL datasets.

Module Contents#

Classes#

DatapackageDescriptor

A simple wrapper providing access to datapackage.json contents.

ZenodoFetcher

API for fetching datapackage descriptors and resource contents from zenodo.

Datastore

Handle connections and downloading of Zenodo Source archives.

ParseKeyValues

Transforms k1=v1,k2=v2,...

Functions#

parse_command_line()

Collect the command line arguments.

_get_pudl_in(→ pathlib.Path)

Figure out what pudl_in path should be used.

_create_datastore(→ Datastore)

Constructs datastore instance.

print_partitions(→ None)

Prints known partition keys and its values for each of the datasets.

validate_cache(→ None)

Validate elements in the datastore cache.

fetch_resources(→ None)

Retrieve all matching resources and store them in the cache.

main()

Cache datasets.

Attributes#

pudl.workspace.datastore.logger[source]#
pudl.workspace.datastore.PUDL_YML[source]#
exception pudl.workspace.datastore.ChecksumMismatch[source]#

Bases: ValueError

Resource checksum (md5) does not match.

class pudl.workspace.datastore.DatapackageDescriptor(datapackage_json: dict, dataset: str, doi: str)[source]#

A simple wrapper providing access to datapackage.json contents.

get_resource_path(name: str) str[source]#

Returns zenodo url that holds contents of given named resource.

_get_resource_metadata(name: str) dict[source]#
get_download_size() int[source]#

Returns the total download size of all the resources in MB.

validate_checksum(name: str, content: str) bool[source]#

Returns True if content matches checksum for given named resource.

_matches(res: dict, **filters: Any)[source]#
get_resources(name: str = None, **filters: Any) collections.abc.Iterator[pudl.workspace.resource_cache.PudlResourceKey][source]#

Returns series of PudlResourceKey identifiers for matching resources.

Parameters:
  • name (str) – if specified, find resource(s) with this name.

  • filters (dict) – if specified, find resoure(s) matching these key=value constraints. The constraints are matched against the ‘parts’ field of the resource entry in the datapackage.json.

get_partitions(name: str = None) dict[str, set[str]][source]#

Return mapping of known partition keys to their allowed known values.

_validate_datapackage(datapackage_json: dict)[source]#

Checks the correctness of datapackage.json metadata.

Throws ValueError if invalid.

get_json_string() str[source]#

Exports the underlying json as normalized (sorted, indented) json string.

class pudl.workspace.datastore.ZenodoFetcher(sandbox: bool = False, timeout: float = 15.0)[source]#

API for fetching datapackage descriptors and resource contents from zenodo.

TOKEN[source]#
DOI[source]#
API_ROOT[source]#
_fetch_from_url(url: str) requests.Response[source]#
_doi_to_url(doi: str) str[source]#

Returns url that holds the datapackage for given doi.

get_descriptor(dataset: str) DatapackageDescriptor[source]#

Returns DatapackageDescriptor for given dataset.

get_resource_key(dataset: str, name: str) pudl.workspace.resource_cache.PudlResourceKey[source]#

Returns PudlResourceKey for given resource.

get_doi(dataset: str) str[source]#

Returns DOI for given dataset.

get_resource(res: pudl.workspace.resource_cache.PudlResourceKey) bytes[source]#

Given resource key, retrieve contents of the file from zenodo.

get_known_datasets() list[str][source]#

Returns list of supported datasets.

class pudl.workspace.datastore.Datastore(local_cache_path: Path | None = None, gcs_cache_path: str | None = None, sandbox: bool = False, timeout: float = 15)[source]#

Handle connections and downloading of Zenodo Source archives.

get_known_datasets() list[str][source]#

Returns list of supported datasets.

get_datapackage_descriptor(dataset: str) DatapackageDescriptor[source]#

Fetch datapackage descriptor for dataset either from cache or Zenodo.

get_resources(dataset: str, cached_only: bool = False, skip_optimally_cached: bool = False, **filters: Any) collections.abc.Iterator[tuple[pudl.workspace.resource_cache.PudlResourceKey, bytes]][source]#

Return content of the matching resources.

Parameters:
  • dataset – name of the dataset to query.

  • cached_only – if True, only retrieve resources that are present in the cache.

  • skip_optimally_cached – if True, only retrieve resources that are not optimally cached. This triggers attempt to optimally cache these resources.

  • filters (key=val) – only return resources that match the key-value mapping in their

  • metadata["parts"].

Yields:

(PudlResourceKey, io.BytesIO) holding content for each matching resource

remove_from_cache(res: pudl.workspace.resource_cache.PudlResourceKey) None[source]#

Remove given resource from the associated cache.

get_unique_resource(dataset: str, **filters: Any) bytes[source]#

Returns content of a resource assuming there is exactly one that matches.

get_zipfile_resource(dataset: str, **filters: Any) zipfile.ZipFile[source]#

Retrieves unique resource and opens it as a ZipFile.

class pudl.workspace.datastore.ParseKeyValues(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]#

Bases: argparse.Action

Transforms k1=v1,k2=v2,…

into dict(k1=v1, k2=v2, …).

__call__(parser, namespace, values, option_string=None)[source]#

Parses the argument value into dict.

pudl.workspace.datastore.parse_command_line()[source]#

Collect the command line arguments.

pudl.workspace.datastore._get_pudl_in(args: dict) pathlib.Path[source]#

Figure out what pudl_in path should be used.

pudl.workspace.datastore._create_datastore(args: argparse.Namespace) Datastore[source]#

Constructs datastore instance.

pudl.workspace.datastore.print_partitions(dstore: Datastore, datasets: list[str]) None[source]#

Prints known partition keys and its values for each of the datasets.

pudl.workspace.datastore.validate_cache(dstore: Datastore, datasets: list[str], args: argparse.Namespace) None[source]#

Validate elements in the datastore cache.

Delete invalid entires from cache.

pudl.workspace.datastore.fetch_resources(dstore: Datastore, datasets: list[str], args: argparse.Namespace) None[source]#

Retrieve all matching resources and store them in the cache.

pudl.workspace.datastore.main()[source]#

Cache datasets.