pudl.workspace.datastore

Datastore manages file retrieval for PUDL datasets.

Module Contents

Classes

DatapackageDescriptor

A simple wrapper providing access to datapackage.json contents.

ZenodoFetcher

API for fetching datapackage descriptors and resource contents from zenodo.

Datastore

Handle connections and downloading of Zenodo Source archives.

ParseKeyValues

Transforms k1=v1,k2=v2,... into dict(k1=v1, k2=v2, ...).

Functions

parse_command_line()

Collect the command line arguments.

_get_pudl_in(args: dict) → pathlib.Path

Figure out what pudl_in path should be used.

_create_datastore(args: dict) → Datastore

Constructs datastore instance.

print_partitions(dstore: Datastore, datasets: List[str]) → None

Prints known partition keys and its values for each of the datasets.

validate_cache(dstore: Datastore, datasets: List[str], args: argparse.Namespace) → None

Validate elements in the datastore cache. Delete invalid entires from cache.

fetch_resources(dstore: Datastore, datasets: List[str], args: argparse.Namespace) → None

Retrieve all matching resources and store them in the cache.

main()

Cache datasets.

Attributes

logger

PUDL_YML

pudl.workspace.datastore.logger[source]
pudl.workspace.datastore.PUDL_YML[source]
exception pudl.workspace.datastore.ChecksumMismatch[source]

Bases: ValueError

Resource checksum (md5) does not match.

class pudl.workspace.datastore.DatapackageDescriptor(datapackage_json: dict, dataset: str, doi: str)[source]

A simple wrapper providing access to datapackage.json contents.

get_resource_path(self, name: str) str[source]

Returns zenodo url that holds contents of given named resource.

_get_resource_metadata(self, name: str) dict[source]
validate_checksum(self, name: str, content: str) bool[source]

Returns True if content matches checksum for given named resource.

_matches(self, res: dict, **filters: Any)[source]
get_resources(self, name: str = None, **filters: Any) Iterator[pudl.workspace.resource_cache.PudlResourceKey][source]

Returns series of PudlResourceKey identifiers for matching resources.

Parameters
  • name (str) – if specified, find resource(s) with this name.

  • filters (dict) – if specified, find resoure(s) matching these key=value constraints. The constraints are matched against the ‘parts’ field of the resource entry in the datapackage.json.

get_partitions(self, name: str = None) Dict[str, Set[str]][source]

Returns mapping of all known partition keys to the set of its known values.

_validate_datapackage(self, datapackage_json: dict)[source]

Checks the correctness of datapackage.json metadata. Throws ValueError if invalid.

get_json_string(self) str[source]

Exports the underlying json as normalized (sorted, indented) json string.

class pudl.workspace.datastore.ZenodoFetcher(sandbox: bool = False, timeout: float = 15.0)[source]

API for fetching datapackage descriptors and resource contents from zenodo.

TOKEN[source]
DOI[source]
API_ROOT[source]
_fetch_from_url(self, url: str) requests.Response[source]
_doi_to_url(self, doi: str) str[source]

Returns url that holds the datapackage for given doi.

get_descriptor(self, dataset: str) DatapackageDescriptor[source]

Returns DatapackageDescriptor for given dataset.

get_resource_key(self, dataset: str, name: str) pudl.workspace.resource_cache.PudlResourceKey[source]

Returns PudlResourceKey for given resource.

get_doi(self, dataset: str) str[source]

Returns DOI for given dataset.

get_resource(self, res: pudl.workspace.resource_cache.PudlResourceKey) bytes[source]

Given resource key, retrieve contents of the file from zenodo.

get_known_datasets(self) List[str][source]

Returns list of supported datasets.

class pudl.workspace.datastore.Datastore(local_cache_path: Optional[pathlib.Path] = None, gcs_cache_path: Optional[str] = None, sandbox: bool = False, timeout: float = 15)[source]

Handle connections and downloading of Zenodo Source archives.

get_known_datasets(self) List[str][source]

Returns list of supported datasets.

get_datapackage_descriptor(self, dataset: str) DatapackageDescriptor[source]

Fetch datapackage descriptor for given dataset either from cache or from zenodo.

get_resources(self, dataset: str, cached_only: bool = False, skip_optimally_cached: bool = False, **filters: Any) Iterator[Tuple[pudl.workspace.resource_cache.PudlResourceKey, bytes]][source]

Return content of the matching resources.

Parameters
  • dataset (str) – name of the dataset to query.

  • cached_only (bool) – if True, only retrieve resources that are present in the cache.

  • skip_optimally_cached (bool) – if True, only retrieve resources that are not optimally cached. This triggers attempt to optimally cache these resources.

  • filters (key=val) – only return resources that match the key-value mapping in their

  • metadata["parts"].

Yields

(PudlResourceKey, io.BytesIO) holding content for each matching resource

remove_from_cache(self, res: pudl.workspace.resource_cache.PudlResourceKey)[source]

Remove given resource from the associated cache.

get_unique_resource(self, dataset: str, **filters: Any) bytes[source]

Returns content of a resource assuming there is exactly one that matches.

get_zipfile_resource(self, dataset: str, **filters: Any) zipfile.ZipFile[source]

Retrieves unique resource and opens it as a ZipFile.

class pudl.workspace.datastore.ParseKeyValues(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]

Bases: argparse.Action

Transforms k1=v1,k2=v2,… into dict(k1=v1, k2=v2, …).

__call__(self, parser, namespace, values, option_string=None)[source]

Parses the argument value into dict.

pudl.workspace.datastore.parse_command_line()[source]

Collect the command line arguments.

pudl.workspace.datastore._get_pudl_in(args: dict) pathlib.Path[source]

Figure out what pudl_in path should be used.

pudl.workspace.datastore._create_datastore(args: dict) Datastore[source]

Constructs datastore instance.

pudl.workspace.datastore.print_partitions(dstore: Datastore, datasets: List[str]) None[source]

Prints known partition keys and its values for each of the datasets.

pudl.workspace.datastore.validate_cache(dstore: Datastore, datasets: List[str], args: argparse.Namespace) None[source]

Validate elements in the datastore cache. Delete invalid entires from cache.

pudl.workspace.datastore.fetch_resources(dstore: Datastore, datasets: List[str], args: argparse.Namespace) None[source]

Retrieve all matching resources and store them in the cache.

pudl.workspace.datastore.main()[source]

Cache datasets.