pudl.workspace.datastore
#
Datastore manages file retrieval for PUDL datasets.
Module Contents#
Classes#
A simple wrapper providing access to datapackage.json contents. |
|
API for fetching datapackage descriptors and resource contents from zenodo. |
|
Handle connections and downloading of Zenodo Source archives. |
|
Transforms k1=v1,k2=v2,... |
Functions#
Collect the command line arguments. |
|
|
Figure out what pudl_in path should be used. |
|
Constructs datastore instance. |
|
Prints known partition keys and its values for each of the datasets. |
|
Validate elements in the datastore cache. |
|
Retrieve all matching resources and store them in the cache. |
|
Cache datasets. |
Attributes#
- exception pudl.workspace.datastore.ChecksumMismatch[source]#
Bases:
ValueError
Resource checksum (md5) does not match.
- class pudl.workspace.datastore.DatapackageDescriptor(datapackage_json: dict, dataset: str, doi: str)[source]#
A simple wrapper providing access to datapackage.json contents.
- get_resource_path(name: str) str [source]#
Returns zenodo url that holds contents of given named resource.
- validate_checksum(name: str, content: str) bool [source]#
Returns True if content matches checksum for given named resource.
- get_resources(name: str = None, **filters: Any) collections.abc.Iterator[pudl.workspace.resource_cache.PudlResourceKey] [source]#
Returns series of PudlResourceKey identifiers for matching resources.
- get_partitions(name: str = None) dict[str, set[str]] [source]#
Return mapping of known partition keys to their allowed known values.
- class pudl.workspace.datastore.ZenodoFetcher(sandbox: bool = False, timeout: float = 15.0)[source]#
API for fetching datapackage descriptors and resource contents from zenodo.
- get_descriptor(dataset: str) DatapackageDescriptor [source]#
Returns DatapackageDescriptor for given dataset.
- get_resource_key(dataset: str, name: str) pudl.workspace.resource_cache.PudlResourceKey [source]#
Returns PudlResourceKey for given resource.
- get_resource(res: pudl.workspace.resource_cache.PudlResourceKey) bytes [source]#
Given resource key, retrieve contents of the file from zenodo.
- class pudl.workspace.datastore.Datastore(local_cache_path: Path | None = None, gcs_cache_path: str | None = None, sandbox: bool = False, timeout: float = 15)[source]#
Handle connections and downloading of Zenodo Source archives.
- get_datapackage_descriptor(dataset: str) DatapackageDescriptor [source]#
Fetch datapackage descriptor for dataset either from cache or Zenodo.
- get_resources(dataset: str, cached_only: bool = False, skip_optimally_cached: bool = False, **filters: Any) collections.abc.Iterator[tuple[pudl.workspace.resource_cache.PudlResourceKey, bytes]] [source]#
Return content of the matching resources.
- Parameters:
dataset – name of the dataset to query.
cached_only – if True, only retrieve resources that are present in the cache.
skip_optimally_cached – if True, only retrieve resources that are not optimally cached. This triggers attempt to optimally cache these resources.
filters (key=val) – only return resources that match the key-value mapping in their
metadata["parts"]. –
- Yields:
(PudlResourceKey, io.BytesIO) holding content for each matching resource
- remove_from_cache(res: pudl.workspace.resource_cache.PudlResourceKey) None [source]#
Remove given resource from the associated cache.
- get_unique_resource(dataset: str, **filters: Any) bytes [source]#
Returns content of a resource assuming there is exactly one that matches.
- get_zipfile_resource(dataset: str, **filters: Any) zipfile.ZipFile [source]#
Retrieves unique resource and opens it as a ZipFile.
- class pudl.workspace.datastore.ParseKeyValues(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]#
Bases:
argparse.Action
Transforms k1=v1,k2=v2,…
into dict(k1=v1, k2=v2, …).
- pudl.workspace.datastore._get_pudl_in(args: dict) pathlib.Path [source]#
Figure out what pudl_in path should be used.
- pudl.workspace.datastore._create_datastore(args: argparse.Namespace) Datastore [source]#
Constructs datastore instance.
- pudl.workspace.datastore.print_partitions(dstore: Datastore, datasets: list[str]) None [source]#
Prints known partition keys and its values for each of the datasets.
- pudl.workspace.datastore.validate_cache(dstore: Datastore, datasets: list[str], args: argparse.Namespace) None [source]#
Validate elements in the datastore cache.
Delete invalid entires from cache.