pudl.workspace.datastore
Datastore manages file retrieval for PUDL datasets.
Module Contents
Classes
A simple wrapper providing access to datapackage.json contents. |
|
API for fetching datapackage descriptors and resource contents from zenodo. |
|
Handle connections and downloading of Zenodo Source archives. |
|
Transforms k1=v1,k2=v2,... into dict(k1=v1, k2=v2, ...). |
Functions
Collect the command line arguments. |
|
|
Figure out what pudl_in path should be used. |
|
Constructs datastore instance. |
|
Prints known partition keys and its values for each of the datasets. |
|
Validate elements in the datastore cache. Delete invalid entires from cache. |
|
Retrieve all matching resources and store them in the cache. |
|
Cache datasets. |
Attributes
- exception pudl.workspace.datastore.ChecksumMismatch[source]
Bases:
ValueError
Resource checksum (md5) does not match.
- class pudl.workspace.datastore.DatapackageDescriptor(datapackage_json: dict, dataset: str, doi: str)[source]
A simple wrapper providing access to datapackage.json contents.
- get_resource_path(self, name: str) str [source]
Returns zenodo url that holds contents of given named resource.
- validate_checksum(self, name: str, content: str) bool [source]
Returns True if content matches checksum for given named resource.
- get_resources(self, name: str = None, **filters: Any) Iterator[pudl.workspace.resource_cache.PudlResourceKey] [source]
Returns series of PudlResourceKey identifiers for matching resources.
- get_partitions(self, name: str = None) Dict[str, Set[str]] [source]
Returns mapping of all known partition keys to the set of its known values.
- class pudl.workspace.datastore.ZenodoFetcher(sandbox: bool = False, timeout: float = 15.0)[source]
API for fetching datapackage descriptors and resource contents from zenodo.
- get_descriptor(self, dataset: str) DatapackageDescriptor [source]
Returns DatapackageDescriptor for given dataset.
- get_resource_key(self, dataset: str, name: str) pudl.workspace.resource_cache.PudlResourceKey [source]
Returns PudlResourceKey for given resource.
- get_resource(self, res: pudl.workspace.resource_cache.PudlResourceKey) bytes [source]
Given resource key, retrieve contents of the file from zenodo.
- class pudl.workspace.datastore.Datastore(local_cache_path: Optional[pathlib.Path] = None, gcs_cache_path: Optional[str] = None, sandbox: bool = False, timeout: float = 15)[source]
Handle connections and downloading of Zenodo Source archives.
- get_datapackage_descriptor(self, dataset: str) DatapackageDescriptor [source]
Fetch datapackage descriptor for given dataset either from cache or from zenodo.
- get_resources(self, dataset: str, cached_only: bool = False, skip_optimally_cached: bool = False, **filters: Any) Iterator[Tuple[pudl.workspace.resource_cache.PudlResourceKey, bytes]] [source]
Return content of the matching resources.
- Parameters
dataset (str) – name of the dataset to query.
cached_only (bool) – if True, only retrieve resources that are present in the cache.
skip_optimally_cached (bool) – if True, only retrieve resources that are not optimally cached. This triggers attempt to optimally cache these resources.
filters (key=val) – only return resources that match the key-value mapping in their
metadata["parts"]. –
- Yields
(PudlResourceKey, io.BytesIO) holding content for each matching resource
- remove_from_cache(self, res: pudl.workspace.resource_cache.PudlResourceKey)[source]
Remove given resource from the associated cache.
- get_unique_resource(self, dataset: str, **filters: Any) bytes [source]
Returns content of a resource assuming there is exactly one that matches.
- get_zipfile_resource(self, dataset: str, **filters: Any) zipfile.ZipFile [source]
Retrieves unique resource and opens it as a ZipFile.
- class pudl.workspace.datastore.ParseKeyValues(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]
Bases:
argparse.Action
Transforms k1=v1,k2=v2,… into dict(k1=v1, k2=v2, …).
- pudl.workspace.datastore._get_pudl_in(args: dict) pathlib.Path [source]
Figure out what pudl_in path should be used.
- pudl.workspace.datastore._create_datastore(args: dict) Datastore [source]
Constructs datastore instance.
- pudl.workspace.datastore.print_partitions(dstore: Datastore, datasets: List[str]) None [source]
Prints known partition keys and its values for each of the datasets.
- pudl.workspace.datastore.validate_cache(dstore: Datastore, datasets: List[str], args: argparse.Namespace) None [source]
Validate elements in the datastore cache. Delete invalid entires from cache.