pudl.workspace.datastore
#
Datastore manages file retrieval for PUDL datasets.
Module Contents#
Classes#
A simple wrapper providing access to datapackage.json contents. |
|
Digital Object Identifiers pointing to currently used Zenodo archives. |
|
API for fetching datapackage descriptors and resource contents from zenodo. |
|
Handle connections and downloading of Zenodo Source archives. |
|
Transforms k1=v1,k2=v2,... |
Functions#
Collect the command line arguments. |
|
|
Prints known partition keys and its values for each of the datasets. |
|
Validate elements in the datastore cache. |
|
Retrieve all matching resources and store them in the cache. |
|
Cache datasets. |
Attributes#
- exception pudl.workspace.datastore.ChecksumMismatchError[source]#
Bases:
ValueError
Resource checksum (md5) does not match.
- class pudl.workspace.datastore.DatapackageDescriptor(datapackage_json: dict, dataset: str, doi: str)[source]#
A simple wrapper providing access to datapackage.json contents.
- get_resource_path(name: str) str [source]#
Returns zenodo url that holds contents of given named resource.
- validate_checksum(name: str, content: str) bool [source]#
Returns True if content matches checksum for given named resource.
- get_resources(name: str = None, **filters: Any) collections.abc.Iterator[pudl.workspace.resource_cache.PudlResourceKey] [source]#
Returns series of PudlResourceKey identifiers for matching resources.
- Parameters:
name – if specified, find resource(s) with this name.
filters (dict) – if specified, find resoure(s) matching these key=value constraints. The constraints are matched against the ‘parts’ field of the resource entry in the datapackage.json.
- get_partitions(name: str = None) dict[str, set[str]] [source]#
Return mapping of known partition keys to their allowed known values.
- get_partition_filters(**filters: Any) collections.abc.Iterator[dict[str, str]] [source]#
Returns list of all known partition mappings.
This can be used to iterate over all resources as the mappings can be directly used as filters and should map to unique resource.
- Parameters:
filters – additional constraints for selecting relevant partitions.
- class pudl.workspace.datastore.ZenodoDoiSettings[source]#
Bases:
pydantic.BaseSettings
Digital Object Identifiers pointing to currently used Zenodo archives.
- class pudl.workspace.datastore.ZenodoFetcher(zenodo_dois: ZenodoDoiSettings | None = None, timeout: float = 15.0)[source]#
API for fetching datapackage descriptors and resource contents from zenodo.
- _descriptor_cache: dict[str, DatapackageDescriptor][source]#
- zenodo_dois: ZenodoDoiSettings[source]#
- _get_token(url: pydantic.HttpUrl) str [source]#
Return the appropriate read-only Zenodo personal access token.
These tokens are associated with the pudl@catalyst.coop Zenodo account, which owns all of the Catalyst raw data archives.
- _get_url(doi: ZenodoDoi) pydantic.HttpUrl [source]#
Construct a Zenodo depsition URL based on its Zenodo DOI.
- get_descriptor(dataset: str) DatapackageDescriptor [source]#
Returns class:DatapackageDescriptor for given dataset.
- get_resource(res: pudl.workspace.resource_cache.PudlResourceKey) bytes [source]#
Given resource key, retrieve contents of the file from zenodo.
- class pudl.workspace.datastore.Datastore(local_cache_path: pathlib.Path | None = None, gcs_cache_path: str | None = None, timeout: float = 15.0)[source]#
Handle connections and downloading of Zenodo Source archives.
- get_datapackage_descriptor(dataset: str) DatapackageDescriptor [source]#
Fetch datapackage descriptor for dataset either from cache or Zenodo.
- get_resources(dataset: str, cached_only: bool = False, skip_optimally_cached: bool = False, **filters: Any) collections.abc.Iterator[tuple[pudl.workspace.resource_cache.PudlResourceKey, bytes]] [source]#
Return content of the matching resources.
- Parameters:
dataset – name of the dataset to query.
cached_only – if True, only retrieve resources that are present in the cache.
skip_optimally_cached – if True, only retrieve resources that are not optimally cached. This triggers attempt to optimally cache these resources.
filters (key=val) – only return resources that match the key-value mapping in their
metadata["parts"]. –
- Yields:
(PudlResourceKey, io.BytesIO) holding content for each matching resource
- remove_from_cache(res: pudl.workspace.resource_cache.PudlResourceKey) None [source]#
Remove given resource from the associated cache.
- get_unique_resource(dataset: str, **filters: Any) bytes [source]#
Returns content of a resource assuming there is exactly one that matches.
- get_zipfile_resource(dataset: str, **filters: Any) zipfile.ZipFile [source]#
Retrieves unique resource and opens it as a ZipFile.
- get_zipfile_resources(dataset: str, **filters: Any) collections.abc.Iterator[tuple[pudl.workspace.resource_cache.PudlResourceKey, zipfile.ZipFile]] [source]#
Iterates over resources that match filters and opens each as ZipFile.
- get_zipfile_file_names(zip_file: zipfile.ZipFile)[source]#
Given a zipfile, return a list of the file names in it.
- class pudl.workspace.datastore.ParseKeyValues(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]#
Bases:
argparse.Action
Transforms k1=v1,k2=v2,…
into dict(k1=v1, k2=v2, …).
- pudl.workspace.datastore.print_partitions(dstore: Datastore, datasets: list[str]) None [source]#
Prints known partition keys and its values for each of the datasets.
- pudl.workspace.datastore.validate_cache(dstore: Datastore, datasets: list[str], args: argparse.Namespace) None [source]#
Validate elements in the datastore cache.
Delete invalid entires from cache.