pudl.workspace.datastore module¶
Datastore manages file retrieval for PUDL datasets.
-
exception
pudl.workspace.datastore.
ChecksumMismatch
[source]¶ Bases:
ValueError
Resource checksum (md5) does not match.
-
class
pudl.workspace.datastore.
DatapackageDescriptor
(datapackage_json: dict, dataset: str, doi: str)[source]¶ Bases:
object
A simple wrapper providing access to datapackage.json contents.
-
get_json_string
() → str[source]¶ Exports the underlying json as normalized (sorted, indented) json string.
-
get_partitions
(name: Optional[str] = None) → Dict[str, Set[str]][source]¶ Returns mapping of all known partition keys to the set of its known values.
-
get_resource_path
(name: str) → str[source]¶ Returns zenodo url that holds contents of given named resource.
-
get_resources
(name: Optional[str] = None, **filters: Any) → Iterator[pudl.workspace.resource_cache.PudlResourceKey][source]¶ Returns series of PudlResourceKey identifiers for matching resources.
-
-
class
pudl.workspace.datastore.
Datastore
(local_cache_path: Optional[pathlib.Path] = None, gcs_cache_path: Optional[str] = None, sandbox: bool = False, timeout: float = 15)[source]¶ Bases:
object
Handle connections and downloading of Zenodo Source archives.
-
get_datapackage_descriptor
(dataset: str) → pudl.workspace.datastore.DatapackageDescriptor[source]¶ Fetch datapackage descriptor for given dataset either from cache or from zenodo.
-
get_resources
(dataset: str, cached_only: bool = False, skip_optimally_cached: bool = False, **filters: Any) → Iterator[Tuple[pudl.workspace.resource_cache.PudlResourceKey, bytes]][source]¶ Return content of the matching resources.
- Parameters
dataset (str) – name of the dataset to query.
cached_only (bool) – if True, only retrieve resources that are present in the cache.
skip_optimally_cached (bool) – if True, only retrieve resources that are not optimally cached. This triggers attempt to optimally cache these resources.
filters (key=val) – only return resources that match the key-value mapping in their
metadata["parts"] –
- Yields
(PudlResourceKey, io.BytesIO) holding content for each matching resource
-
get_unique_resource
(dataset: str, **filters: Any) → bytes[source]¶ Returns content of a resource assuming there is exactly one that matches.
-
get_zipfile_resource
(dataset: str, **filters: Any) → zipfile.ZipFile[source]¶ Retrieves unique resource and opens it as a ZipFile.
-
remove_from_cache
(res: pudl.workspace.resource_cache.PudlResourceKey)[source]¶ Remove given resource from the associated cache.
-
-
class
pudl.workspace.datastore.
ParseKeyValues
(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]¶ Bases:
argparse.Action
Transforms k1=v1,k2=v2,… into dict(k1=v1, k2=v2, …).
-
class
pudl.workspace.datastore.
ZenodoFetcher
(sandbox: bool = False, timeout: float = 15.0)[source]¶ Bases:
object
API for fetching datapackage descriptors and resource contents from zenodo.
-
API_ROOT
= {'production': 'https://zenodo.org/api', 'sandbox': 'https://sandbox.zenodo.org/api'}¶
-
DOI
= {'production': {'censusdp1tract': '10.5281/zenodo.4127049', 'eia860': '10.5281/zenodo.4127027', 'eia860m': '10.5281/zenodo.4540268', 'eia861': '10.5281/zenodo.4127029', 'eia923': '10.5281/zenodo.4127040', 'epacems': '10.5281/zenodo.4660268', 'ferc1': '10.5281/zenodo.4127044', 'ferc714': '10.5281/zenodo.4127101'}, 'sandbox': {'censusdp1tract': '10.5072/zenodo.674992', 'eia860': '10.5072/zenodo.672210', 'eia860m': '10.5072/zenodo.692655', 'eia861': '10.5072/zenodo.687052', 'eia923': '10.5072/zenodo.687071', 'epacems': '10.5072/zenodo.672963', 'ferc1': '10.5072/zenodo.687072', 'ferc714': '10.5072/zenodo.672224'}}¶
-
TOKEN
= {'production': 'KXcG5s9TqeuPh1Ukt5QYbzhCElp9LxuqAuiwdqHP0WS4qGIQiydHn6FBtdJ5', 'sandbox': 'qyPC29wGPaflUUVAv1oGw99ytwBqwEEdwi4NuUrpwc3xUcEwbmuB4emwysco'}¶
-
get_descriptor
(dataset: str) → pudl.workspace.datastore.DatapackageDescriptor[source]¶ Returns DatapackageDescriptor for given dataset.
-
get_resource
(res: pudl.workspace.resource_cache.PudlResourceKey) → bytes[source]¶ Given resource key, retrieve contents of the file from zenodo.
-
get_resource_key
(dataset: str, name: str) → pudl.workspace.resource_cache.PudlResourceKey[source]¶ Returns PudlResourceKey for given resource.
-
-
pudl.workspace.datastore.
fetch_resources
(dstore: pudl.workspace.datastore.Datastore, datasets: List[str], args: argparse.Namespace) → None[source]¶ Retrieve all matching resources and store them in the cache.
-
pudl.workspace.datastore.
print_partitions
(dstore: pudl.workspace.datastore.Datastore, datasets: List[str]) → None[source]¶ Prints known partition keys and its values for each of the datasets.
-
pudl.workspace.datastore.
validate_cache
(dstore: pudl.workspace.datastore.Datastore, datasets: List[str], args: argparse.Namespace) → None[source]¶ Validate elements in the datastore cache. Delete invalid entires from cache.