pudl.workspace.datastore module

Datastore manages file retrieval for PUDL datasets.

exception pudl.workspace.datastore.ChecksumMismatch[source]

Bases: ValueError

Resource checksum (md5) does not match.

class pudl.workspace.datastore.DatapackageDescriptor(datapackage_json: dict, dataset: str, doi: str)[source]

Bases: object

A simple wrapper providing access to datapackage.json contents.

get_json_string()str[source]

Exports the underlying json as normalized (sorted, indented) json string.

get_partitions(name: Optional[str] = None)Dict[str, Set[str]][source]

Returns mapping of all known partition keys to the set of its known values.

get_resource_path(name: str)str[source]

Returns zenodo url that holds contents of given named resource.

get_resources(name: Optional[str] = None, **filters: Any)Iterator[pudl.workspace.resource_cache.PudlResourceKey][source]

Returns series of PudlResourceKey identifiers for matching resources.

Parameters
  • name (str) – if specified, find resource(s) with this name.

  • filters (dict) – if specified, find resoure(s) matching these key=value constraints. The constraints are matched against the ‘parts’ field of the resource entry in the datapackage.json.

validate_checksum(name: str, content: str)bool[source]

Returns True if content matches checksum for given named resource.

class pudl.workspace.datastore.Datastore(local_cache_path: Optional[pathlib.Path] = None, gcs_cache_path: Optional[str] = None, sandbox: bool = False, timeout: float = 15)[source]

Bases: object

Handle connections and downloading of Zenodo Source archives.

get_datapackage_descriptor(dataset: str)pudl.workspace.datastore.DatapackageDescriptor[source]

Fetch datapackage descriptor for given dataset either from cache or from zenodo.

get_known_datasets()List[str][source]

Returns list of supported datasets.

get_resources(dataset: str, cached_only: bool = False, skip_optimally_cached: bool = False, **filters: Any)Iterator[Tuple[pudl.workspace.resource_cache.PudlResourceKey, bytes]][source]

Return content of the matching resources.

Parameters
  • dataset (str) – name of the dataset to query.

  • cached_only (bool) – if True, only retrieve resources that are present in the cache.

  • skip_optimally_cached (bool) – if True, only retrieve resources that are not optimally cached. This triggers attempt to optimally cache these resources.

  • filters (key=val) – only return resources that match the key-value mapping in their

  • metadata["parts"]

Yields

(PudlResourceKey, io.BytesIO) holding content for each matching resource

get_unique_resource(dataset: str, **filters: Any)bytes[source]

Returns content of a resource assuming there is exactly one that matches.

get_zipfile_resource(dataset: str, **filters: Any)zipfile.ZipFile[source]

Retrieves unique resource and opens it as a ZipFile.

remove_from_cache(res: pudl.workspace.resource_cache.PudlResourceKey)[source]

Remove given resource from the associated cache.

class pudl.workspace.datastore.ParseKeyValues(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]

Bases: argparse.Action

Transforms k1=v1,k2=v2,… into dict(k1=v1, k2=v2, …).

class pudl.workspace.datastore.ZenodoFetcher(sandbox: bool = False, timeout: float = 15.0)[source]

Bases: object

API for fetching datapackage descriptors and resource contents from zenodo.

API_ROOT = {'production': 'https://zenodo.org/api', 'sandbox': 'https://sandbox.zenodo.org/api'}
DOI = {'production': {'censusdp1tract': '10.5281/zenodo.4127049', 'eia860': '10.5281/zenodo.4127027', 'eia860m': '10.5281/zenodo.4540268', 'eia861': '10.5281/zenodo.4127029', 'eia923': '10.5281/zenodo.4127040', 'epacems': '10.5281/zenodo.4660268', 'ferc1': '10.5281/zenodo.4127044', 'ferc714': '10.5281/zenodo.4127101'}, 'sandbox': {'censusdp1tract': '10.5072/zenodo.674992', 'eia860': '10.5072/zenodo.672210', 'eia860m': '10.5072/zenodo.692655', 'eia861': '10.5072/zenodo.687052', 'eia923': '10.5072/zenodo.687071', 'epacems': '10.5072/zenodo.672963', 'ferc1': '10.5072/zenodo.687072', 'ferc714': '10.5072/zenodo.672224'}}
TOKEN = {'production': 'KXcG5s9TqeuPh1Ukt5QYbzhCElp9LxuqAuiwdqHP0WS4qGIQiydHn6FBtdJ5', 'sandbox': 'qyPC29wGPaflUUVAv1oGw99ytwBqwEEdwi4NuUrpwc3xUcEwbmuB4emwysco'}
get_descriptor(dataset: str)pudl.workspace.datastore.DatapackageDescriptor[source]

Returns DatapackageDescriptor for given dataset.

get_doi(dataset: str)str[source]

Returns DOI for given dataset.

get_known_datasets()List[str][source]

Returns list of supported datasets.

get_resource(res: pudl.workspace.resource_cache.PudlResourceKey)bytes[source]

Given resource key, retrieve contents of the file from zenodo.

get_resource_key(dataset: str, name: str)pudl.workspace.resource_cache.PudlResourceKey[source]

Returns PudlResourceKey for given resource.

pudl.workspace.datastore.fetch_resources(dstore: pudl.workspace.datastore.Datastore, datasets: List[str], args: argparse.Namespace)None[source]

Retrieve all matching resources and store them in the cache.

pudl.workspace.datastore.main()[source]

Cache datasets.

pudl.workspace.datastore.parse_command_line()[source]

Collect the command line arguments.

pudl.workspace.datastore.print_partitions(dstore: pudl.workspace.datastore.Datastore, datasets: List[str])None[source]

Prints known partition keys and its values for each of the datasets.

pudl.workspace.datastore.validate_cache(dstore: pudl.workspace.datastore.Datastore, datasets: List[str], args: argparse.Namespace)None[source]

Validate elements in the datastore cache. Delete invalid entires from cache.