pudl.etl#

Dagster definitions for the PUDL ETL and Output tables.

Submodules#

Package Contents#

Classes#

EtlSettings

Main settings validation class.

Functions#

epacems_io_manager(→ EpaCemsIOManager)

IO Manager that writes EPA CEMS partitions to individual parquet files.

ferc1_dbf_sqlite_io_manager(→ FercDBFSQLiteIOManager)

Create a SQLiteManager dagster resource for the ferc1 dbf database.

ferc1_xbrl_sqlite_io_manager(→ FercXBRLSQLiteIOManager)

Create a SQLiteManager dagster resource for the ferc1 dbf database.

pudl_mixed_format_io_manager(→ dagster.IOManager)

Create a SQLiteManager dagster resource for the pudl database.

dataset_settings(→ pudl.settings.DatasetsSettings)

Dagster resource for parameterizing PUDL ETL assets.

datastore(→ pudl.workspace.datastore.Datastore)

Dagster resource to interact with Zenodo archives.

ferc_to_sqlite_settings(...)

Dagster resource for parameterizing the ferc_to_sqlite graph.

asset_check_from_schema(...)

Create a dagster asset check based on the resource schema, if defined.

_get_keys_from_assets(→ list[dagster.AssetKey])

Get a list of asset keys.

create_non_cems_selection(→ dagster.AssetSelection)

Create a selection of assets excluding CEMS and all downstream assets.

load_dataset_settings_from_file(→ dict)

Load dataset settings from a settings file in pudl.package_data.settings.

Attributes#

PUDL_PACKAGE

Define a gobal PUDL package object for use across the entire codebase.

logger

raw_module_groups

core_module_groups

out_module_groups

all_asset_modules

default_assets

default_asset_checks

_package

_asset_keys

default_resources

default_tag_concurrency_limits

default_config

defs

A collection of dagster assets, resources, IO managers, and jobs for the PUDL ETL.

pudl.etl.epacems_io_manager(init_context: dagster.InitResourceContext) EpaCemsIOManager[source]#

IO Manager that writes EPA CEMS partitions to individual parquet files.

pudl.etl.ferc1_dbf_sqlite_io_manager(init_context) FercDBFSQLiteIOManager[source]#

Create a SQLiteManager dagster resource for the ferc1 dbf database.

pudl.etl.ferc1_xbrl_sqlite_io_manager(init_context) FercXBRLSQLiteIOManager[source]#

Create a SQLiteManager dagster resource for the ferc1 dbf database.

pudl.etl.pudl_mixed_format_io_manager(init_context) dagster.IOManager[source]#

Create a SQLiteManager dagster resource for the pudl database.

pudl.etl.PUDL_PACKAGE[source]#

Define a gobal PUDL package object for use across the entire codebase.

This needs to happen after the definition of the Package class above, and it is used in some of the class definitions below, but having it defined in the middle of this module is kind of obscure, so it is imported in the __init__.py for this subpackage and then imported in other modules from that more prominent location.

pudl.etl.dataset_settings(init_context) pudl.settings.DatasetsSettings[source]#

Dagster resource for parameterizing PUDL ETL assets.

This resource allows us to specify the years we want to process for each datasource in the Dagit UI.

pudl.etl.datastore(init_context) pudl.workspace.datastore.Datastore[source]#

Dagster resource to interact with Zenodo archives.

pudl.etl.ferc_to_sqlite_settings(init_context) pudl.settings.FercToSqliteSettings[source]#

Dagster resource for parameterizing the ferc_to_sqlite graph.

This resource allows us to specify the years we want to process for each datasource in the Dagit UI.

class pudl.etl.EtlSettings(_case_sensitive: bool | None = None, _env_prefix: str | None = None, _env_file: pydantic_settings.sources.DotenvType | None = ENV_FILE_SENTINEL, _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_parse_none_str: str | None = None, _secrets_dir: str | pathlib.Path | None = None, **values: Any)[source]#

Bases: pydantic_settings.BaseSettings

Main settings validation class.

ferc_to_sqlite_settings: FercToSqliteSettings | None#
datasets: DatasetsSettings | None#
name: str | None#
title: str | None#
description: str | None#
version: str | None#
publish_destinations: list[str] = []#
classmethod from_yaml(path: str) EtlSettings[source]#

Create an EtlSettings instance from a yaml_file path.

Parameters:

path – path to a yaml file; this could be remote.

Returns:

An ETL settings object.

pudl.etl.logger[source]#
pudl.etl.raw_module_groups[source]#
pudl.etl.core_module_groups[source]#
pudl.etl.out_module_groups[source]#
pudl.etl.all_asset_modules[source]#
pudl.etl.default_assets[source]#
pudl.etl.default_asset_checks[source]#
pudl.etl.asset_check_from_schema(asset_key: dagster.AssetKey, package: pudl.metadata.classes.Package) dagster.AssetChecksDefinition | None[source]#

Create a dagster asset check based on the resource schema, if defined.

pudl.etl._get_keys_from_assets(asset_def: dagster.AssetsDefinition | dagster.SourceAsset | dagster._core.definitions.cacheable_assets.CacheableAssetsDefinition) list[dagster.AssetKey][source]#

Get a list of asset keys.

Most assets have one key, which can be retrieved as a list from asset.keys.

Multi-assets have multiple keys, which can also be retrieved as a list from asset.keys.

SourceAssets always only have one key, and don’t have asset.keys. So we look for asset.key and wrap it in a list.

We don’t handle CacheableAssetsDefinitions yet.

pudl.etl._package[source]#
pudl.etl._asset_keys[source]#
pudl.etl.default_resources[source]#
pudl.etl.default_tag_concurrency_limits[source]#
pudl.etl.default_config[source]#
pudl.etl.create_non_cems_selection(all_assets: list[dagster.AssetsDefinition]) dagster.AssetSelection[source]#

Create a selection of assets excluding CEMS and all downstream assets.

Parameters:

all_assets – A list of asset definitions to remove CEMS assets from.

Returns:

An asset selection with all_assets assets excluding CEMS assets.

pudl.etl.load_dataset_settings_from_file(setting_filename: str) dict[source]#

Load dataset settings from a settings file in pudl.package_data.settings.

Parameters:

setting_filename – name of settings file.

Returns:

Dictionary of dataset settings.

pudl.etl.defs: dagster.Definitions[source]#

A collection of dagster assets, resources, IO managers, and jobs for the PUDL ETL.