pudl.load

Routines for loading PUDL data into various storage formats.

Module Contents

Functions

dfs_to_sqlite(dfs: Dict[str, pandas.DataFrame], engine: sqlalchemy.engine.Engine, check_foreign_keys: bool = True, check_types: bool = True, check_values: bool = True) → None

Load a dictionary of dataframes into the PUDL SQLite DB.

df_to_parquet(df: pandas.DataFrame, resource_id: str, root_path: Union[str, pathlib.Path], partition_cols: Union[List[str], Literal[None]] = None) → None

Write a PUDL table out to a partitioned Parquet dataset.

Attributes

logger

MINIMUM_SQLITE_VERSION

pudl.load.logger[source]
pudl.load.MINIMUM_SQLITE_VERSION = 3.32.0[source]
pudl.load.dfs_to_sqlite(dfs: Dict[str, pandas.DataFrame], engine: sqlalchemy.engine.Engine, check_foreign_keys: bool = True, check_types: bool = True, check_values: bool = True) None[source]

Load a dictionary of dataframes into the PUDL SQLite DB.

Parameters
  • dfs – Dictionary mapping table names to dataframes.

  • engine – PUDL DB connection engine.

  • check_foreign_keys – if True, enforce foreign key constraints.

  • check_types – if True, enforce column data types.

  • check_values – if True, enforce value constraints.

pudl.load.df_to_parquet(df: pandas.DataFrame, resource_id: str, root_path: Union[str, pathlib.Path], partition_cols: Union[List[str], Literal[None]] = None) None[source]

Write a PUDL table out to a partitioned Parquet dataset.

Uses the name of the table to look up appropriate metadata and construct a PyArrow schema.

Parameters
  • df – The tabular data to be written to a Parquet dataset.

  • resource_id – Name of the table that’s being written to Parquet.

  • root_path – Top level directory for the partitioned dataset.

  • partition_cols – Columns to use to partition the Parquet dataset. For EPA CEMS we use [“year”, “state”].