pudl.load.parquet

Load PUDL data into an Apache Parquet dataset.

Currently this module is only used for the EPA CEMS hourly dataset, but it will also be used for other long tables that are too big for SQLite to handle gracefully.

Module Contents

Functions

epacems_to_parquet(df, root_path)

Write an EPA CEMS dataframe out to a partitioned Parquet dataset.

Attributes

logger

INT_NULLABLE

INT_NOT_NULL

STR_NOT_NULL

TIMESTAMP

FLOAT_NULLABLE

FLOAT_NOT_NULL

DICT_NULLABLE

EPACEMS_ARROW_SCHEMA

Schema defining efficient data types for EPA CEMS Parquet outputs.

pudl.load.parquet.logger[source]
pudl.load.parquet.INT_NULLABLE[source]
pudl.load.parquet.INT_NOT_NULL[source]
pudl.load.parquet.STR_NOT_NULL[source]
pudl.load.parquet.TIMESTAMP[source]
pudl.load.parquet.FLOAT_NULLABLE[source]
pudl.load.parquet.FLOAT_NOT_NULL[source]
pudl.load.parquet.DICT_NULLABLE[source]
pudl.load.parquet.EPACEMS_ARROW_SCHEMA[source]

Schema defining efficient data types for EPA CEMS Parquet outputs.

pudl.load.parquet.epacems_to_parquet(df, root_path)[source]

Write an EPA CEMS dataframe out to a partitioned Parquet dataset.

Parameters
  • df (pandas.DataFrame) – Dataframe containing the data to be output.

  • root_path (path-like) – The top level directory for the partitioned dataset.

Returns

None