pudl.extract.parquet

Extractor for Parquet data.

Module Contents

Classes

ParquetExtractor

Class for extracting dataframes from parquet files.

Attributes

pudl.extract.parquet.logger[source]
class pudl.extract.parquet.ParquetExtractor(ds)[source]

Bases: pudl.extract.extractor.GenericExtractor

Class for extracting dataframes from parquet files.

The extraction logic is invoked by calling extract() method of this class.

source_filename(page: str, **partition: pudl.extract.extractor.PartitionSelection) str[source]

Produce the source Parquet file name as it will appear in the archive.

Parameters:
  • page – pudl name for the dataset contents, eg “boiler_generator_assn” or “data”

  • partition – partition to load. Examples: {‘year’: 2009}

Returns:

string name of the parquet file

load_source(page: str, **partition: pudl.extract.extractor.PartitionSelection) pandas.DataFrame[source]

Produce the dataframe object for the given partition.

This method assumes that the archive includes one unzipped file per partition.

Parameters:
  • page – pudl name for the dataset contents, eg “boiler_generator_assn” or “data”

  • partition – partition to load. Examples: {‘year’: 2009} {‘year_month’: ‘2020-08’}

Returns:

pd.DataFrame instance containing CSV data