pudl.extract.csv

Extractor for CSV data.

Module Contents

Classes

CsvExtractor

Class for extracting dataframes from CSV files.

Attributes

pudl.extract.csv.logger[source]
class pudl.extract.csv.CsvExtractor(ds)[source]

Bases: pudl.extract.extractor.GenericExtractor

Class for extracting dataframes from CSV files.

The extraction logic is invoked by calling extract() method of this class.

READ_CSV_KWARGS: dict[str, Any][source]

Keyword arguments that are passed to pandas.read_csv().

These allow customization of the CSV parsing process. For example, you can specify the column delimeter, data types, date parsing, etc. This can greatly reduce peak memory usage and speed up the extraction process. Unfortunately you must refer to the column headers using their original names as they appear in the CSV.

TODO[zaneselvans] 2024-04-19: it would be useful to be able to specify different CSV reading options for different pages within the same dataset. At the moment the same arguments will be applied to all pages. This still allows some flexibility because some pandas.read_csv() arguments like dtype don’t raise errors if the columns they apply to aren’t present.

source_filename(page: str, **partition: pudl.extract.extractor.PartitionSelection) str[source]

Produce the source CSV file name as it will appear in the archive.

Parameters:
  • page – pudl name for the dataset contents, eg “boiler_generator_assn” or “data”

  • partition – partition to load. Examples: {‘year’: 2009} {‘year_month’: ‘2020-08’}

Returns:

string name of the CSV file

load_source(page: str, **partition: pudl.extract.extractor.PartitionSelection) pandas.DataFrame[source]

Produce the dataframe object for the given partition.

Parameters:
  • page – pudl name for the dataset contents, eg “boiler_generator_assn” or “data”

  • partition – partition to load. Examples: {‘year’: 2009} {‘year_month’: ‘2020-08’}

Returns:

pd.DataFrame instance containing CSV data