pudl.extract.extractor

Generic functionality for extractors.

Module Contents

Classes

GenericMetadata

Load generic metadata from Python package data.

GenericExtractor

Generic extractor base class.

Functions

concat_pages(→ dict[str, pandas.DataFrame])

Concatenate similar pages of data from different years into single dataframes.

_is_dict_str_strint(→ bool)

partition_extractor_factory(→ dagster.OpDefinition)

Construct a Dagster op that extracts one partition of data, given an extractor.

partitions_from_settings_factory(→ dagster.OpDefinition)

Construct a Dagster op to get target partitions from settings in Dagster context.

raw_df_factory(→ dagster.AssetsDefinition)

Return a dagster graph asset to extract raw DataFrames from CSV or Excel files.

Attributes

pudl.extract.extractor.StrInt[source]
pudl.extract.extractor.PartitionSelection[source]
pudl.extract.extractor.logger[source]
class pudl.extract.extractor.GenericMetadata(dataset_name: str)[source]

Load generic metadata from Python package data.

When metadata object is instantiated, it is given ${dataset} name and it will attempt to load csv files from pudl.package_data.${dataset} package.

It expects the following kinds of files:

  • column_map/${page}.csv currently informs us how to translate input column names to standardized pudl names for given (partition, input_col_name). Relevant page is encoded in the filename.

get_dataset_name() str[source]

Returns the name of the dataset described by this metadata.

_load_csv(package: str, filename: str) pandas.DataFrame[source]

Load metadata from a filename that is found in a package.

_get_partition_selection(partition: dict[str, PartitionSelection]) str[source]

Grab the partition key.

get_all_pages() list[str][source]

Returns list of all known pages.

get_all_columns(page) list[str][source]

Returns list of all pudl columns for a given page across all partitions.

get_column_map(page, **partition)[source]

Return dictionary for renaming columns in a given partition and page.

class pudl.extract.extractor.GenericExtractor(ds)[source]

Bases: abc.ABC

Generic extractor base class.

METADATA: GenericMetadata[source]

Instance of metadata object to use with this extractor.

BLACKLISTED_PAGES = [][source]

List of supported pages that should not be extracted.

abstract source_filename(page: str, **partition: PartitionSelection) str[source]

Produce the source file name as it will appear in the archive.

Parameters:
  • page – pudl name for the dataset contents, eg “boiler_generator_assn” or “coal_stocks”

  • partition – partition to load. Examples: {‘year’: 2009} {‘year_month’: ‘2020-08’}

Returns:

string name of the source file

abstract load_source(page: str, **partition: PartitionSelection) pandas.DataFrame[source]

Produce the source data for the given page and partition(s).

Parameters:
  • page – pudl name for the dataset contents, eg “boiler_generator_assn” or “coal_stocks”

  • partition – partition to load. Examples: {‘year’: 2009} {‘year_month’: ‘2020-08’}

Returns:

pd.DataFrame instance with the source data

process_raw(df: pandas.DataFrame, page: str, **partition: PartitionSelection) pandas.DataFrame[source]

Takes any special steps for processing raw data and renaming columns.

process_renamed(df: pandas.DataFrame, page: str, **partition: PartitionSelection) pandas.DataFrame[source]

Takes any special steps for processing data after columns are renamed.

get_page_cols(page: str, partition_selection: str) pandas.RangeIndex[source]

Get the columns for a particular page and partition key.

validate(df: pandas.DataFrame, page: str, **partition: PartitionSelection)[source]

Check if there are any missing or extra columns.

process_final_page(df: pandas.DataFrame, page: str) pandas.DataFrame[source]

Final processing stage applied to a page DataFrame.

combine(dfs: list[pandas.DataFrame], page: str) pandas.DataFrame[source]

Concatenate dataframes into one, take any special steps for processing final page.

extract(**partitions: PartitionSelection) dict[str, pandas.DataFrame][source]

Extracts dataframes.

Returns dict where keys are page names and values are DataFrames containing data across given years.

Parameters:

partitions – keyword argument dictionary specifying how the source is partitioned and which particular partitions to extract. Examples: {‘years’: [2009, 2010]} {‘year_month’: ‘2020-08’} {‘form’: ‘gas_distribution’, ‘year’=’2020’}

pudl.extract.extractor.concat_pages(paged_dfs: list[dict[str, pandas.DataFrame]]) dict[str, pandas.DataFrame][source]

Concatenate similar pages of data from different years into single dataframes.

Transform a list of dictionaries of dataframes into a single dictionary of dataframes, where each dataframe is the concatenation of dataframes with identical keys from the input list.

For the relatively large EIA930 dataset this is a very memory-intensive operation, so the op is tagged with a high memory-use tag. For all the other datasets which use this op, the time spent concatenating pages is very brief, so this tag should not impact the overall concurrency of the DAG much.

Parameters:

paged_dfs – A list of dictionaries whose keys are page names, and values are extracted DataFrames. Each element of the list corresponds to a single year of the dataset being extracted.

Returns:

A dictionary of DataFrames keyed by page name, where the DataFrame contains that page’s data from all extracted years concatenated together.

pudl.extract.extractor._is_dict_str_strint(_context: dagster.TypeCheckContext, x: Any) bool[source]
pudl.extract.extractor.dagster_dict_str_strint[source]
pudl.extract.extractor.partition_extractor_factory(extractor_cls: type[GenericExtractor], name: str) dagster.OpDefinition[source]

Construct a Dagster op that extracts one partition of data, given an extractor.

Parameters:
  • extractor_cls – Class of type Extractor used to extract the data.

  • name – Name of an Excel based dataset (e.g. “eia860”).

pudl.extract.extractor.partitions_from_settings_factory(name: str) dagster.OpDefinition[source]

Construct a Dagster op to get target partitions from settings in Dagster context.

Parameters:

name – Name of an Excel based dataset (e.g. “eia860”).

pudl.extract.extractor.raw_df_factory(extractor_cls: type[GenericExtractor], name: str) dagster.AssetsDefinition[source]

Return a dagster graph asset to extract raw DataFrames from CSV or Excel files.

Parameters:
  • extractor_cls – The dataset-specific CSV or Excel extractor used to extract the data. Must correspond to the dataset identified by name.

  • name – Name of a CSV or Excel based dataset (e.g. “eia860” or “eia930”).