pudl.extract.excel
#
Load excel metadata CSV files form a python data package.
Module Contents#
Classes#
Load Excel metadata from Python package data. |
|
Logic for extracting |
Functions#
|
Concatenate similar pages of data from different years into single dataframes. |
|
Construct a Dagster op that extracts one year of data, given an extractor class. |
|
Construct a Dagster op to get target years from settings in the Dagster context. |
|
Return a dagster graph asset to extract a set of raw DataFrames from Excel files. |
Attributes#
- class pudl.extract.excel.Metadata(dataset_name: str)[source]#
Load Excel metadata from Python package data.
Excel sheet files may contain many different tables. When we load those into dataframes, metadata tells us how to do this. Metadata generally informs us about the position of a given page in the file (which sheet and which row) and it informs us how to translate excel column names into standardized column names.
When metadata object is instantiated, it is given ${dataset} name and it will attempt to load csv files from pudl.package_data.${dataset} package.
It expects the following kinds of files:
skiprows.csv tells us how many initial rows should be skipped when loading data for given (partition, page).
skipfooter.csv tells us how many bottom rows should be skipped when loading data for given partition (partition, page).
page_map.csv tells us what is the excel sheet name that should be read when loading data for given (partition, page)
column_map/${page}.csv currently informs us how to translate input column names to standardized pudl names for given (partition, input_col_name). Relevant page is encoded in the filename.
- get_sheet_name(page, **partition)[source]#
Return name of Excel sheet containing data for given partition and page.
- get_skiprows(page, **partition)[source]#
Return number of header rows to skip when loading a partition and page.
Return number of footer rows to skip when loading a partition and page.
- get_column_map(page, **partition)[source]#
Return dictionary for renaming columns in a given partition and page.
- get_all_columns(page)[source]#
Returns list of all pudl columns for a given page across all partitions.
- class pudl.extract.excel.GenericExtractor(ds)[source]#
Logic for extracting
pd.DataFrame
from Excel spreadsheets.This class implements the generic dataset agnostic logic to load data from excel spreadsheet simply by using excel Metadata for given dataset.
It is expected that individual datasets wil subclass this code and add custom business logic by overriding necessary methods.
When implementing custom business logic, the following should be modified:
DATASET class attribute controls which excel metadata should be loaded.
2. BLACKLISTED_PAGES class attribute specifies which pages should not be loaded from the underlying excel files even if the metadata is available. This can be used for experimental/new code that should not be run yet.
3. dtypes() should return dict with {column_name: pandas_datatype} if you need to specify which datatypes should be uded upon loading.
4. If data cleanup is necessary, you can apply custom logic by overriding one of the following functions (they all return the modified dataframe):
process_raw() is applied right after loading the excel DataFrame from the disk.
process_renamed() is applied after input columns were renamed to standardized pudl columns.
process_final_page() is applied when data from all available years is merged into single DataFrame for a given page.
5. get_datapackage_resources() if partition is anything other than a year, this method should be overwritten in the dataset-specific extractor.
- add_data_maturity(df: pandas.DataFrame, page, **partition) pandas.DataFrame [source]#
Add data_maturity column to indicate the maturity of partition data.
The three options enumerated here are
final
,provisional
ormonthly_update
(incremental_ytd
is not currently implemented). We determine if a df should be labeled asprovisional
by using the file names because EIA seems to always includeEarly_Release
in the file names. We determine if a df should be labeled asmonthly_update
by checking if theself.dataset_name
iseia860m
.This method adds a column and thus adds
data_maturity
toself.cols_added
.
- static process_renamed(df, page, **partition)[source]#
Transforms dataframe after columns are renamed.
- extract(**partitions)[source]#
Extracts dataframes.
Returns dict where keys are page names and values are DataFrames containing data across given years.
- load_excel_file(page, **partition)[source]#
Produce the ExcelFile object for the given (partition, page).
- Parameters:
page (str) – pudl name for the dataset contents, eg “boiler_generator_assn” or “coal_stocks”
partition – partition to load. (ex: 2009 for year partition or “2020-08” for year_month partition)
- Returns:
pd.ExcelFile instance with the parsed excel spreadsheet frame
- excel_filename(page, **partition)[source]#
Produce the xlsx document file name as it will appear in the archive.
- Parameters:
page – pudl name for the dataset contents, eg “boiler_generator_assn” or “coal_stocks”
partition – partition to load. (ex: 2009 for year partition or “2020-08” for year_month partition)
- Returns:
string name of the xlsx file
- pudl.extract.excel.concat_pages(paged_dfs: list[dict[str, pandas.DataFrame]]) dict[str, pandas.DataFrame] [source]#
Concatenate similar pages of data from different years into single dataframes.
Transform a list of dictionaries of dataframes into a single dictionary of dataframes, where each dataframe is the concatenation of dataframes with identical keys from the input list.
- Parameters:
paged_dfs – A list of dictionaries whose keys are page names, and values are extracted DataFrames. Each element of the list corresponds to a single year of the dataset being extracted.
- Returns:
A dictionary of DataFrames keyed by page name, where the DataFrame contains that page’s data from all extracted years concatenated together.
- pudl.extract.excel.year_extractor_factory(extractor_cls: type[GenericExtractor], name: str) dagster.OpDefinition [source]#
Construct a Dagster op that extracts one year of data, given an extractor class.
- Parameters:
extractor_cls – Class of type
GenericExtractor
used to extract the data.name – Name of an Excel based dataset (e.g. “eia860”).
- pudl.extract.excel.years_from_settings_factory(name: str) dagster.OpDefinition [source]#
Construct a Dagster op to get target years from settings in the Dagster context.
- Parameters:
name – Name of an Excel based dataset (e.g. “eia860”). Currently this must be one of the attributes of
pudl.settings.EiaSettings
- pudl.extract.excel.raw_df_factory(extractor_cls: type[GenericExtractor], name: str) dagster.AssetsDefinition [source]#
Return a dagster graph asset to extract a set of raw DataFrames from Excel files.
- Parameters:
extractor_cls – The dataset-specific Excel extractor used to extract the data. Needs to correspond to the dataset identified by
name
.name – Name of an Excel based dataset (e.g. “eia860”). Currently this must be one of the attributes of
pudl.settings.EiaSettings