pudl.extract.eia923 module¶
Retrieves data from EIA Form 923 spreadsheets for analysis.
This modules pulls data from EIA’s published Excel spreadsheets.
This code is for use analyzing EIA Form 923 data. Currenly only years 2009-2016 work, as they share nearly identical file formatting.
-
pudl.extract.eia923.
extract
(eia923_years, data_dir)[source]¶ Creates a dictionary of DataFrames containing all the EIA 923 tables.
- Parameters
- Returns
A dictionary containing the names of EIA 923 pages (keys) and
pandas.DataFrame
instances filled with the data from each page (values).- Return type
-
pudl.extract.eia923.
get_eia923_column_map
(page, year)[source]¶ Given a year and EIA923 page, returns info needed to slurp it from Excel.
The format of the EIA923 has changed slightly over the years, and so it is not completely straightforward to pull information from the spreadsheets into our analytical framework. This function looks up a map of the various tabs in the spreadsheet by year and page, and returns the information needed to name the data fields in a standardized way, and pull the right cells from each year & page into our database.
- Parameters
- Returns
- A tuple containing:
int: sheet_name (int): An integer indicating which page in the worksheet the data should be pulled from. 0 is the first page, 1 is the second page, etc. For use by
pandas.read_excel()
int: skiprows, an integer indicating how many rows should be skipped at the top of the sheet being read in, before the header row that contains the strings which will be converted into column names in the dataframe which is created by
pandas.read_excel()
int: skiprows, an integer indicating how many rows should be skipped at the top of the sheet being read in, before the header row that contains the strings which will be converted into column names in the dataframe which is created by
pandas.read_excel()
dict: column_map, a dictionary that maps the names of the columns in the year being read in, to the canonical EIA923 column names. This dictionary will be used by
pandas.DataFrame.rename()
. The keys are the column names in the dataframe as read from older years, and the values are the canonmical column names. All should be stripped of leading and trailing whitespace, converted to lower case, and have internal non-alphanumeric characters replaced with underscores.
- Return type
-
pudl.extract.eia923.
get_eia923_file
(yr, data_dir)[source]¶ Construct the appopriate path for a given year’s EIA923 Excel file.
-
pudl.extract.eia923.
get_eia923_page
(page, eia923_xlsx, years=(2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018))[source]¶ Reads a table from given years of EIA923 data, returns a DataFrame.
- Parameters
page (str) – The string label indicating which page of the EIA923 we are attempting to read in. The page argument must be one of the strings listed in
pudl.constants.working_pages_eia923()
.eia923_xlsx (
pandas.io.excel.ExcelFile
) – xlsx file of EIA Form 923 for input year(s).years (list) – The set of years to read into the dataframe.
- Returns
A dataframe containing the data from the selected page and selected years from EIA 923.
- Return type