pudl.extract.eia860 module

Retrieve data from EIA Form 860 spreadsheets for analysis.

This modules pulls data from EIA’s published Excel spreadsheets.

This code is for use analyzing EIA Form 860 data.

pudl.extract.eia860.extract(eia860_years, data_dir)[source]

Creates a dictionary of DataFrames containing all the EIA 860 tables.

Parameters
  • eia860_years (list) – a list of data_years

  • data_dir (str) – Top level datastore directory.

Returns

A dictionary of EIA 860 pages (keys) and DataFrames (values)

Return type

dict

pudl.extract.eia860.get_eia860_column_map(page, year)[source]

Given a year and EIA860 page, returns info needed to slurp it from Excel.

The format of the EIA860 has changed slightly over the years, and so it is not completely straightforward to pull information from the spreadsheets into our analytical framework. This function looks up a map of the various tabs in the spreadsheet by year and page, and returns the information needed to name the data fields in a standardized way, and pull the right cells from each year & page into our database.

Parameters
  • page (str) – The string label indicating which page of the EIA860 we are attempting to read in. Must be one of the following: - ‘generation_fuel’ - ‘stocks’ - ‘boiler_fuel’ - ‘generator’ - ‘fuel_receipts_costs’ - ‘plant_frame’

  • year (int) – The year that we’re trying to read data for.

Returns

A tuple containing:
  • int: sheet_name, an integer indicating which page in the worksheet the data should be pulled from. 0 is the first page, 1 is the second page, etc. For use by pandas.read_excel()

  • int: skiprows, an integer indicating how many rows should be skipped at the top of the sheet being read in, before the header row that contains the strings which will be converted into column names in the dataframe which is created by pandas.read_excel()

  • dict: column_map, a dictionary that maps the names of the columns in the year being read in, to the canonical EIA923 column names (i.e. the column names as they are in 2014-2016). This dictionary will be used by DataFrame.rename(). The keys are the column names in the dataframe as read from older years, and the values are the canonmical column names. All should be stripped of leading and trailing whitespace, converted to lower case, and have internal non-alphanumeric characters replaced with underscores.

  • pd.Index: all_columns, the column Index associated with the column map – it includes all of the columns which might be present in all of the years of data, for use in setting the column index of the raw dataframe which is ultimately extracted, so we can ensure that they all have the same columns, even if we’re only loading a limited number of years.

Return type

tuple

pudl.extract.eia860.get_eia860_file(yr, file, data_dir)[source]

Construct the appopriate path for a given EIA860 Excel file.

Parameters
  • year (int) – The year that we’re trying to read data for.

  • file (str) – A string containing part of the file name for a given EIA 860 file (e.g. ‘Generat’)

  • data_dir (str) – Top level datastore directory.

Returns

Path to EIA 860 spreadsheets corresponding to a given year.

Return type

str

Raises

AssertionError – If the requested year is not in the list of working years for EIA 860.

pudl.extract.eia860.get_eia860_page(page, eia860_xlsx, years=(2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018))[source]

Reads a table from several years of EIA860 data, returns a DataFrame.

Parameters
  • page (str) –

    The string label indicating which page of the EIA860 we are attempting to read in. The page argument must be exactly one of the following strings:

    • ’boiler_generator_assn’

    • ’utility’

    • ’plant’

    • ’generator_existing’

    • ’generator_proposed’

    • ’generator_retired’

    • ’ownership’

  • eia860_xlsx (pandas.io.excel.ExcelFile) – xlsx file of EIA Form 860 for input year or years

  • years (list) – The set of years to read into the DataFrame.

Returns

A DataFrame of EIA 860 data from selected page, years.

Return type

pandas.DataFrame

Raises
  • AssertionError – If the page string is not among the list of recognized EIA 860 page strings.

  • AssertionError – If the year is not in the list of years that work for EIA 860.

pudl.extract.eia860.get_eia860_xlsx(years, filename, data_dir)[source]

Read in Excel files to create Excel objects from EIA860 spreadsheets.

Rather than reading in the same Excel files several times, we can just read them each in once (one per year) and use the ExcelFile object to refer back to the data in memory.

Parameters
  • years (list) – The years that we’re trying to read data for.

  • filename (str) – [‘enviro_assn’, ‘utilities’, ‘plants’, ‘generators’]

  • data_dir (path-like) – Path to PUDL input datastore directory.

Returns

xlsx file of EIA Form 860 for input year(s).

Return type

pandas.io.excel.ExcelFile