pudl.init module

The Public Utility Data Liberation (PUDL) project core module.

The PUDL project integrates several different public data sets into one well normalized database allowing easier access and interaction between all of them. This module defines database tables and initializes them with data from:

  • US Energy Information Agency (EIA): - Form 860 (eia860) - Form 923 (eia923)

  • US Federal Energy Regulatory Commission (FERC): - Form 1 (ferc1)

  • US Environmental Protection Agency (EPA): - Continuous Emissions Monitory System (epacems)

pudl.init.connect_db(pudl_settings, testing=False)[source]

Connect to the PUDL database using global settings from settings.py.

pudl.init.drop_tables(engine)[source]

Drop all the tables and views associated with the PUDL Database.

pudl.init.ingest_static_tables(engine)[source]

Populate static PUDL tables with constants for use as foreign keys.

There are many values specified within the data that are essentially constant, but which we need to store for data validation purposes, for use as foreign keys. E.g. the list of valid EIA fuel type codes, or the possible state and country codes indicating a coal delivery’s location of origin. For now these values are primarily stored in a large collection of lists, dictionaries, and dataframes which are specified in the pudl.constants module. This function uses those data structures to populate a bunch of small infrastructural tables within the PUDL DB.

Parameters

engine (sqlalchemy.engine) – A database engine with which to connect to to the PUDL DB.

Returns: Nothing.

pudl.init.init_db(pudl_settings, ferc1_tables=None, ferc1_years=None, eia923_tables=None, eia923_years=None, eia860_tables=None, eia860_years=None, epacems_years=None, epacems_states=None, epaipm_tables=None, pudl_testing=None, debug=None, csvdir=None, keep_csv=None)[source]

Create the PUDL database and fill it up with data.

Parameters
  • ferc1_tables (list) – The list of tables that will be created and ingested. By default only known to be working tables are ingested. That list of tables is defined in pudl.constants.

  • ferc1_years (list) – The list of years from which to pull FERC Form 1 data.

  • eia923_tables (list) – The list of tables that will be created and ingested. By default only known to be working tables are ingested. That list of tables is defined in pudl.constants.

  • eia923_years (iterable) – The list of years from which to pull EIA 923 data.

  • eia860_tables (list) – The list of tables that will be created and ingested. By default only known to be working tables are ingested. That list of tables is defined in pudl.constants.

  • eia860_years (iterable) – The list of years from which to pull EIA 860 data.

  • epacems_years (iterable) – The list of years from which to pull EPA CEMS data. Note that there’s only one EPA CEMS table.

  • epacems_states (iterable) – The list of states for which we are to pull EPA CEMS data. With all states, ETL takes ~8 hours.

  • epaipm_tables (list) – The list of tables that will be created and ingested. By default only known to be working tables are ingested. That list of tables is defined in pudl.constants.

  • debug (bool) – You can tell init_db to ingest whatever list of tables you want, but if your desired table is not in the list of known to be working tables, you need to set debug=True (otherwise init_db won’t let you).