pudl.convert.datapkg_to_sqlite module

Merge compatible PUDL datapackages and load the result into an SQLite DB.

This script merges a set of compatible PUDL datapackages into a single tabular datapackage, and then loads that package into the PUDL SQLite DB

The input datapackages must all have been produced in the same ETL run, and share the same datapkg-bundle-uuid value. Any data sources (e.g. ferc1, eia923) that appear in more than one of the datapackages to be merged must also share identical ETL parameters (years, tables, states, etc.), allowing easy deduplication of resources.

Having the ability to load only a subset of the datapackages resulting from an ETL run into the SQLite database is helpful because larger datasets are much easier to work with via columnar datastores like Apache Parquet – loading all of EPA CEMS into SQLite can take more than 24 hours. PUDL also provides a separate epacems_to_parquet script that can be used to generate a Parquet dataset that is partitioned by state and year, which can be read directly into pandas or dask dataframes, for use in conjunction with the other PUDL data that is stored in the SQLite DB.

pudl.convert.datapkg_to_sqlite.datapkg_to_sqlite(sqlite_url, out_path, clobber=False, fkeys=False)[source]

Load a PUDL datapackage into a sqlite database.

Parameters
  • sqlite_url (str) – An SQLite database connection URL.

  • out_path (path-like) – Path to the base directory of the datapackage to be loaded into SQLite. Must contain the datapackage.json file.

  • clobber (bool) – If True, replace an existing PUDL DB if it exists. If False (the default), fail if an existing PUDL DB is found.

  • fkeys (bool) – If true, tell SQLite to check foreign key constraints for the records that are being loaded. Left off by default.

Returns

None

pudl.convert.datapkg_to_sqlite.main()[source]

Merge PUDL datapackages and save them into an SQLite database.

pudl.convert.datapkg_to_sqlite.parse_command_line(argv)[source]

Parse command line arguments. See the -h option.

Parameters

argv (str) – Command line arguments, including caller filename.

Returns

Dictionary of command line arguments and their parsed values.

Return type

dict