pudl.convert.datapkg_to_sqlite module¶

Merge a compatible PUDL datapackages and load the result into an SQLite DB.

This script merges a set of compatible PUDL datapackages into a single tabular datapackage that can be loaded into an SQLite database (or potentially other storage media like Google BigQuery, PostgreSQL, etc.). The input datapackages must all have been produced in the same ETL run, and share the same datapkg-bundle-uuid value. Any data sources (e.g. ferc1, eia923) that appear in more than one of the datapackages to be merged must also share identical ETL parameters (years, tables, states, etc.), allowing easy deduplication of resources.

Having the ability to load only a subset of the datapackages resulting from an ETL run into the SQLite database is helpful because larger datasets like the EPA CEMS hourly emissions table which have ~1 billion records and take up ~100 GB of space when uncompressed are much easier to work with via columnar datastores like Apache Parquet – loading all of EPA CEMS into SQLite can take more than 24 hours. PUDL also provides a separate epacems_to_parquet script that can be used to generate a Parquet dataset that is partitioned by state and year, which can be read directly into pandas or dask dataframes, for use in conjunction with the other PUDL data that is stored in the SQLite DB.

pudl.convert.datapkg_to_sqlite.datapkg_to_sqlite(sqlite_url, out_path, clobber=False)[source]¶

Load a PUDL datapackage into a sqlite database.

Parameters

sqlite_url (str) – An SQLite database connection URL.
out_path (path-like) – Path to the base directory of the datapackage to be loaded into SQLite. Must contain the datapackage.json file.
clobber (bool) – If True, replace an existing PUDL DB if it exists. If False (the default), fail if an existing PUDL DB is found.

Returns

None

pudl.convert.datapkg_to_sqlite.main()[source]¶: Merge PUDL datapackages and save them into an SQLite database.

pudl.convert.datapkg_to_sqlite.parse_command_line(argv)[source]¶

Parse command line arguments. See the -h option.

Parameters: argv (str) – Command line arguments, including caller filename.
Returns: Dictionary of command line arguments and their parsed values.
Return type: dict