pudl.convert.datapkg_to_sqlite module¶
Merge a compatible PUDL datapackages and load the result into an SQLite DB.
This script merges a set of compatible PUDL datapackages into a single
tabular datapackage that can be loaded into an SQLite database (or potentially
other storage media like Google BigQuery, PostgreSQL, etc.). The input
datapackages must all have been produced in the same ETL run, and share the
same datapkg-bundle-uuid
value. Any data sources (e.g. ferc1, eia923) that
appear in more than one of the datapackages to be merged must also share
identical ETL parameters (years, tables, states, etc.), allowing easy
deduplication of resources.
Having the ability to load only a subset of the datapackages resulting from an ETL run into the SQLite database is helpful because larger datasets like the EPA CEMS hourly emissions table which have ~1 billion records and take up ~100 GB of space when uncompressed are much easier to work with via columnar datastores like Apache Parquet – loading all of EPA CEMS into SQLite can take more than 24 hours. PUDL also provides a separate epacems_to_parquet script that can be used to generate a Parquet dataset that is partitioned by state and year, which can be read directly into pandas or dask dataframes, for use in conjunction with the other PUDL data that is stored in the SQLite DB.
-
pudl.convert.datapkg_to_sqlite.
datapkg_to_sqlite
(sqlite_url, out_path, clobber=False)[source]¶ Load a PUDL datapackage into a sqlite database.
- Parameters
sqlite_url (str) – An SQLite database connection URL.
out_path (path-like) – Path to the base directory of the datapackage to be loaded into SQLite. Must contain the datapackage.json file.
clobber (bool) – If True, replace an existing PUDL DB if it exists. If False (the default), fail if an existing PUDL DB is found.
- Returns
None