pudl.convert.datapkg_to_sqlite module¶
Merge compatible PUDL datapackages and load the result into an SQLite DB.
This script merges a set of compatible PUDL datapackages into a single tabular datapackage, and then loads that package into the PUDL SQLite DB
The input datapackages must all have been produced in the same ETL run, and
share the same datapkg-bundle-uuid
value. Any data sources (e.g. ferc1,
eia923) that appear in more than one of the datapackages to be merged must
also share identical ETL parameters (years, tables, states, etc.), allowing
easy deduplication of resources.
Having the ability to load only a subset of the datapackages resulting from an ETL run into the SQLite database is helpful because larger datasets are much easier to work with via columnar datastores like Apache Parquet – loading all of EPA CEMS into SQLite can take more than 24 hours. PUDL also provides a separate epacems_to_parquet script that can be used to generate a Parquet dataset that is partitioned by state and year, which can be read directly into pandas or dask dataframes, for use in conjunction with the other PUDL data that is stored in the SQLite DB.
-
pudl.convert.datapkg_to_sqlite.
datapkg_to_sqlite
(sqlite_url, out_path, clobber=False, fkeys=False)[source]¶ Load a PUDL datapackage into a sqlite database.
- Parameters
sqlite_url (str) – An SQLite database connection URL.
out_path (path-like) – Path to the base directory of the datapackage to be loaded into SQLite. Must contain the datapackage.json file.
clobber (bool) – If True, replace an existing PUDL DB if it exists. If False (the default), fail if an existing PUDL DB is found.
fkeys (bool) – If true, tell SQLite to check foreign key constraints for the records that are being loaded. Left off by default.
- Returns
None