Published Data Packages¶
We’ve chosen tabular data packages as the main distribution format for PUDL because they:
are based on a free and open standard that should work on any platform,
are relatively easy for both humans and computers to understand,
are easy to archive and distribute,
provide rich metadata describing their contents,
do not force users into any particular platform.
We our hope this will allow the data to reach the widest possible audience.
See also
The Frictionless Data software and specifications, a project of the Open Knowledge Foundation
We intend to publish tabular data packages on a quarterly basis, containing the full outputs from the PUDL ETL pipeline via Zenodo, and open data archiving service provided by CERN. The most recent release can always be found through this concept DOI: 10.5281/zenodo.3653158. Each individual version of the data releases will be assigned its own unique DOI.
Users who aren’t working with Python, or who don’t want to set up and run the data processing pipeline themselves can download and use the data packages directly. We provide scripts alongside the published data packages that will load them into a local SQLite database or for the larger datasets, convert them into Apache Parquet datasets on disk which can be read directly into Pandas, Dask, or R dataframes.
We archive the original input data and provide a one-line script which should allow users to replicate the entire ETL process, generating byte-for-byte identical outputs. See the documentation published with the data releases for details on how to load or reproduce the data packages.
We also curate the Catalyst Cooperative Community on Zenodo which lists all of the archived products generated by our projects.
The these archives and the DOIs associated with them should be permanently accessible, and are suitable for use as references in academic and other publications.
Using Data Packages¶
Once you’ve downloaded or generated your own tabular data packages you can use them to do analysis on almost any platform. For now, we are primarily using the data packages to populate a local SQLite database.
Open an issue on Github and let us know if you have another example we can add.
SQLite¶
If you want to access the data via SQL, we have provided a script that loads
several data packages into a local sqlite3
database. Note that these
data packages must have all been generated by the same ETL run, or they
will be considered incompatible by the script. For example, to load three
data packages generated by our example ETL configuration into your local SQLite
DB, you could run the following command from within your PUDL workspace:
$ datapkg_to_sqlite \
-o datapkg/pudl-example/pudl-merged \
datapkg/pudl-example/ferc1-example/datapackage.json \
datapkg/pudl-example/eia-example/datapackage.json \
The path after the -o
flag tells the script where to put the merged
data package, and the subsequent paths to the various datapackage.json
files indicate which data packages should be merged and loaded into SQLite.
Apache Parquet¶
The EPA CEMS Hourly data approaches 100 GB in size, which is too large to work with directly in memory on most systems, and take a very very long time to load into SQLite. Instead, we recommend converting the Hourly Emissions table into an Apache Parquet dataset which is stored on disk locally, and either reading in only parts of it using pandas, or using Dask dataframes, to serialize or distribute your analysis tasks. Dask can also speed up processing for in-memory tasks, especially if you have a powerful system with multiple cores, a solid state disk, and plenty of memory.
If you have generated an EPA CEMS data package, you can use the
epacems_to_parquet
script to convert the hourly emissions table like this:
$ epacems_to_parquet datapkg/pudl-example/epacems-eia-example/datapackage.json
The script will automatically generate a Parquet Dataset which is partitioned
by year and state in the parquet/epacems
directory within your workspace.
Run epacems_to_parquet --help
for more details.
Microsoft Access / Excel¶
If you’d rather do spreadsheet based analysis, here’s how you can pull the data packages into Microsoft Access for use with Excel and other Microsoft tools:
Todo
Document process for pulling data packages or datapackage bundles into Microsoft Access / Excel. If you’ve gotten this to work and would like to contribute an example, please let us know!
Other Platforms¶
Because the data packages we’re publishing right now are designed as well normalized relational database tables, pulling them directly into e.g. Pandas or R dataframes for interactive use probably isn’t the most useful thing to do. In the future we intend to generate and publish data packages containing denormalized tables including values derived from analysis of the original data, post-ETL. These packages would be more suitable for direct interactive use.
Want to submit another example? Check out the documentation on contributing. Wish there was an example here for your favorite data analysis tool, but don’t know what it would look like? Feel free to open a Github issue requesting it.