Published Data Packages¶
We’ve chosen tabular data packages as the main distribution format for PUDL because they:
are based on a free and open standard that should work on any platform,
are relatively easy for both humans and computers to understand,
are easy to archive and distribute,
provide rich metadata describing their contents,
do not force users into any particular platform.
We our hope this will allow the data to reach the widest possible audience.
See also
The Frictionless Data software and specifications, a project of the Open Knowledge Foundation
Downloading Data Packages¶
Note
Release v0.3.0 of the catalystcoop.pudl
package will be used to
generate tabular datapackages for distribution. You will be able to find
them listed on the Catalyst Cooperative Community page on Zenodo
Our intent is to automate the creation of a standard bundle of data packages containing all of the currently integrated data. Users who aren’t working with Python, or who don’t want to set up and run the data processing pipeline themselves will be able to just download and use the data packages directly. Each data release will be issued a DOI, and archived at Zenodo, and may be made available in other ways as well.
Zenodo¶
Every PUDL software release is automatically archived and issued a digital object id (DOI) by Zenodo through an integration with Github. The overarching DOI for the entire PUDL project is 10.5281/zenodo.3404014, and each release will get its own (versioned) DOI.
On a quarterly basis, we will also upload a standard set of data packages to Zenodo alongside the PUDL release that was used to generate them, and the packages will also be issued citeable DOIs so they can be easily referenced in research and other publications. Our goal is to make replication of any analyses that depend on the released code and published data as easy to replicate as possible.
Other Sites?¶
Are there other data archiving and access platforms that you’d like to see the pudl data packages published to? If so feel free to create an issue on Github to let us know about it, and explain what it would add to the project. Other sites we’ve thought about include:
Using Data Packages¶
Once you’ve downloaded or generated your own tabular data packages you can use them to do analysis on almost any platform. For now, we are primarily using the data packages to populate a local SQLite database.
Open an issue on Github and let us know if you have another example we can add.
SQLite¶
If you want to access the data via SQL, we have provided a script that loads
several data packages into a local sqlite3
database. Note that these
data packages must have all been generated by the same ETL run, or they
will be considered incompatible by the script. For example, to load three
data packages generated by our example ETL configuration into your local SQLite
DB, you could run the following command from within your PUDL workspace:
$ datapkg_to_sqlite \
-o datapkg/pudl-example/pudl-merged \
datapkg/pudl-example/ferc1-example/datapackage.json \
datapkg/pudl-example/eia-example/datapackage.json \
datapkg/pudl-example/epaipm-example/datapackage.json
The path after the -o
flag tells the script where to put the merged
data package, and the subsequent paths to the various datapackage.json
files indicate which data packages should be merged and loaded into SQLite.
Apache Parquet¶
The EPA CEMS Hourly data approaches 100 GB in size, which is too large to work with directly in memory on most systems, and take a very very long time to load into SQLite. Instead, we recommend converting the Hourly Emissions table into an Apache Parquet dataset which is stored on disk locally, and either reading in only parts of it using pandas, or using Dask dataframes, to serialize or distribute your analysis tasks. Dask can also speed up processing for in-memory tasks, especially if you have a powerful system with multiple cores, a solid state disk, and plenty of memory.
If you have generated an EPA CEMS data package, you can use the
epacems_to_parquet
script to convert the hourly emisssions table like this:
$ epacems_to_parquet datapkg/pudl-example/epacems-eia-example/datapackage.json
The script will automatically generate a Parquet Dataset which is partitioned
by year and state in the parquet/epacems
directory within your workspace.
Run epacems_to_parquet --help
for more details.
Microsoft Access / Excel¶
If you’d rather do spreadsheet based analysis, here’s how you can pull the data packages into Microsoft Access for use with Excel and other Microsoft tools:
Todo
Document process for pulling data packages or datapackage bundles into Microsoft Access / Excel
Other Platforms¶
Because the data packages we’re publishing right now are designed as well normalized relational database tables, pulling them directly into e.g. Pandas or R dataframes for interactive use probably isn’t the most useful thing to do. In the future we intend to generate and publish data packages containing denormalized tables including values derived from analysis of the original data, post-ETL. These packages would be suitable for direct interactive use.
Want to submit another example? Check out the documentation on contributing. Wish there was an example here for your favorite data analysis tool, but don’t know what it would look like? Feel free to open a Github issue requesting it.