Published Data Packages¶
We’ve chosen tabular data packages as the main distribution format for PUDL because they:
are based on a free and open standard that should work on any platform,
are relatively easy for both humans and computers to understand,
are easy to archive and distribute,
provide rich metadata describing their contents,
do not force users into any particular platform.
We our hope this will allow the data to reach the widest possible audience.
See also
The Frictionless Data software and specifications, a project of the Open Knowledge Foundation
Downloading Data Packages¶
Note
As of catalystcoop.pudl v0.2.0
we have not yet made our first data
release. For the moment you still need to generate your own data packages.
However, as soon as v0.2.0 is released, we will start working on a data
release, and hope to be able to include the DOI and a link to the Zenodo
archive here as of v0.2.1.
Our intent is to automate the creation of a standard bundle of data packages containing all of the currently integrated data. Users who aren’t working with Python, or who don’t want to set up and run the data processing pipeline themselves will be able to just download and use the data packages directly. Each data release will be issued a DOI, and archived at Zenodo, and may be made available in other ways as well.
Zenodo¶
Every PUDL software release is automatically archived and issued a digital object id (DOI) by Zenodo through an integration with Github. The overarching DOI for the entire PUDL project is 10.5281/zenodo.3404014, and each release will get its own (versioned) DOI.
On a quarterly basis, we will also upload a standard set of data packages to Zenodo alongside the PUDL release that was used to generate them, and the packages will also be issued citeable DOIs so they can be easily referenced in research and other publications. Our goal is to make replication of any analyses that depend on the released code and published data as easy to replicate as possible.
Other Sites?¶
Are there other data archiving and access platforms that you’d like to see the pudl data packages published to? If so feel free to create an issue on Github to let us know about it, and explain what it would add to the project. Other sites we’ve thought about include:
Using Data Packages¶
Once you’ve downloaded or generated your own tabular data packages you can use them to do analysis on almost any platform. For now, we are primarily using the data packages to populate a local SQLite database.
Open an issue on Github and let us know if you have another example we can add.
SQLite¶
If you want to access the data via SQL, we have provided a script that loads
a bundle of data packages into a local sqlite3
database, e.g.:
$ datapkg_to_sqlite --pkg_bundle_name pudl-example
Python, Pandas, and Jupyter¶
You can read the datapackages into pandas.DataFrame
for interactive
in-memory use within
JupyterLab,
or for programmatic use in your own Python modules. Several example Jupyter
notebooks are deployed into your PUDL workspace notebook
directory by the
pudl_setup
script.
Todo
Update pudl_intro.ipynb
to provide an example of reading the example
datapackages directly.
$ jupyter lab notebook/pudl_intro.ipynb
If you’re using Python and need to work with larger-than-memory data,
especially the EPA CEMS Hourly dataset, we recommend checking out
the Dask project, which extends the interface to
pandas.DataFrame
objects enabling serialized, parallel and distributed
processing tasks. It can also speed up processing for in-memory tasks,
especially if you have a powerful system with multiple cores, a solid state
disk, and plenty of memory.
The R programming language¶
Todo
Get someone who uses R to give us an example here… maybe we can get someone from OKFN to do it?
Microsoft Access / Excel¶
If you’d rather do spreadsheet based analysis, here’s how you can pull the datapackages into Microsoft Access and Excel.
Todo
Document process for pulling data packages or datapackage bundles into Microsoft Access / Excel
Other Platforms¶
Want to submit another example? Check out the documentation on contributing. Wish there was an example here for your favorite data analysis tool, but don’t know what it would look like? Feel free to open a Github issue requesting it.