Published Data Packages

We’ve chosen tabular data packages as the main distribution format for PUDL because they:

  • are based on a free and open standard that should work on any platform,

  • are relatively easy for both humans and computers to understand,

  • are easy to archive and distribute,

  • provide rich metadata describing their contents,

  • do not force users into any particular platform.

We our hope this will allow the data to reach the widest possible audience.

See also

The Frictionless Data software and specifications, a project of the Open Knowledge Foundation

Downloading Data Packages

After the initial release of the PUDL software, we will automate the creation of a standard bundle of data packages containing all of the currently integrated data. Users who aren’t working with Python, or who don’t want to set up and run the data processing pipeline themselves will be able to just download and use the data packages directly. We intend to publish them to the following locations:

Zenodo

Integration between Zenodo and Github makes it easy to automatically archive and issue digital object ids (DOIs) for any tagged release. On a regular basis, we will also upload a standard set of data packages to Zenodo alongside the PUDL release that was used to generate them, and the packages will also be issued citeable DOIs so they can be easily referenced in research and other publications. Our goal is to make replication of any analyses that depend on the released code and published data as easy to replicate as possible.

Datahub

We also intend to regularly publish new data packages via Datahub.io, a open data portal which natively understands data packages, parses the included metadata, and can help integrate the PUDL data with other open public data.

Other Sites?

Are there other data archiving and access platforms that you’d like to see the pudl data packages published to? If so feel free to create an issue on Github to let us know about it, and explain what it would add to the project. Other sites we’ve thought about include:

Using Data Packages

Once you’ve downloaded or generated your own tabular data packages you can use them to do analysis on almost any platform. Below are a few examples. Open an issue on Github and let us know if you have another example we can add.

Python, Pandas, and Jupyter

You can read the datapackages into pandas.DataFrame for interactive in-memory use within JupyterLab, or for programmatic use in your own Python modules. Several example Jupyter notebooks are deployed into your PUDL workspace notebooks directory by the pudl_setup script.

With the pudl conda environment activated you can start up a notebook server and experiment with those notebooks by running the following from within your PUDL workspace:

$ jupyter-lab --notebook-dir=notebooks

Then select the pudl_intro.ipynb notebook from the file browser on the left hand side of the JupyterLab interface.

Todo

Update pudl_intro.ipynb to read the example datapackage.

If you’re using Python and need to work with larger-than-memory data, especially the EPA CEMS Hourly dataset, we recommend checking out the Dask project, which extends the interface to pandas.DataFrame objects enabling serialized, parallel and distributed processing tasks. It can also speed up processing for in-memory tasks, especially if you have a powerful system with multiple cores, a solid state disk, and plenty of memory.

The R programming language

Todo

Get someone who uses R to give us an example here… maybe we can get someone from OKFN to do it?

SQLite

If you’d rather access the data via SQL, you can easily load the datapackages into a local sqlite3 database.

Todo

Write and document datapackage bundle to SQLite script.

Microsoft Access / Excel

If you’d rather do spreadsheet based analysis, here’s how you can pull the datapackages into Microsoft Access and Excel.

Todo

Document process for pulling data packages or datapackage bundles into Microsoft Access / Excel

Other Platforms

Want to submit another example? Check out the documentation on contributing. Wish there was an example here for your favorite data analysis tool, but don’t know what it would look like? Feel free to open a Github issue requesting it.