Published Data Packages

We’ve chosen tabular data packages as the main distribution format for PUDL because they:

  • are based on a free and open standard that should work on any platform,

  • are relatively easy for both humans and computers to understand,

  • are easy to archive and distribute,

  • provide rich metadata describing their contents,

  • do not force users into any particular platform.

We our hope this will allow the data to reach the widest possible audience.

See also

The Frictionless Data software and specifications, a project of the Open Knowledge Foundation

Downloading Data Packages

Note

Release v0.3.0 of the catalystcoop.pudl package will be used to generate tabular datapackages for distribution. You will be able to find them listed on the Catalyst Cooperative Community page on Zenodo

Our intent is to automate the creation of a standard bundle of data packages containing all of the currently integrated data. Users who aren’t working with Python, or who don’t want to set up and run the data processing pipeline themselves will be able to just download and use the data packages directly. Each data release will be issued a DOI, and archived at Zenodo, and may be made available in other ways as well.

Zenodo

Every PUDL software release is automatically archived and issued a digital object id (DOI) by Zenodo through an integration with Github. The overarching DOI for the entire PUDL project is 10.5281/zenodo.3404014, and each release will get its own (versioned) DOI.

On a quarterly basis, we will also upload a standard set of data packages to Zenodo alongside the PUDL release that was used to generate them, and the packages will also be issued citeable DOIs so they can be easily referenced in research and other publications. Our goal is to make replication of any analyses that depend on the released code and published data as easy to replicate as possible.

Other Sites?

Are there other data archiving and access platforms that you’d like to see the pudl data packages published to? If so feel free to create an issue on Github to let us know about it, and explain what it would add to the project. Other sites we’ve thought about include:

Using Data Packages

Once you’ve downloaded or generated your own tabular data packages you can use them to do analysis on almost any platform. For now, we are primarily using the data packages to populate a local SQLite database.

Open an issue on Github and let us know if you have another example we can add.

SQLite

If you want to access the data via SQL, we have provided a script that loads several data packages into a local sqlite3 database. Note that these data packages must have all been generated by the same ETL run, or they will be considered incompatible by the script. For example, to load three data packages generated by our example ETL configuration into your local SQLite DB, you could run the following command from within your PUDL workspace:

$ datapkg_to_sqlite \
    -o datapkg/pudl-example/pudl-merged \
    datapkg/pudl-example/ferc1-example/datapackage.json \
    datapkg/pudl-example/eia-example/datapackage.json \
    datapkg/pudl-example/epaipm-example/datapackage.json

The path after the -o flag tells the script where to put the merged data package, and the subsequent paths to the various datapackage.json files indicate which data packages should be merged and loaded into SQLite.

Apache Parquet

The EPA CEMS Hourly data approaches 100 GB in size, which is too large to work with directly in memory on most systems, and take a very very long time to load into SQLite. Instead, we recommend converting the Hourly Emissions table into an Apache Parquet dataset which is stored on disk locally, and either reading in only parts of it using pandas, or using Dask dataframes, to serialize or distribute your analysis tasks. Dask can also speed up processing for in-memory tasks, especially if you have a powerful system with multiple cores, a solid state disk, and plenty of memory.

If you have generated an EPA CEMS data package, you can use the epacems_to_parquet script to convert the hourly emisssions table like this:

$ epacems_to_parquet datapkg/pudl-example/epacems-eia-example/datapackage.json

The script will automatically generate a Parquet Dataset which is partitioned by year and state in the parquet/epacems directory within your workspace. Run epacems_to_parquet --help for more details.

Microsoft Access / Excel

If you’d rather do spreadsheet based analysis, here’s how you can pull the data packages into Microsoft Access for use with Excel and other Microsoft tools:

Todo

Document process for pulling data packages or datapackage bundles into Microsoft Access / Excel

Other Platforms

Because the data packages we’re publishing right now are designed as well normalized relational database tables, pulling them directly into e.g. Pandas or R dataframes for interactive use probably isn’t the most useful thing to do. In the future we intend to generate and publish data packages containing denormalized tables including values derived from analysis of the original data, post-ETL. These packages would be suitable for direct interactive use.

Want to submit another example? Check out the documentation on contributing. Wish there was an example here for your favorite data analysis tool, but don’t know what it would look like? Feel free to open a Github issue requesting it.