Published Data Packages¶

We’ve chosen tabular data packages as the main distribution format for PUDL because they:

are based on a free and open standard that should work on any platform,
are relatively easy for both humans and computers to understand,
are easy to archive and distribute,
provide rich metadata describing their contents,
do not force users into any particular platform.

We our hope this will allow the data to reach the widest possible audience.

Using Data Packages¶

Once you’ve downloaded or generated your own tabular data packages you can use them to do analysis on almost any platform. For now, we are primarily using the data packages to populate a local SQLite database.

Open an issue on Github and let us know if you have another example we can add.

SQLite¶

If you want to access the data via SQL, we have provided a script that loads several data packages into a local sqlite3 database. Note that these data packages must have all been generated by the same ETL run, or they will be considered incompatible by the script. For example, to load three data packages generated by our example ETL configuration into your local SQLite DB, you could run the following command from within your PUDL workspace:

$ datapkg_to_sqlite \
    -o datapkg/pudl-example/pudl-merged \
    datapkg/pudl-example/ferc1-example/datapackage.json \
    datapkg/pudl-example/eia-example/datapackage.json \

The path after the -o flag tells the script where to put the merged data package, and the subsequent paths to the various datapackage.json files indicate which data packages should be merged and loaded into SQLite.

Apache Parquet¶

The EPA CEMS Hourly data approaches 100 GB in size, which is too large to work with directly in memory on most systems, and take a very very long time to load into SQLite. Instead, we recommend converting the Hourly Emissions table into an Apache Parquet dataset which is stored on disk locally, and either reading in only parts of it using pandas, or using Dask dataframes, to serialize or distribute your analysis tasks. Dask can also speed up processing for in-memory tasks, especially if you have a powerful system with multiple cores, a solid state disk, and plenty of memory.

If you have generated an EPA CEMS data package, you can use the epacems_to_parquet script to convert the hourly emissions table like this:

$ epacems_to_parquet datapkg/pudl-example/epacems-eia-example/datapackage.json

The script will automatically generate a Parquet Dataset which is partitioned by year and state in the parquet/epacems directory within your workspace. Run epacems_to_parquet --help for more details.

Microsoft Access / Excel¶

If you’d rather do spreadsheet based analysis, here’s how you can pull the data packages into Microsoft Access for use with Excel and other Microsoft tools:

Todo

Document process for pulling data packages or datapackage bundles into Microsoft Access / Excel. If you’ve gotten this to work and would like to contribute an example, please let us know!

Other Platforms¶

Because the data packages we’re publishing right now are designed as well normalized relational database tables, pulling them directly into e.g. Pandas or R dataframes for interactive use probably isn’t the most useful thing to do. In the future we intend to generate and publish data packages containing denormalized tables including values derived from analysis of the original data, post-ETL. These packages would be more suitable for direct interactive use.

Want to submit another example? Check out the documentation on contributing. Wish there was an example here for your favorite data analysis tool, but don’t know what it would look like? Feel free to open a Github issue requesting it.