pudl.convert.epacems_to_parquet module

A script for converting the EPA CEMS dataset from gzip to Apache Parquet.

The original EPA CEMS data is available as ~12,000 gzipped CSV files, one for each month for each state, from 1995 to the present. On disk they take up about 7.3 GB of space, compressed. Uncompressed it is closer to 100 GB. That’s too much data to work with in memory.

Apache Parquet is a compressed, columnar datastore format, widely used in Big Data applications. It’s an open standard, and is very fast to read from disk. It works especially well with both Dask dataframes (a parallel / distributed computing extension of pandas) and Apache Spark (a cloud based Big Data processing pipeline system.)

Since pulling 100 GB of data into SQLite takes a long time, and working with that data en masse isn’t particularly pleasant on a laptop, this script can be used to convert the original EPA CEMS data to the more widely usable Apache Parquet format for use with Dask, either on a multi-core workstation or in an interactive cloud computing environment like Pangeo.

pudl.convert.epacems_to_parquet.create_cems_schema()[source]

Make an explicit Arrow schema for the EPA CEMS data.

Make changes in the types of the generated parquet files by editing this function.

Note that parquet’s internal representation doesn’t use unsigned numbers or 16-bit ints, so just keep things simple here and always use int32 and float32.

Returns

An Arrow schema for the EPA CEMS data.

Return type

pyarrow.schema

pudl.convert.epacems_to_parquet.create_in_dtypes()[source]

Create a dictionary of input data types.

This specifies the dtypes of the input columns, which is necessary for some cases where, e.g., a column is always NaN.

Returns

mapping columns names to pandas data types.

Return type

dict

pudl.convert.epacems_to_parquet.epacems_to_parquet(datapkg_path, epacems_years, epacems_states, out_dir, compression='snappy', partition_cols=('year', 'state'), clobber=False)[source]

Take transformed EPA CEMS dataframes and output them as Parquet files.

We need to do a few additional manipulations of the dataframes after they have been transformed by PUDL to get them ready for output to the Apache Parquet format. Mostly this has to do with ensuring homogeneous data types across all of the dataframes, and downcasting to the most efficient data type possible for each of them. We also add a ‘year’ column so that we can partition the datset on disk by year as well as state. (Year partitions follow the CEMS input data, based on local plant time. The operating_datetime_utc identifies time in UTC, so there’s a mismatch of a few hours on December 31 / January 1.)

Parameters
  • datapkg_path (path-like) – Path to the datapackage.json file describing the datapackage contaning the EPA CEMS data to be converted.

  • epacems_years (list) – list of years from which we are trying to read CEMS data

  • epacems_states (list) – list of years from which we are trying to read CEMS data

  • out_dir (path-like) – The directory in which to output the Parquet files

  • compression (string) –

  • partition_cols (tuple) –

  • clobber (bool) – If True and there is already a directory with out_dirs name, the existing parquet files will be deleted and new ones will be generated in their place.

Raises

AssertionError – Raised if an output directory is not specified.

Todo

Return to

pudl.convert.epacems_to_parquet.main()[source]

Convert zipped EPA CEMS Hourly data to Apache Parquet format.

pudl.convert.epacems_to_parquet.parse_command_line(argv)[source]

Parse command line arguments. See the -h option.

Parameters

argv (str) – Command line arguments, including caller filename.

Returns

Dictionary of command line arguments and their parsed values.

Return type

dict