pudl.convert.epacems_to_parquet module¶

A script for converting the EPA CEMS dataset from gzip to Apache Parquet.

The original EPA CEMS data is available as ~12,000 gzipped CSV files, one for each month for each state, from 1995 to the present. On disk they take up about 7.3 GB of space, compressed. Uncompressed it is closer to 100 GB. That’s too much data to work with in memory.

Apache Parquet is a compressed, columnar datastore format, widely used in Big Data applications. It’s an open standard, and is very fast to read from disk. It works especially well with both Dask dataframes (a parallel / distributed computing extension of Pandas) and Apache Spark (a cloud based Big Data processing pipeline system.)

Since pulling 100 GB of data into postgres takes a long time, and working with that data en masse isn’t particularly pleasant on a laptop, this script can be used to convert the original EPA CEMS data to the more widely usable Apache Parquet format for use with Dask, either on a multi-core workstation or in an interactive cloud computing environment like Pangeo.

For more information on working with these systems check out:

pudl.convert.epacems_to_parquet.create_cems_schema()[source]¶

Make an explicit Arrow schema for the EPA CEMS data.

Make changes in the types of the generated parquet files by editing this function.

Note that parquet’s internal representation doesn’t use unsigned numbers or 16-bit ints, so just keep things simple here and always use int32 and float32.

Returns: An Arrow schema for the EPA CEMS data.
Return type: pyarrow.schema

pudl.convert.epacems_to_parquet.epacems_to_parquet(epacems_years, epacems_states, data_dir, out_dir, pkg_dir, compression='snappy', partition_cols=('year', 'state'))[source]¶

Take transformed EPA CEMS dataframes and output them as Parquet files.

We need to do a few additional manipulations of the dataframes after they have been transformed by PUDL to get them ready for output to the Apache Parquet format. Mostly this has to do with ensuring homogeneous data types across all of the dataframes, and downcasting to the most efficient data type possible for each of them. We also add a ‘year’ column so that we can partition the datset on disk by year as well as state.

Parameters

epacems_years (list) – list of years from which we are trying to read CEMs data
epacems_states (list) – list of years from which we are trying to read CEMs data
data_dir (path-like) – Path to the top directory of the PUDL datastore.
out_dir (path-like) – The directory in which to output the Parquet files
pkg_dir (path-like) – The directory in which to output…
compression (type?) –
partition_cols (type?) –

Raises

AssertionError – Raised if an output directory is not specified.

Todo

Return to

pudl.convert.epacems_to_parquet.main()[source]¶: Convert zipped EPA CEMS Hourly data to Apache Parquet format.

pudl.convert.epacems_to_parquet.parse_command_line(argv)[source]¶

Parse command line arguments. See the -h option.

Parameters: argv (str) – Command line arguments, including caller filename.
Returns: Dictionary of command line arguments and their parsed values.
Return type: dict

pudl.convert.epacems_to_parquet.year_from_operating_datetime(df)[source]¶

Add a ‘year’ column based on the year in the operating_datetime.

Parameters: df (pandas.DataFrame) – A DataFrame containing EPA CEMS data.
Returns: A DataFrame containing EPA CEMS data with a ‘year’ column.
Return type: pandas.DataFrame