pudl.convert.epacems_to_parquet module

A script for converting the EPA CEMS dataset from gzip to Apache Parquet.

The original EPA CEMS data is available as ~12,000 gzipped CSV files, one for each month for each state, from 1995 to the present. On disk they take up about 7.3 GB of space, compressed. Uncompressed it is closer to 100 GB. That’s too much data to work with in memory.

Apache Parquet is a compressed, columnar datastore format, widely used in Big Data applications. It’s an open standard, and is very fast to read from disk. It works especially well with both Dask dataframes (a parallel / distributed computing extension of Pandas) and Apache Spark (a cloud based Big Data processing pipeline system.)

Since pulling 100 GB of data into postgres takes a long time, and working with that data en masse isn’t particularly pleasant on a laptop, this script can be used to convert the original EPA CEMS data to the more widely usable Apache Parquet format for use with Dask, either on a multi-core workstation or in an interactive cloud computing environment like Pangeo.

For more information on working with these systems check out:
pudl.convert.epacems_to_parquet.create_cems_schema()[source]

Make an explicit Arrow schema for the EPA CEMS data.

Make changes in the types of the generated parquet files by editing this function.

Note that parquet’s internal representation doesn’t use unsigned numbers or 16-bit ints, so just keep things simple here and always use int32 and float32.

Returns

An Arrow schema for the EPA CEMS data.

Return type

pyarrow.schema

pudl.convert.epacems_to_parquet.epacems_to_parquet(epacems_years, epacems_states, data_dir, out_dir, pkg_dir, compression='snappy', partition_cols=('year', 'state'))[source]

Take transformed EPA CEMS dataframes and output them as Parquet files.

We need to do a few additional manipulations of the dataframes after they have been transformed by PUDL to get them ready for output to the Apache Parquet format. Mostly this has to do with ensuring homogeneous data types across all of the dataframes, and downcasting to the most efficient data type possible for each of them. We also add a ‘year’ column so that we can partition the datset on disk by year as well as state.

Parameters
  • epacems_years (list) – list of years from which we are trying to read CEMs data

  • epacems_states (list) – list of years from which we are trying to read CEMs data

  • data_dir (path-like) – Path to the top directory of the PUDL datastore.

  • out_dir (path-like) – The directory in which to output the Parquet files

  • pkg_dir (path-like) – The directory in which to output…

  • compression (type?) –

  • partition_cols (type?) –

Raises

AssertionError – Raised if an output directory is not specified.

Todo

Return to

pudl.convert.epacems_to_parquet.main()[source]

Convert zipped EPA CEMS Hourly data to Apache Parquet format.

pudl.convert.epacems_to_parquet.parse_command_line(argv)[source]

Parse command line arguments. See the -h option.

Parameters

argv (str) – Command line arguments, including caller filename.

Returns

Dictionary of command line arguments and their parsed values.

Return type

dict

pudl.convert.epacems_to_parquet.year_from_operating_datetime(df)[source]

Add a ‘year’ column based on the year in the operating_datetime.

Parameters

df (pandas.DataFrame) – A DataFrame containing EPA CEMS data.

Returns

A DataFrame containing EPA CEMS data with a ‘year’ column.

Return type

pandas.DataFrame