pudl.load.csv module¶

Functions for loading processed PUDL data tables into CSV files.

Once each set of tables pertaining to a data source have been transformed, we need to output them into CSV files which will become the data underlying tabular data resources. Most of these resources contain an entire table. In the case of larger tables (like EPA CEMS) the data may be partitioned into a collection of gzipped CSV files which are all part of a single resource group.

These functions are designed to pick up where the transform step leaves off, taking a dictionary of dataframes and applying a few last alterations that are necessary only in the context of outputting the data as text based files. These include converting floatified integer columns into strings with null values, and appropriately indexing the dataframes as needed.

pudl.load.csv.clean_columns_dump(df, resource_name, datapkg_dir)[source]¶

Output cleaned data columns to a CSV file.

Ensures that the id column is set appropriately depending on whether the table has a natural primary key or an autoincremnted pseudo-key. Ensures that the set of columns in the dataframe to be output are identical to those in the corresponding metadata definition. Transforms integer columns with NA values into strings for dumping, as appropriate.

Parameters

resource_name (str) – The exact name of the tabular resource which the DataFrame df is going to be used to populate. This will be used to name the output CSV file, and must match the corresponding stored metadata template.
datapkg_dir (path-like) – Path to the datapackage directory that the CSV will be part of. Assumes CSV files get put in a “data” directory within this directory.
df (pandas.DataFrame) – The dataframe containing the data to be written out into CSV for inclusion in a tabular datapackage.

Returns

None

pudl.load.csv.csv_dump(df, resource_name, keep_index, datapkg_dir)[source]¶

Write a dataframe to CSV.

Set pandas.DataFrame.to_csv() arguments appropriately depending on what data source we’re writing out, and then write it out. In practice this means adding a .csv to the end of the resource name, and then, if it’s part of epacems, adding a .gz after that.

Parameters

df (pandas.DataFrame) – The DataFrame to be dumped to CSV.
resource_name (str) – The exact name of the tabular resource which the DataFrame df is going to be used to populate. This will be used to name the output CSV file, and must match the corresponding stored metadata template.
keep_index (bool) – if True, use the “id” column of df as the index and output it.
datapkg_dir (path-like) – Path to the top level datapackage directory.

Returns

None

pudl.load.csv.dict_dump(transformed_dfs, data_source, datapkg_dir)[source]¶

Wrapper for clean_columns_dump that takes a dictionary of DataFrames.

Parameters

transformed_dfs (dict) – A dictionary of DataFrame objects in which tables from datasets (keys) correspond to normalized DataFrames of values from that table (values)
data_source (str) – The name of the data source we are working with (eia923, ferc1, etc.)
datapkg_dir (path-like) – Path to the top level directory for the datapackage these CSV files are part of. Will contain a “data” directory and a datapackage.json file.

Returns

None