pudl.load module

A module with functions for loading the pudl database tables.

class pudl.load.BulkCopy(table_name, engine, buffer=1073741824, csvdir='', keep_csv=False)[source]

Bases: contextlib.AbstractContextManager

Accumulate several DataFrames, then COPY FROM python to postgresql.

NOTE: You should use this class to load one table at a time. To load different tables, use different instances of BulkCopy.

Parameters
  • table_name (str) – The exact name of the database table which the DataFrame df is going to be used to populate. It will be used both to look up an SQLAlchemy table object in the PUDLBase metadata object, and to name the CSV file.

  • engine (sqlalchemy.engine) – SQLAlchemy database engine, which will be used to pull the CSV output into the database.

  • buffer (int) – Size of data to accumulate (in bytes) before actually writing the data into postgresql. (Approximate, because we don’t introspect memory usage ‘deeply’). Default 1 GB.

  • csvdir (str) – Path to the directory into which the CSV file should be saved, if it’s being kept.

  • keep_csv (bool) – True if the CSV output should be saved after the data has been loaded into the database. False if they should be deleted. NOTE: If multiple COPYs are done for the same table_name, only the last will be retained by keep_csv, which may be unsatisfying.

Example

>>> with BulkCopy(my_table, my_engine) as p:
        for df in df_generator:
            p.add(df)
add(df)[source]

Add a DataFrame to the accumulated list.

Parameters

df (pandas.DataFrame) – The DataFrame to add to the accumulated list.

Returns

None

close()[source]

Output the accumulated tabular data to disk.

Todo

Incomplete docstring.

spill()[source]

Spills the accumulated dataframes into postgresql.

class pudl.load.BulkCopyPkg(table_name, pkg_dir, buffer=1073741824)[source]

Bases: contextlib.AbstractContextManager

Accumulate several DataFrames, then COPY FROM pandas to a CSV.

NOTE: You shoud use this class to load one table at a time. To load different tables, use different instances of BulkCopy.

Parameters
  • table_name (str) – The exact name of the database table which the DataFrame df is going to be used to populate. It will be used both to look up an SQLAlchemy table object in the PUDLBase metadata object, and to name the CSV file.

  • engine (sqlalchemy.engine) – SQLAlchemy database engine, which will be used to pull the CSV output into the database.

  • buffer (int) – Size of data to accumulate (in bytes) before actually writing the data into postgresql. (Approximate, because we don’t introspect memory usage ‘deeply’). Default 1 GB.

Example

>>> with BulkCopy(my_table, my_engine) as p:
        for df in df_generator:
            p.add(df)
add(df)[source]

Adds a DataFrame to the accumulated list.

Parameters

df (pandas.DataFrame) – The DataFrame to add to the accumulated list.

Returns

None

close()[source]

Output the accumulated tabular data to disk.

Todo

Return to

spill()[source]

Spill the accumulated dataframes into the datapackage.

pudl.load.clean_columns_dump(table_name, pkg_dir, df)[source]

Output the cleaned columns to a CSV file.

Parameters
Returns

None

Todo

Incomplete Docstring

pudl.load.csv_dump(df, table_name, keep_index, pkg_dir)[source]

Writes a dataframe to CSV and loads it into postgresql using COPY FROM.

The fastest way to load a bunch of records is using the database’s native text file copy function. This function dumps a given dataframe out to a CSV file, and then loads it into the specified table using a sqlalchemy wrapper around the postgresql COPY FROM command, called postgres_copy.

Note that this creates an additional in-memory representation of the data, which takes slightly less memory than the DataFrame itself.

Parameters
  • df (pandas.DataFrame) – The DataFrame which is to be dumped to CSV and loaded into the database. All DataFrame columns must have exactly the same names as the database fields they are meant to populate, and all column data types must be directly compatible with the database fields they are meant to populate. Do any cleanup before you call this function.

  • table_name (str) – The exact name of the database table which the DataFrame df is going to be used to populate. It will be used both to look up an SQLAlchemy table object in the PUDLBase metadata object, and to name the CSV file.

  • keep_index (bool) – Should the output CSV file contain an index?

  • pkg_dir (path-like) – Path to the directory into which the CSV file should be saved, if it’s being kept.

Returns

None

pudl.load.dict_dump(transformed_dfs, data_source, need_fix_inting={'coalmine_eia923': ('mine_id_msha', 'county_id_fips'), 'fuel_receipts_costs_eia923': ('mine_id_pudl', ), 'generation_fuel_eia923': ('nuclear_unit_id', ), 'generators_eia860': ('turbines_num', ), 'hourly_emissions_epacems': ('facility_id', 'unit_id_epa'), 'plants_eia860': ('utility_id_eia', ), 'plants_entity_eia': ('zip_code', ), 'plants_hydro_ferc1': ('construction_year', 'installation_year'), 'plants_pumped_storage_ferc1': ('construction_year', 'installation_year'), 'plants_small_ferc1': ('construction_year', 'ferc_license_id'), 'plants_steam_ferc1': ('construction_year', 'installation_year')}, pkg_dir='')[source]

Wrapper for _csv_dump for each data source.

Parameters
  • transformed_dfs (dict) – A dictionary of DataFrame objects in which tables from datasets (keys) correspond to normalized DataFrames of values from that table (values)

  • datasource (str) – The name of the datasource we are working with.

  • need_fix_inting (dict) – A dictionary containing table names (keys) and column names for each table that need their null values cleaned up (values).

  • pkg_dir (path-like) – Path to the directory into which the CSV file should be saved, if it’s being kept.

Returns

None

pudl.load.dict_dump_load(transformed_dfs, data_source, pudl_engine, need_fix_inting={'coalmine_eia923': ('mine_id_msha', 'county_id_fips'), 'fuel_receipts_costs_eia923': ('mine_id_pudl', ), 'generation_fuel_eia923': ('nuclear_unit_id', ), 'generators_eia860': ('turbines_num', ), 'hourly_emissions_epacems': ('facility_id', 'unit_id_epa'), 'plants_eia860': ('utility_id_eia', ), 'plants_entity_eia': ('zip_code', ), 'plants_hydro_ferc1': ('construction_year', 'installation_year'), 'plants_pumped_storage_ferc1': ('construction_year', 'installation_year'), 'plants_small_ferc1': ('construction_year', 'ferc_license_id'), 'plants_steam_ferc1': ('construction_year', 'installation_year')}, csvdir='', keep_csv=False)[source]

Wrapper for _csv_dump_load for each data source.

Parameters
  • transformed_dfs (dict) – A dictionary of DataFrame objects in which tables from datasets (keys) correspond to normalized DataFrames of values from that table (values)

  • data_source (str) – The name of the datasource we are working with.

  • pudl_engine (sqlalchemy.engine) – SQLAlchemy database engine, which will be used to pull the CSV output into the database.

  • need_fix_inting (dict) – A dictionary containing table names (keys) and column names for each table that need their null values cleaned up (values).

  • csvdir (path-like) – Path to the directory into which the CSV file should be saved, if it’s being kept.

  • keep_csv (bool) – True if the CSV output should be saved after the data has been loaded into the database. False if they should be deleted. NOTE: If multiple COPYs are done for the same table_name, only the last will be retained by keep_csv, which may be unsatisfying.

Returns

None