pudl.output.export module

Routines for exporting data from PUDL for use elsewhere.

Function names should be indicative of the format of the thing that’s being exported (e.g. CSV, Excel spreadsheets, parquet files, HDF5).

pudl.output.export.annotated_xlsx(df, notes_dict, tags_dict, first_cols, sheet_name, xlsx_writer)[source]

Outputs an annotated spreadsheet workbook based on compiled dataframes.

Creates annotation tab and header rows for EIA 860, EIA 923, and FERC 1 fields in a dataframe. This is done using an Excel Writer object, which must be created and saved outside the function, thereby allowing multiple sheets and associated annotations to be compiled in the same Excel file.

Parameters
  • df (pandas.DataFrame) – The dataframe for which annotations are being created

  • notes_dict (dict) – dictionary with column names as keys and long annotations as values

  • tags_dict (dict) – dictionary of dictionaries with tag categories as keys for outer dictionary and values are dictionaries with column names as keys and values are tag within the tag category

  • first_cols (list) – ordered list of columns that should come first in outfile

  • sheet_name (string) – name of data sheet in output spreadsheet

  • xlsx_writer (pandas.ExcelWriter) – this is an ExcelWriter object used to accumulate multiple tabs, which must be created outside of function, before calling the first time e.g. “xlsx_writer = pd.ExcelWriter(‘outfile.xlsx’)”

Returns

which must be called outside the function, after final use of function, for writing out to excel: “xlsx_writer.save()”

Return type

xlsx_writer (pandas.ExcelWriter)

pudl.output.export.data_package(pkg_tables, pkg_skeleton, pudl_settings, testing=False, dry_run=False)[source]

Create a data package of requested tables and their dependencies.

See Frictionless Data for the tabular data package specification: http://frictionlessdata.io/specs/tabular-data-package/

Parameters
  • pkg_tables (iterable) – The names of database tables to include. Each one will be converted into a tabular data resource. Dependent tables will also be added to the data package.

  • pkg_skeleton (dict) – A python dictionary containing several top level elements of the data package JSON descriptor specific to the data package, including: * name: pudl-<datasource> e.g. pudl-eia923, pudl-ferc1 * title: One line human readable description. * description: A paragraph long description. * keywords: For search purposes.

  • pudl_settings (dict) – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • testing (bool) – Connect to the test database or live PUDL database?

  • dry_run (bool) – Should the function validate tables using goodtables? If True, do no validate; if false, validate.

Returns

an object representing the data package, as defined by the datapackage library.

Return type

data_pkg (Package)

Todo

remove upon removal of pudl_db

pudl.output.export.generate_data_packages(package_settings, pudl_settings, debug=False)[source]

Coordinates the generation of data packages.

For each bundle of packages laid out in the package_settings, this function generates data packages. First, the settings are validated (which runs through each of the settings listed in the package_settings). Then for each of the packages, run through the etl (extract, transform, load) functions, which generates CSVs. Then the metadata for the packages is generated by pulling from the metadata (which is a json file containing the schema for all of the possible pudl tables).

Parameters
  • package_settings (iterable) – a list of dictionaries. Each item in the list corresponds to a data package. Each data package’s dictionary contains the arguements for its ETL function.

  • pudl_settings (dict) – a dictionary filled with settings that mostly describe paths to various resources and outputs.

  • debug (bool) – If True, return a dictionary with package names (keys) and a list with the data package metadata and report (values).

Returns

A tuple containing generated metadata for the packages laid out in the package_settings.

Return type

tuple

pudl.output.export.generate_metadata(pkg_settings, tables, pkg_dir, uuid_pkgs=UUID('c432ff84-488f-4555-a9c1-d0f28726b57a'))[source]

Generate metadata for package tables and validate package.

The metadata for this package is compiled from the pkg_settings and from the “megadata”, which is a json file containing the schema for all of the possible pudl tables. Given a set of tables, this function compiles metadata and validates the metadata and the package. This function assumes datapackage CSVs have already been generated.

See Frictionless Data for the tabular data package specification: http://frictionlessdata.io/specs/tabular-data-package/

Parameters
  • pkg_settings (dict) – a dictionary containing package settings containing top level elements of the data package JSON descriptor specific to the data package including: * name: short package name e.g. pudl-eia923, ferc1-test, cems_pkg * title: One line human readable description. * description: A paragraph long description. * keywords: For search purposes.

  • tables (list) – a list of tables that are included in this data package.

  • pkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapackage_dir directory, with the name of the package as the name of the subdirectory.

  • uuid_pkgs

Todo

Return to (uuid_pkgs)

Returns

a datapackage. See frictionlessdata specs. dict: a valition dictionary containing validity of package and any errors that were generated during packaing.

Return type

datapackage.package.Package

pudl.output.export.get_fields(table)[source]

Generate table schema compatible list of fields from database table.

See: https://frictionlessdata.io/specs/table-schema/

Field attributes which are currently set by the function: * name (same as the database column) * description (taken from the database column ‘comment’ field.) * type (simplified from the SQL Alchemy Column data type) * constraints (only for Enum types)

Todo

constraints other than Enum

Parameters

table (SQL Alchemy Table) – The Table object to generate fields from.

Returns

A list of ‘field’ JSON objects, conforming to the Frictionless Data Table Schema standard.

Return type

list

Todo

Remove upon removal of pudl_db

pudl.output.export.get_foreign_keys(table)[source]

Gets a list of foreignKey objects from an SQLAlchemy Table.

Parameters

table (SQL Alchemy Table) – The Table object to generate a list of missing values from.

Returns

A list of foreignKey object based on the selected SQLAlchemy Table.

Todo

Remove upon removal of pudl_db

pudl.output.export.get_missing_values(table)[source]

Get a list of missing values from an SQLAlchemy Table.

We’ll only really be able to see how this works with some data. For now it just returns the default value: [“”].

Parameters

table (SQL Alchemy Table) – The Table object to generate a list of missing values from.

Returns

a list containing the default value “”

Return type

list

Todo

Remove upon removal of pudl_db

pudl.output.export.get_primary_key(table)[source]

Creates a primaryKey object based on an SQLAlchemy Table.

Parameters

table (SQL Alchemy Table) – The Table object to generate fields from.

Returns

A primaryKey object based on the selected SQLAlchemy Table.

Todo

Remove upon removal of pudl_db

pudl.output.export.get_table(tablename, testing=False)[source]

Retrieves SQLAlchemy Table object corresponding to a PUDL DB table name.

Parameters
  • tablename (str) – the name of the PUDL database table to retrieve

  • testing (bool) – Use the test database (True) or the live database (False)?

Returns

The SQLAlchemy Table object corresponding to the PUDL database tables name selected.

Todo

remove upon removal of pudl_db

pudl.output.export.get_table_schema(table)[source]

Create a Table Schema descriptor from an SQL Alchemy table.

See: https://frictionlessdata.io/specs/table-schema/

There are four possible elements in the Table Schema: * fields (an array of field descriptors) * primaryKey * foreignKeys (an array of foreignKey objects) * missingValues (an array of strings to be interpreted as null)

Parameters

table (SQL Alchemy Table) – The Table object to generate a list of missing values from.

Returns

a dictionary containing the fields, primary keys, foreign keys, and missing values of the table schema

Return type

dict

Todo

Remove upon removal of pudl_db

pudl.output.export.get_tabular_data_resource(table_name, pkg_dir)[source]

Creates a Tabular Data Resource descriptor for a PUDL table.

Based on the information in the database, and some additional metadata this function will generate a valid Tabular Data Resource descriptor, according to the Frictionless Data specification, which can be found here: https://frictionlessdata.io/specs/tabular-data-resource/

Parameters
  • table_name (string) – table name for which you want to generate a Tabular Data Resource descriptor

  • pkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapackage_dir directory, with the name of the package as the name of the subdirectory.

Returns

A JSON object containing key information about the selected table

Return type

Tabular Data Resource descriptor

pudl.output.export.get_tabular_data_resource_og(tablename, pkg_dir, testing=False)[source]

Creates a Tabular Data Resource descriptor for a PUDL DB table.

Parameters
  • tablename (str) – the name of the PUDL database table to retrieve

  • pkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapackage_dir directory, with the name of the package as the name of the subdirectory.

  • testing (bool) – Use the test database (True) or the live database (False)?

Based on the information in the database, and some additional metadata, stored elsewhere (Where?!?!) this function will generate a valid Tabular Data Resource descriptor, according to the Frictionless Data specification, which can be found here:

https://frictionlessdata.io/specs/tabular-data-resource/

Returns: a Tabular Data Resource descriptor describing the contents of the selected table

Todo

remove upon removal of pudl_db

pudl.output.export.hash_csv(csv_path)[source]

Calculates a SHA-i256 hash of the CSV file for data integrity checking.

Parameters

csv_path (path-like) – Path the CSV file to hash.

Returns

the hexdigest of the hash, with a ‘sha256:’ prefix.

Return type

str

pudl.output.export.simplify_sql_type(sql_type, field_name='')[source]

Convert an SQL Alchemy Type into a string type for use in Table Schema.

See: https://frictionlessdata.io/specs/table-schema/

Parameters
  • sql_type (sqlalchemy.sql.sqltypes.type instance) – The type associated with the column being processed, as extracted from a MetaData object reflecting the database being packaged. Should be taken from the list of Column objects associated with a Table object.

  • field_name (string, optional) – The name of the field, which may offer more context as to the nature of the field (e.g. an integer field whose name ends in _year is a year).

Returns

A string representing a simple data type, allowed in the Table Schema standard.

Return type

string

Todo

Remove upon removal of pudl_db

pudl.output.export.test_file_consistency(pkg_name, tables, out_dir)[source]

Tests the consistency of tables for packaging.

The purpose of this function is to test that we have the correct list of tables. There are three different ways we could determine which tables are being dumped into packages: a list of the tables being generated through the ETL functions, the list of dependent tables and the list of CSVs in package directory.

Currently, this function is supposed to be fed the ETL function tables which are tested against the CSVs present in the package directory.

Parameters
  • pkg_name (string) – the name of the data package.

  • tables (list) – a list of table names to be tested.

  • out_dir (path-like) – the directory in which to check the consistency of table files

Raises

AssertionError – If the tables in the CSVs and the ETL tables are not exactly the same list of tables.

Todo

Determine what to do with the dependent tables check.