pudl.dfc module

Implemenation of DataFrameCollection.

Pudl ETL needs to exchange collections of named tables (pandas.DataFrame) between ETL tasks and the volume of data contained in these tables can far exceed the memory of a single machine.

Prefect framework currently caches task results in-memory and this can lead to out of memory problem, especially when dealing with large datasets (e.g. during the full data release). To alleviate this problem, prefect team recommends passing “references” to actual data that is stored separately.

DataFrameCollection does just this. It keeps lightweight references to named data frames and stores the data either locally or on cloud storage (we use pandas.to_pickle method which supports these various storage backends out of the box).

Think of DataFrameCollection as a dict-like structure backed by a disk.

class pudl.dfc.DataFrameCollection(storage_path: Optional[str] = None, **data_frames: Dict[str, pandas.core.frame.DataFrame])[source]

Bases: object

This class can hold named pandas.DataFrame that are stored on disk or GCS.

This should be used whenever dictionaries of named pandas.DataFrames are passed between prefect tasks. Due to the implicit in-memory caching of task results it is important to keep the in-memory footprint of the exchanged data small.

This wrapper achieves this by maintaining references to tables that themselves are stored on a persistent medium such as local disk of GCS bucket.

This is intended to be used from within prefect flows and new instances can be configured by setting relevant prefect.context variables.

add_reference(name: str, table_id: uuid.UUID)[source]

Adds reference to a named dataframe to this collection.

This assumes that the data is already present on disk.

static from_dict(d: Dict[str, pandas.core.frame.DataFrame])[source]

Constructs new DataFrameCollection from dataframe dictionary.

get(name: str)pandas.core.frame.DataFrame[source]

Returns the content of the named dataframe.

get_table_names()List[str][source]

Returns sorted list of dataframes that are contained in this collection.

items()Iterator[Tuple[str, pandas.core.frame.DataFrame]][source]

Iterates over table names and the corresponding pd.DataFrame objects.

references()Iterator[Tuple[str, uuid.UUID]][source]

Returns a set-like object with (name, table_id) tuples.

store(name: str, data: pandas.core.frame.DataFrame)[source]

Adds named dataframe to collection and stores its contents on disk.

to_dict()Dict[str, pandas.core.frame.DataFrame][source]

Loads the entire collection to memory as a dictionary.

union(*others)[source]

Returns new DataFrameCollection that is union of self and others.

update(other)[source]

Adds references to tables from the other DataFrameCollection.

exception pudl.dfc.TableExistsError[source]

Bases: Exception

The table already exists.

Either the table already exists in the DataFrameCollection when it is added or the file containing the serialized form is found on disk.