pudl.dfc module¶
Implemenation of DataFrameCollection.
Pudl ETL needs to exchange collections of named tables (pandas.DataFrame) between ETL tasks and the volume of data contained in these tables can far exceed the memory of a single machine.
Prefect framework currently caches task results in-memory and this can lead to out of memory problem, especially when dealing with large datasets (e.g. during the full data release). To alleviate this problem, prefect team recommends passing “references” to actual data that is stored separately.
DataFrameCollection does just this. It keeps lightweight references to named data frames and stores the data either locally or on cloud storage (we use pandas.to_pickle method which supports these various storage backends out of the box).
Think of DataFrameCollection as a dict-like structure backed by a disk.
-
class
pudl.dfc.
DataFrameCollection
(storage_path: Optional[str] = None, **data_frames: Dict[str, pandas.core.frame.DataFrame])[source]¶ Bases:
object
This class can hold named pandas.DataFrame that are stored on disk or GCS.
This should be used whenever dictionaries of named pandas.DataFrames are passed between prefect tasks. Due to the implicit in-memory caching of task results it is important to keep the in-memory footprint of the exchanged data small.
This wrapper achieves this by maintaining references to tables that themselves are stored on a persistent medium such as local disk of GCS bucket.
This is intended to be used from within prefect flows and new instances can be configured by setting relevant prefect.context variables.
-
add_reference
(name: str, table_id: uuid.UUID)[source]¶ Adds reference to a named dataframe to this collection.
This assumes that the data is already present on disk.
-
static
from_dict
(d: Dict[str, pandas.core.frame.DataFrame])[source]¶ Constructs new DataFrameCollection from dataframe dictionary.
-
get_table_names
() → List[str][source]¶ Returns sorted list of dataframes that are contained in this collection.
-
items
() → Iterator[Tuple[str, pandas.core.frame.DataFrame]][source]¶ Iterates over table names and the corresponding pd.DataFrame objects.
-
references
() → Iterator[Tuple[str, uuid.UUID]][source]¶ Returns a set-like object with (name, table_id) tuples.
-
store
(name: str, data: pandas.core.frame.DataFrame)[source]¶ Adds named dataframe to collection and stores its contents on disk.
-