pudl.extract.sec10k

Load pre-processed SEC 10-K assets from Google Cloud Storage.

These “raw” tables are generated by the SEC 10-K data extraction pipeline which can be found in this repository: https://github.com/catalyst-cooperative/mozilla-sec-eia

Upstream data is not partitioned by year, but we want to be able to extract a subset of the data for testing, so the Sec10kSettings allow specification of which years to extract, and those are used to filter the extracted data before returning it.

Attributes

Functions

extract(→ pandas.DataFrame)

Extract SEC 10-K data from the datastore.

raw_sec10k_asset_factory(→ dagster.AssetsDefinition)

An asset factory for extracting SEC 10-K data by table.

Module Contents

pudl.extract.sec10k.extract(ds: pudl.workspace.datastore.Datastore, table: str, years: list[int]) pandas.DataFrame[source]

Extract SEC 10-K data from the datastore.

Allows filtering by year to enable testing of the pipeline with a smaller amount of data, like a pseudo-partition. This is necessary because the SEC 10-K data is not partitioned uppstream.

Parameters:
  • ds – Initialized PUDL datastore.

  • table – Which of the valid tables should be extracted?

  • years – Which years of data to include in the output.

Returns:

A dataframe containing the SEC 10-K data.

pudl.extract.sec10k.raw_sec10k_asset_factory(table) dagster.AssetsDefinition[source]

An asset factory for extracting SEC 10-K data by table.

pudl.extract.sec10k.raw_sec10k_assets[source]