pudl.transform.sec10k

Transformations and standardizations for SEC 10-K data tables.

This module contains routines for reshaping, cleaning, and standardizing the raw SEC 10-K data into normalized core tables. These core tables serve as the building blocks for denormalized output tables.

Attributes

Functions

_year_quarter_to_date(→ pandas.Series)

Convert a year quarter in the format '2024q1' to date type.

_simplify_filename_sec10k(→ pandas.Series)

Strip non-unique path and file type extension from partial SEC 10-K filenames.

_compute_fraction_owned(→ pandas.Series)

Clean percent ownership, convert to float, then convert percent to ratio.

_standardize_taxpayer_id_irs(→ pandas.Series)

Standardize the IRS taxpayer ID number to NN-NNNNNNN format.

_standardize_industrial_classification(→ pandas.DataFrame)

Split industry names and codes into separate columns.

_pivot_info_block(→ pandas.DataFrame)

Select and pivot distinct blocks of company information for further processing.

core_sec10k__quarterly_filings(→ pandas.DataFrame)

Standardize the contents of the SEC 10-K filings table.

core_sec10k__company_info(...)

Reshape data from the raw SEC 10-K company information table.

core_sec10k__quarterly_company_information(...)

Clean and standardize the contents of the SEC 10-K company information table.

core_sec10k__changelog_company_name(→ pandas.DataFrame)

Clean and standardize the SEC 10-K company name changelog table.

core_sec10k__parents_and_subsidiaries(→ pandas.DataFrame)

Standardize the contents of the SEC 10-K parents and subsidiaries table.

core_sec10k__quarterly_exhibit_21_company_ownership(...)

Standardize the contents of the SEC 10-K exhibit 21 company ownership table.

core_sec10k__assn_sec10k_filers_and_eia_utilities(...)

Create an association table between SEC 10-K companies and EIA utilities.

Module Contents

pudl.transform.sec10k.logger[source]
pudl.transform.sec10k._year_quarter_to_date(year_quarter: pandas.Series) pandas.Series[source]

Convert a year quarter in the format ‘2024q1’ to date type.

pudl.transform.sec10k._simplify_filename_sec10k(filename_sec10k: pandas.Series) pandas.Series[source]

Strip non-unique path and file type extension from partial SEC 10-K filenames.

The full source URL can be constructed by prepending https://www.sec.gov/Archives/edgar/data/ and adding back the “.txt” file type extension, as is done for the source_url column in the output tables.

pudl.transform.sec10k._compute_fraction_owned(percent_ownership: pandas.Series) pandas.Series[source]

Clean percent ownership, convert to float, then convert percent to ratio.

pudl.transform.sec10k._standardize_taxpayer_id_irs(taxpayer_id_irs: pandas.Series) pandas.Series[source]

Standardize the IRS taxpayer ID number to NN-NNNNNNN format.

  • Remove all non-numeric characters

  • Set all values that are not 9 digits long or are entirely zeroes to pd.NA

  • Reformat to NN-NNNNNNN format.

pudl.transform.sec10k._standardize_industrial_classification(sic: pandas.Series) pandas.DataFrame[source]

Split industry names and codes into separate columns.

Most values take the form of “Industry description [1234]”, but some are just 4-digit numbers. This function splits the industry name and code into separate columns, and fills in the code column with the value from the standard_industrial_classification column if it is a 4-digit number. Any values that don’t fit either of these two patterns are set to NA. No effort is made to fill in missing industry descriptions based on the industry code.

See e.g. https://www.osha.gov/data/sic-manual for code definitions.”

pudl.transform.sec10k._pivot_info_block(df: pandas.DataFrame, block: str) pandas.DataFrame[source]

Select and pivot distinct blocks of company information for further processing.

Parameters:
  • df – The dataframe containing the SEC 10-K company information.

  • block – The block of associated information to pivot.

pudl.transform.sec10k.core_sec10k__quarterly_filings(raw_sec10k__quarterly_filings: pandas.DataFrame) pandas.DataFrame[source]

Standardize the contents of the SEC 10-K filings table.

pudl.transform.sec10k.core_sec10k__company_info(raw_sec10k__quarterly_company_information: pandas.DataFrame)[source]

Reshape data from the raw SEC 10-K company information table.

Each SEC 10-K filing contains a header block that lists information the filer, and potentially a number of other related companies. Each company level block of information may include the company’s name, physical address, some details about how it files the SEC 10-K, and some other company attributes like taxpayer ID number, industrial classification, etc. It may also include a mailing address that is distinct from the physical address, and the history of names that the company has had over time.

In its original form, this data is mostly contained in two poorly normalized “key” and “value” columns, with the key indicating what kind of data will be contained in the value column. Here we pivot the keys into column names so that each column has a homogeneous type of value.

Each filing may contain information about multiple companies, with each identified by a different filer_count. For each company there are several different potential types of information provide in distinct named “blocks” (business_address, company_data, filing_values, mail_address, and former_company). Each block type needs to be extracted and pivoted independently to avoid column name collisions and because they don’t all have the same set of block_count and filer_count values.

After they’ve been pivoted, the blocks are merged together into two disinct tables:

  • per-filing company and filing information

  • the history of company name changes

This information is used elsewhere for matching SEC 10-K filers to EIA utilities and for identifying the companies that are mentioned in the unstructured Exhibit 21 attachements.

There is a 1-to-1 correspondence between the company_data and filing_values blocks, and virtually all business_address blocks (99.9%), but a substantial number (~5%) of mail_address blocks do not share index values with any company_data or filing_values blocks, and are ultimately dropped. The former_company blocks contain a large amount of duplicative data, but can be associated with a central_index_key value for later association with an individual SEC 10-K filer.

Further processing that standardizes the contents of these tables is deferred to separate downstream core assets.

pudl.transform.sec10k.core_sec10k__quarterly_company_information(_core_sec10k__quarterly_company_information: pandas.DataFrame) pandas.DataFrame[source]

Clean and standardize the contents of the SEC 10-K company information table.

pudl.transform.sec10k.core_sec10k__changelog_company_name(_core_sec10k__changelog_company_name: pandas.DataFrame) pandas.DataFrame[source]

Clean and standardize the SEC 10-K company name changelog table.

Every filing that references a company independently reports the history of its name changes, so after combining them we end up with many copies of the name change histories for some companies. We deduplicate the combined history and keep only a single instance of each reported name change.

There are a handful of cases in which the same name change events seem to have been reported differently in different filings (slightly different company names or dates) which make the results here imperfect, but for the most part it works well.

pudl.transform.sec10k.core_sec10k__parents_and_subsidiaries(raw_sec10k__parents_and_subsidiaries: pandas.DataFrame) pandas.DataFrame[source]

Standardize the contents of the SEC 10-K parents and subsidiaries table.

pudl.transform.sec10k.core_sec10k__quarterly_exhibit_21_company_ownership(raw_sec10k__exhibit_21_company_ownership: pandas.DataFrame) pandas.DataFrame[source]

Standardize the contents of the SEC 10-K exhibit 21 company ownership table.

pudl.transform.sec10k.core_sec10k__assn_sec10k_filers_and_eia_utilities(core_sec10k__parents_and_subsidiaries: pandas.DataFrame) pandas.DataFrame[source]

Create an association table between SEC 10-K companies and EIA utilities.

We are retroactively extracting this simple association table (crosswalk) from the un-normalized parent/subsidiary relationships handed off from the SEC process upstream so it is available as an easy to understand core table.

Because we accept only the single, highest probability match from the Splink record linkage process, we expect a 1-to-1 relationship between these IDs. Note that most companies in this table have no CIK because most records come from the unstructured Exhibit 21 data, and most that do have a CIK have no EIA Utility ID because most companies are not utilities, and most utilities are still unmatched.