The Public Utility Data Liberation Project¶
What is PUDL?¶
The PUDL Project is an open source data processing pipeline that makes US energy data easier to access and use programmatically.
Hundreds of gigabytes of valuable data are published by US government agencies, but it’s often difficult to work with. PUDL takes the original spreadsheets, CSV files, and databases and turns them into a unified resource. This allows users to spend more time on novel analysis and less time on data preparation.
What data is available?¶
PUDL currently integrates data from:
EIA Form 860 (2004-2019)
EIA Form 860m (2020-2021)
EIA Form 861 (2001-2019)
EIA Form 923 (2001-2019)
EPA Continuous Emissions Monitoring System (CEMS) (1995-2020)
FERC Form 1 (1994-2019)
FERC Form 714 (2006-2019)
Thanks to support from the Alfred P. Sloan Foundation Energy & Environment Program, from 2021 to 2023 we will be integrating the following data as well:
EIA Form 176 (The Annual Report of Natural Gas Supply and Disposition)
FERC Form 2 (Annual Report of Major Natural Gas Companies)
Machine Readable Specifications of State Clean Energy Standards
Who is PUDL for?¶
The project is focused on serving researchers, activists, journalists, policy makers, and small businesses that might not otherwise be able to afford access to this data from commercial sources and who may not have the time or expertise to do all the data processing themselves from scratch.
We want to make this data accessible and easy to work with for as wide an audience as possible: anyone from a grassroots youth climate organizers working with Google sheets to university researchers with access to scalable cloud computing resources and everyone in between!
How do I access the data?¶
There are four main ways to access PUDL outputs. For more details you’ll want to check out the complete documentation, but here’s a quick overview:
Datasette¶
We publish a lot of the data on https://data.catalyst.coop using a tool called Datasette that lets us wrap our databases in a relatively friendly web interface. You can browse and query the data, make simple charts and maps, and download portions of the data as CSV files or JSON so you can work with it locally. For a quick introduction to what you can do with the Datasette interface, check out this 17 minute video.
This access mode is good for casual data explorers or anyone who just wants to grab a small subset of the data. It also lets you share links to a particular subset of the data and provides a REST API for querying the data from other applications.
Docker + Jupyter¶
Want access to all the published data in bulk? If you’re familiar with Python and Jupyter Notebooks and are willing to install Docker you can:
Download a PUDL data release from CERN’s Zenodo archiving service.
Run the archived image using
docker-compose up
Access the data via the resulting Jupyter Notebook server running on your machine.
If you’d rather work with the PUDL SQLite Databases and Apache Parquet files directly, they are accessible within the same Zenodo archive.
The PUDL Examples repository has more detailed instructions on how to work with the Zenodo data archive and Docker image.
JupyterHub¶
Do you want to use Python and Jupyter Notebooks to access the data but aren’t comfortable setting up Docker? We are working with 2i2c to host a JupyterHub that has the same software and data as the Docker container and Zenodo archive mentioned above, but running in the cloud.
Note: you’ll only have 4-6GB of RAM and 1 CPU to work with on the JupyterHub, so if you need more computing power, you may need to set PUDL up on your own computer. Eventually we hope to offer scalable computing resources on the JupyterHub as well.
The PUDL Development Environment¶
If you’re more familiar with the Python data science stack and are comfortable working
with git, conda
environments, and the Unix command line, then you can set up the
whole PUDL Development Environment on your own computer. This will allow you to run the
full data processing pipeline yourself, tweak the underlying source code, and (we hope!)
make contributions back to the project.
This is by far the most involved way to access the data and isn’t recommended for most users. You should check out the Development section of the main PUDL documentation for more details.
Contributing to PUDL¶
Find PUDL useful? Want to help make it better? There are lots of ways to help!
First, be sure to read our Code of Conduct.
You can file a bug report, make a feature request, or ask questions in the Github issue tracker.
Feel free to fork the project and make a pull request with new code, better documentation, or example notebooks.
Make a recurring financial contribution to support our work liberating public energy data.
Hire us to do some custom analysis and allow us to integrate the resulting code into PUDL.
For more information check out the Contributing section of the PUDL Documentation
Licensing¶
In general, our code, data, and other work are permissively licensed for use by anybody, for any purpose, so long as you give us credit for the work we’ve done.
The PUDL software is released under the MIT License.
The PUDL data and documentation are published under the Creative Commons Attribution License v4.0 (CC-BY-4.0).
Contact Us¶
For user support, bug reports and anything else that could be useful or interesting to other users, please make a GitHub issue.
For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: pudl@catalyst.coop
If you’d like to get occasional updates about the project sign up for our email list.
Follow us on Twitter: @CatalystCoop
More info on our website: https://catalyst.coop
About Catalyst Cooperative¶
Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.
Introduction¶
PUDL is a data processing pipeline Created by Catalyst Cooperative that cleans, integrates, and standardizes some of the most widely used public energy datasets in the US. The data serve researchers, activists, journalists, and policy makers that might not have the technical expertise to access it in its raw form, the time to clean and prepare the data for bulk analysis, or the means to purchase it from existing commercial providers.
Available Data¶
Currently, PUDL has cleaned and integrated data from:
EIA Form 860 (including EIA 860m)
EIA Form 861 (preliminary)
FERC Form 714 (preliminary)
In addition, we distribute an SQLite databases containing all available years of the raw FERC Form 1 data and an SQLite version of the US Census DP1 geodatabase
To get started using PUDL data, visit our Data Access page, or continue reading to learn more about the PUDL data processing pipeline.
Raw Data Archives¶
PUDL depends on “raw” data inputs from sources that are known to occasionally update their data or alter the published format. These changes may be incompatible with the way the data are read and interpreted by PUDL, so, to ensure the integrity of our data processing, we periodically create archives of the raw inputs on Zenodo. Each of the data inputs may have several different versions archived, and all are assigned a unique DOI and made available through the REST API. Each release of the PUDL Python package is embedded with a set of of DOIs to indicate which version of the raw inputs it is meant to process. This process helps ensure that our outputs are replicable.
To enable programmatic access to individual partitions of the data (by year, state, etc.), we archive the raw inputs as Frictionless Data Packages. The data packages contain both the raw data in their originally published format (CSVs, Excel spreadsheets, and Visual FoxPro database (DBF) files) and metadata that depicts how each the dataset is partitioned.
The PUDL software will download a copy of the appropriate raw inputs automatically as needed and organize them in a local datastore.
See also
The software that creates and archives the raw inputs can be found in our PUDL Scrapers and PUDL Zenodo Storage repositories on GitHub.
The ETL Process¶
The core of PUDL’s work takes place in the ETL (Extract, Transform, and Load) process.
Extract¶
The Extract step reads the raw data from the original heterogeneous formats into a
collection of pandas.DataFrame
with uniform column names across all years so
that it can be easily processed in bulk. Data distributed as binary database files, such
as the DBF files from FERC Form 1, may be converted into a unified SQLite database
before individual dataframes are created.
See also
Module documentation within the pudl.extract
subpackage.
Transform¶
The Transform step is generally broken down into two phases. Phase one focuses on cleaning and organizing data within individual tables while phase two focuses on the integration and deduplication of data between tables. These tasks can be tedious data wrangling toil that impose a huge amount of overhead on anyone trying to do analysis based on the publicly available data. PUDL implements common data cleaning operations in the hopes that we can all work on more interesting problems most of the time. These operations include:
Standardization of units (e.g. dollars not thousands of dollars)
Standardization of N/A values
Standardization of freeform names and IDs
Use of controlled vocabularies for categorical values like fuel type
Use of more readable codes and column names
Imposition of well defined, rich data types for each column
Converting local timestamps to UTC
Reshaping of data into well normalized tables which minimize data duplication
Inferring Plant IDs which link records across many years of FERC Form 1 data
Inferring linkages between FERC and EIA Plants and Utilities.
Inferring more complete associations between EIA boilers and generators
See also
The module and per-table transform functions in the pudl.transform
sub-package have more details on the specific transformations applied to each
table.
Many of the original datasets contain large amounts of duplicated data. For instance, the EIA reports the name of each power plant in every table that refers to otherwise unique plant-related data. Similarly, many attributes like plant latitude and longitude are reported separately every year. Often, these reported values are not self-consistent. There may be several different spellings of a plant’s name, or an incorrectly reported latitude in one year.
The transform step attempts to eliminate this kind of inconsistent and duplicate information when normalizing the tables by choosing only the most consistently reported value for inclusion in the final database. If a value which should be static is not consistently reported, it may also be set to N/A.
See also
Tidy Data by Hadley Wickham, Journal of Statistical Software (2014).
A Simple Guide to the Five Normal Forms in Relational Database Theory by William Kent, Communications of the ACM (1983).
Load¶
At the end of the Transform step, we have collections of DataFrames that correspond to database tables. These are written out to (“loaded” into) platform independent tabular data packages where the data is stored as CSV files and the metadata is stored as JSON. These static, text-based output formats are archive-friendly and can be used to populate a database or read with Python, R, and many other tools. See the PUDL Data Dictionary page for a list of the normalized database tables and their contents.
Note
Starting with v0.5.0 of PUDL, we will begin generating SQLite database and Apache Parquet file outputs directly and using those formats to distribute the processed data.
See also
Module documentation within the pudl.load
sub-package.
Database & Output Tables¶
Tabular Data Packages are archive friendly and platform independent, but, given the size and complexity of the data within PUDL, this format isn’t ideal for day to day interactive use. In practice, we take the clean, processed data in the data packages and use it to populate a local SQLite database. To handle the ~1 billion row EPA CEMS hourly time series, we convert the data package into Apache Parquet dataset that are partitioned by state and year. For more details on these conversions to SQLite and Parquet formats, see Data Packages.
Denormalized Outputs¶
We normalized the data to make storage more efficient and avoid data integrity issues, but you may want to combine information from more than one of the tables to make the data more readable and readily interpretable. For example, PUDL stores name that EIA uses to refer to a power plant in the plants_entity_eia table in association with the plant’s unique numeric ID. If you are working with data from the fuel_receipts_costs_eia923 table, which records monthly per-plant fuel deliveries, you may want to have the name of the plant alongside the fuel delivery information since it’s more recognizable than the plant ID.
Rather than requiring everyone to write their own SQL SELECT
and JOIN
statements
or do a bunch of pandas.merge()
operations to bring together data, PUDL provides a
variety of predefined queries as methods of the pudl.output.pudltabl.PudlTabl
class. These methods perform common joins to return output tables (pandas DataFrames)
that contain all of the useful information in one place. In some cases, like with EIA,
the output tables are composed to closely resemble the raw spreadsheet tables you’re
familiar with.
Note
In the future, we intend to replace the simple denormalized output tables with database views that are integrated into the distributed SQLite database directly. This will provide the same convenience without requiring use of the Python software layer.
Analysis Outputs¶
There are several analytical routines built into the
pudl.output.pudltabl.PudlTabl
output objects for calculating derived values
like the heat rate by generation unit (hr_by_unit
) or the capacity factor by generator
(capacity_factor
). We intend to
integrate more analytical outputs into the library over time.
See also
The PUDL Examples GitHub repo to see how to access the PUDL Database directly, use the output functions, or work with the EPA CEMS data using Dask.
How to Learn Dask in 2021 is a great collection of self-guided resources if you are already familiar with Python, Pandas, and NumPy.
Data Validation¶
We have a growing collection of data validation test cases that we run before
publishing a data release to try and avoid publishing data with known issues. Most of
these validations are described in the pudl.validate
module. They check things
like:
The heat content of various fuel types are within expected bounds.
Coal ash, moisture, mercury, sulfur etc. content are within expected bounds
Generator heat rates and capacity factors are realistic for the type of prime mover being reported.
Some data validations are currently only specified within our test suite, including:
The expected number of records within each table
The fact that there are no entirely N/A columns
A variety of database integrity checks are also run either during the ETL process or when the data is loaded into SQLite.
See our Testing PUDL documentation for more information.
Data Access¶
We publish the PUDL pipeline outputs in several ways to serve different users and use cases. We’re always trying to increase accessibility of the PUDL data, so if you have suggestions or questions please open a GitHub issue or email us at pudl@catalyst.coop.
How Should You Access PUDL Data?¶
We provide four primary ways of interacting with PUDL data. Here’s how to find out which one is right for you and your use case.
Access Method |
Types of User |
Use Cases |
---|---|---|
Curious Explorer, Spreadsheet Analyst, Web Developer |
Explore the PUDL database interactively in a web browser. Select data to download as CSVs for local analysis in spreadsheets. Create sharable links to a particular selection of data. Access PUDL data via a REST API. |
|
Researcher, Database User, Notebook Analyst |
Use a stable, citable, fully processed version of the PUDL on your own computer. Use PUDL in Jupyer Notebooks running in a stable, archived Docker container. Access the SQLite DB and Parquet files directly using any toolset. |
|
New Python User, Notebook Analyst |
Work through the PUDL example notebooks without any downloads or setup. Perform your own notebook-based analyses using PUDL data and limited computational resources. |
|
Python Developer, Data Wrangler |
Run the PUDL data processing pipeline on your own computer. Edit the PUDL source code and run the software tests and data validations. Integrate a new data source or newly released data from one of existing sources. |
|
Deprecated |
For working with our published data prior to v0.4.0 |
Datasette¶
We provide web-based access to the PUDL data via a Datasette deployment at https://data.catalyst.coop.
Datasette is an open source tool that wraps SQLite databases in an interactive front-end. It allows users to browse database tables, select portions of them using dropdown menus, build their own SQL queries, and download data to CSVs. It also creates a REST API allowing the data in the database to be queried programmatically. All the query parameters are stored in the URL so you can also share links to the data you’ve selected.
Note that only data that has been fully integrated into the SQLite databases are available here. Currently this includes the core PUDL database and our concatenation of all historical FERC Form 1 databases.
Zenodo Archives¶
We use Zenodo to archive our fully processed data as a SQLite databasees and Parquet files. We also archive a Docker image that contains the software environment required to use PUDL within Jupyter Notebooks. You can find all our archived data products in the Catalyst Cooperative Community on Zenodo.
The current (beta) version of the archived data and Docker container can be downloaded from This Zenodo archive
Detailed instructions on how to access the archived PUDL data using a Docker container can be found in our PUDL Examples repository.
The SQLite databases and Parquet files containing the PUDL data, the complete FERC 1 database, and EPA CEMS hourly data are contained in that same archive, if you want to access them directly without using PUDL.
Note
If you’re already familiar with Docker, you can also pull the image we use to run Jupyter directly:
$ docker pull catalystcoop/pudl-jupyter:latest
JupyterHub¶
We’ve set up a JupyterHub in collaboration with 2i2c.org to provide access to all of the processed PUDL data and the software environment required to work with it. You don’t have to download or install anything to use it, but we do need to create an account for you.
Request an account by submitting this form.
Once we’ve created an account for you follow this link to log in and open up the first example notebook on the JupyterHub.
You can create your own notebooks and upload, save, and download modest amounts of data on the hub.
We can only offer a small amount of memory (4-6GB) and processing power (1 CPU) per user on the JupyterHub for free. If you need to work with lots of data or do computationally intensive analysis, you may want to look into using the Zenodo Archives option on your own computer. The JupyterHub uses exactly the same data and software environment as the Zenodo Archives. Eventually we also want to offer paid access to the JupyterHub with plenty of computing power.
Development Environment¶
If you want to run the PUDL data processing pipeline yourself from scratch, run the software tests, or make changes to the source code, you’ll need to set up our development environment. This is a bit involved, so it has its own separate documentation.
Most users shouldn’t need to do this, and will probably find working with the pre-processed data via one of the other access modes easier. But if you want to contribute to the project please give it a shot!
Data Packages¶
Note
Prior to v0.4.0 of PUDL we only published processed data as tabular data packages. As of v0.4.0 we are will distribute the SQLite databases and Apache Parquet files alongside a set of data packages. As of PUDL v0.5.0 we will be generating SQLite and Apache Parquet outputs directly, and will no longer be archiving tabular data packages as the format of record, and the format conversions described below will no longer be necessary.
Archived Data Packages¶
We periodically publish data packages containing the full outputs from the PUDL ETL pipeline on Zenodo, an open data archiving service provided by CERN. The most recent release can always be found through this concept DOI: 10.5281/zenodo.3653158. Each individual version of the data releases will be assigned its own unique DOI.
All of our archived products can be found in the Catalyst Cooperative Community on Zenodo. These archives and the DOIs associated with them should be permanently accessible and are suitable for use as references in academic and other publications.
Once you’ve downloaded or generated your own tabular data packages you will probably want to convert them into a more analysis-oriented file format. We typically use SQLite for the core FERC and EIA data, and Apache Parquet files for the very long tables like EPA CEMS.
Converting to SQLite¶
If you want to access the data via SQL, we have provided a script that loads several
data packages into a local sqlite3
database. Note that these data packages
must have all been generated by the same ETL run, or they will be considered
incompatible by the script. For example, to load three data packages generated by our
example ETL configuration into your local SQLite DB, you could run the following
command from within your PUDL workspace:
$ datapkg_to_sqlite \
datapkg/pudl-example/ferc1-example/datapackage.json \
datapkg/pudl-example/eia-example/datapackage.json \
Run datapkg_to_sqlite --help
for more details.
Converting to Apache Parquet¶
The EPA CEMS Hourly data approaches 100 GB in size uncompressed. This is too large to work with directly in memory on most systems and take a very long time to load into SQLite. Instead, we recommend converting the Hourly Emissions table into an Apache Parquet dataset which is stored on disk locally, and either reading in only parts of it using pandas, or using Dask dataframes, to serialize or distribute your analysis tasks. Dask can also speed up processing for in-memory tasks, especially if you have a powerful system with multiple cores, a solid state disk, and plenty of memory.
If you have generated an EPA CEMS data package, you can use the
epacems_to_parquet
script to convert the hourly emissions table like this:
$ epacems_to_parquet datapkg/pudl-example/epacems-eia-example/datapackage.json
The script will automatically generate a Parquet Dataset which is partitioned
by year and state in the parquet/epacems
directory within your workspace.
Run epacems_to_parquet --help
for more details.
Data Sources¶
EIA Form 860¶
Source URL |
|
---|---|
Source Description |
The status of existing electric generating plants and associated equipment in the United States and those scheduled for initial commercial operation within 10 years of the filing. |
Respondents |
Utilities |
Source Format |
Microsoft Excel (.xls/.xlsx) |
Source Years |
2001-2019 |
Size (Download) |
413.4 MB |
PUDL Code |
|
Years Liberated |
2004-2019 |
Records Liberated |
~1 million |
Issues |
Background¶
The Form EIA-860 collects utility, owner, plant, and generator-level data from existing and planned entities with one or more megawatt of capacity. The form also contains information regarding environmental control equipment and construction cost data from 2013-2018.
As of 2019, the EIA-860 Form is organized into the following schedules:
Schedule 1: Identification
Schedule 2: Power plant data
Schedule 3: Generator information
Schedule 4: Ownership of generators
Schedule 6: Boiler information
(Schedule 5 contained generator construction cost information)
Who is required to fill out the form?¶
Respondents include all existing and proposed plants that have a total generator nameplate capacity (sum for generators at a single site) of 1 Megawatt (MW) or greater and are connected to the local or regional electric power grid. Annual responses are due between the beginning of January and the end of February.
Jointly owned plants must be reported only once by their operator or planned operator.
What does the original data look like?¶
Approximately a year after respondents submit their form, the EIA publishes the data in a series of spreadsheets that reflect the thematic contents of the form. These spreadsheets can change year-to-year as the questions in the form are updated or as EIA adopts new formatting standards for their outputs. They are accessible on the EIA website as downloadable ZIP files categorized by year. To gain greater insight into year-to-year nuances of the form, we recommend downloading multiple years of EIA-860 ZIP files and comparing both the Form and the Form Instructions files. See below for our description of notable irregularities in the data.
How much of the data is accessible through PUDL?¶
EIA-860 data stretches back to 2001, and PUDL currently covers all years starting from 2004. The prior years are published as DBF files and need a special process to read and extract. We intend to include these older years as soon as we can.
PUDL does not currently include the files pertaining to specific renewable energy resources or interconnection.
Notable Irregularities¶
In 2012 and 2013, the Form was updated to include specific information about renewable generators. These new data are not included in PUDL.
Prior to 2009, the Generators table was split into two spreadsheets: one for operating and one for proposed generation. In 2007 and before, there was an additional file for proposed changes to existing generation. The latter is excluded from PUDL while the former is combined into a single table during the transformation process.
EIA 860 includes a table in “Schedule 6: Boiler Information” which is an association
table between boilers and generators. This association is important because in EIA 923
the net generation is reported by generators and the fuel consumption is reported by
boilers - so a good boiler generator association is crucial for understanding heat
rates. Unfortunately, the reported associations are incomplete. We have implemented a
methodology fills in many of the missing links 2014 and later that covers more than 95%
net generation reported in the generation_eia923 table. See
this blog post and
pudl.transform.eia
for more information.
PUDL Data Tables¶
We’ve segmented the processed EIA-860 data into the following normalized data tables. Clicking on the links will show you a description of the table as well as the names and descriptions of each of its fields.
Data Dictionary |
Browse Online |
---|---|
https://data.catalyst.coop/pudl/boiler_generator_assn_eia860 |
|
We’ve also created the following entity tables modeled after EIA data collected from multiple tables.
Data Dictionary |
Browse Online |
---|---|
PUDL Data Transformations¶
The PUDL transformation process cleans the input data so that it is adjusted for uniformity, corrected for errors, and ready for bulk programmatic use.
To see the transformations applied to the data in each table, you can read the
doc-strings for pudl.transform.eia860
created for each tables’ respective
transform function.
EIA Form 923¶
Source URL |
|
---|---|
Source Description |
Generation, consumption, stocks, receipts |
Respondents |
Electric, CHP plants, and sometimes fuel transfer terminals with either 1MW+ or the ability to receive and deliver power to the grid. |
Source Format |
Microsoft Excel (.xls/.xlsx) |
Source Years |
2001-2019 |
Size (Download) |
243.3 MB |
PUDL Code |
|
Years Liberated |
2001-2019 |
Records Liberated |
~3.6 million |
Issues |
Background¶
Form EIA-923 is known as the Power Plant Operations Report. The data include electric power generation, energy source consumption, end of reporting period fossil fuel stocks, as well as the quality and cost of fossil fuel receipts at the power plant and prime mover level (with a subset of +10MW steam-electric plants reporting at the boiler and generator level. Information is available for non-utility plants starting in 1970 and utility plants beginning in 1999. The Form EIA-923 has evolved over the years, beginning as an environmental add-on in 2007 and ultimately eclipsing the information previously recorded in EIA-906, EIA-920, FERC 423, and EIA-423 by 2008.
As of 2019, the EIA-923 Form is organized into the following schedules:
Schedule 2: fuel receipts and costs
Schedules 3A & 5A: generator data including generation, fuel consumption and stocks
Schedule 4: fossil fuel stocks
Schedules 6 & 7: non-utility source and disposition of electricity
Schedules 8A-F: environmental data
Who is required to fill out the form?¶
Respondents include all all electric and CHP plants, and in some cases fuel transfer terminals, that have a total generator nameplate capacity (sum for generators at a single site) of 1 Megawatt (MW) or greater and are connected to the local or regional electric power grid.
Selected plants may be permitted to report schedules 1-4B monthly and 6-8 annually so as to lighten their reporting burden. All other respondents must respond to the Form in its entirety once a year.
What does the original data look like?¶
Once the respondents have submitted their responses, the EIA creates a series of spreadsheets that reflect themes within the form. These spreadsheets have changed over the years as the form itself evolves. They are accessible on the EIA website as downloadable ZIP files categorized by year. The internal data are organized into excel spreadsheets. To gain greater insight into year-to-year nuances of the form, we recommend downloading multiple years of EIA-923 ZIP files and comparing both the Form and the Form Instructions files.
How much of the data is accessible through PUDL?¶
EIA-923 data stretches back to 1970, and PUDL currently covers all years starting from 2009. Due to a difference in reporting between the older and newer years, the older data will require more time to integrate. Monthly and year to date releases are not yet integrated.
In addition, We have not yet integrated tables reporting fuel stocks, data from Puerto Rico, or EIA-923 schedules 6, 7, and 8.
Notable Irregularities¶
File Naming Conventions¶
The naming conventions for the raw files are confusing and difficult to trace year to year. Subtle and not so subtle changes to the form and published spreadsheets make aggregating pre-2009 data difficult from a programmatic standpoint.
Protected Data¶
In accordance with the Freedom of Information Act and the Trade Secrets Act, certain information reported to EIA-923 may remain undisclosed to the public until three months after its collection date. The fields subject to this legislation include: total delivered cost of coal, natural gas, and petroleum received at non-utility power plants and the commodity cost information for all plants (Schedule 2).
Net generation & fuel consumed reported in two seperate tables¶
Net generation and fuel consumption are reported in two seperate tables in EIA-923: in the generation_eia923 and generation_fuel_eia923 tables. The generation_fuel_eia923 table is more complete (the generation_eia923 table includes only ~55% of the reported MWh), but the generation_eia923 table is more granular (it is reported at the generator level).
Data Estimates¶
Plants that did not respond or reported unverified data were recorded as estimates rolled in with the state/fuel aggregates values reported under the plant id 99999.
PUDL Database Tables¶
We’ve segmented the processed EIA-923 data into the following normalized data tables. Clicking on the links will show you a description of the table as well as the names and descriptions of each of its fields.
EIA-923 Data Tables¶
These tables contain the bulk data reported in the EIA-923.
Data Dictionary |
Browse Online |
---|---|
EIA-923 Structural Tables¶
These tables define various codes and abbreviations more fully.
Data Dictionary |
Browse Online |
---|---|
PUDL Data Transformations¶
The PUDL transformation process cleans the input data so that it is adjusted for uniformity, corrected for errors, and ready for bulk programmatic use.
To see the transformations applied to the data in each table, you can read the
function level documentation in pudl.transform.eia923
.
EPA CEMS Hourly¶
Source URL |
|
---|---|
Source Description |
Hourly CO2, SO2, NOx emissions and gross load |
Respondents |
Coal and high-sulfur fueled plants |
Source Format |
Comma Separated Value (.csv) |
Source Years |
1995-2020 |
Size (Download) |
8.7 GB |
PUDL Code |
|
Years Liberated |
1995-2020 |
Records Liberated |
~1 billion |
Issues |
Background¶
As depicted by the EPA, Continuous Emissions Monitoring Systems (CEMS) are the
“total equipment necessary for the determination of a gas or particulate matter
concentration or emission rate.” They are used to determine compliance with EPA
emissions standards and are therefore associated with a given “smokestack” and are
categorized in the raw data by a corresponding unitid
. Because point sources of
pollution are not alway correlated on a one-to-one basis with generation units, the
CEMS unitid
serves as its own unique grouping. The EPA in collaboration with the
EIA has developed a crosswalk table
that maps the EPA’s unitid
onto EIA’s boiler_id
, generator_id
, and
plant_id_eia
. This file has been integrated into the SQL database.
The EPA Clean Air Markets Division (CAMD) has collected emissions data from CEMS units stretching back to 1995. Among the data included in CEMS are hourly SO2, CO2, NOx emission and gross load.
Who is required to install CEMS and report to EPA?¶
Part 75 of the Federal Code of Regulations (FRC), the backbone of the Clean Air Act Title IV and Acid Rain Program, requires coal and other solid-combusting units (see §72.2) to install and use CEMS (see §75.2, §72.6). Certain low-sulfur fueled gas and oil units (see §72.2) may seek exemption or alternative means of monitoring their emissions if desired (see §§75.23, §§75.48, §§75.66). Once CEMS are installed, Part 75 requires hourly data recording, including during startup, shutdown, and instances of malfunction as well as quarterly data reporting to the EPA. The regulation further details the protocol for missing data calculations and backup monitoring for instances of CEMS failure (see §§75,31-37).
A plain English explanation of the requirements of Part 75 is available in section 2.0 Overview of Part 75 Monitoring Requirements
What does the original data look like?¶
EPA CAMD publishes the CEMS data in an online data portal . The files are available in a prepackaged format, accessible via a user interface or FTP site with each downloadable zip file encompassing a year of data.
How much of the data is accessible through PUDL?¶
All of it!
Notable Irregularities¶
CEMS is by far the largest dataset in PUDL at the moment with hourly records for
thousands of plants spanning decades. Note that the ETL process can easily take all
day for the full dataset. PUDL also provides a script that converts the raw EPA CEMS
data into Apache Parquet files that can be read and queried very efficiently with
Dask. Check out the EPA CEMS example notebook
in our
pudl-examples repository
on GitHub for pointers on how to access this big dataset efficiently using dask
.
PUDL Data Tables¶
Clicking on the links will show you a description of the table as well as the names and descriptions of each of its fields.
Data Dictionary |
Browse Online |
---|---|
Not Available via Datasette |
PUDL Data Transformations¶
The PUDL transformation process cleans the input data so that it is adjusted for uniformity, corrected for errors, and ready for bulk programmatic use.
To see the transformations applied to the data in each table, you can read the
documentation for pudl.transform.epacems
created for their respective
transform functions.
Thanks to Karl Dunkle Werner for contributing much of the EPA CEMS Hourly ETL code!
FERC Form 1¶
Source URL |
|
---|---|
Source Description |
Financial and operational information from electric utilities, licensees and others entities subject to FERC jurisdiction. |
Respondents |
Major electric utilities and licensees. |
Source Format |
FoxPro Database (.DBC/.DBF) |
Source Years |
1994-2019 |
Size (Download) |
1.3 GB |
PUDL Code |
|
Years Liberated |
1994-2019 |
Records Liberated |
~12 million (116 raw tables), ~316,000 (7 clean tables) |
Issues |
Background¶
The FERC Form 1, otherwise known as the Electric Utility Annual Report, contains financial and operating data for major utilities and licensees. Much of it is not publicly available anywhere else.
Who is required to fill out the form?¶
As outlined in the Commission’s Uniform System of Accounts Prescribed for Public Utilities and Licensees Subject To the Provisions of The Federal Power Act (18 C.F.R. Part 101), to qualify as a respondent, entities must exceed at least one of the following criteria for three consecutive years prior to reporting:
1 million MWh of total sales
100MWh of annual sales for resale
500MWh of annual power exchanges delivered
500MWh of annual wheeling for others (deliveries plus losses)
Annual responses are due in April of the following year. FERC typically releases the new data in October.
How much of the data is accessible through PUDL?¶
Thus far, we have integrated 7 tables into the full PUDL ETL pipeline. We focused on the tables pertaining to power plants, their capital & operating expenses, and fuel consumption; however, we have the tools required to pull just about any other table in as well.
What does the original data look like?¶
See also
Explore the full FERC Form 1 dataset at: https://data.catalyst.coop/ferc1
The data is published as a collection of Visual FoxPro databases: one per year beginning in 1994. The databases all share a very similar structure and contain a total of 116 data tables and ~8GB of raw data (though 90% of that data is in 3 tables containing binary data). The final release of Visual FoxPro was v9.0 in 2007. Its extended support period ended in 2015. The bridge application which allowed this database to be used in Microsoft Access has been discontinued. FERC’s continued use of this database format creates a significant barrier to data access.
The FERC 1 database is poorly normalized and the data itself does not appear to be subject to much quality control. For more detailed context and documentation on a table-by-table basis, look at FERC Form 1 Data Dictionary.
Notable Irregularities¶
Sadly, the FERC Form 1 database is not particularly… relational. The only
foreign key relationships that exist map respondent_id
fields in the
individual data tables back to f1_respondent_id
. In theory, most of the
data tables use report_year
, respondent_id
, row_number
,
spplmnt_num
and report_prd
as a composite primary key
In practice, there are several thousand records (out of ~12 million), including some in almost every table, that violate the uniqueness constraint on those primary keys. Since there aren’t many meaningful foreign key relationships anyway, rather than dropping the records with non-unique natural composite keys, we chose to preserve all of the records and use surrogate auto-incrementing primary keys in the cloned SQLite database.
Lots of the data included in the FERC tables is extraneous and difficult to parse. None of the tables have record identification and they sometimes contain multiple rows pertaining to the same plant or portion of a plant. For example, a utility might report values for individual plants as well as the sum total, rendering any aggregations performed on the column inaccurate. Sometimes there are values reported for the total rows and not the individual plants making them difficult to simply remove. Moreover, these duplicate rows are incredibly difficult to identify.
To improve their usability, we have developed a complex system of regional mapping in order to create ids for each of the plants that can then be compared to PUDL ids and used for integration with EIA and other data. We also remove many of the duplicate rows and are in the midst of executing a more thorough review of the extraneous rows.
Over time we will pull in and clean up additional FERC Form 1 tables. If there’s data you need from Form 1 in bulk, you can hire us to liberate it first.
PUDL Data Tables¶
We’ve segmented the processed FERC Form 1 data into the following normalized data tables. Clicking on the links will show you a description of the table as well as the names and descriptions of each of its fields.
Data Dictionary |
Browse Online |
---|---|
PUDL Data Transformations¶
To see the transformations applied to the data in each table, you can read the
pudl.transform.ferc1
module documentation for more details. created for their
respective transform functions.
Work in Progress & Future Datasets¶
Contents
Work in Progress¶
Thanks to a grant from the Alfred P. Sloan Foundation Energy & Environment Program, we have support to integrate the following new datasets between April 2021 and March 2023.
There’s a huge variety and quantity of data about the US electric utility system available to the public. The data we have integrated is just the beginning! Other data we’ve heard demand for are listed below. If you’re interested in using one of them and would like to add it to PUDL check out our contribution guidelines. If there are other datasets you think we should be looking at integration, don’t hesitate to open an issue on Github requesting the data and explaining why it would be useful.
Census DP1¶
The US Census Demographic Profile 1 (DP1) provides Census tract, county, and state-level demographic information, along with the geometries defining those areas. We use this information in generating historical utility and balancing authority service territories based on FERC 714 and EIA 861 data. Currently, we are distributing the Census DP1 data as a standalone SQLite DB.
EIA Form 861¶
The EIA Form 861, also known as the Annual Electric Power Industry Report, compiles information on load, generation, capacity, sales, revenues, programs, and more. Right now we’ve got all of 861 integrated and are building out our testing and data validation before publishing the data officially.
EIA Form 176¶
EIA Form 176, also known as the Annual Report of Natural and Supplemental Gas Supply and Disposition, describes the origins, suppliers, and disposition of natural gas on a yearly and state by state basis.
FERC Form 714¶
FERC Form 714 includes hourly loads reported by load balancing authorities annually. This is a modestly sized dataset, in the 100s of MB, distributed as CSV files exported from a Visual FoxPro database prior to publication. All of the raw tables are being extracted, and a couple of them have been integrated into the transform process. None are in the PUDL DB yet.
FERC EQR¶
The FERC Electric Quarterly Reports (EQR), also known as FERC Form 920, includes the details of transactions between different utilities and transactions between utilities and merchant generators. It covers ancillary services as well as energy and capacity, time and location of delivery, prices, contract length, etc. It’s one of the few public sources of information about renewable energy power purchase agreements (PPAs). This is a large (~100s of GB) dataset composed of a very large number of relatively clean CSV files, but it requires fuzzy processing to get at some of the interesting and only indirectly reported attributes.
FERC Form 2¶
FERC Form 2 is analogous to FERC Form 1, but it pertains to gas rather than electric utilities. The data paint a detailed picture of the finances of natural gas utilities.
PHMSA Natural Gas Pipelines¶
The PHMSA Natural Gas Annual Report, published by the Pipeline and Hazardous Materials Safety Administration (part of the US Dept. of Transportation), collects data about natural gas gathering and transmission and distribution systems (including their age, length, diameter, materials, and carrying capacity). PHAMSA also has information about natural gas storage facilities and liquefied natural gas shipping facilities.
Machine Readable Clean Energy Standards¶
Renewable Portfolio Standards (RPS) and Clean Energy Standards (CES) have emerged as one of the primary policy tools to decarbonize the US electricity supply. Researchers who model future electricity systems need to include these binding regulations as constraints on their models to ensure that the systems they explore are legally compliant. Unfortunately for modelers, RPS and CES regulations vary from state to state. Sometimes there are carve outs for different types of generation, and sometimes there are different requirements for different types of utilities or distributed resources. Our goal is to compile a programmatically usable database of RPS/CES policies in the US for quick and easy reference by modelers.
Future Data of Interest¶
Transmission and Distribution Systems¶
In order to run electricity system operations models and cost optimizations, you need some kind of model of the interconnections between generation and loads. There doesn’t appear to be a generally accepted, publicly available set of these network descriptions (yet!).
EIA Water Usage¶
EIA Water records water use by thermal generating stations in the US.
MSHA Mines and Production¶
The MSHA Mines & Production dataset describes coal production by mine and operating company along with statistics about labor productivity and safety. This is a smaller dataset (100s of MB) available as relatively clean and well structured CSV files.
Data Dictionaries¶
PUDL Data Dictionary¶
The following data tables have been cleaned and transformed by our ETL process.
assn_gen_eia_unit_epa¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
generator_id |
string |
Generator identification code. Often numeric, but sometimes includes letters. It's a string! |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
unit_id_epa |
string |
Smokestack unit monitored by EPA CEMS. |
assn_plant_id_eia_epa¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
plant_id_epa |
integer |
N/A |
boiler_fuel_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
ash_content_pct |
number |
Ash content percentage by weight to the nearest 0.1 percent. |
boiler_id |
string |
Boiler identification code. Alphanumeric. |
fuel_consumed_units |
number |
Consumption of the fuel type in physical units. Note: this is the total quantity consumed for both electricity and, in the case of combined heat and power plants, process steam production. |
fuel_mmbtu_per_unit |
number |
Heat content of the fuel in millions of Btus per physical unit. |
fuel_type_code |
string |
The fuel code reported to EIA. Two or three letter alphanumeric. |
fuel_type_code_pudl |
string |
Standardized fuel codes in PUDL. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
report_date |
date |
Date reported. |
sulfur_content_pct |
number |
Sulfur content percentage by weight to the nearest 0.01 percent. |
boiler_generator_assn_eia860¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
bga_source |
string |
The source from where the unit_id_pudl is compiled. The unit_id_pudl comes directly from EIA 860, or string association (which looks at all the boilers and generators that are not associated with a unit and tries to find a matching string in the respective collection of boilers or generator), or from a unit connection (where the unit_id_eia is employed to find additional boiler generator connections). |
boiler_id |
string |
EIA-assigned boiler identification code. |
generator_id |
string |
EIA-assigned generator identification code. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
report_date |
date |
Date reported. |
unit_id_eia |
string |
EIA-assigned unit identification code. |
unit_id_pudl |
integer |
Dynamically assigned PUDL unit id. WARNING: This ID is not guaranteed to be static long term as the input data and algorithm may evolve over time. |
boilers_entity_eia¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
boiler_id |
string |
The EIA-assigned boiler identification code. Alphanumeric. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
prime_mover_code |
string |
Code for the type of prime mover (e.g. CT, CG) |
coalmine_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
county_id_fips |
integer |
County ID from the Federal Information Processing Standard Publication 6-4. |
mine_id_msha |
integer |
MSHA issued mine identifier. |
mine_id_pudl |
integer |
PUDL issued surrogate key. |
mine_name |
string |
Coal mine name. |
mine_type_code |
string |
Type of mine. P: Preparation plant, U: Underground, S: Surface, SU: Mostly Surface with some Underground, US: Mostly Underground with some Surface. |
state |
string |
Two letter US state abbreviations and three letter ISO-3166-1 country codes for international mines. |
energy_source_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
abbr |
string |
N/A |
source |
string |
N/A |
ferc_accounts¶
Account numbers from the FERC Uniform System of Accounts for Electric Plant, which is defined in Code of Federal Regulations (CFR) Title 18, Chapter I, Subchapter C, Part 101. (See e.g. https://www.law.cornell.edu/cfr/text/18/part-101). Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
description |
string |
Long description of the FERC Account. |
ferc_account_id |
string |
Account number, from FERC's Uniform System of Accounts for Electric Plant. Also includes higher level labeled categories. |
ferc_depreciation_lines¶
PUDL assigned FERC Form 1 line identifiers and long descriptions from FERC Form 1 page 219, Accumulated Provision for Depreciation of Electric Utility Plant (Account 108). Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
description |
string |
Description of the FERC depreciation account, as listed on FERC Form 1, Page 219. |
line_id |
string |
A human readable string uniquely identifying the FERC depreciation account. Used in lieu of the actual line number, as those numbers are not guaranteed to be consistent from year to year. |
fuel_ferc1¶
Annual fuel cost and quanitiy for steam plants with a capacity of 25+ MW, internal combustion and gas-turbine plants of 10+ MW, and all nuclear plants. As reported on page 402 of FERC Form 1 and extracted from the f1_fuel table in FERC's FoxPro Database. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
fuel_cost_per_mmbtu |
number |
Average cost of fuel consumed in the report year, in nominal USD per mmBTU of fuel heat content. |
fuel_cost_per_unit_burned |
number |
Average cost of fuel consumed in the report year, in nominal USD per reported fuel unit. |
fuel_cost_per_unit_delivered |
number |
Average cost of fuel delivered in the report year, in nominal USD per reported fuel unit. |
fuel_mmbtu_per_unit |
number |
Average heat content of fuel consumed in the report year, in mmBTU per reported fuel unit. |
fuel_qty_burned |
number |
Quantity of fuel consumed in the report year, in terms of the reported fuel units. |
fuel_type_code_pudl |
string |
PUDL assigned code indicating the general fuel type. |
fuel_unit |
string |
PUDL assigned code indicating reported fuel unit of measure. |
plant_name_ferc1 |
string |
Name of the plant, as reported to FERC. This is a freeform string, not guaranteed to be consistent across references to the same plant. |
record_id |
string |
Identifier indicating original FERC Form 1 source record. format: {table_name}_{report_year}_{report_prd}_{respondent_id}_{spplmnt_num}_{row_number}. Unique within FERC Form 1 DB tables which are not row-mapped. |
report_year |
year |
Four-digit year in which the data was reported. |
utility_id_ferc1 |
integer |
FERC assigned respondent_id, identifying the reporting entity. Stable from year to year. |
fuel_receipts_costs_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
ash_content_pct |
number |
Ash content percentage by weight to the nearest 0.1 percent. |
chlorine_content_ppm |
number |
N/A |
contract_expiration_date |
date |
Date contract expires.Format: MMYY. |
contract_type_code |
string |
Purchase type under which receipts occurred in the reporting month. C: Contract, NC: New Contract, S: Spot Purchase, T: Tolling Agreement. |
energy_source_code |
string |
The fuel code associated with the fuel receipt. Two or three character alphanumeric. |
fuel_cost_per_mmbtu |
number |
All costs incurred in the purchase and delivery of the fuel to the plant in cents per million Btu(MMBtu) to the nearest 0.1 cent. |
fuel_group_code |
string |
Groups the energy sources into fuel groups that are located in the Electric Power Monthly: Coal, Natural Gas, Petroleum, Petroleum Coke. |
fuel_group_code_simple |
string |
Simplified grouping of fuel_group_code, with Coal and Petroluem Coke as well as Natural Gas and Other Gas grouped together. |
fuel_qty_units |
number |
Quanity of fuel received in tons, barrel, or Mcf. |
fuel_type_code_pudl |
string |
Standardized fuel codes in PUDL. |
heat_content_mmbtu_per_unit |
number |
Heat content of the fuel in millions of Btus per physical unit to the nearest 0.01 percent. |
id |
integer |
PUDL issued surrogate key. |
mercury_content_ppm |
number |
Mercury content in parts per million (ppm) to the nearest 0.001 ppm. |
mine_id_pudl |
integer |
PUDL mine identification number. |
moisture_content_pct |
number |
N/A |
natural_gas_delivery_contract_type_code |
string |
Contract type for natrual gas delivery service: |
natural_gas_transport_code |
string |
Contract type for natural gas transportation service. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
primary_transportation_mode_code |
string |
Transportation mode for the longest distance transported. |
report_date |
date |
Date reported. |
secondary_transportation_mode_code |
string |
Transportation mode for the second longest distance transported. |
sulfur_content_pct |
number |
Sulfur content percentage by weight to the nearest 0.01 percent. |
supplier_name |
string |
Company that sold the fuel to the plant or, in the case of Natural Gas, pipline owner. |
fuel_type_aer_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
abbr |
string |
N/A |
fuel_type |
string |
N/A |
fuel_type_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
abbr |
string |
N/A |
fuel_type |
string |
N/A |
generation_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
generator_id |
string |
Generator identification code. Often numeric, but sometimes includes letters. It's a string! |
net_generation_mwh |
number |
Net generation for specified period in megawatthours (MWh). |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
report_date |
date |
Date reported. |
generation_fuel_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
fuel_consumed_for_electricity_mmbtu |
number |
Total consumption of fuel to produce electricity, in physical units, year to date. |
fuel_consumed_for_electricity_units |
number |
Consumption for electric generation of the fuel type in physical units. |
fuel_consumed_mmbtu |
number |
Total consumption of fuel in physical units, year to date. Note: this is the total quantity consumed for both electricity and, in the case of combined heat and power plants, process steam production. |
fuel_consumed_units |
number |
Consumption of the fuel type in physical units. Note: this is the total quantity consumed for both electricity and, in the case of combined heat and power plants, process steam production. |
fuel_mmbtu_per_unit |
number |
Heat content of the fuel in millions of Btus per physical unit. |
fuel_type |
string |
The fuel code reported to EIA. Two or three letter alphanumeric. |
fuel_type_code_aer |
string |
A partial aggregation of the reported fuel type codes into larger categories used by EIA in, for example, the Annual Energy Review (AER).Two or three letter alphanumeric. |
fuel_type_code_pudl |
string |
Standardized fuel codes in PUDL. |
net_generation_mwh |
number |
Net generation, year to date in megawatthours (MWh). This is total electrical output net of station service. In the case of combined heat and power plants, this value is intended to include internal consumption of electricity for the purposes of a production process, as well as power put on the grid. |
nuclear_unit_id |
integer |
For nuclear plants only. This unit ID appears to correspond directly to the generator ID, as reported in the EIA-860. Nuclear plants are the only type of plants for which data are shown explicitly at the generating unit level. Note that nuclear plants only report their fuel consumption and net generation in the generation_fuel_eia923 table and not elsewhere. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
prime_mover_code |
string |
Type of prime mover. |
report_date |
date |
Date reported. |
generators_eia860¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
capacity_mw |
number |
The highest value on the generator nameplate in megawatts rounded to the nearest tenth. |
carbon_capture |
boolean |
Indicates whether the generator uses carbon capture technology. |
cofire_fuels |
boolean |
Can the generator co-fire fuels?. |
current_planned_operating_date |
date |
The most recently updated effective date on which the generator is scheduled to start operation |
data_source |
string |
Source of EIA 860 data. Either Annual EIA 860 or the year-to-date updates from EIA 860M. |
deliver_power_transgrid |
boolean |
Indicate whether the generator can deliver power to the transmission grid. |
distributed_generation |
boolean |
Whether the generator is considered distributed generation |
energy_source_1_transport_1 |
string |
Primary Mode of Transportaion for Energy Source 1 |
energy_source_1_transport_2 |
string |
Secondary Mode of Transportaion for Energy Source 1 |
energy_source_1_transport_3 |
string |
Third Mode of Transportaion for Energy Source 1 |
energy_source_2_transport_1 |
string |
Primary Mode of Transportaion for Energy Source 2 |
energy_source_2_transport_2 |
string |
Secondary Mode of Transportaion for Energy Source 2 |
energy_source_2_transport_3 |
string |
Third Mode of Transportaion for Energy Source 2 |
energy_source_code_1 |
string |
The code representing the most predominant type of energy that fuels the generator. |
energy_source_code_2 |
string |
The code representing the second most predominant type of energy that fuels the generator |
energy_source_code_3 |
string |
The code representing the third most predominant type of energy that fuels the generator |
energy_source_code_4 |
string |
The code representing the fourth most predominant type of energy that fuels the generator |
energy_source_code_5 |
string |
The code representing the fifth most predominant type of energy that fuels the generator |
energy_source_code_6 |
string |
The code representing the sixth most predominant type of energy that fuels the generator |
fuel_type_code_pudl |
string |
Standardized fuel codes in PUDL. |
generator_id |
string |
Generator identification number. |
minimum_load_mw |
number |
The minimum load at which the generator can operate at continuosuly. |
multiple_fuels |
boolean |
Can the generator burn multiple fuels? |
nameplate_power_factor |
number |
The nameplate power factor of the generator. |
operational_status |
string |
The operating status of the generator. This is based on which tab the generator was listed in in EIA 860. |
operational_status_code |
string |
The operating status of the generator. |
other_modifications_date |
date |
Planned effective date that the generator is scheduled to enter commercial operation after any other planned modification is complete. |
other_planned_modifications |
boolean |
Indicates whether there are there other modifications planned for the generator. |
owned_by_non_utility |
boolean |
Whether any part of generator is owned by a nonutilty |
ownership_code |
string |
Identifies the ownership for each generator. |
planned_derate_date |
date |
Planned effective month that the generator is scheduled to enter operation after the derate modification. |
planned_energy_source_code_1 |
string |
New energy source code for the planned repowered generator. |
planned_modifications |
boolean |
Indicates whether there are any planned capacity uprates/derates, repowering, other modifications, or generator retirements scheduled for the next 5 years. |
planned_net_summer_capacity_derate_mw |
number |
Decrease in summer capacity expected to be realized from the derate modification to the equipment. |
planned_net_summer_capacity_uprate_mw |
number |
Increase in summer capacity expected to be realized from the modification to the equipment. |
planned_net_winter_capacity_derate_mw |
number |
Decrease in winter capacity expected to be realized from the derate modification to the equipment. |
planned_net_winter_capacity_uprate_mw |
number |
Increase in winter capacity expected to be realized from the uprate modification to the equipment. |
planned_new_capacity_mw |
number |
The expected new namplate capacity for the generator. |
planned_new_prime_mover_code |
string |
New prime mover for the planned repowered generator. |
planned_repower_date |
date |
Planned effective date that the generator is scheduled to enter operation after the repowering is complete. |
planned_retirement_date |
date |
Planned effective date of the scheduled retirement of the generator. |
planned_uprate_date |
date |
Planned effective date that the generator is scheduled to enter operation after the uprate modification. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
reactive_power_output_mvar |
number |
Reactive Power Output (MVAr) |
report_date |
date |
Date reported. |
retirement_date |
date |
Date of the scheduled or effected retirement of the generator. |
startup_source_code_1 |
string |
The code representing the first, second, third or fourth start-up and flame stabilization energy source used by the combustion unit(s) associated with this generator. |
startup_source_code_2 |
string |
The code representing the first, second, third or fourth start-up and flame stabilization energy source used by the combustion unit(s) associated with this generator. |
startup_source_code_3 |
string |
The code representing the first, second, third or fourth start-up and flame stabilization energy source used by the combustion unit(s) associated with this generator. |
startup_source_code_4 |
string |
The code representing the first, second, third or fourth start-up and flame stabilization energy source used by the combustion unit(s) associated with this generator. |
summer_capacity_estimate |
boolean |
Whether the summer capacity value was an estimate |
summer_capacity_mw |
number |
The net summer capacity. |
summer_estimated_capability_mw |
number |
EIA estimated summer capacity (in MWh). |
switch_oil_gas |
boolean |
Indicates whether the generator switch between oil and natural gas. |
syncronized_transmission_grid |
boolean |
Indicates whether standby generators (SB status) can be synchronized to the grid. |
technology_description |
string |
High level description of the technology used by the generator to produce electricity. |
time_cold_shutdown_full_load_code |
string |
The minimum amount of time required to bring the unit to full load from shutdown. |
turbines_inverters_hydrokinetics |
string |
Number of wind turbines, or hydrokinetic buoys. |
turbines_num |
integer |
Number of wind turbines, or hydrokinetic buoys. |
uprate_derate_completed_date |
date |
The date when the uprate or derate was completed. |
uprate_derate_during_year |
boolean |
Was an uprate or derate completed on this generator during the reporting year? |
utility_id_eia |
integer |
EIA-assigned identification number for the company that is responsible for the day-to-day operations of the generator. |
winter_capacity_estimate |
boolean |
Whether the winter capacity value was an estimate |
winter_capacity_mw |
number |
The net winter capacity. |
winter_estimated_capability_mw |
number |
EIA estimated winter capacity (in MWh). |
generators_entity_eia¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
associated_combined_heat_power |
boolean |
Indicates whether the generator is associated with a combined heat and power system |
bypass_heat_recovery |
boolean |
Can this generator operate while bypassing the heat recovery steam generator? |
duct_burners |
boolean |
Indicates whether the unit has duct-burners for supplementary firing of the turbine exhaust gas |
fluidized_bed_tech |
boolean |
Indicates whether the generator uses fluidized bed technology |
generator_id |
string |
Generator identification number |
operating_date |
date |
Date the generator began commercial operation |
operating_switch |
string |
Indicates whether the fuel switching generator can switch when operating |
original_planned_operating_date |
date |
The date the generator was originally scheduled to be operational |
other_combustion_tech |
boolean |
Indicates whether the generator uses other combustion technologies |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
previously_canceled |
boolean |
Indicates whether the generator was previously reported as indefinitely postponed or canceled |
prime_mover_code |
string |
EIA assigned code for the prime mover (i.e. the engine, turbine, water wheel, or similar machine that drives an electric generator) |
pulverized_coal_tech |
boolean |
Indicates whether the generator uses pulverized coal technology |
rto_iso_lmp_node_id |
string |
The designation used to identify the price node in RTO/ISO Locational Marginal Price reports |
rto_iso_location_wholesale_reporting_id |
string |
The designation used to report ths specific location of the wholesale sales transactions to FERC for the Electric Quarterly Report |
solid_fuel_gasification |
boolean |
Indicates whether the generator is part of a solid fuel gasification system |
stoker_tech |
boolean |
Indicates whether the generator uses stoker technology |
subcritical_tech |
boolean |
Indicates whether the generator uses subcritical technology |
supercritical_tech |
boolean |
Indicates whether the generator uses supercritical technology |
topping_bottoming_code |
string |
If the generator is associated with a combined heat and power system, indicates whether the generator is part of a topping cycle or a bottoming cycle |
ultrasupercritical_tech |
boolean |
Indicates whether the generator uses ultra-supercritical technology |
hourly_emissions_epacems¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
co2_mass_measurement_code |
string |
Identifies whether the reported value of emissions was measured, calculated, or measured and substitute. |
co2_mass_tons |
number |
Carbon dioxide emissions in short tons. |
facility_id |
integer |
New EPA plant ID. |
gross_load_mw |
number |
Average power in megawatts delivered during time interval measured. |
heat_content_mmbtu |
number |
The energy contained in fuel burned, measured in million BTU. |
nox_mass_lbs |
number |
NOx emissions in pounds. |
nox_mass_measurement_code |
string |
Identifies whether the reported value of emissions was measured, calculated, or measured and substitute. |
nox_rate_lbs_mmbtu |
number |
The average rate at which NOx was emitted during a given time period. |
nox_rate_measurement_code |
string |
Identifies whether the reported value of emissions was measured, calculated, or measured and substitute. |
operating_datetime_utc |
datetime |
Date and time measurement began (UTC). |
operating_time_hours |
number |
Length of time interval measured. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
so2_mass_lbs |
number |
Sulfur dioxide emissions in pounds. |
so2_mass_measurement_code |
string |
Identifies whether the reported value of emissions was measured, calculated, or measured and substitute. |
state |
string |
State the plant is located in. |
steam_load_1000_lbs |
number |
Total steam pressure produced by a unit during the reported hour. |
unit_id_epa |
integer |
Smokestack unit monitored by EPA CEMS. |
unitid |
string |
Facility-specific unit id (e.g. Unit 4) |
ownership_eia860¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
fraction_owned |
number |
Proportion of generator ownership. |
generator_id |
string |
Generator identification number. |
owner_city |
string |
City of owner. |
owner_name |
string |
Name of owner. |
owner_state |
string |
Two letter US & Canadian state and territory abbreviations. |
owner_street_address |
string |
Steet address of owner. |
owner_utility_id_eia |
integer |
EIA-assigned owner's identification number. |
owner_zip_code |
string |
Zip code of owner. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
report_date |
date |
Date reported. |
utility_id_eia |
integer |
EIA-assigned identification number for the company that is responsible for the day-to-day operations of the generator. |
plant_in_service_ferc1¶
Balances and changes to FERC Electric Plant in Service accounts, as reported on FERC Form 1. Data originally from the f1_plant_in_srvce table in FERC's FoxPro database. Account numbers correspond to the FERC Uniform System of Accounts for Electric Plant, which is defined in Code of Federal Regulations (CFR) Title 18, Chapter I, Subchapter C, Part 101. (See e.g. https://www.law.cornell.edu/cfr/text/18/part-101). Each FERC respondent reports starting and ending balances for each account annually. Balances are organization wide, and are not broken down on a per-plant basis. End of year balance should equal beginning year balance plus the sum of additions, retirements, adjustments, and transfers. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
amount_type |
string |
String indicating which original FERC Form 1 column the listed amount came from. Each field should have one (potentially NA) value of each type for each utility in each year, and the ending_balance should equal the sum of starting_balance, additions, retirements, adjustments, and transfers. |
distribution_acct360_land |
number |
FERC Account 360: Distribution Plant Land and Land Rights. |
distribution_acct361_structures |
number |
FERC Account 361: Distribution Plant Structures and Improvements. |
distribution_acct362_station_equip |
number |
FERC Account 362: Distribution Plant Station Equipment. |
distribution_acct363_storage_battery_equip |
number |
FERC Account 363: Distribution Plant Storage Battery Equipment. |
distribution_acct364_poles_towers |
number |
FERC Account 364: Distribution Plant Poles, Towers, and Fixtures. |
distribution_acct365_overhead_conductors |
number |
FERC Account 365: Distribution Plant Overhead Conductors and Devices. |
distribution_acct366_underground_conduit |
number |
FERC Account 366: Distribution Plant Underground Conduit. |
distribution_acct367_underground_conductors |
number |
FERC Account 367: Distribution Plant Underground Conductors and Devices. |
distribution_acct368_line_transformers |
number |
FERC Account 368: Distribution Plant Line Transformers. |
distribution_acct369_services |
number |
FERC Account 369: Distribution Plant Services. |
distribution_acct370_meters |
number |
FERC Account 370: Distribution Plant Meters. |
distribution_acct371_customer_installations |
number |
FERC Account 371: Distribution Plant Installations on Customer Premises. |
distribution_acct372_leased_property |
number |
FERC Account 372: Distribution Plant Leased Property on Customer Premises. |
distribution_acct373_street_lighting |
number |
FERC Account 373: Distribution PLant Street Lighting and Signal Systems. |
distribution_acct374_asset_retirement |
number |
FERC Account 374: Distribution Plant Asset Retirement Costs. |
distribution_total |
number |
Distribution Plant Total (FERC Accounts 360-374). |
electric_plant_in_service_total |
number |
Total Electric Plant in Service (FERC Accounts 101, 102, 103 and 106) |
electric_plant_purchased_acct102 |
number |
FERC Account 102: Electric Plant Purchased. |
electric_plant_sold_acct102 |
number |
FERC Account 102: Electric Plant Sold (Negative). |
experimental_plant_acct103 |
number |
FERC Account 103: Experimental Plant Unclassified. |
general_acct389_land |
number |
FERC Account 389: General Land and Land Rights. |
general_acct390_structures |
number |
FERC Account 390: General Structures and Improvements. |
general_acct391_office_equip |
number |
FERC Account 391: General Office Furniture and Equipment. |
general_acct392_transportation_equip |
number |
FERC Account 392: General Transportation Equipment. |
general_acct393_stores_equip |
number |
FERC Account 393: General Stores Equipment. |
general_acct394_shop_equip |
number |
FERC Account 394: General Tools, Shop, and Garage Equipment. |
general_acct395_lab_equip |
number |
FERC Account 395: General Laboratory Equipment. |
general_acct396_power_operated_equip |
number |
FERC Account 396: General Power Operated Equipment. |
general_acct397_communication_equip |
number |
FERC Account 397: General Communication Equipment. |
general_acct398_misc_equip |
number |
FERC Account 398: General Miscellaneous Equipment. |
general_acct399_1_asset_retirement |
number |
FERC Account 399.1: Asset Retirement Costs for General Plant. |
general_acct399_other_property |
number |
FERC Account 399: General Plant Other Tangible Property. |
general_subtotal |
number |
General Plant Subtotal (FERC Accounts 389-398). |
general_total |
number |
General Plant Total (FERC Accounts 389-399.1). |
hydro_acct330_land |
number |
FERC Account 330: Hydro Land and Land Rights. |
hydro_acct331_structures |
number |
FERC Account 331: Hydro Structures and Improvements. |
hydro_acct332_reservoirs_dams_waterways |
number |
FERC Account 332: Hydro Reservoirs, Dams, and Waterways. |
hydro_acct333_wheels_turbines_generators |
number |
FERC Account 333: Hydro Water Wheels, Turbins, and Generators. |
hydro_acct334_accessory_equip |
number |
FERC Account 334: Hydro Accessory Electric Equipment. |
hydro_acct335_misc_equip |
number |
FERC Account 335: Hydro Miscellaneous Power Plant Equipment. |
hydro_acct336_roads_railroads_bridges |
number |
FERC Account 336: Hydro Roads, Railroads, and Bridges. |
hydro_acct337_asset_retirement |
number |
FERC Account 337: Asset Retirement Costs for Hydraulic Production. |
hydro_total |
number |
Hydraulic Production Plant Total (FERC Accounts 330-337) |
intangible_acct301_organization |
number |
FERC Account 301: Intangible Plant Organization. |
intangible_acct302_franchises_consents |
number |
FERC Account 302: Intangible Plant Franchises and Consents. |
intangible_acct303_misc |
number |
FERC Account 303: Miscellaneous Intangible Plant. |
intangible_total |
number |
Intangible Plant Total (FERC Accounts 301-303). |
major_electric_plant_acct101_acct106_total |
number |
Total Major Electric Plant in Service (FERC Accounts 101 and 106). |
nuclear_acct320_land |
number |
FERC Account 320: Nuclear Land and Land Rights. |
nuclear_acct321_structures |
number |
FERC Account 321: Nuclear Structures and Improvements. |
nuclear_acct322_reactor_equip |
number |
FERC Account 322: Nuclear Reactor Plant Equipment. |
nuclear_acct323_turbogenerators |
number |
FERC Account 323: Nuclear Turbogenerator Units |
nuclear_acct324_accessory_equip |
number |
FERC Account 324: Nuclear Accessory Electric Equipment. |
nuclear_acct325_misc_equip |
number |
FERC Account 325: Nuclear Miscellaneous Power Plant Equipment. |
nuclear_acct326_asset_retirement |
number |
FERC Account 326: Asset Retirement Costs for Nuclear Production. |
nuclear_total |
number |
Total Nuclear Production Plant (FERC Accounts 320-326) |
other_acct340_land |
number |
FERC Account 340: Other Land and Land Rights. |
other_acct341_structures |
number |
FERC Account 341: Other Structures and Improvements. |
other_acct342_fuel_accessories |
number |
FERC Account 342: Other Fuel Holders, Products, and Accessories. |
other_acct343_prime_movers |
number |
FERC Account 343: Other Prime Movers. |
other_acct344_generators |
number |
FERC Account 344: Other Generators. |
other_acct345_accessory_equip |
number |
FERC Account 345: Other Accessory Electric Equipment. |
other_acct346_misc_equip |
number |
FERC Account 346: Other Miscellaneous Power Plant Equipment. |
other_acct347_asset_retirement |
number |
FERC Account 347: Asset Retirement Costs for Other Production. |
other_total |
number |
Total Other Production Plant (FERC Accounts 340-347). |
production_total |
number |
Total Production Plant (FERC Accounts 310-347). |
record_id |
string |
Identifier indicating original FERC Form 1 source record. format: {table_name}_{report_year}_{report_prd}_{respondent_id}_{spplmnt_num}_{row_number}. Unique within FERC Form 1 DB tables which are not row-mapped. |
report_year |
year |
Four-digit year in which the data was reported. |
rtmo_acct380_land |
number |
FERC Account 380: RTMO Land and Land Rights. |
rtmo_acct381_structures |
number |
FERC Account 381: RTMO Structures and Improvements. |
rtmo_acct382_computer_hardware |
number |
FERC Account 382: RTMO Computer Hardware. |
rtmo_acct383_computer_software |
number |
FERC Account 383: RTMO Computer Software. |
rtmo_acct384_communication_equip |
number |
FERC Account 384: RTMO Communication Equipment. |
rtmo_acct385_misc_equip |
number |
FERC Account 385: RTMO Miscellaneous Equipment. |
rtmo_total |
number |
Total RTMO Plant (FERC Accounts 380-386) |
steam_acct310_land |
number |
FERC Account 310: Steam Plant Land and Land Rights. |
steam_acct311_structures |
number |
FERC Account 311: Steam Plant Structures and Improvements. |
steam_acct312_boiler_equip |
number |
FERC Account 312: Steam Boiler Plant Equipment. |
steam_acct313_engines |
number |
FERC Account 313: Steam Engines and Engine-Driven Generators. |
steam_acct314_turbogenerators |
number |
FERC Account 314: Steam Turbogenerator Units. |
steam_acct315_accessory_equip |
number |
FERC Account 315: Steam Accessory Electric Equipment. |
steam_acct316_misc_equip |
number |
FERC Account 316: Steam Miscellaneous Power Plant Equipment. |
steam_acct317_asset_retirement |
number |
FERC Account 317: Asset Retirement Costs for Steam Production. |
steam_total |
number |
Total Steam Production Plant (FERC Accounts 310-317). |
transmission_acct350_land |
number |
FERC Account 350: Transmission Land and Land Rights. |
transmission_acct352_structures |
number |
FERC Account 352: Transmission Structures and Improvements. |
transmission_acct353_station_equip |
number |
FERC Account 353: Transmission Station Equipment. |
transmission_acct354_towers |
number |
FERC Account 354: Transmission Towers and Fixtures. |
transmission_acct355_poles |
number |
FERC Account 355: Transmission Poles and Fixtures. |
transmission_acct356_overhead_conductors |
number |
FERC Account 356: Overhead Transmission Conductors and Devices. |
transmission_acct357_underground_conduit |
number |
FERC Account 357: Underground Transmission Conduit. |
transmission_acct358_underground_conductors |
number |
FERC Account 358: Underground Transmission Conductors. |
transmission_acct359_1_asset_retirement |
number |
FERC Account 359.1: Asset Retirement Costs for Transmission Plant. |
transmission_acct359_roads_trails |
number |
FERC Account 359: Transmission Roads and Trails. |
transmission_total |
number |
Total Transmission Plant (FERC Accounts 350-359.1) |
utility_id_ferc1 |
integer |
FERC assigned respondent_id, identifying the reporting entity. Stable from year to year. |
plant_unit_epa¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
plant_id_epa |
integer |
N/A |
unit_id_epa |
string |
Smokestack unit monitored by EPA CEMS. |
plants_eia¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
plant_id_pudl |
integer |
N/A |
plant_name_eia |
string |
N/A |
plants_eia860¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
ash_impoundment |
string |
Is there an ash impoundment (e.g. pond, reservoir) at the plant? |
ash_impoundment_lined |
string |
If there is an ash impoundment at the plant, is the impoundment lined? |
ash_impoundment_status |
string |
If there is an ash impoundment at the plant, the ash impoundment status as of December 31 of the reporting year. |
datum |
string |
N/A |
energy_storage |
string |
Indicates if the facility has energy storage capabilities. |
ferc_cogen_docket_no |
string |
The docket number relating to the FERC qualifying facility cogenerator status. |
ferc_exempt_wholesale_generator_docket_no |
string |
The docket number relating to the FERC qualifying facility exempt wholesale generator status. |
ferc_small_power_producer_docket_no |
string |
The docket number relating to the FERC qualifying facility small power producer status. |
liquefied_natural_gas_storage |
string |
Indicates if the facility have the capability to store the natural gas in the form of liquefied natural gas. |
natural_gas_local_distribution_company |
string |
Names of Local Distribution Company (LDC), connected to natural gas burning power plants. |
natural_gas_pipeline_name_1 |
string |
The name of the owner or operator of natural gas pipeline that connects directly to this facility or that connects to a lateral pipeline owned by this facility. |
natural_gas_pipeline_name_2 |
string |
The name of the owner or operator of natural gas pipeline that connects directly to this facility or that connects to a lateral pipeline owned by this facility. |
natural_gas_pipeline_name_3 |
string |
The name of the owner or operator of natural gas pipeline that connects directly to this facility or that connects to a lateral pipeline owned by this facility. |
natural_gas_storage |
string |
Indicates if the facility have on-site storage of natural gas. |
nerc_region |
string |
NERC region in which the plant is located |
net_metering |
string |
Did this plant have a net metering agreement in effect during the reporting year? (Only displayed for facilities that report the sun or wind as an energy source). This field was only reported up until 2015 |
pipeline_notes |
string |
Additional owner or operator of natural gas pipeline. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
regulatory_status_code |
string |
Indicates whether the plant is regulated or non-regulated. |
report_date |
date |
Date reported. |
transmission_distribution_owner_id |
string |
EIA-assigned code for owner of transmission/distribution system to which the plant is interconnected. |
transmission_distribution_owner_name |
string |
Name of the owner of the transmission or distribution system to which the plant is interconnected. |
transmission_distribution_owner_state |
string |
State location for owner of transmission/distribution system to which the plant is interconnected. |
utility_id_eia |
integer |
EIA-assigned identification number for the company that is responsible for the day-to-day operations of the generator. |
water_source |
string |
Name of water source associater with the plant. |
plants_entity_eia¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
balancing_authority_code_eia |
string |
The plant's balancing authority code. |
balancing_authority_name_eia |
string |
The plant's balancing authority name. |
city |
string |
The plant's city. |
county |
string |
The plant's county. |
ferc_cogen_status |
string |
Indicates whether the plant has FERC qualifying facility cogenerator status. |
ferc_exempt_wholesale_generator |
string |
Indicates whether the plant has FERC qualifying facility exempt wholesale generator status |
ferc_small_power_producer |
string |
Indicates whether the plant has FERC qualifying facility small power producer status |
grid_voltage_2_kv |
number |
Plant's grid voltage at point of interconnection to transmission or distibution facilities |
grid_voltage_3_kv |
number |
Plant's grid voltage at point of interconnection to transmission or distibution facilities |
grid_voltage_kv |
number |
Plant's grid voltage at point of interconnection to transmission or distibution facilities |
iso_rto_code |
string |
The code of the plant's ISO or RTO. NA if not reported in that year. |
latitude |
number |
Latitude of the plant's location, in degrees. |
longitude |
number |
Longitude of the plant's location, in degrees. |
plant_id_eia |
integer |
The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration. |
plant_name_eia |
string |
Plant name. |
primary_purpose_naics_id |
number |
North American Industry Classification System (NAICS) code that best describes the primary purpose of the reporting plant |
sector_id |
number |
Plant-level sector number, designated by the primary purpose, regulatory status and plant-level combined heat and power status |
sector_name |
string |
Plant-level sector name, designated by the primary purpose, regulatory status and plant-level combined heat and power status |
service_area |
string |
Service area in which plant is located; for unregulated companies, it's the electric utility with which plant is interconnected |
state |
string |
Plant state. Two letter US state and territory abbreviations. |
street_address |
string |
Plant street address |
timezone |
string |
IANA timezone name |
zip_code |
string |
Plant street address |
plants_ferc1¶
Name, utility, and PUDL id for steam plants with a capacity of 25,000+ kW, internal combustion and gas-turbine plants of 10,000+ kW, and all nuclear plants. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
plant_id_pudl |
integer |
A manually assigned PUDL plant ID. May not be constant over time. |
plant_name_ferc1 |
string |
Name of the plant, as reported to FERC. This is a freeform string, not guaranteed to be consistent across references to the same plant. |
utility_id_ferc1 |
integer |
FERC assigned respondent_id, identifying the reporting entity. Stable from year to year. |
plants_hydro_ferc1¶
Generating plant statistics for hydroelectric plants with an installed nameplate capacity of 10 MW. As reported on FERC Form 1, pages 406-407 and extracted from the f1_hydro table in FERC's FoxPro database. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
asset_retirement_cost |
number |
Cost of plant: asset retirement costs. Nominal USD. |
avg_num_employees |
number |
Average number of employees. |
capacity_mw |
number |
Total installed (nameplate) capacity, in megawatts. |
capex_equipment |
number |
Cost of plant: equipment. Nominal USD. |
capex_facilities |
number |
Cost of plant: reservoirs, dams, and waterways. Nominal USD. |
capex_land |
number |
Cost of plant: land and land rights. Nominal USD. |
capex_per_mw |
number |
Cost of plant per megawatt of installed (nameplate) capacity. Nominal USD. |
capex_roads |
number |
Cost of plant: roads, railroads, and bridges. Nominal USD. |
capex_structures |
number |
Cost of plant: structures and improvements. Nominal USD. |
capex_total |
number |
Total cost of plant. Nominal USD. |
construction_type |
string |
Type of plant construction ('outdoor', 'semioutdoor', or 'conventional'). Categorized by PUDL based on our best guess of intended value in FERC1 freeform strings. |
construction_year |
year |
Four digit year of the plant's original construction. |
installation_year |
year |
Four digit year in which the last unit was installed. |
net_capacity_adverse_conditions_mw |
number |
Net plant capability under the least favorable operating conditions, in megawatts. |
net_capacity_favorable_conditions_mw |
number |
Net plant capability under the most favorable operating conditions, in megawatts. |
net_generation_mwh |
number |
Net generation, exclusive of plant use, in megawatt hours. |
opex_dams |
number |
Production expenses: maintenance of reservoirs, dams, and waterways. Nominal USD. |
opex_electric |
number |
Production expenses: electric expenses. Nominal USD. |
opex_engineering |
number |
Production expenses: maintenance, supervision, and engineering. Nominal USD. |
opex_generation_misc |
number |
Production expenses: miscellaneous hydraulic power generation expenses. Nominal USD. |
opex_hydraulic |
number |
Production expenses: hydraulic expenses. Nominal USD. |
opex_misc_plant |
number |
Production expenses: maintenance of miscellaneous hydraulic plant. Nominal USD. |
opex_operations |
number |
Production expenses: operation, supervision, and engineering. Nominal USD. |
opex_per_mwh |
number |
Production expenses per net megawatt hour generated. Nominal USD. |
opex_plant |
number |
Production expenses: maintenance of electric plant. Nominal USD. |
opex_rents |
number |
Production expenses: rent. Nominal USD. |
opex_structures |
number |
Production expenses: maintenance of structures. Nominal USD. |
opex_total |
number |
Total production expenses. Nominal USD. |
opex_water_for_power |
number |
Production expenses: water for power. Nominal USD. |
peak_demand_mw |
number |
Net peak demand on the plant (60-minute integration), in megawatts. |
plant_hours_connected_while_generating |
number |
Hours the plant was connected to load while generating. |
plant_name_ferc1 |
string |
Name of the plant, as reported to FERC. This is a freeform string, not guaranteed to be consistent across references to the same plant. |
plant_type |
string |
Kind of plant (Run-of-River or Storage). |
project_num |
integer |
FERC Licensed Project Number. |
record_id |
string |
Identifier indicating original FERC Form 1 source record. format: {table_name}_{report_year}_{report_prd}_{respondent_id}_{spplmnt_num}_{row_number}. Unique within FERC Form 1 DB tables which are not row-mapped. |
report_year |
year |
Four-digit year in which the data was reported. |
utility_id_ferc1 |
integer |
FERC assigned respondent_id, identifying the reporting entity. Stable from year to year. |
plants_pudl¶
Home table for PUDL assigned plant IDs. These IDs are manually generated each year when new FERC and EIA reporting is integrated, and any newly identified plants are added to the list with a new ID. Each ID maps to a power plant which is reported in at least one FERC or EIA data set. This table is read in from a spreadsheet stored in the PUDL repository: src/pudl/package_data/glue/mapping_eia923_ferc1.xlsx Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
plant_id_pudl |
integer |
A manually assigned PUDL plant ID. May not be constant over time. |
plant_name_pudl |
string |
Plant name, chosen arbitrarily from the several possible plant names available in the plant matching process. Included for human readability only. |
plants_pumped_storage_ferc1¶
Generating plant statistics for hydroelectric pumped storage plants with an installed nameplate capacity of 10+ MW. As reported on page 408 of FERC Form 1 and extracted from the f1_pumped_storage table in FERC's FoxPro Database. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
asset_retirement_cost |
number |
Cost of plant: asset retirement costs. Nominal USD. |
avg_num_employees |
number |
Average number of employees. |
capacity_mw |
number |
Total installed (nameplate) capacity, in megawatts. |
capex_equipment_electric |
number |
Cost of plant: accessory electric equipment. Nominal USD. |
capex_equipment_misc |
number |
Cost of plant: miscellaneous power plant equipment. Nominal USD. |
capex_facilities |
number |
Cost of plant: reservoirs, dams, and waterways. Nominal USD. |
capex_land |
number |
Cost of plant: land and land rights. Nominal USD. |
capex_per_mw |
number |
Cost of plant per megawatt of installed (nameplate) capacity. Nominal USD. |
capex_roads |
number |
Cost of plant: roads, railroads, and bridges. Nominal USD. |
capex_structures |
number |
Cost of plant: structures and improvements. Nominal USD. |
capex_total |
number |
Total cost of plant. Nominal USD. |
capex_wheels_turbines_generators |
number |
Cost of plant: water wheels, turbines, and generators. Nominal USD. |
construction_type |
string |
Type of plant construction ('outdoor', 'semioutdoor', or 'conventional'). Categorized by PUDL based on our best guess of intended value in FERC1 freeform strings. |
construction_year |
year |
Four digit year of the plant's original construction. |
energy_used_for_pumping_mwh |
number |
Energy used for pumping, in megawatt-hours. |
installation_year |
year |
Four digit year in which the last unit was installed. |
net_generation_mwh |
number |
Net generation, exclusive of plant use, in megawatt hours. |
net_load_mwh |
number |
Net output for load (net generation - energy used for pumping) in megawatt-hours. |
opex_dams |
number |
Production expenses: maintenance of reservoirs, dams, and waterways. Nominal USD. |
opex_electric |
number |
Production expenses: electric expenses. Nominal USD. |
opex_engineering |
number |
Production expenses: maintenance, supervision, and engineering. Nominal USD. |
opex_generation_misc |
number |
Production expenses: miscellaneous pumped storage power generation expenses. Nominal USD. |
opex_misc_plant |
number |
Production expenses: maintenance of miscellaneous hydraulic plant. Nominal USD. |
opex_operations |
number |
Production expenses: operation, supervision, and engineering. Nominal USD. |
opex_per_mwh |
number |
Production expenses per net megawatt hour generated. Nominal USD. |
opex_plant |
number |
Production expenses: maintenance of electric plant. Nominal USD. |
opex_production_before_pumping |
number |
Total production expenses before pumping. Nominal USD. |
opex_pumped_storage |
number |
Production expenses: pumped storage. Nominal USD. |
opex_pumping |
number |
Production expenses: We are here to PUMP YOU UP! Nominal USD. |
opex_rents |
number |
Production expenses: rent. Nominal USD. |
opex_structures |
number |
Production expenses: maintenance of structures. Nominal USD. |
opex_total |
number |
Total production expenses. Nominal USD. |
opex_water_for_power |
number |
Production expenses: water for power. Nominal USD. |
peak_demand_mw |
number |
Net peak demand on the plant (60-minute integration), in megawatts. |
plant_capability_mw |
number |
Net plant capability in megawatts. |
plant_hours_connected_while_generating |
number |
Hours the plant was connected to load while generating. |
plant_name_ferc1 |
string |
Name of the plant, as reported to FERC. This is a freeform string, not guaranteed to be consistent across references to the same plant. |
project_num |
integer |
FERC Licensed Project Number. |
record_id |
string |
Identifier indicating original FERC Form 1 source record. format: {table_name}_{report_year}_{report_prd}_{respondent_id}_{spplmnt_num}_{row_number}. Unique within FERC Form 1 DB tables which are not row-mapped. |
report_year |
year |
Four-digit year in which the data was reported. |
utility_id_ferc1 |
integer |
FERC assigned respondent_id, identifying the reporting entity. Stable from year to year. |
plants_small_ferc1¶
Generating plant statistics for steam plants with less than 25 MW installed nameplate capacity and internal combustion plants, gas turbine-plants, conventional hydro plants, and pumped storage plants with less than 10 MW installed nameplate capacity. As reported on FERC Form 1 pages 410-411, and extracted from the FERC FoxPro database table f1_gnrt_plant. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
capacity_mw |
number |
Name plate capacity in megawatts. |
capex_per_mw |
number |
Plant costs (including asset retirement costs) per megawatt. Nominal USD. |
construction_year |
year |
Original year of plant construction. |
ferc_license_id |
integer |
FERC issued operating license ID for the facility, if available. This value is extracted from the original plant name where possible. |
fuel_cost_per_mmbtu |
number |
Average fuel cost per mmBTU (if applicable). Nominal USD. |
fuel_type |
string |
Kind of fuel. Originally reported to FERC as a freeform string. Assigned a canonical value by PUDL based on our best guess. |
net_generation_mwh |
number |
Net generation excluding plant use, in megawatt-hours. |
opex_fuel |
number |
Production expenses: Fuel. Nominal USD. |
opex_maintenance |
number |
Production expenses: Maintenance. Nominal USD. |
opex_total |
number |
Total plant operating expenses, excluding fuel. Nominal USD. |
peak_demand_mw |
number |
Net peak demand for 60 minutes. Note: in some cases peak demand for other time periods may have been reported instead, if hourly peak demand was unavailable. |
plant_name_ferc1 |
string |
PUDL assigned simplified plant name. |
plant_name_original |
string |
Original plant name in the FERC Form 1 FoxPro database. |
plant_type |
string |
PUDL assigned plant type. This is a best guess based on the fuel type, plant name, and other attributes. |
record_id |
string |
Identifier indicating original FERC Form 1 source record. format: {table_name}_{report_year}_{report_prd}_{respondent_id}_{spplmnt_num}_{row_number}. Unique within FERC Form 1 DB tables which are not row-mapped. |
report_year |
year |
Four-digit year in which the data was reported. |
total_cost_of_plant |
number |
Total cost of plant. Nominal USD. |
utility_id_ferc1 |
integer |
FERC assigned respondent_id, identifying the reporting entity. Stable from year to year. |
plants_steam_ferc1¶
Generating plant statistics for steam plants with a capacity of 25+ MW, internal combustion and gas-turbine plants of 10+ MW, and all nuclear plants. As reported on page 402 of FERC Form 1 and extracted from the f1_gnrt_plant table in FERC's FoxPro Database. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
asset_retirement_cost |
number |
Asset retirement cost. |
avg_num_employees |
number |
Average number of plant employees during report year. |
capacity_mw |
number |
Total installed plant capacity in MW. |
capex_equipment |
number |
Capital expense for equipment. |
capex_land |
number |
Capital expense for land and land rights. |
capex_per_mw |
number |
Capital expenses per MW of installed plant capacity. |
capex_structures |
number |
Capital expense for structures and improvements. |
capex_total |
number |
Total capital expenses. |
construction_type |
string |
Type of plant construction ('outdoor', 'semioutdoor', or 'conventional'). Categorized by PUDL based on our best guess of intended value in FERC1 freeform strings. |
construction_year |
year |
Year the plant's oldest still operational unit was built. |
installation_year |
year |
Year the plant's most recently built unit was installed. |
net_generation_mwh |
number |
Net generation (exclusive of plant use) in MWh during report year. |
not_water_limited_capacity_mw |
number |
Plant capacity in MW when not limited by condenser water. |
opex_allowances |
number |
Allowances. |
opex_boiler |
number |
Maintenance of boiler (or reactor) plant. |
opex_coolants |
number |
Cost of coolants and water (nuclear plants only) |
opex_electric |
number |
Electricity expenses. |
opex_engineering |
number |
Maintenance, supervision, and engineering. |
opex_fuel |
number |
Total cost of fuel. |
opex_misc_power |
number |
Miscellaneous steam (or nuclear) expenses. |
opex_misc_steam |
number |
Maintenance of miscellaneous steam (or nuclear) plant. |
opex_operations |
number |
Production expenses: operations, supervision, and engineering. |
opex_per_mwh |
number |
Total operating expenses per MWh of net generation. |
opex_plants |
number |
Maintenance of electrical plant. |
opex_production_total |
number |
Total operating epxenses. |
opex_rents |
number |
Rents. |
opex_steam |
number |
Steam expenses. |
opex_steam_other |
number |
Steam from other sources. |
opex_structures |
number |
Maintenance of structures. |
opex_transfer |
number |
Steam transferred (Credit). |
peak_demand_mw |
number |
Net peak demand experienced by the plant in MW in report year. |
plant_capability_mw |
number |
Net continuous plant capability in MW |
plant_hours_connected_while_generating |
number |
Total number hours the plant was generated and connected to load during report year. |
plant_id_ferc1 |
integer |
Algorithmically assigned PUDL FERC Plant ID. WARNING: NOT STABLE BETWEEN PUDL DB INITIALIZATIONS. |
plant_name_ferc1 |
string |
Name of the plant, as reported to FERC. This is a freeform string, not guaranteed to be consistent across references to the same plant. |
plant_type |
string |
Simplified plant type, categorized by PUDL based on our best guess of what was intended based on freeform string reported to FERC. Unidentifiable types are null. |
record_id |
string |
Identifier indicating original FERC Form 1 source record. format: {table_name}_{report_year}_{report_prd}_{respondent_id}_{spplmnt_num}_{row_number}. Unique within FERC Form 1 DB tables which are not row-mapped. |
report_year |
year |
Four-digit year in which the data was reported. |
utility_id_ferc1 |
integer |
FERC assigned respondent_id, identifying the reporting entity. Stable from year to year. |
water_limited_capacity_mw |
number |
Plant capacity in MW when limited by condenser water. |
prime_movers_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
abbr |
string |
N/A |
prime_mover |
string |
N/A |
purchased_power_ferc1¶
Purchased Power (Account 555) including power exchanges (i.e. transactions involving a balancing of debits and credits for energy, capacity, etc.) and any settlements for imbalanced exchanges. Reported on pages 326-327 of FERC Form 1. Extracted from the f1_purchased_pwr table in FERC's FoxPro database. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
billing_demand_mw |
number |
Monthly average billing demand (for requirements purchases, and any transactions involving demand charges). In megawatts. |
coincident_peak_demand_mw |
number |
Average monthly coincident peak (CP) demand (for requirements purchases, and any transactions involving demand charges). Monthly CP demand is the metered demand during the hour (60-minute integration) in which the supplier's system reaches its monthly peak. In megawatts. |
delivered_mwh |
number |
Gross megawatt-hours delivered in power exchanges and used as the basis for settlement. |
demand_charges |
number |
Demand charges. Nominal USD. |
energy_charges |
number |
Energy charges. Nominal USD. |
non_coincident_peak_demand_mw |
number |
Average monthly non-coincident peak (NCP) demand (for requirements purhcases, and any transactions involving demand charges). Monthly NCP demand is the maximum metered hourly (60-minute integration) demand in a month. In megawatts. |
other_charges |
number |
Other charges, including out-of-period adjustments. Nominal USD. |
purchase_type |
string |
Categorization based on the original contractual terms and conditions of the service. Must be one of 'requirements', 'long_firm', 'intermediate_firm', 'short_firm', 'long_unit', 'intermediate_unit', 'electricity_exchange', 'other_service', or 'adjustment'. Requirements service is ongoing high reliability service, with load integrated into system resource planning. 'Long term' means 5+ years. 'Intermediate term' is 1-5 years. 'Short term' is less than 1 year. 'Firm' means not interruptible for economic reasons. 'unit' indicates service from a particular designated generating unit. 'exchange' is an in-kind transaction. |
purchased_mwh |
number |
Megawatt-hours shown on bills rendered to the respondent. |
received_mwh |
number |
Gross megawatt-hours received in power exchanges and used as the basis for settlement. |
record_id |
string |
Identifier indicating original FERC Form 1 source record. format: {table_name}_{report_year}_{report_prd}_{respondent_id}_{spplmnt_num}_{row_number}. Unique within FERC Form 1 DB tables which are not row-mapped. |
report_year |
year |
Four-digit year in which the data was reported. |
seller_name |
string |
Name of the seller, or the other party in an exchange transaction. |
tariff |
string |
FERC Rate Schedule Number or Tariff. (Note: may be incomplete if originally reported on multiple lines.) |
total_settlement |
number |
Sum of demand, energy, and other charges. For power exchanges, the settlement amount for the net receipt of energy. If more energy was delivered than received, this amount is negative. Nominal USD. |
utility_id_ferc1 |
integer |
FERC assigned respondent_id, identifying the reporting entity. Stable from year to year. |
transport_modes_eia923¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
abbr |
string |
N/A |
mode |
string |
N/A |
utilities_eia¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
utility_id_eia |
integer |
The EIA Utility Identification number. |
utility_id_pudl |
integer |
A manually assigned PUDL utility ID. May not be stable over time. |
utility_name_eia |
string |
The name of the utility. |
utilities_eia860¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
address_2 |
string |
N/A |
attention_line |
string |
N/A |
city |
string |
Name of the city in which operator/owner is located |
contact_firstname |
string |
N/A |
contact_firstname_2 |
string |
N/A |
contact_lastname |
string |
N/A |
contact_lastname_2 |
string |
N/A |
contact_title |
string |
N/A |
contact_title_2 |
string |
N/A |
entity_type |
string |
Entity type of principle owner (C = Cooperative, I = Investor-Owned Utility, Q = Independent Power Producer, M = Municipally-Owned Utility, P = Political Subdivision, F = Federally-Owned Utility, S = State-Owned Utility, IND = Industrial, COM = Commercial |
phone_extension_1 |
string |
Phone extension for contact 1 |
phone_extension_2 |
string |
Phone extension for contact 2 |
phone_number_1 |
string |
Phone number for contact 1 |
phone_number_2 |
string |
Phone number for contact 2 |
plants_reported_asset_manager |
string |
Is the reporting entity an asset manager of power plants reported on Schedule 2 of the form? |
plants_reported_operator |
string |
Is the reporting entity an operator of power plants reported on Schedule 2 of the form? |
plants_reported_other_relationship |
string |
Does the reporting entity have any other relationship to the power plants reported on Schedule 2 of the form? |
plants_reported_owner |
string |
Is the reporting entity an owner of power plants reported on Schedule 2 of the form? |
report_date |
date |
Date reported. |
state |
string |
State of the operator/owner |
street_address |
string |
Street address of the operator/owner |
utility_id_eia |
integer |
EIA-assigned identification number for the company that is responsible for the day-to-day operations of the generator. |
zip_code |
string |
Zip code of the operator/owner |
zip_code_4 |
string |
N/A |
utilities_entity_eia¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
utility_id_eia |
integer |
The EIA Utility Identification number. |
utility_name_eia |
string |
The name of the utility. |
utilities_ferc1¶
This table maps the manually assigned PUDL utility ID to a FERC respondent ID, enabling a connection between the FERC and EIA data sets. It also stores the utility name associated with the FERC respondent ID. Those values originate in the f1_respondent_id table in FERC's FoxPro database, which is stored in a file called F1_1.DBF. This table is generated from a spreadsheet stored in the PUDL repository: results/id_mapping/mapping_eia923_ferc1.xlsx Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
utility_id_ferc1 |
integer |
FERC assigned respondent_id, identifying the reporting entity. Stable from year to year. |
utility_id_pudl |
integer |
A manually assigned PUDL utility ID. May not be stable over time. |
utility_name_ferc1 |
string |
Name of the responding utility, as it is reported in FERC Form 1. For human readability only. |
utilities_pudl¶
Home table for PUDL assigned utility IDs. These IDs are manually generated each year when new FERC and EIA reporting is integrated, and any newly found utilities are added to the list with a new ID. Each ID maps to a power plant owning or operating entity which is reported in at least one FERC or EIA data set. This table is read in from a spreadsheet stored in the PUDL repository: src/pudl/package_data/glue/mapping_eia923_ferc1.xlsx Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
utility_id_pudl |
integer |
A manually assigned PUDL utility ID. May not be stable over time. |
utility_name_pudl |
string |
Utility name, chosen arbitrarily from the several possible utility names available in the utility matching process. Included for human readability only. |
utility_plant_assn¶
Pending description. Browse or query this table in Datasette.
Field Name |
Type |
Description |
---|---|---|
plant_id_pudl |
integer |
N/A |
utility_id_pudl |
integer |
N/A |
FERC Form 1 Data Dictionary¶
We have mapped the Visual FoxPro DBF files to their corresponding FERC Form 1 database tables and provided a short description of the contents of each table here.
Note
The Table Names link to the contents of the database table on our FERC Form 1 Datasette deployment where you can browse and query the raw data yourself or download the SQLite DB in its entirety.
The mapping of File Name to Table Name is consistent across all years of data.
Page numbers correspond to the pages of the FERC Form 1 PDF as it appeared in 2015 and may not be valid for other years.
Many tables without descriptions were discontinued prior to 2015.
The “Freq” column indicates the reporting frequency – A for Annual; Q for Quarterly. A/Q if the data is reported both annually and quarterly.
Table Name / Data Link |
File Name |
Pages |
Freq |
Table Description |
---|---|---|---|---|
F1_106_2009.DBF |
106 |
A |
Information on Formula Rates |
|
F1_106A_2009.DBF |
106 |
A |
Information on Formula Rates |
|
F1_106B_2009.DBF |
106 |
A |
Information on Formula Rates |
|
F1_208_ELC_DEP.DBF |
208 |
Q |
Electric Plant In Service and Accumulated Provision For Depreciation by Function |
|
F1_231_TRN_STDYCST.DBF |
231 |
A/Q |
Transmission Service and Generation Interconnection Study Costs |
|
F1_324_ELC_EXPNS.DBF |
324 |
Q |
Electric Production, Other Power Supply Expenses, Transmission and Distribution Expenses |
|
F1_325_ELC_CUST.DBF |
325 |
Q |
Electric Customer Accounts, Service, Sales, Administration and General Expenses |
|
F1_331_TRANSISO.DBF |
331 |
A/Q |
Transmission of Electricity by ISO/RTOs |
|
F1_338_DEP_DEPL.DBF |
338 |
Q |
Depreciation, Depletion and Amortization of Electric Plant (FERC Accounts 403, 403.1, 404, and 405, except Amortization of Acquisition Adjustments) |
|
F1_397_ISORTO_STL.DBF |
397 |
A/Q |
Amounts Included in ISO/RTO Settlement Statements |
|
F1_398_ANCL_PS.DBF |
398 |
A |
Purchases and Sales of Ancillary Services |
|
F1_399_MTH_PEAK.DBF |
399 |
A/Q |
Monthly Peak Loads and Energy Output |
|
F1_400_SYS_PEAK.DBF |
400 |
A/Q |
Monthly Transmission System Peak Load |
|
F1_400A_ISO_PEAK.DBF |
980, 400a |
A/Q |
Monthly ISO/RTO Transmission System Peak Load |
|
F1_429_TRANS_AFF.DBF |
429 |
A |
Transactions with Associated (Affiliated) Companies |
|
F1_2.DBF |
336-337 |
A |
Depreciation & Amortization of Electric Plant (Basis for Amortization Charges) |
|
F1_3.DBF |
219 |
A |
Accumulated Provision for Depreciation of Elecric Utility Plant (Account 108) |
|
F1_4.DBF |
266-267 |
A |
Accumulated Deferred Investment Tax Credits |
|
F1_5.DBF |
234-234a |
A |
Accumulated Deferred Income Taxes (Individual Schedule Lines) |
|
F1_6.DBF |
234-234b |
A |
Accumulated Deferred Income Taxes (Notes) |
|
F1_7.DBF |
272-273 |
A |
Accumulated Deferred Income Taxes - Accelerated Amortization Property |
|
F1_8.DBF |
276-277 |
A |
Accumulated Deferred Income Taxes - Other |
|
F1_9.DBF |
274-275 |
A |
Accumulated Deferred Income Taxes - Other Property |
|
F1_10.DBF |
228-229 |
A |
Allowances |
|
F1_ALLOWANCES_NOX.DBF |
230-230a |
A |
||
F1_78.DBF |
||||
F1_11.DBF |
112-113 |
A/Q |
Comparative Balance Sheet (Liabilities & Other Credits) |
|
F1_12.DBF |
250-251 |
A |
Capital Stock |
|
F1_13.DBF |
120-121 |
A/Q |
Statement of Cash Flows |
|
F1_14.DBF |
356 |
A |
Common Utility Plant & Expenses |
|
F1_CMPINC_HEDGE.DBF |
990, 122(a)(b) |
A/Q |
Statement of Accumulated Comparative Income, Comparative Income, and Hedging Activities |
|
F1_CMPINC_HEDGE_A.DBF |
990 |
|||
F1_18.DBF |
105 |
A |
Names, Titles, and Addresses of Directors |
|
F1_76.DBF |
||||
F1_79.DBF |
Descriptive headers for each column in the Form 1. Useful for discerning their semantic content. |
|||
F1_15.DBF |
110-111 |
A/Q |
Comparative Balance Sheet (Assets & Other Debits) |
|
F1_16.DBF |
217 |
Spending on Construction (1994-2002 only) |
||
F1_17.DBF |
102 |
A |
Control Over Respondent |
|
F1_19.DBF |
254-254b |
A |
Capital Stock Expense |
|
F1_20.DBF |
252 |
|||
F1_21.DBF |
336-337 |
A |
Depreciation & Amortization of Electric Plant (Depreciation & Amortization Charges) |
|
F1_22.DBF |
254 |
|||
F1_23.DBF |
336-337 |
A |
Depreciation & Amortization of Electric Plant (Factors Used in Estimating Depreciation Charges) |
|
F1_27.DBF |
320-323 |
A |
Electric Operation & Maintenance Expenses |
|
F1_26.DBF |
300-301b |
A/Q |
Electric Operating Revenues (Unbilled Revenues Only) |
|
F1_24.DBF |
401-401a |
A |
Electric Energy Account |
|
F1_25.DBF |
300-301a |
A/Q |
Electric Operating Revenues (Individual Schedule Lines) |
|
F1_28.DBF |
429 |
|||
F1_EMAIL.DBF |
||||
F1_29.DBF |
431 |
|||
F1_30.DBF |
430 |
|||
f1_footnote_data |
F1_85.DBF |
450 |
A/Q |
Footnote Data |
f1_footnote_tbl |
F1_87.DBF |
|||
f1_freeze |
F1_FREEZE.DBF |
|||
F1_31.DBF |
402-403b |
A |
Steam-Electric Generation Plant Statistics - Large Plants (Fuel Details) |
|
F1_32.DBF |
101 |
A |
General Information |
|
F1_33.DBF |
410-411 |
A |
Generating Plant Statistics (Small Plants) |
|
F1_86.DBF |
406-407 |
A |
Hydroelectric Gen Plant Stats (Large Plants) |
|
F1_88.DBF |
1 |
A/Q |
Identification & Attestation |
|
F1_34.DBF |
108-109 |
A/Q |
Important Changes During the Quarter/Year |
|
F1_35.DBF |
114-117b |
A/Q |
Statement of Income (Other Income & Deductions, Interest Charges, Extraordinary Items) |
|
F1_36.DBF |
114-117a |
A/Q |
Statement of Income |
|
F1_90.DBF |
213 |
A |
Electric Plant Leased to Others |
|
F1_80.DBF |
||||
F1_93.DBF |
256-257 |
A |
Long-Term Debt |
|
F1_38.DBF |
233 |
A |
Miscellaneous Deferred Debits |
|
F1_37.DBF |
335 |
A |
Miscellaneous General Expenses - Electric |
|
F1_39.DBF |
401-401b |
A |
Monthly Peaks & Output |
|
F1_40.DBF |
227, 228-229 |
A |
Materials & Supplies |
|
F1_41.DBF |
320 |
|||
F1_42.DBF |
221 |
|||
f1_note_fin_stmnt |
F1_43.DBF |
122-123 |
A/Q |
Notes to Financial Statements |
F1_44.DBF |
202-203 |
A |
Nuclear Fuel Materials |
|
F1_45.DBF |
104 |
A |
Officers |
|
F1_46.DBF |
269 |
A |
Other Deferred Credits |
|
F1_47.DBF |
253 |
A |
Other Paid-in Capital |
|
F1_48.DBF |
232 |
A/Q |
Other Regulatory Assets |
|
F1_49.DBF |
278 |
A/Q |
Other Regulatory Liabilities |
|
F1_50.DBF |
218 |
|||
F1_51.DBF |
340 |
|||
f1_pins |
F1_PINS.DBF |
|||
F1_92.DBF |
204, 214 |
A |
Electric Plant Held for Future Use |
|
F1_52.DBF |
204-207 |
A |
Electric Plant in Service |
|
F1_81.DBF |
||||
F1_53.DBF |
408-409 |
A |
Pumped Storage Generating Plant Statistics (Large Plants) |
|
F1_54.DBF |
326-327 |
A |
Purchased Power |
|
F1_59.DBF |
352-353 |
A |
Research, Development & Demonstration Activities |
|
F1_55.DBF |
261 |
A |
Reconciliation of Reported Net Income with Taxable Income for Federal Income Taxes |
|
F1_56.DBF |
350-351 |
A |
Regulatory Commission Expenses |
|
F1_57.DBF |
103 |
A |
Corporations Controlled by Respondent |
|
F1_1.DBF |
Respondent ID |
|||
F1_58.DBF |
118-119 |
A/Q |
Statement of Retained Earnings for the Year |
|
F1_RG_TRN_SRV_REV.DBF |
302 |
A/Q |
Regional Transmission Service Revenues (Account 457.1) |
|
F1_84.DBF |
Descriptive labels for each numbered row in the Form 1. Useful for identifying semantic content and changes in line numbers from year to year. |
|||
F1_S0_CHECKS.DBF |
||||
F1_S0_FILING_LOG.DBF |
||||
F1_61.DBF |
310-311 |
A |
Sales for Resale |
|
F1_60.DBF |
304 |
A |
Sales of Electricity by Rate Schedules |
|
F1_91.DBF |
224-225 |
A |
Investment in Subsidiary Companies (Account 123.1) |
|
F1_62.DBF |
224-225 |
A |
Investment in Subsidiary Companies (Total Line for Schedule) |
|
F1_77.DBF |
||||
F1_63.DBF |
002-004 |
A/Q |
List of Schedules |
|
F1_SECURITY.DBF |
106 |
|||
F1_64.DBF |
106 |
|||
F1_65.DBF |
354-355 |
A |
Distribution of Salaries & Wages |
|
F1_89.DBF |
402-403a |
A |
Steam-Electric Generation Plant Statistics - Large Plants (Plant Information) |
|
F1_66.DBF |
426-427 |
A |
Substations |
|
F1_82.DBF |
||||
F1_67.DBF |
262-263 |
A |
Taxes Accrued, Prepaid & Charged During Year |
|
F1_83.DBF |
||||
F1_68.DBF |
230-230b |
A |
Unrecovered Plant & Regulatory Study Costs |
|
F1_69.DBF |
200-201 |
A/Q |
Summary of Utility Plant & Accumulated Provisions for Depreciation, Amortization, & Depletion |
|
F1_70.DBF |
216 |
A |
Construction Work in Progress - Electric |
|
F1_71.DBF |
424-425 |
A |
Transmission Lines Added During Year |
|
F1_72.DBF |
332 |
A/Q |
Transmission of Electricity by Others |
|
F1_73.DBF |
328-330 |
A/Q |
Transmission of Electricity for Others |
|
F1_74.DBF |
422-423 |
A |
Transmission Line Statistics |
|
F1_75.DBF |
230-230a |
A |
Extraordinary Property Losses |
Contributing to PUDL¶
Welcome! We’re excited that you’re interested in contributing to the Public Utility Data Liberation effort! The work is currently being coordinated by the members of the Catalyst Cooperative. PUDL is meant to serve a wide variety of public interests including academic research, climate advocacy, data journalism, and public policy making. This open source project has been supported by a combination of volunteer contributions, grant funding from the Alfred P. Sloan Foundation, and reinvestment of net income from the cooperative’s client projects.
Please make sure you review our code of conduct, which is based on the Contributor Covenant. We want to make the PUDL project welcoming to contributors with different levels of experience and diverse personal backgrounds.
How to Get Involved¶
We welcome just about any kind of contribution to the project. Alone, we’ll never be able to understand every use case or integrate all the available data. The project will serve the community better if other folks get involved.
There are lots of ways to contribute – it’s not all about code!
Ask questions on Github using the issue tracker.
Suggest new data and features that would be useful.
File bug reports on Github.
Help expand and improve the documentation, or create new example notebooks
Help us create more and better software test cases.
Give us feedback on overall usability – what’s confusing?
Tell us a story about how you’re using of the data.
Point us at interesting publications related to open energy data, open source energy system modeling, how energy policy can be affected by better data, or open source tools we should check out.
Cite PUDL using DOIs from Zenodo if you use the software or data in your own published work.
Point us toward appropriate grant funding opportunities and meetings where we might present our work.
Share your Jupyter notebooks and other analyses that use PUDL.
Hire Catalyst to do analysis for your organization using the PUDL data – contract work helps us self-fund ongoing open source development.
Contribute code via pull requests. See the developer setup for more details.
And of course… we also appreciate financial contributions.
See also
Development Setup for instructions on how to set up the PUDL development environment.
Find us on GitHub¶
Github is the primary platform we use to manage the project, integrate contributions, write and publish documentation, answer user questions, automate testing & deployment, etc. Signing up for a GitHub account (even if you don’t intend to write code) will allow you to participate in online discussions and track projects that you’re interested in.
Asking (and answering) questions is a valuable contribution! As noted in How to support open-source software and stay sane, it’s much more efficient to ask and answer questions in a public forum because then other users and contributors who are having the same problem can find answers without having to re-ask the same question. The forum we’re using is our Github issues.
Even if you feel like you have a basic question, we want you to feel comfortable asking for help in public – we (Catalyst) only recently came to this data work from being activists and policy wonks – so it’s easy for us to remember when it all seemed frustrating and alien! Sometimes it still does. We want people to use the software and data to do good things in the world. We want you to be able to access it. Using a public forum also enables the community of users to help each other!
Don’t hesitate to open an issue with a feature request, a pointer to energy data that needs liberating, or a reference to documentation that’s out of date, unclear, or missing. Understanding how people are using the software, and how they would like to be using the software, is very valuable and will help us make it more useful and usable.
Development¶
Development Setup¶
This page will walk you through what you need to do if you want to be able to contribute code or documentation to the PUDL project.
These instructions assume that you are working on a Unix-like operating system (MacOS
or Linux) and are already familiar with git
, GitHub, and the Unix shell.
Warning
While it should be possible to set up the development environment on Windows, we haven’t done it. In the future we may create a Docker image that provides the development environment. E.g. for use with VS Code’s Containers extension.
Note
If you’re new to git
and GitHub , you’ll want to
check out:
Install conda¶
We use the conda
package manager to specify and update our development
environment, preferentially installing packages from the community maintained
conda-forge distribution channel. We recommend
using miniconda rather
than the large pre-defined collection of scientific packages bundled together
in the Anaconda Python distribution. You may also want to consider using
mamba – a faster drop-in replacement for
conda
written in C++.
After a conda package manager, make sure it’s configured to use strict channel priority with the following commands:
$ conda update conda
$ conda config --set channel_priority strict
Fork and Clone the PUDL Repository¶
Unless you’re part of the Catalyst Cooperative organization already, you’ll need to fork the PUDL repository This makes a copy of it in your personal (or organizational) account on GitHub that is independent of, but linked to, the original “upstream” project.
Then, clone the repository from your fork to your local computer where you’ll be editing the code or docs. This will download the whole history of the project, including the most recent version, and put it in a local directory where you can make changes.
Create the PUDL Dev Environment¶
Inside the devtools
directory of your newly cloned repository, you’ll see
an environment.yml
file that specifies the pudl-dev
conda
environment. You can create and activate that environment from within the
main repository directory by running:
$ conda update conda
$ conda env create --name pudl-dev --file devtools/environment.yml
$ conda activate pudl-dev
This environment installs the catalystcoop.pudl
package directly using
the code in your cloned repository so that it can be edited during
development. It also installs all of the software PUDL depends on, some
packages for testing and quality control, packages for working with interactive Jupyter
Notebooks, and a few Python packages that have binary dependencies which can
be easier to satisfy through conda
packages.
Getting and Storing an EIA API Key¶
PUDL accesses Energy Information Agency (EIA) datasets via an API, which requires permission from the EIA. New users must register for an API key, which is free, nearly instantaneous, and only requires you give an email address.
To make this key accessible to pudl, store it in an environment variable and reactivate the environment:
$ conda activate pudl-dev
$ conda env config vars set API_KEY_EIA='your_api_key_here'
$ conda activate pudl-dev
Updating the PUDL Dev Environment¶
You will need to periodically update your development (pudl-dev
) conda
environment to get you newer versions of existing dependencies and
incorporate any changes to the environment specification that have been
made by other contributors. The most reliable way to do this is to remove the
existing environment and recreate it.
Note
Different development branches within the repository may specify their own
slightly different versions of the pudl-dev
conda environment. As a
result, you may need to update your environment when switching from one
branch to another.
If you want to work with the most recent version of the code on a branch
named new-feature
, then from within the top directory of the PUDL
repository you would do:
$ git checkout new-feature
$ git pull
$ conda deactivate
$ conda update conda
$ conda env remove --name pudl-dev
$ conda env create --name pudl-dev --file devtools/environment.yml
$ conda activate pudl-dev
If you find yourself recreating the environment frequently, and are
frustrated by how long it takes conda
to solve the dependencies, we
recommend using the mamba solver.
You’ll want to install it in your base
conda environment – i.e. with no
conda environment activated):
$ conda deactivate
$ conda install mamba
Then the above development environment update process would become:
$ git checkout new-feature
$ git pull
$ conda deactivate
$ mamba update mamba
$ mamba env remove --name pudl-dev
$ mamba env create --name pudl-dev --file devtools/environment.yml
$ conda activate pudl-dev
If you are working with locally processed data and there have been changes to the expectations about that data in the PUDL software, you may also need to regenerate your PUDL SQLite database or other outputs. See Running the ETL Pipeline for more details.
Set Up Code Linting¶
We use several automated tools to apply uniform coding style and formatting
across the project codebase. This is known as
code linting, and it reduces
merge conflicts, makes the code easier to read, and helps catch some types of
bugs before they are committed. These tools are part of the pudl-dev
conda
environment and their configuration files are checked into the GitHub
repository. If you’ve cloned the pudl repo and are working inside the pudl conda
environment, they should be installed and ready to go.
Git Pre-commit Hooks¶
Git hooks let you automatically run scripts at various points as you manage your source code. “Pre-commit” hook scripts are run when you try to make a new commit. These scripts can review your code and identify bugs, formatting errors, bad coding habits, and other issues before the code gets checked in. This gives you the opportunity to fix those issues before publishing them.
To make sure they are run before you commit any code, you need to enable the pre-commit hooks scripts with this command:
$ pre-commit install
The scripts that run are configured in the .pre-commit-config.yaml
file.
See also
The pre-commit project: A framework for managing and maintaining multi-language pre-commit hooks.
Real Python Code Quality Tools and Best Practices gives a good overview of available linters and static code analysis tools.
Code and Docs Linters¶
Flake8 is a popular Python
linting framework, with a
large selection of plugins. We use it to check the formatting and syntax of
the code and docstrings embedded within the PUDL packages.
Doc8 is a lot like flake8, but for Python
documentation written in the reStructuredText format and built by
Sphinx. This is the de-facto
standard for Python documentation. The doc8
tool checks for syntax errors
and other formatting issues in the documentation source files under the
docs/
directory.
Automatic Formatting¶
Rather than alerting you that there’s a style issue in your Python code, autopep8 tries to fix it for you automatically, applying consistent formatting rules based on PEP 8. Similarly isort automatically groups and orders Python import statements in each module to minimize diffs and merge conflicts.
Linting Within Your Editor¶
If you are using an editor designed for Python development many of these code linting and formatting tools can be run automatically in the background while you write code or documentation. Popular editors that work with the above tools include:
Visual Studio Code, from Microsoft (free)
Atom developed by GitHub (free), and
Sublime Text (paid).
Each of these editors have their own collection of plugins and settings for working with linters and other code analysis tools.
Creating a Workspace¶
PUDL needs to know where to store its big piles of inputs and outputs. It also comes
with some example configuration files. The pudl_setup
script lets PUDL know where
all this stuff should go. We call this a “PUDL workspace”:
$ pudl_setup <PUDL_DIR>
Here <PUDL_DIR> is the path to the directory where you want PUDL to do its
business – this is where the datastore will be located and where any outputs
that are generated end up. The script will also put a configuration file called
.pudl.yml
in your home directory that records the location of this
workspace and uses it by default in the future. If you run pudl_setup
with
no arguments, it assumes you want to use the current working directory.
The workspace is laid out like this:
Directory / File |
Contents |
|
Raw data, automatically organized by source, year, etc. |
|
Tabular data packages generated by PUDL. |
|
Apache Parquet files generated by PUDL. |
|
Example configuration files for controlling PUDL scripts. |
|
|
Settings Files¶
Several of the scripts provided as part of PUDL require more arguments than can be easily managed on the command line. It’s also useful to preserve a record of how the data processing pipeline was run in one instance so that it can be re-run in exactly the same way. We have these scripts read their settings from YAML files, examples of which are included in the distribution.
There are two example files that are deployed into a users workspace with the
pudl_setup
script (see: Creating a Workspace). The two settings files direct
PUDL to process 1 year (“fast”) and all years (“full”) of data respectively. Each
file contains parameters for both the ferc1_to_sqlite
and the pudl_etl
scripts.
Setttings for ferc1_to_sqlite¶
Parameter |
Description |
---|---|
|
A single 4-digit year to use as the reference for inferring FERC Form 1 database’s structure. Typically, the most recent year of available data. |
|
A list of years to be included in the cloned FERC Form 1 database. You should only use a continuous range of years. 1994 is the earliest year available. |
|
A list of strings indicating what tables to load. The list of acceptable
tables can be found in the the example settings file and corresponds to
the values found in the |
Settings for pudl_etl¶
The pudl_etl
script requires a YAML settings file. In the repository this
example file is lives in src/pudl/package_data/settings
. This example file
(etl_example.yml
) is deployed onto a user’s system in the
settings
directory within the PUDL workspace when the pudl_setup
script
is run. Once this file is in the settings directory, users can copy it and
modify it as appropriate for their own use.
This settings file allows users to determine the scope of the integrated by PUDL. Most datasets can be used to generate stand-alone data packages. If you only want to use FERC Form 1, you can remove the other data package specifications or alter their parameters such that none of their data is processed (e.g. by setting the list of years to be an empty list). The settings are verified early on in the ETL process, so if you got something wrong, you should get an assertion error quickly.
While PUDL largely keeps datasets disentangled for ETL purposes (enabling stand-alone ETL), the EPA CEMS and EIA datasets are exceptions. EPA CEMS cannot be loaded without EIA because it relies on IDs that come from EIA 860. Similarly, EIA Forms 860 and 923 are very tightly related. You can load only EIA 860, but the settings verification will automatically add in a few 923 tables that are needed to generate the complete list of plants and generators.
Warning
If you are processing the EIA 860/923 data, we strongly recommend including the same years in both datasets. We only test two combinations of inputs:
That all available years of EIA 860/923 can be processed together, and
That the most recent year of both datasets can be processed together.
Other combinations of years may yield unexpected results.
Structure of the pudl_etl Settings File¶
The general structure of the settings file and the names of the keys of the
dictionaries should not be changed, but the values of those dictionaries
can be edited. There are two high-level elements of the settings file which
pertain to the entire bundle of tabular data packages which will be generated:
datapkg_bundle_name
and datapkg_bundle_settings
. The
datapkg_bundle_name
determines which directory the data packages are
written into. The elements and structure of the datapkg_bundle_settings
are described below:
datapkg_bundle_settings
├── name : unique name identifying the data package
│ title : short human readable title for the data package
│ description : a longer description of the data package
│ datasets
│ ├── dataset name
│ │ ├── dataset etl parameter (e.g. states) : list of states
│ │ └── dataset etl parameter (e.g. years) : list of years
│ └── dataset name
│ │ ├── dataset etl parameter (e.g. states) : list of states
│ │ └── dataset etl parameter (e.g. years) : list of years
└── another data package...
The dataset names must not be changed. The dataset names enabled include:
eia
(which includes Forms 860/923 only for now), ferc1
, and epacems
.
Any other dataset name will result in an assertion error.
Note
We strongly recommend leaving the arguments that specify which database tables are generated unchanged – i.e. always include all of the tables; many analyses require data from multiple tables, and removing a few tables doesn’t change how long the ETL process takes by much.
Dataset ETL parameters (like years, states, tables) will only register if they
are a part of the correct dataset. If you put some FERC Form 1 ETL parameter in
an EIA dataset specification, FERC Form 1 will not be loaded as a part of that
dataset. For an exhaustive listing of the available parameters, see the
etl_example.yml
file.
Running the ETL Pipeline¶
So you want to run the PUDL data processing pipeline? This is the most involved way to get access to PUDL data. It’s only recommended if you want to edit the ETL process or contribute to the code base. Check out the Data Access documentation if you just want to use the processed data.
These instructions assume you have already gone through the development setup (see: Development Setup).
There are four main scripts that are involved in the PUDL processing pipeline:
ferc1_to_sqlite
converts the FERC Form 1 DBF files into a single large SQLite database so that the data is easier to extract.pudl_etl
is where the magic happens. This is the main script which coordinates the “Extract, Transform, Load” process that generates Tabular Data Packages.datapkg_to_sqlite
converts the Tabular Data Packages into a SQLite database. We recommend doing this for all of the smaller to medium sized tables, which is currently everything but the hourly EPA CEMS data.epacems_to_parquet
converts the (~1 billion row) EPA CEMS Data Package into Apache Parquet files for fast on-disk querying.
Settings files dictate which datasets, years, tables, or states get run through the
the processing pipeline. Two example settings files are provided in the settings
folder that is created when you run pudl_setup
.
See also
Creating a Workspace for more on how to create a PUDL data workspace.
Settings Files for info details on the contents of the settings files.
The Fast ETL¶
Running the fast ETL processes one year of data for each dataset. This is what we do in our software integration tests.
$ ferc1_to_sqlite settings/etl_fast.yml
$ pudl_etl settings/etl_fast.yml
$ datapkg_to_sqlite \
datapkg/pudl-fast/ferc1/datapackage.json \
datapkg/pudl-fast/epacems-eia/datapackage.json
$ epacems_to_parquet --years 2019 --states ID -- \
datapkg/pudl-fast/epacems-eia/datapackage.json
The Full ETL¶
The full ETL setting file includes all the datasets with all of the years and tables with the exception of EPA CEMS. A full ETL for EPA CEMS can take up to 15 hours of processing time, so the example setting here is all years of CEMS for one state (Idaho!) and takes around 20 minutes to process.
$ ferc1_to_sqlite settings/etl_full.yml
$ pudl_etl settings/etl_full.yml
$ datapkg_to_sqlite datapkg/pudl-full/ferc1/datapackage.json \
datapkg/pudl-full/eia/datapackage.json
$ epacems_to_parquet --states ID -- datapkg/pudl-full/epacems-eia/datapackage.json
Additional Notes¶
These commands should result in a bunch of Python logging
output describing
what the script is doing, file outputs in the sqlite
, datapkg
, and
parquet
directories within your workspace. When the ETL is complete, you should
see new files at sqlite/ferc1.sqlite
and sqlite/pudl.sqlite
as well as a new
directory at datapkg/pudl-fast
or datapkg/pudl-full
containing several
datapackage directories – one for each of the ferc1
, eia
(Forms 860 and
923), and epacems-eia
datasets.
Each of the data packages that are part of the bundle have metadata describing their
structure. This metadata is stored in the associated datapackage.json
file.
The data are stored in a bunch of CSV files (some of which may be gzip
compressed) in the data/
directories of each data package.
You can use the pudl_etl
script to process more or different data by copying and
editing either of the settings files and running the script again with your new
settings file as an argument. Comments in the example settings file explain the
available parameters. Know that these example files are the only configurations that
are tested automatically and supported.
If you want to re-run pudl_etl
and replace an existing bundle of data packages,
you can use --clobber
. If you want to generate a new data packages with a new or
modified settings file, you can change the name of the output datapackage bundle in
the configuration file.
All of the PUDL scripts have help messages if you want additional information (run
script_name --help
).
Project Management¶
The people working on PUDL are distributed all over North America. Collaboration takes place online. We make extensive use of Github’s project management tools as well as Zenhub which provides additional features for sprint planning, task estimation, and progress reports.
Issues and Project Tracking¶
We use Github issues to track bugs, enhancements, support requests, and just about any other work that goes into the project. Try to make sure that issues have informative tags so we can find them easily.
We use Zenhub Sprints, Epics, and Releases to track our progress. These won’t be visible unless you have the ZenHub browser extension installed.
GitHub Workflow¶
We have 2 persistent branches: main and dev.
We create temporary feature branches off of dev and make pull requests to dev throughout our 2 week long sprints.
At the end of each sprint, assuming all the tests are passing, dev is merged into main.
Pull Requests¶
Before making a PR, make sure the tests run and pass locally, including the code linters and pre-commit hooks. See Set Up Code Linting for details.
Don’t forget to merge any new commits to the dev branch into your feature branch before making a PR.
If for some reason the continuous integration tests fail for your PR, try and figure out why and fix it, or ask for help. If the tests fail, we don’t want to merge it into dev. You can see the status of the CI builds in the GitHub Actions for the PUDL repo.
Please don’t decrease the overall test coverage – if you introduce new code, it also needs to be exercised by the tests. See Testing PUDL for details.
Write good docstrings using the Google format
Pull Requests should update the documentation to reflect changes to the code, especially if it changes something user-facing, like how one of the command line scripts works.
Releases¶
Periodically, we tag a new release on main and upload the packages to the Python Package Index and conda-forge.
Whenever we tag a release on Github, the repository is archived on Zenodo and issued a DOI.
For some software releases we archive processed data on Zenodo along with a Docker container that encapsulates the necessary software environment.
User Support¶
We don’t (yet) have funding to do user support, so it’s currently all community and volunteer based. In order to ensure that others can find the answers to questions that have already been asked, we try to do all support in public using Github issues.
Testing PUDL¶
We use Tox to coordinate our software testing
and to manage other build and sanity checking tools. Under the hood, it invokes
a variety of other collections of command-line tools in predefined combinations
that are described in tox.ini
. These include software tests defined using
pytest, code linters like flake8
, documentation
generators like Sphinx, and sanity checks defined as git pre-commit hooks. Each
of these tools, or sometimes collections of related tools, can be selected at
the command line. They can also be run independently without using Tox, but for
the sake of simplicitly and standardization, we try to mostly just run them
using the predefined settings we have configured in Tox.
The simplest way to test PUDL – which is also how the code is tested automatically by our continuous integration setup – is to just run Tox alone with no arguments. This will typically take 25 minutes to run.
$ tox
Note
If you aren’t familiar with pytest and Tox already, you may want to go peruse their introductory documentation.
Software Tests¶
Our pytest
based software tests are all stored under the test/
directory in the main repository. They are organized into 3 broad categories,
each with its own subdirectory:
Software Unit Tests (
test/unit/
) can be run in seconds and don’t require any external data. They test the basic functionality of various functions and classes, often using minimal inline data structures that are specified in the test modules themselves.Software Integration Tests (
test/integration/
) test larger collections of functionality including the interactions between different parts of the overall software system and in some cases interactions with external systems requiring network connectivity. The main thing our integration tests do is run the full PUDL data processing pipeline for the most recent year of data. This takes around 15 minutes.Data Validations (
test/validate/
) sanity check the PUDL outputs generated by the data processing pipeline. This helps us catch issues with the input data as well as more subtle bugs that don’t prevent the code from executing but do have unintended or unexpected impacts on the output data. The data validation requires a fully populated PUDL database and is quite different from the other tests.
Running tests with Tox¶
Tox installs the PUDL package in a fresh Python environment, ensuring that the
tests only have access to packages which would be installed on a new user’s
computer. Tox’s overall behavior is configured with the tox.ini
file in the
main repository directory. There are several different “test environments”
defined to test different aspects of the software or to perform other
actions like building the documentation. We’ll go through some of the most
common ones below.
Continuous Integration Tests¶
Our default tox test environment is ci
– that includes all of the tests
that will be run in continuous integration using a GitHub Action. You should run these tests before
pushing code to the repository or making a pull request. Because it’s the
default test environment, it will be run if you call Tox without any
arguments:
$ tox
This is equivalent to:
$ tox -e ci
If the PUDL package’s dependencies have been changed (in setup.py
) or you
recently ran the tests while on another branch of the repository with other
dependencies, you may need to tell Tox to recreate the software environment
it uses with the -r
flag. This behavior is turned on by default for the
ci
, full
, and validate
tests since they take a long time to run
and the extra time required to recreate the software environment is short by
comparison.
Note
You will need to register for an EIA API key to run the integration
tests which are included as part of the ci
tests. We use data from the
EIA API to fill in missing monthly fuel costs in the marginal cost of
electricity calculations. Once you have the API key, you’ll need to store it
in an environment variable named API_KEY_EIA
within the shell where you
are running the tests. You may want to add it to your .bashrc
or
.zshrc
so that it’s automatically available to PUDL in the future. There
are many tutorials on how to manage environment variables online. Here’s one
tutorial from Digital Ocean.
In addition to running the unit
and integration
tests, the CI test
environment lints the code and documentation input files and uses Sphinx to
build the documentation. It also generates a test coverage report. Running
the full set of CI tests takes 20-25 minutes and requires a fair amount of
data. If you don’t already have that data downloaded, it will be downloaded
automatically and put in your local datastore
Note
Locally the tests will run using whatever version of Python is part of your
pudl-dev
conda environment, but we have our CI set up to test on both
Python 3.8 and 3.9 in parallel.
Software Unit and Integration Tests¶
To run the unit
or integration
tests on their own, you use the -e
flag to choose those test environments explicitly:
$ tox -e unit
or:
$ tox -e integration
Full ETL Tests¶
As mentioned above, the CI tests process a single year of data. If you would
like to more exhaustively test the ETL process without affecting your
existing FERC 1 and PUDL databases, you can use the full
test
environment which may take close to an hour to run:
$ tox -e full
This will process all years of data for the EIA and FERC datasets and all
years of EPA CEMS data for a single state (Idaho). The ETL parameters for
this test are defined in test/settings/full-integration-tests.yml
Running Other Commands with Tox¶
You can run any of the individual test environments that tox -av
lists on
their own:
$ tox -av
default environments:
ci -> Run all continuous integration (CI) checks & generate test coverage.
additional environments:
flake8 -> Run the full suite of flake8 linters on the PUDL codebase.
pre_commit -> Run git pre-commit hooks not covered by the other linters.
bandit -> Check the PUDL codebase for common insecure code patterns.
linters -> Run the pre-commit, flake8, and bandit linters.
doc8 -> Check the documentation input files for syntactical correctness.
docs -> Remove old docs output and rebuild HTML from scratch with Sphinx
unit -> Run all the software unit tests.
ferc1_solo -> Test whether FERC 1 can be loaded into the PUDL database alone.
integration -> Run all software integration tests and process a full year of data.
validate -> Run all data validation tests. This requires a complete PUDL DB.
ferc1_schema -> Verify FERC Form 1 DB schema are compatible for all years.
full_integration -> Run ETL and integration tests for all years and data sources.
full -> Run all CI checks, but for all years of data.
build -> Prepare Python source and binary packages for release.
testrelease -> Do a dry run of Python package release using the PyPI test server.
release -> Release the PUDL package to the production PyPI server.
Note that not all of them literally run tests. For instance, to lint and build the documentation you can run:
$ tox -e docs
To run all of the code and documentation linters, but not run any of the other tests:
$ tox -e linters
Each of the test environments defined in tox.ini
is just a collection of
dependencies and commands. To see what they consist of, you can open the file
in your text editor. Each section starts with [testenv:xxxxxx]
and the
section called commands
is a list of shell commands that that test
environment will run.
Selecting Input Data for Integration Tests¶
The software integration tests need a year’s worth of input data to process. By default they will look in your local PUDL datastore to find it. If the data they need isn’t available locally, they will download it from Zenodo and put it in the local datastore.
However, if you’re editing code that affects how the datastore works, you
probably don’t want to risk contaminating your working datastore. You can
use a disposable temporary datastore instead by having Tox pass the
--tmp-data
flag in to pytest
like this:
$ tox -e integration -- --tmp-data
The floating --
isn’t a typo, it tells Tox that you’re done giving it
command line arguments, and that any additional arguments it gets should be
passed through to pytest
. We’ve configured pytest
(through the
test/conftest.py
configuration file) to be on the lookout for the
--tmp-data
flag and act accordingly.
See also
Development Setup for more on how to set up a PUDL workspace, including a datastore.
Working with the Datastore for more on how to work with the datastore.
Data Validation¶
Given the processed outputs of the PUDL ETL pipeline, we have a collection of tests that can be run to verify that the outputs look correct. We run all available data validations before each data release is archived on Zenodo. It is useful to run the data validation tests prior to making a pull request that makes changes to the ETL process or output functions to ensure that the outputs have not been unintentionally affected.
These data validation tests are organized into datasource specific modules
under test/validate
. Running the full data validation can take as much as
an hour, depending on your computer. These tests require a fully populated
PUDL database which contains all available FERC and EIA data, as specified by
the src/pudl/package_data/settings/etl_full.yml
input file. They are run
against the “live” SQLite database in your pudl workspace at
sqlite/pudl.sqlite
. To run the full data validation against an existing
database:
$ tox -e validate
The data validation cases that pertain to the contents of the data tables are
currently stored as part of the pudl.validate
module.
The expected number of records in each output table is stored in the validation
test modules under test/validate
as pytest parameterizations.
Data Validation Notebooks¶
We have a collection of Jupyter Notebooks that run the same functions as the
data validation. The notebooks also produce some visualizations of the data
to make it easier to understand what’s wrong when validation fails. These
notebooks are stored in test/notebooks
Like the data validations, the notebooks will only run successfully when there’s a full PUDL SQLite database available in your PUDL workspace.
Running pytest Directly¶
Running tests directly with pytest
gives you the ability to run only
tests from a particular test module or even a single individual test case.
It’s also faster because there’s no testing environment to set up. Instead,
it just uses your Python environment which should be the pudl-dev
conda
environment discussed in Development Setup. This is convenient if you’re
debugging something specific or developing new test cases, but it’s not as
robust as using Tox.
Running specific tests¶
To run the software unit tests with pytest
directly (the same set of tests
that would be run by tox -e unit
):
$ pytest test/unit
To run only the unit tests for the Excel spreadsheet extraction module:
$ pytest test/unit/extract/excel_test.py
To run only the unit tests defined by a single test class within that module:
$ pytest test/unit/extract/excel_test.py::TestGenericExtractor
Custom PUDL pytest flags¶
We have defined several custom flags to control pytest’s behavior when running the PUDL tests. They are mostly intended for use internally to specify the behavior we want in the high level Tox test environments.
You can always check to see what custom flags exist by running
pytest --help
and looking at the custom options
section:
custom options:
--live-dbs Use existing PUDL/FERC1 DBs instead of creating temporary ones.
--tmp-data Download fresh input data for use with this test run only.
--etl-settings=ETL_SETTINGS
Path to a non-standard ETL settings file to use.
--gcs-cache-path=GCS_CACHE_PATH
If set, use this GCS path as a datastore cache layer.
--sandbox Use raw inputs from the Zenodo sandbox server.
The main flexibility that these custom options provide is in selecting where the raw input data comes from and what data the tests should be run against. Being able to specify the tests to run and the data to run them against independently simplifies the test suite and keeps the data and tests very clearly separated.
The --live-dbs
option lets you use your existing FERC 1 and PUDL databases
instead of building a new database at all. This can be useful if you want to
test code that only operates on an existing database, and has nothing to do
with the construction of that database. For example, the output routines:
$ pytest --live-dbs test/integration/fast_output_test.py
We also use this option to run the data validations.
Assuming you do want to run the ETL and build new databases as part of the test
you’re running, the contents of that database are determined by an ETL settings
file. By default, the settings file that’s used is
test/settings/integration-test.yml
But it’s also possible to use a
different input file, generating a different database, and then run some
tests against that database.
For example, we test that FERC 1 data can be loaded into a PUDL database all
by itself by running the ETL tests with a settings file that includes only A
couple of FERC 1 tables for a single year. This is the ferc1_solo
Tox
test environment:
$ pytest --etl-settings=test/settings/ferc1-solo-test.yml test/integration/etl_test.py
Similarly, we use the test/settings/full-integration-test.yml
settings file
to specify an exhaustive collection of input data, and then we run a test that
checks that the database schemas extracted from all historical FERC 1 databases
are compatible with each other. This is the ferc1_schema
test:
$ pytest --etl-settings test/settings/full-integration-test.yml test/integration/etl_test.py::test_ferc1_schema
The raw input data that all the tests use is ultimately coming from our archives on Zenodo. However, you can optionally tell the tests to look in a different places for more rapidly accessible caches of that data and to force the download of a fresh copy (especially useful when you are testing the datastore functionality specifically). By default, the tests will use the datastore that’s part of your local PUDL workspace.
For example, to run the ETL portion of the integration tests and download fresh input data to a temporary datastore that’s later deleted automatically:
$ pytest --tmp-data test/integration/etl_test.py
Building the Documentation¶
We use Sphinx and Read The Docs to semi-automatically build and host our documentation.
Sphinx is tightly integrated with the Python programming language and needs
to be able to import and parse the source code to do its job. Thus, it also
needs to be able to create an appropriate python environment. This process is
controlled by docs/conf.py
.
If you are editing the documentation and need to regenerate the outputs as you go to see your changes reflected locally, the most reliable option is to use Tox. Tox will remove the previously generated outputs and regenerate everything from scratch:
$ tox -e docs
If you’re just working on a single page and don’t care about the entire set of documents being regenerated and linked together, you can call Sphinx directly:
$ sphinx-build -b html docs docs/_build/html
This will only update any files that have been changed since the last time the documentation was generated.
To view the documentation that’s been output at HTML, you’ll need to open the
docs/_build/html/index.html
file within the PUDL repository with a web
browser. You may also be able to set up automatic previewing of the rendered
documentation in your text editor with appropriate plugins.
Note
Some of the documentation files are dynamically generated. We use the
sphinx-apidoc utility to generate RST files from the docstrings embedded
in our source code, so you should never edit the files under docs/api
.
If you create a new module, the corresponding documentation file will also
need to be checked in to version control.
Similarly the PUDL Data Dictionary is generated dynamically
by the pudl.convert.datapkg_to_rst
script that gets run by Tox when it
builds the docs.
Working with the Datastore¶
The input data that PUDL processes comes from a variety of US government agencies. However, these agencies typically make the data available on their websites or via FTP without planning for programmatic access. To ensure reproducible, programmatic access, we periodically archive the input files on the Zenodo research archiving service maintained by CERN. (See our pudl-scrapers and pudl-zenodo-storage repositories on GitHub for more information.)
When PUDL needs a data resource, it will attempt to automatically retrieve it from Zenodo and store it locally in a file hierarchy organized by dataset and the versioned DOI of the corresponding Zenodo deposition.
The pudl_datastore
script can also be used to pre-download the raw input data in
bulk. It uses the routines defined in the pudl.workspace.datastore
module. For
details on what data is available, for what time periods, and how much of it there
is, see the PUDL Data Sources. At present the pudl_datastore
script
downloads the entire collection of data available for each dataset. For the FERC Form
1 and EPA CEMS datasets, this is several gigabytes.
For example, to download the full EIA Form 860 dataset (covering 2001-present) you would use:
$ pudl_datastore --dataset eia860
For more detailed usage information, see:
$ pudl_datastore --help
The downloaded data will be used by the script to populate a datastore under
the data
directory in your workspace, organized by data source, form, and
date:
data/censusdp1tract/
data/eia860/
data/eia861/
data/eia923/
data/epacems/
data/ferc1/
data/ferc714/
If the download fails to complete successfully, the script can be run repeatedly until all the files are downloaded. It will not try and re-download data which is already present locally.
Adding a new Dataset to the Datastore¶
There are three components necessary to prepare a new datastet for use with the PUDL datastore.
Create a
pudl-scraper
to download the raw data.Use
pudl-zenodo-storage
to upload the data to Zenodo.Prepare the datastore to retrieve the data from Zenodo.
In the event that data is already available on Zenodo in the appropriate format, it may be possible to skip steps 1 and 2.
Create a scraper¶
Where possible, we use Scrapy to handle data collection. Our scrapy spiders, as well as any custom scripts, are located in our scrapers repo. Familiarize yourself with scrapy, and note the following.
From a scraper, a correct ouput directory takes the form:
`pudl_scrapers.helpers.new_output_dir(self.settings["OUTPUT_DIR"] /
"datastet_name")`
The pudl_scrapers.settings
and pudl_scrapers.helpers
can be imported
outside the context of a Scrapy scraper to achieve the same effect as needed.
To take advantage of the existing file saving pipeline, create a custom item in
the items.py
collection. Make sure that it inherits from the existing
DataFile
class, and ensure that your spider yields the new item. See the
items.py
for examples.
If you follow those guidelines, your new scraper should play well with the rest of the environment.
Prepare zenodo_store¶
Our zenodo_store script initializes and updates data sources that we maintain on Zenodo . It prepares Frictionless Datapackages from scraped files and uploads them to the appropriate Zenodo archive.
To add a new archive to our Zenodo storage collection:
- Update
zs.metadata
with a UUID and metadata for the new Zenodo archive. These details will be used by Zenodo to identify and describe the archive on the website. The UUID is used to uniquely distinguish the archive prior to the creation of a DOI.
- Update
Prepare a new library to handle the frictionless datapackage descriptor of the archive.
The library name should take the form
frictionless.DATASET_raw
.The library must contain frictionless data metadata describing the archive.
The library must contain a
datapackager(dfiles)
function that:receives a list of zenodo file descriptors
converts each to an appropriate frictionless datapackage resource descriptor
Important: The resource descriptor must include an additional
descriptor["remote_url"]
that contains the zenodo url to download its resource. This will be the same as thedescriptor["path"]
at this stage.If there are criteria by which you wish to be able to discover or filter specific resources,
descriptor["parts"][...]
should be used to denote those details. For example,descriptor["parts"]["year"] = 2018
would be appropriate to allow filtering by year.Combines the resource descriptors and frictionless metadata to produce the complete datapackage descriptor as a python dict.
In the
bin/zenodo_store.py
script:Import the new frictionless library.
Add the new source to the
archive_selection
function; follow the format of the existing selectors.Add the new source name to the help text in the
parse_main() .. deposition
argument.
The above steps should be sufficient to allow automatic initialization and updates of the new data source on Zenodo.
You initialize an archive (preferably starting with the sandbox) by running
zenodo_store.py --initialize --verbose --sandbox
If successful, the DOI and url for your archive will be printed. You will need to visit the url to review and publish the Zenodo archive before it can be used.
If you lose track of the DOI, you can look up the archive on Zenodo using the
UUID from zs.metadata
.
Prepare the Datastore¶
If you have used a scraper and zenodo_store to prepare a Zenodo archive as above, you can add support for your archive to the datastore by adding the DOI to pudl.workspace.datastore.DOI, under “sandbox” or “production” as appropriate.
If you want to prepare an archive for the datastore separately, the following are required.
#. The root path must contain a datapackage.json
file that conforms to the
frictionless datapackage spec
#. Each listed resource among the datapackage.json
resources must include:
path
containing the zenodo download url for the specific file.
remote_url
with the same url as thepath
name
of the file
hash
with the md5 hash of the file
parts
a set of key / value pairs defining additional attributes that can be used to select a subset of the whole datapackage. For example, theepacems
dataset is partitioned by year and state, and"parts": {"year": 2010, "state": "ca"}
would indicate that the resource contains data for the state of California in the year 2010. Unpartitioned datasets like theferc714
which includes all years in a single file, would have an empty"parts": {}
Cloning the FERC Form 1 DB¶
FERC Form 1 is… special.
The FERC Form 1 is published in a particularly inaccessible format (proprietary binary FoxPro database files), and the data itself is unclean and poorly organized. As a result, very few people are currently able to use it. This means that, while we have not yet integrated the vast majority of the available data into PUDL, it’s useful to just provide programmatic access to the bulk raw data, independent of the cleaner subset of the data included within PUDL.
To provide that access, we’ve broken the pudl.extract.ferc1
process
down into two distinct steps:
Clone the entire FERC Form 1 database from FoxPro into a local file-based
sqlite3
database. This includes 116 distinct tables, with thousands of fields, covering the time period from 1994 to the present.Pull a subset of the data out of that database for further processing and integration into the PUDL data packages and
sqlite3
database.
If you want direct access to the original FERC Form 1 database, you can just do the database cloning and connect directly to the resulting database. This has become especially useful since Microsoft recently discontinued the database driver that until late 2018 had allowed users to load the FoxPro database files into Microsoft Access.
In any case, cloning the original FERC database is the first step in the PUDL
ETL process. This can be done with the ferc1_to_sqlite
script (which is an
entrypoint into the pudl.convert.ferc1_to_sqlite
module) which is
installed as part of the PUDL Python package. It takes its instructions from a
YAML file, an example of which is included in the settings
directory in
your PUDL workspace. Once you’ve created a datastore, you can
try this example:
$ ferc1_to_sqlite settings/etl-full.yml
This should create an SQLite database that you can find in your workspace at
sqlite/ferc1.sqlite
By default, the script pulls in all available years of
data and all but 3 of the 100+ database tables. The excluded tables
(f1_footnote_tbl
, f1_footnote_data
and f1_note_fin_stmnt
) contain
unreadable binary data, and increase the overall size of the database by a
factor of ~10 (to ~8 GB rather than 800 MB). If for some reason you need access
to those tables, you can create your own settings file and un-comment those
tables in the list of tables that it directs the script to load.
Note
This script pulls all of the FERC Form 1 data into a single database,
but FERC distributes a separate database for each year. Virtually all
the database tables contain a report_year
column that indicates which
year they came from, preventing collisions between records in the merged
multi-year database. One notable exception is the f1_respondent_id
table, which maps respondent_id
to the names of the respondents. For
that table, we have allowed the most recently reported record to take
precedence, overwriting previous mappings if they exist.
Note
There are a handful of respondent_id
values that appear in the FERC
Form 1 database tables but do not show up in f1_respondent_id
.
This renders the foreign key relationships between those tables invalid.
During the database cloning process we add these respondent_id
values to
the f1_respondent_id
table with a respondent_name
indicating that
the ID was filled in by PUDL.
Naming Conventions¶
In the PUDL codebase, we aspire to follow the naming and other conventions detailed in PEP 8.
Admittedly we have a lot of… named things in here, and we haven’t been perfect about following conventions everywhere. We’re trying to clean things up as we come across them again in maintaining the code.
Imperative verbs (e.g. connect) should precede the object being acted upon (e.g. connect_db), unless the function returns a simple value (e.g. datadir).
No duplication of information (e.g. form names).
lowercase, underscores separate words (i.e.
snake_case
).Semi-private helper functions (functions used within a single module only and not exposed via the public API) should be preceded by an underscore.
When the object is a table, use the full table name (e.g. ingest_fuel_ferc1).
When dataframe outputs are built from multiple tables, identify the type of information being pulled (e.g. “plants”) and the source of the tables (e.g.
eia
orferc1
). When outputs are built from a single table, simply use the table name (e.g.boiler_fuel_eia923
).
Glossary of Abbreviations¶
General Abbreviations¶
Abbreviation |
Definition |
---|---|
|
abbreviation |
|
association |
|
average (mean) |
|
barrel (quantity of liquid fuel) |
|
capital expense |
|
correlation |
|
database |
|
dataframe & dataframes |
|
directory |
|
expenses |
|
equipment |
|
information |
|
thousand cubic feet (volume of gas) |
|
million British Thermal Units |
|
Megawatt |
|
Megawatt Hours |
|
number |
|
operating expense |
|
percent |
|
parts per million |
|
parts per billion |
|
(fiscal) quarter |
|
quantity |
|
utility & utilities |
|
United States |
|
US Dollars |
Data Source Specific Abbreviations¶
Abbreviation |
Definition |
---|---|
|
Fuel Receipts and Costs (EIA Form 923) |
|
Generation (EIA Form 923) |
|
Generation Fuel (EIA Form 923) |
|
Generators (EIA Form 923) |
|
Utilities (EIA Form 860) |
|
Ownership (EIA Form 860) |
Data Extraction Functions¶
The lower level namespace uses an imperative verb to identify the action the
function performs followed by the object of extraction (e.g.
get_eia860_file
). The upper level namespace identifies the dataset where
extraction is occurring.
Output Functions¶
When dataframe outputs are built from multiple tables, identify the type of
information being pulled (e.g. plants
) and the source of the tables (e.g.
eia
or ferc1
). When outputs are built from a single table, simply use
the table name (e.g. boiler_fuel_eia923
).
Table Names¶
See this article on database naming conventions.
Table names in snake_case
The data source should follow the thing it applies to e.g.
plant_id_ferc1
Columns and Field Names¶
total
should come at the beginning of the name (e.g.total_expns_production
)Identifiers should be structured
type
+_id_
+source
wheresource
is the agency or organization that has assigned the ID. (e.g.plant_id_eia
)The data source or label (e.g.
plant_id_pudl
) should follow the thing it is describingUnits should be appended to field names where applicable (e.g.
net_generation_mwh
). This includes “per unit” signifiers (e.g._pct
for percent,_ppm
for parts per million, or a generic_per_unit
when the type of unit varies, as in columns containing a heterogeneous collection of fuels)Financial values are assumed to be in nominal US dollars.
_id
indicates the field contains a usually numerical reference to another table, which will not be intelligible without looking up the value in that other table.The suffix
_code
indicates the field contains a short abbreviation from a well defined list of values, that probably needs to be looked up if you want to understand what it means.The suffix
_type
(e.g.fuel_type
) indicates a human readable category from a well defined list of values. Whenever possible we try to use these longer descriptive names rather than codes._name
indicates a longer human readable name, that is likely not well categorized into a small set of acceptable values._date
indicates the field contains aDate
object._datetime
indicates the field contains a fullDatetime
object._year
indicates the field contains aninteger
4-digit year.capacity
refers to nameplate capacity (e.g.capacity_mw
)– other specific types of capacity are annotated.Regardless of what label utilities are given in the original data source (e.g.
operator
in EIA orrespondent
in FERC) we refer to them asutilities
in PUDL.
Data and ETL Design Guidelines¶
Here we list some technical norms and expectations that we strive to adhere to and hope that contributors can also follow.
We’re all learning as we go – if you have suggestions for best practices we might want to adopt, let us know!
Input vs. Output Data¶
It’s important to differentiate between the original data we’re attempting to provide easy access to and analyses or data products that are derived from that original data. The original data is meant to be archived and re-used as an alternative to other users re-processing the raw data from various public agencies. For the sake of reproducibility, it’s important that we archive the inputs alongside the ouputs – since the reporting agencies often go back and update the data they have published without warning and without version control.
Minimize Data Alteration¶
We are trying to provide a uniform, easy-to-use interface to existing public data. We want to provide access to the original data, insofar as that is possible, while still having it be uniform and easy-to-use. Some alteration is unavoidable and other changes make the data much more usable, but these should be made with care and documentation.
Make sure data is available at its full, original resolution. Don’t aggregate the data unnecessarily when it is brought into PUDL. However, creating tools to aggregate it in derived data products is very useful.
Todo
Need fuller enumeration of data alteration / preservation principles.
Examples of Acceptable Changes¶
Converting all power plant capacities to MW, or all generation to MWh.
Assigning uniform
NA
values.Standardizing
datetime
types.Re-naming columns to be the same across years and datasets.
Assigning simple fuel type codes when the original data source uses free-form strings that are not programmatically usable.
Examples of Unacceptable Changes¶
Applying an inflation adjustment to a financial variable like fuel cost. There are a variety of possible inflation indices users might want to use, so that transformation should be applied in the output layer that sits on top of the original data.
Aggregating data that has date/time information associated with it into a time series when the individual records do not pertain to unique timesteps. For example, the EIA 923 Fuel Receipts and Costs table lists fuel deliveries by month, but each plant might receive several deliveries from the same supplier of the same fuel type in a month – the individual delivery information should be retained.
Computing heat rates for generators in an original table that contains both fuel heat content and net electricity generation. The heat rate would be a derived value and not part of the original data.
Make Tidy Data¶
The best practices in data organization go by different names in data science, statistics, and database design, but they all try to minimize data duplication and ensure an easy to transform uniform structure that can be used for a wide variety of purposes – at least in the source data (i.e. database tables or the published data packages).
Each column in a table represents a single, homogeneous variable.
Each row in a table represents a single observation – i.e. all of the variables reported in that row pertain to the same case/instance of something.
Don’t store the same value in more than one pace – each piece of data should have an authoritative source.
Don’t store derived values in the archived data sources.
Reading on Tidy Data¶
Tidy Data A paper on the benefits of organizing data into single variable, homogeneously typed columns, and complete single observation records. Oriented toward the R programming language, but the ideas apply universally to organizing data. (Hadley Wickham, The Journal of Statistical Software, 2014)
Good enough practices in scientific computing A whitepaper from the organizers of Software and Data Carpentry on good habits to ensure your work is reproducible and reusable — both by yourself and others! (Greg Wilson et al., PLOS Computational Biology, 2017)
Best practices for scientific computing An earlier version of the above whitepaper aimed at a more technical, data-oriented set of scientific users. (Greg Wilson et al., BLOS Biology, 2014)
A Simple Guide to Five Normal Forms A classic 1983 rundown of database normalization. Concise, informal, and understandable, with a few good illustrative examples. Bonus points for the ASCII art.
Use Simple Data Types¶
The Frictionless Data TableSchema standard includes a modest selection of data types that are meant to be very widely usable in other contexts. Make sure that whatever data type you’re using is included within that specification, but also be as specific as possible within that collection of options.
This is one aspect of a broader “least common denominator” strategy that is common within the open data. This strategy is also behind our decision to distribute the processed data as CSV files (with metadata stored as JSON).
Use Consistent Units¶
Different data sources often use different units to describe the same type of quantities. Rather than force users to do endless conversions while using the data, we try to convert similar quantities into the same units during ETL. For example, we typically convert all electrical generation to MWh, plant capacities to MW, and heat content to MMBTUs (though, MMBTUs are awful: seriously M=1000 because Roman numerals? So MM is a million, despite the fact that M/Mega is a million in SI. And a BTU is… the amount of energy required to raise the temperature of one an avoirdupois pound of water by 1 degree Farenheit?! What century even is this?).
Silo the ETL Process¶
It should be possible to run the ETL process on each data source independently and with any combination of data sources included. This allows users to include only the data need. In some cases, like the EIA 860 and EIA 923 data, two data sources may be so intertwined that keeping them separate doesn’t really make sense. This should be the exception, however, not the rule.
Separate Data from Glue¶
The glue that relates different data sources to each other should be applied after or alongside the ETL process and not as a mandatory part of ETL. This makes it easy to pull individual data sources in and work with them even when the glue isn’t working or doesn’t yet exist.
Partition Big Data¶
Our goal is for users to be able to run the ETL process on a decent laptop. However, some of the utility datasets are hundreds of gigabytes in size (e.g. EPA CEMS Hourly, FERC EQR). Many users will not need to use the entire dataset for the work they are doing. Partitioning the data allows them to pull in only certain years, certain states, or other sensible partitions of the data so that they don’t run out of memory or disk space or have to wait hours while data they don’t need is being processed.
Naming Conventions¶
There are only two hard problems in computer science: caching, naming things, and off-by-one errors.
Use Consistent Names¶
If two columns in different tables record the same quantity in the same units,
give them the same name. That way if they end up in the same dataframe for
comparison it’s easy to automatically rename them with suffixes indicating
where they came from. For example, net electricity generation is reported to
both FERC Form 1 and EIA 923, so we’ve named columns net_generation_mwh
in
each of those data sources. Similarly, give non-comparable quantities reported
in different data sources different column names. This helps make it clear
that the quantities are actually different.
Follow Existing Conventions¶
We are trying to use consistent naming conventions for the data tables,
columns, data sources, and functions. Generally speaking PUDL is a collection
of subpackages organized by purpose (extract, transform, load, analysis,
output, datastore…), containing a module for each data source. Each data source
has a short name that is used everywhere throughout the project and is composed of
the reporting agency and the form number or another identifying abbreviation:
ferc1
, epacems
, eia923
, eia861
, etc. See the naming
conventions document for more details.
Complete, Continuous Time Series¶
Most of the data in PUDL are time series’ ranging from hourly to annual in resolution.
Assume and provide contiguous time series. Otherwise there are just too many possible combinations of cases to deal with. E.g. don’t expect things to work if you pull in data from 2009-2010, and then also from 2016-2018, but not 2011-2015.
Assume and provide complete time series. In data that is indexed by date or time, ensure that it is available as a complete time series even if some values are missing (and thus NA). Many time series analyses only work when all the timesteps are present.
Packaging and Dependencies¶
In order to distribute a ready-to-use package to others via the Python Package
Index and conda-forge
, we need to encapsulate it with some metadata and
define its dependencies. When we first packaged up PUDL Python packaging systems, they
were a bit of a mess. Changes to the Python packaging & build system implemented
as a result of PEP 517 and PEP 518 have improved the available options,
and we should look at using a simpler more modern setup. The online
Python Packages book is a great guide to current
best / better practices.
setup.py
¶
The setup.py
script in the top level of the repository coordinates the
packaging process using setuptools
, a part of the Python standard
library. setup.py
is really just a single function call to
setuptools.setup()
, and the parameters of that function are
metadata related to the Python package. Most of them are relatively self
explanatory – like the name of the package, the license it’s being released
under, search keywords, etc. – but a few are more arcane:
use_scm_version
: Instead of having a hard-coded version that’s stored in the repository somewhere, handed off to the packaging script, and often out of date, pull the version from the source code management (SCM) system, in our case git (and Github). To make a release, we will first need to tag a particular revision ingit
with a version likev0.1.0
.python_requires='>=3.8'
: Specifies what versions of Python the package is expected to run on. In this case, it’s anything greater than or equal to 3.8.setup_requires=['setuptools_scm']
: What other packages need to be installed in order for the packaging script to run? Because we are obtaining the package version from our SCM (git/Github), we need the special package that lets us do that magic: setuptools_scm. This automatically generated version number can then be accessed in the package metadata, as is done our top-level__init__.py
file:__version__ = pkg_resources.get_distribution(__name__).version
This is admittedly convoluted.
install_requires
: lists all the other packages that need to be installed beforepudl
can be installed. These are our package dependencies. This list plays a role similar to theenvironment.yml
file in the mainpudl
repository, but it depends onpip
notconda
– in the packaging system we do not have access toconda
. It turns out this makes our lives difficult because of the kind of Python packages we depend on. More on this below.extras_require
: a dictionary describing optional packages that can be conditionally installed depending on the expected usage of the install. For now, this is mostly used in conjunction with Tox to ensure that the required documentation and testing packages are installed alongside PUDL in the virtual environment.packages=find_packages('src')
: Thepackages
parameter takes a list of all the python packages to be included in the distribution that is being packaged. Thesetuptools.find_packages
function automatically searches whatever directories it is given for any packages and all of their subpackages. All of the code we want to distribute to users lives under thesrc
directory.package_dir={'': 'src'}
: this tells the packaging to treat any modules or packages found in thesrc
directory as part of theroot
package of the distribution. This is a vestigial parameter that pertains to thedistutils
which are the predecessor tosetuptools
… but the system still depends on them deep down inside. In our case, we don’t have any modules that aren’t part of any package – everything is withinpudl
.include_package_data=True
: This tells the packaging system to include any non-python files that it finds in the directories it has been told to package. In our case, this is all the stuff insidepackage_data
including example settings files, metadata, glue, etc.entry_points
: This parameter tells the packaging what executable scripts should be installed on the user’s system and which modules:functions implement those scripts.
MANIFEST.in
¶
In addition to generating a version number automatically based on our git
repository, setuptools_scm
pulls every single file tracked by the
repository and every other random file sitting in the working repository
directory into the distribution. This is… not what we want. MANIFEST.in
allows us to specify in more detail which files should be included and
excluded. Mostly, we are just including the python package and supporting data that
exist under the src/pudl
directory.
The MIT License¶
Copyright 2017-2019 Catalyst Cooperative and the Climate Policy Initiative
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Catalyst Cooperative Code of Conduct¶
Our Pledge¶
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
Our Standards¶
Examples of behavior that contributes to creating a positive environment include:
Using welcoming and inclusive language
Being respectful of differing viewpoints and experiences
Gracefully accepting constructive criticism
Focusing on what is best for the community
Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
The use of sexualized language or imagery and unwelcome sexual attention or advances
Trolling, insulting/derogatory comments, and personal or political attacks
Public or private harassment
Publishing others’ private information, such as a physical or electronic address, without explicit permission
Other conduct which could reasonably be considered inappropriate in a professional setting
Our Responsibilities¶
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
Scope¶
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
Enforcement¶
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at pudl@catalyst.coop. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project’s leadership.
Attribution¶
This Code of Conduct is adapted from the Contributor Covenant version 1.4, available at http://contributor-covenant.org/version/1/4/
PUDL Release Notes¶
0.4.0 (2021-08-16)¶
This is a ridiculously large update including more than a year and a half’s worth of work.
New Data Coverage¶
EIA Form 860 for 2004-2008 + 2019, plus eia860m through 2020.
EIA Form 923 for 2001-2008 + 2019
EPA CEMS Hourly for 2019-2020
FERC Form 1 for 2019
US Census Demographic Profile (DP1) for 2010
FERC Form 714 for 2006-2019 (experimental)
EIA Form 861 for 2001-2019 (experimental)
Documentation & Data Accessibility¶
We’ve updated and (hopefully) clarified the documentation, and no longer expect most users to perform the data processing on their own. Instead, we are offering several methods of directly accessing already processed data:
Processed data archives on Zenodo that include a Docker container preserving the required software environment for working with the data.
A JupyterHub instance hosted in collaboration with 2i2c
Browsable database access via Datasette at https://data.catalyst.coop
Users who still want to run the ETL themselves will need to set up the set up the PUDL development environment
Data Cleaning & Integration¶
We now inject placeholder utilities in the cloned FERC Form 1 database when respondent IDs appear in the data tables, but not in the respondent table. This addresses a bunch of unsatisfied foreign key constraints in the original databases published by FERC.
We’re doing much more software testing and data validation, and so hopefully we’re catching more issues early on.
Hourly Electricity Demand and Historical Utility Territories¶
With support from GridLab and in collaboration with researchers at Berkeley’s Center for Environmental Public Policy, we did a bunch of work on spatially attributing hourly historical electricity demand. This work was largely done by @ezwelty and @yashkumar1803 and included:
Semi-programmatic compilation of historical utility and balancing authority service territory geometries based on the counties associated with utilities, and the utilities associated with balancing authorities in the EIA 861 (2001-2019). See e.g. #670 but also many others.
A method for spatially allocating hourly electricity demand from FERC 714 to US states based on the overlapping historical utility service territories described above. See #741
A fast timeseries outlier detection routine for cleaning up the FERC 714 hourly data using correlations between the time series reported by all of the different entities. See #871
Net Generation and Fuel Consumption for All Generators¶
We have developed an experimental methodology to produce net generation and fuel consumption for all generators. The process has known issues and is being actively developed. See #989
Net electricity generation and fuel consumption are reported in multiple ways in the EIA 923. The generation_fuel_eia923 table reports both generation and fuel consumption, and breaks them down by plant, prime mover, and fuel. In parallel, the generation_eia923 table reports generation by generator, and the boiler_fuel_eia923 table reports fuel consumption by boiler.
The generation_fuel_eia923 table is more complete, but the generation_eia923 + boiler_fuel_eia923 tables are more granular. The generation_eia923 table includes only ~55% of the total MWhs reported in the generation_fuel_eia923 table.
The pudl.analysis.allocate_net_gen
module estimates the net electricity
generation and fuel consumption attributable to individual generators based on
the more expansive reporting of the data in the generation_fuel_eia923
table.
Data Management and Archiving¶
We now use a series of web scrapers to collect snapshots of the raw input data that is processed by PUDL. These original data are archived as Frictionless Data Packages on Zenodo, so that they can be accessed reproducibly and programmatically via a REST API. This addresses the problems we were having with the v0.3.x releases, in which the original data on the agency websites was liable to be modified long after its “final” release, rendering it incompatible with our software. These scrapers and the Zenodo archiving scripts can be found in our pudl-scrapers and pudl-zenodo-storage repositories. The archives themselves can be found within the Catalyst Cooperative community on Zenodo
There’s an experimental caching system that allows these Zenodo archives to work as long-term “cold storage” for citation and reproducibility, with cloud object storage acting as a much faster way to access the same data for day to day non-local use, implemented by @rousik
We’ve decided to shift to producing a combination of relational databases (SQLite files) and columnar data stores (Apache Parquet files) as the primary outputs of PUDL. Tabular Data Packages didn’t end up serving either database or spreadsheet users very well. The CSV file were often too large to access via spreadsheets, and users missed out on the relationships between data tables. Needing to separately load the data packages into SQLite and Parquet was a hassle and generated a lot of overly complicated and fragile code.
Known Issues¶
The EIA 861 and FERC 714 data are not yet integrated into the SQLite database outputs, because we need to overhaul our entity resolution process to accommodate them in the database structure. That work is ongoing, see #639
The EIA 860 and EIA 923 data don’t cover exactly the same rage of years. EIA 860 only goes back to 2004, while EIA 923 goes back to 2001. This is because the pre-2004 EIA 860 data is stored in the DBF file format, and we need to update our extraction code to deal with the different format. This means some analyses that require both EIA 860 and EIA 923 data (like the calculation of heat rates) can only be performed as far back as 2004 at the moment. See #848
There are 387 EIA utilities and 228 EIA palnts which appear in the EIA 923, but which haven’t yet been assigned PUDL IDs and associated with the corresponding utilities and plants reported in the FERC Form 1. These entities show up in the 2001-2008 EIA 923 data that was just integrated. These older plants and utilities can’t yet be used in conjuction with FERC data. When the EIA 860 data for 2001-2003 has been integrated, we will finish this manual ID assignment process. See #848, #1069
52 of the algorithmically assigned
plant_id_ferc1
values found in theplants_steam_ferc1
table are currently associated with more than oneplant_id_pudl
value (99 PUDL plant IDs are involved), indicating either that the algorithm is making poor assignments, or that the manually assignedplant_id_pudl
values are incorrect. This is out of several thousand distinctplant_id_ferc1
values. See #954The county FIPS codes associated with coal mines reported in the Fuel Receipts and Costs table are being treated inconsistently in terms of their data types, especially in the output functions, so they are currently being output as floating point numbers that have been cast to strings, rather than zero-padded integers that are strings. See #1119
0.3.2 (2020-02-17)¶
The primary changes in this release:
The 2009-2010 data for EIA 860 have been integrated, including updates to the data validation test cases.
Output tables are more uniform and less restrictive in what they include, no longer requiring PUDL Plant & Utility IDs in some tables. This release was used to compile v1.1.0 of the PUDL Data Release, which is archived at Zenodo under this DOI: https://doi.org/10.5281/zenodo.3672068
With this release, the EIA 860 & 923 data now (finally!) cover the same span of time. We do not anticipate integrating any older EIA 860 or 923 data at this time.
0.3.1 (2020-02-05)¶
A couple of minor bugs were found in the preparation of the first PUDL data release:
No maximum version of Python was being specified in setup.py. PUDL currently only works on Python 3.7, not 3.8.
epacems_to_parquet
conversion script was erroneously attempting to verify the availability of raw input data files, despite the fact that it now relies on the packaged post-ETL epacems data. Didn’t catch this before since it was always being run in a context where the original data was lying around… but that’s not the case when someone just downloads the released data packages and tries to load them.
0.3.0 (2020-01-30)¶
This release is mostly about getting the infrastructure in place to do regular data releases via Zenodo, and updating ETL with 2018 data.
Added lots of data validation / quality assurance test cases in anticipation of archiving data. See the pudl.validate module for more details.
New data since v0.2.0 of PUDL:
EIA Form 860 for 2018
EIA Form 923 for 2018
FERC Form 1 for 1994-2003 and 2018 (select tables)
We removed the FERC Form 1 accumulated depreciation table from PUDL because it requires detailed row-mapping in order to be accurate across all the years. It and many other FERC tables will be integrated soon, using new row-mapping methods.
Lots of new plants and utilities integrated into the PUDL ID mapping process, for the earlier years (1994-2003). All years of FERC 1 data should be integrated for all future ferc1 tables.
Command line interfaces of some of the ETL scripts have changed, see their help messages for details.
0.2.0 (2019-09-17)¶
This is the first release of PUDL to generate data packages as the canonical output, rather than loading data into a local PostgreSQL database. The data packages can then be used to generate a local SQLite database, without relying on any software being installed outside of the Python requirements specified for the catalyst.coop package.
This change will enable easier installation of PUDL, as well as archiving and bulk distribution of the data products in a platform independent format.
0.1.0 (2019-09-12)¶
This is the only release of PUDL that will be made that makes use of PostgreSQL as the primary data product. It is provided for reference, in case there are users relying on this setup who need access to a well defined release.
pudl¶
pudl package¶
Subpackages¶
pudl.analysis package¶
Allocate data from generation_fuel_eia923 table to generator level.
Net electricity generation and fuel consumption are reported in mutiple ways in the EIA 923. The generation_fuel_eia923 table reports both generation and fuel consumption, and breaks them down by plant, prime mover, and fuel. In parallel, the generation_eia923 table reports generation by generator, and the boiler_fuel_eia923 table reports fuel consumption by boiler.
The generation_fuel_eia923 table is more complete, but the generation_eia923 + boiler_fuel_eia923 tables are more granular. The generation_eia923 table includes only ~55% of the total MWhs reported in the generation_fuel_eia923 table.
This module estimates the net electricity generation and fuel consumption
attributable to individual generators based on the more expansive reporting of
the data in the generation_fuel_eia923 table. The main coordinating function
here is pudl.analysis.allocate_net_gen.allocate_gen_fuel_by_gen()
.
The algorithm we’re using assumes:
The generation_eia923 table is the authoritative source of information about how much generation is attributable to an individual generator, if it reports in that table.
The generation_fuel_eia923 table is the authoritative source of information about how much generation and fuel consumption is attributable to an entire plant.
The generators_eia860 table provides an exhaustive list of all generators whose generation is being reported in the generation_fuel_eia923 table.
We allocate the net generation reported in the generation_fuel_eia923 table on the basis of plant, prime mover, and fuel type among the generators in each plant that have matching fuel types. Generation is allocated proportional to reported generation if it’s available, and proportional to each generator’s capacity if generation is not available.
In more detail: within each year of data, we split the plants into three groups:
Plants where ALL generators report in the more granular generation_eia923 table.
Plants where NONE of the generators report in the generation_eia923 table.
Plants where only SOME of the generators report in the generation_eia923 table.
In plant-years where ALL generators report more granular generation, the total net generation reported in the generation_fuel_eia923 table is allocated in proportion to the generation each generator reported in the generation_eia923 table. We do this instead of using net_generation_mwh from generation_eia923 because there are some small discrepancies between the total amounts of generation reported in these two tables.
In plant-years where NONE of the generators report more granular generation, we create a generator record for each associated fuel type. Those records are merged with the generation_fuel_eia923 table on plant, prime mover code, and fuel type. Each group of plant, prime mover, and fuel will have some amount of reported net generation associated with it, and one or more generators. The net generation is allocated among the generators within the group in proportion to their capacity. Then the allocated net generation is summed up by generator.
In the hybrid case, where only SOME of of a plant’s generators report the more granular generation data, we use a combination of the two allocation methods described above. First, the total generation reported across a plant in the generation_fuel_eia923 table is allocated between the two categories of generators (those that report fine-grained generation, and those that don’t) in direct proportion to the fraction of the plant’s generation which is reported in the generation_eia923 table, relative to the total generation reported in the generation_fuel_eia923 table.
Note that this methology does not distinguish between primary and secondary fuel_types for generators. It associates portions of net generation to each generators in the same plant do not report detailed generation, have the same prime_mover_code, and use the same fuels, but have very different capacity factors in reality, this methodology will allocate generation such that they end up with very similar capacity factors. We imagine this is an uncommon scenario.
This methodology has several potential flaws and drawbacks. Because there is no indicator of what portion of the energy_source_codes (ie. fuel_type), we associate the net generation equally among them. In effect, if a plant had multiple generators with the same prime_mover_code but opposite primary and secondary fuels (eg. gen 1 has a primary fuel of ‘NG’ and secondary fuel of ‘DFO’, while gen 2 has a primary fuel of ‘DFO’ and a secondary fuel of ‘NG’), the methodology associates the generation_fuel_eia923 records similarly across these two generators. However, the allocated net generation will still be porporational to each generator’s net generation (if it’s reported) or capacity (if generation is not reported).
-
pudl.analysis.allocate_net_gen.
DATA_COLS
= ['net_generation_mwh', 'fuel_consumed_mmbtu']¶ Data columns from generation_fuel_eia923 that are being allocated.
-
pudl.analysis.allocate_net_gen.
IDX_GENS
= ['plant_id_eia', 'generator_id', 'report_date']¶ Id columns for generators.
-
pudl.analysis.allocate_net_gen.
IDX_PM_FUEL
= ['plant_id_eia', 'prime_mover_code', 'fuel_type', 'report_date']¶ Id columns for plant, prime mover & fuel type records.
-
pudl.analysis.allocate_net_gen.
agg_by_generator
(gen_pm_fuel)[source]¶ Aggreate the allocated gen fuel data to the generator level.
- Parameters
gen_pm_fuel (pandas.DataFrame) – result of allocate_gen_fuel_by_gen_pm_fuel()
-
pudl.analysis.allocate_net_gen.
allocate_gen_fuel_by_gen
(pudl_out)[source]¶ Allocate gen fuel data columns to generators.
The generation_fuel_eia923 table includes net generation and fuel consumption data at the plant/fuel type/prime mover level. The most granular level of plants that PUDL typically uses is at the plant/generator level. This method converts the generation_fuel_eia923 table to the level of plant/generators.
- Parameters
pudl_out (pudl.output.pudltabl.PudlTabl) – An object used to create the tables for EIA and FERC Form 1 analysis.
- Returns
table with columns
IDX_GENS
andDATA_COLS
. TheDATA_COLS
will be scaled to the level of theIDX_GENS
.- Return type
-
pudl.analysis.allocate_net_gen.
allocate_gen_fuel_by_gen_pm_fuel
(gf, gen, gens, drop_interim_cols=True)[source]¶ Proportionally allocate net gen from gen_fuel table to generators.
- Two main steps here:
associate generation_fuel_eia923 table data w/ generators
allocate generation_fuel_eia923 table data proportionally
The association process happens via associate_generator_tables().
The allocation process (via calc_allocation_fraction()) entails generating a fraction for each record within a
IDX_PM_FUEL
group. We have two data points for generating this ratio: the net generation in the generation_eia923 table and the capacity from the generators_eia860 table. The end result is a frac column which is unique for each generator/prime_mover/fuel record and is used to allocate the associated net generation from the generation_fuel_eia923 table.- Args:
- gf (pandas.DataFrame): generator_fuel_eia923 table with columns:
IDX_PM_FUEL
and net_generation_mwh and fuel_consumed_mmbtu.- gen (pandas.DataFrame): generation_eia923 table with columns:
IDX_GENS
and net_generation_mwh.- gens (pandas.DataFrame): generators_eia860 table with cols:
IDX_GENS
, capacity_mw, prime_mover_code, and all of the energy_source_code columns- drop_interim_cols (boolean): True/False flag for dropping interim
columns which are used to generate the net_generation_mwh column (they are mostly the frac column and net generataion reported in the original generation_eia923 and generation_fuel_eia923 tables) that are useful for debugging. Default is False, which will drop the columns.
- Returns
pandas.DataFrame
-
pudl.analysis.allocate_net_gen.
associate_generator_tables
(gf, gen, gens)[source]¶ Associate the three tables needed to assign net gen to generators.
- Parameters
gf (pandas.DataFrame) – generator_fuel_eia923 table with columns:
IDX_PM_FUEL
and net_generation_mwh and fuel_consumed_mmbtu.gen (pandas.DataFrame) – generation_eia923 table with columns:
IDX_GENS
and net_generation_mwh.gens (pandas.DataFrame) – generators_eia860 table with cols:
IDX_GENS
and all of the energy_source_code columns
TODO: Convert these groupby/merges into transforms.
-
pudl.analysis.allocate_net_gen.
calc_allocation_fraction
(gen_pm_fuel, drop_interim_cols=True)[source]¶ Make frac column to allocate net gen from the generation fuel table.
- There are three main types of generators:
“all gen”: generators of plants which fully report to the generators_eia860 table.
“some gen”: generators of plants which partially report to the generators_eia860 table.
“gf only”: generators of plants which do not report at all to the generators_eia860 table.
“no pm”: generators that have missing prime movers.
Each different type of generator needs to be treated slightly differently, but all will end up with a frac column that can be used to allocate the net_generation_mwh_gf_tbl.
- Parameters
gen_pm_fuel (pandas.DataFrame) – output of prep_alloction_fraction().
drop_interim_cols (boolean) – True/False flag for dropping interim columns which are used to generate the frac column (they are mostly interim frac columns and totals of net generataion from various groupings of generators) that are useful for debugging. Default is False.
-
pudl.analysis.allocate_net_gen.
prep_alloction_fraction
(gen_assoc)[source]¶ Make flags and aggregations to prepare for the calc_allocation_ratios().
In calc_allocation_ratios(), we will break the generators out into four types - see calc_allocation_ratios() docs for details. This function adds flags for splitting the generators. It also adds
-
pudl.analysis.allocate_net_gen.
remove_retired_generators
(gen_assoc)[source]¶ Remove the retired generators.
We don’t want to associate net generation to generators that are retired (or proposed! or any other operational_status besides existing).
We do want to keep the generators that retire mid-year and have generator specific data from the generation_eia923 table. Removing the generators that retire mid-report year and don’t report to the generation_eia923 table is not exactly a great assumption. For now, we are removing them. We should employ a strategy that allocates only a portion of the generation to them based on their operational months (or by doing the allocation on a monthly basis).
- Parameters
gen_assoc (pandas.DataFrame) – table of generators with stacked fuel types and broadcasted net generation data from the generation_eia923 and generation_fuel_eia923 tables. Output of associate_generator_tables().
-
pudl.analysis.allocate_net_gen.
stack_generators
(gens, cat_col='energy_source_code_num', stacked_col='fuel_type')[source]¶ Stack the generator table with a set of columns.
- Parameters
gens (pandas.DataFrame) – generators_eia860 table with cols:
IDX_GENS
and all of the energy_source_code columnscat_col (string) – name of category column which will end up having the column names of cols_to_stack
stacked_col (string) – name of column which will end up with the stacked data from cols_to_stack
- Returns
a dataframe with these columns: idx_stack, cat_col, stacked_col
- Return type
A module with functions to aid generating MCOE.
-
pudl.analysis.mcoe.
capacity_factor
(pudl_out, min_cap_fact=0, max_cap_fact=1.5)[source]¶ Calculate the capacity factor for each generator.
Capacity Factor is calculated by using the net generation from eia923 and the nameplate capacity from eia860. The net gen and capacity are pulled into one dataframe, then the dates from that dataframe are pulled out to determine the hours in each period based on the frequency. The number of hours is used in calculating the capacity factor. Then records with capacity factors outside the range specified by min_cap_fact and max_cap_fact are dropped.
-
pudl.analysis.mcoe.
fuel_cost
(pudl_out)[source]¶ Calculate fuel costs per MWh on a per generator basis for MCOE.
Fuel costs are reported on a per-plant basis, but we want to estimate them at the generator level. This is complicated by the fact that some plants have several different types of generators, using different fuels. We have fuel costs broken out by type of fuel (coal, oil, gas), and we know which generators use which fuel based on their energy_source_code and reported prime_mover. Coal plants use a little bit of natural gas or diesel to get started, but based on our analysis of the “pure” coal plants, this amounts to only a fraction of a percent of their overal fuel consumption on a heat content basis, so we’re ignoring it for now.
For plants whose generators all rely on the same fuel source, we simply attribute the fuel costs proportional to the fuel heat content consumption associated with each generator.
For plants with more than one type of generator energy source, we need to split out the fuel costs according to fuel type – so the gas fuel costs are associated with generators that have energy_source_code gas, and the coal fuel costs are associated with the generators that have energy_source_code coal.
-
pudl.analysis.mcoe.
heat_rate_by_gen
(pudl_out)[source]¶ Convert per-unit heat rate to by-generator, adding fuel type & count.
Heat rates really only make sense at the unit level, since input fuel and output electricity are comingled at the unit level, but it is useful in many contexts to have that per-unit heat rate associated with each of the underlying generators, as much more information is available about the generators.
To combine the (potentially) more granular temporal information from the per-unit heat rates with annual generator level attributes, we have to do a many-to-many merge. This can’t be done easily with merge_asof(), so we treat the year and month fields as categorial variables, and do a normal inner merge that broadcasts monthly dates in one direction, and generator IDs in the other.
- Returns
with columns report_date, plant_id_eia, unit_id_pudl, generator_id, heat_rate_mmbtu_mwh, fuel_type_code_pudl, fuel_type_count. The output will have a time frequency corresponding to that of the input pudl_out. Output data types are set to their canonical values before returning.
- Return type
- Raises
ValueError if pudl_out.freq is None. –
-
pudl.analysis.mcoe.
heat_rate_by_unit
(pudl_out)[source]¶ Calculate heat rates (mmBTU/MWh) within separable generation units.
Assumes a “good” Boiler Generator Association (bga) i.e. one that only contains boilers and generators which have been completely associated at some point in the past.
The BGA dataframe needs to have the following columns:
report_date (annual)
plant_id_eia
unit_id_pudl
generator_id
boiler_id
The unit_id is associated with generation records based on report_date, plant_id_eia, and generator_id. Analogously, the unit_id is associated with boiler fuel consumption records based on report_date, plant_id_eia, and boiler_id.
Then the total net generation and fuel consumption per unit per time period are calculated, allowing the calculation of a per unit heat rate. That per unit heat rate is returned in a dataframe containing:
report_date
plant_id_eia
unit_id_pudl
net_generation_mwh
fuel_consumed_mmbtu
heat_rate_mmbtu_mwh
-
pudl.analysis.mcoe.
mcoe
(pudl_out, min_heat_rate=5.5, min_fuel_cost_per_mwh=0.0, min_cap_fact=0.0, max_cap_fact=1.5, all_gens=True)[source]¶ Compile marginal cost of electricity (MCOE) at the generator level.
Use data from EIA 923, EIA 860, and (someday) FERC Form 1 to estimate the MCOE of individual generating units. The calculation is performed over the range of times and at the time resolution of the input pudl_out object.
- Parameters
pudl_out (pudl.output.pudltable.PudlTabl) – a PUDL output object specifying the time resolution and date range for which the calculations should be performed.
min_heat_rate (float) – lowest plausible heat rate, in mmBTU/MWh. Any MCOE records with lower heat rates are presumed to be invalid, and are discarded before returning.
min_cap_fact (float) – minimum & maximum generator capacity factor. Generator records with a lower capacity factor will be filtered out before returning. This allows the user to exclude generators that aren’t being used enough to have valid.
max_cap_fact (float) – minimum & maximum generator capacity factor. Generator records with a lower capacity factor will be filtered out before returning. This allows the user to exclude generators that aren’t being used enough to have valid.
min_fuel_cost_per_mwh (float) – minimum fuel cost on a per MWh basis that is required for a generator record to be considered valid. For some reason there are now a large number of $0 fuel cost records, which previously would have been NaN.
all_gens (bool) – if True, include attributes of all generators in the generators_eia860 table, rather than just the generators which have records in the derived MCOE values. True by default.
- Returns
a dataframe organized by date and generator, with lots of juicy information about the generators – including fuel cost on a per MWh and MMBTU basis, heat rates, and net generation.
- Return type
Compile historical utility and balancing area territories.
Use the mapping of utilities to counties, and balancing areas to utilities, available within the EIA 861, in conjunction with the US Census geometries for counties, to infer the historical spatial extent of utility and balancing area territories. Output the resulting geometries for use in other applications.
-
pudl.analysis.service_territory.
add_geometries
(df, census_gdf, dissolve=False, dissolve_by=None)[source]¶ Merge census geometries into dataframe on county_id_fips, optionally dissolving.
Merge the US Census county-level geospatial information into the DataFrame df based on the the column county_id_fips (in df), which corresponds to the column GEOID10 in census_gdf. Also bring in the population and area of the counties, summing as necessary in the case of dissolved geometries.
- Parameters
df (pandas.DataFrame) – A DataFrame containing a county_id_fips column.
census_gdf (geopandas.GeoDataFrame) – A GeoDataFrame based on the US Census demographic profile (DP1) data at county resolution, with the original column names as published by US Census.
dissolve (bool) – If True, dissolve individual county geometries into larger service territories.
dissolve_by (list) – The columns to group by in the dissolve. For example, dissolve_by=[“report_date”, “utility_id_eia”] might provide annual utility service territories, while [“report_date”, “balancing_authority_id_eia”] would provide annual balancing authority territories.
- Returns
geopandas.GeoDataFrame
-
pudl.analysis.service_territory.
compile_geoms
(pudl_out, census_counties, entity_type, dissolve=False, limit_by_state=True, save=True)[source]¶ Compile all available utility or balancing authority geometries.
- Parameters
pudl_out (pudl.output.pudltabl.PudlTabl) – A PUDL output object, which will be used to extract and cache the EIA 861 tables.
census_counties (geopandas.GeoDataFrame) – A GeoDataFrame containing the county level US Census DP1 data and county geometries.
entity_type (str) – The type of service territory geometry to compile. Must be either “ba” (balancing authority) or “util” (utility).
dissolve (bool) – Whether to dissolve the compiled geometries to the utility/balancing authority level, or leave them as counties.
limit_by_state (bool) – Whether to limit included counties to those with observed EIA 861 data in association with the state and utility/balancing authority.
save (bool) – If True, save the compiled GeoDataFrame as a GeoParquet file before returning. Especially useful in the case of dissolved geometries, as they are computationally expensive.
- Returns
geopandas.GeoDataFrame
-
pudl.analysis.service_territory.
get_all_utils
(pudl_out)[source]¶ Compile IDs and Names of all known EIA Utilities.
Grab all EIA utility names and IDs from both the EIA 861 Service Territory table and the EIA 860 Utility entity table. This is a temporary function that’s only needed because we haven’t integrated the EIA 861 information into the entity harvesting process and PUDL database yet.
- Parameters
pudl_out (pudl.output.pudltabl.PudlTabl) – The PUDL output object which should be used to obtain PUDL data.
- Returns
Having 2 columns
utility_id_eia
andutility_name_eia
.- Return type
-
pudl.analysis.service_territory.
get_territory_fips
(ids, assn, assn_col, st_eia861, limit_by_state=True)[source]¶ Compile county FIPS codes associated with an entity’s service territory.
For each entity identified by ids, look up the set of counties associated with that entity on an annual basis. Optionally limit the set of counties to those within states where the selected entities reported activity elsewhere within the EIA 861 data.
- Parameters
ids (iterable of ints) – A collection of EIA utility or balancing authority IDs.
assn (pandas.DataFrame) – Association table, relating
report_date
,state – column indicated by
assn_col
– if it’s notutility_id_eia
.utility_id_eia to each other (and) – column indicated by
assn_col
– if it’s notutility_id_eia
.well as the (as) – column indicated by
assn_col
– if it’s notutility_id_eia
.assn_col (str) – Label of the dataframe column in
assn
that contains the ID of the entities of interest. Should probably be eitherbalancing_authority_id_eia
orutility_id_eia
.st_eia861 (pandas.DataFrame) – The EIA 861 Service Territory table.
limit_by_state (bool) – Whether to require that the counties associated with the balancing authority are inside a state that has also been seen in association with the balancing authority and the utility whose service territory contians the county.
- Returns
A table associating the entity IDs with a collection of counties annually, identifying counties both by name and county_id_fips (both state and state_id_fips are included for clarity).
- Return type
-
pudl.analysis.service_territory.
get_territory_geometries
(ids, assn, assn_col, st_eia861, census_gdf, limit_by_state=True, dissolve=False)[source]¶ Compile service territory geometries based on county_id_fips.
Calls
get_territory_fips
to generate the list of counties associated with each entity identified byids
, and then merges in the corresponding county geometries from the US Census DP1 data passed in viacensus_gdf
.Optionally dissolve all of the county level geometries into a single geometry for each combination of entity and year.
Note
Dissolving geometires is a costly operation, and may take half an hour or more if you are processing all entities for all years. Dissolving also means that all the per-county information will be lost, rendering the output inappropriate for use in many analyses. Dissolving is mostly useful for generating visualizations.
- Parameters
ids (iterable of ints) – A collection of EIA balancing authority IDs.
assn (pandas.DataFrame) – Association table, relating
report_date
,state – column indicated by
assn_col
– if it’s notutility_id_eia
.utility_id_eia to each other (and) – column indicated by
assn_col
– if it’s notutility_id_eia
.well as the (as) – column indicated by
assn_col
– if it’s notutility_id_eia
.assn_col (str) – Label of the dataframe column in
assn
that contains the ID of the entities of interest. Should probably be eitherbalancing_authority_id_eia
orutility_id_eia
.st_eia861 (pandas.DataFrame) – The EIA 861 Service Territory table.
census_gdf (geopandas.GeoDataFrame) – The US Census DP1 county-level geometries as returned by pudl.output.censusdp1tract.get_layer(“county”).
limit_by_state (bool) – Whether to require that the counties associated with the balancing authority are inside a state that has also been seen in association with the balancing authority and the utility whose service territory contians the county.
dissolve (bool) – If False, each record in the compiled territory will correspond to a single county, with a county-level geometry, and there will be many records enumerating all the counties associated with a given balancing_authority_id_eia in each year. If dissolve=True, all of the county-level geometries for each utility in each year will be merged together (“dissolved”) resulting in a single geometry and record for each balancing_authority-year.
- Returns
geopandas.GeoDataFrame
-
pudl.analysis.service_territory.
main
()[source]¶ Compile historical utility and balancing area territories.
-
pudl.analysis.service_territory.
parse_command_line
(argv)[source]¶ Parse script command line arguments. See the -h option.
-
pudl.analysis.service_territory.
plot_all_territories
(gdf, report_date, respondent_type=('balancing_authority', 'utility'), color='black', alpha=0.25, basemap=True)[source]¶ Plot all of the planning areas of a given type for a given report date.
Todo
This function needs to be made more general purpose, and less entangled with the FERC 714 data.
- Parameters
gdf (geopandas.GeoDataFrame) – GeoDataFrame containing planning area geometries, organized by respondent_id_ferc714 and report_date.
report_date (datetime) – A Datetime indicating what year’s planning areas should be displayed.
respondent_type (str or iterable) – Type of respondent whose planning areas should be displayed. Either “utility” or “balancing_authority” or an iterable collection containing both.
color (str) – Color to use for the planning areas.
alpha (float) – Transparency to use for the planning areas.
basemap (bool) – If true, use the OpenStreetMap tiles for context.
- Returns
matplotlib.axes.Axes
-
pudl.analysis.service_territory.
plot_historical_territory
(gdf, id_col, id_val)[source]¶ Plot all the historical geometries defined for the specified entity.
This is useful for exploring how a particular entity’s service territory has evolved over time, or for identifying individual missing or inaccurate territories.
- Parameters
gdf (geopandas.GeoDataFrame) – A geodataframe containing geometries pertaining electricity planning areas. Can be broken down by county FIPS code, or have a single record containing a geometry for each combination of report_date and the column being used to select planning areas (see below).
id_col (str) – The label of a column in gdf that identifies the planning area to be visualized, like utility_id_eia, balancing_authority_id_eia, or balancing_authority_code_eia.
- Returns
None
Spatial operations for demand allocation.
-
pudl.analysis.spatial.
check_gdf
(gdf: geopandas.geodataframe.GeoDataFrame) → None[source]¶ Check that GeoDataFrame contains (Multi)Polygon geometries with non-zero area.
- Parameters
gdf – GeoDataFrame.
- Raises
TypeError – Object is not a GeoDataFrame.
AttributeError – GeoDataFrame has no geometry.
TypeError – Geometry is not a GeoSeries.
ValueError – Geometry contains null geometries.
ValueError – Geometry contains non-(Multi)Polygon geometries.
ValueError – Geometry contains (Multi)Polygon geometries with zero area.
ValueError – MultiPolygon contains Polygon geometries with zero area.
-
pudl.analysis.spatial.
dissolve
(gdf: geopandas.geodataframe.GeoDataFrame, by: Iterable[str], func: Union[Callable, str, list, dict], how: Union[Literal[union, first], Callable[[geopandas.geoseries.GeoSeries], shapely.geometry.base.BaseGeometry]] = 'union') → geopandas.geodataframe.GeoDataFrame[source]¶ Dissolve layer by aggregating features based on common attributes.
- Parameters
gdf – GeoDataFrame with non-empty (Multi)Polygon geometries.
by – Names of columns to group features by.
func – Aggregation function for data columns (see
pd.DataFrame.groupby()
).how – Aggregation function for geometry column. Either ‘union’ (
gpd.GeoSeries.unary_union()
), ‘first’ (first geometry in group), or a function aggregating multiple geometries into one.
- Returns
GeoDataFrame with dissolved geometry and data columns, and grouping columns set as the index.
-
pudl.analysis.spatial.
explode
(gdf: geopandas.geodataframe.GeoDataFrame, ratios: Optional[Iterable[str]] = None) → geopandas.geodataframe.GeoDataFrame[source]¶ Explode MultiPolygon to multiple Polygon geometries.
- Parameters
gdf – GeoDataFrame with non-zero-area (Multi)Polygon geometries.
ratios – Names of columns to rescale by the area fraction of the Polygon relative to the MultiPolygon. If provided, MultiPolygon cannot self-intersect. By default, the original value is used unchanged.
- Raises
ValueError – Geometry contains self-intersecting MultiPolygon.
- Returns
GeoDataFrame with each Polygon as a separate row in the GeoDataFrame. The index is the number of the source row in the input GeoDataFrame.
-
pudl.analysis.spatial.
get_data_columns
(df: pandas.core.frame.DataFrame) → list[source]¶ Return list of columns, ignoring geometry.
-
pudl.analysis.spatial.
overlay
(*gdfs: geopandas.geodataframe.GeoDataFrame, how: Literal[intersection, union, identity, symmetric_difference, difference] = 'intersection', ratios: Optional[Iterable[str]] = None) → geopandas.geodataframe.GeoDataFrame[source]¶ Overlay multiple layers incrementally.
When a feature from one layer overlaps the feature of another layer, the area of overlap is split into two geometrically-identical features: one for each of the original overlapping features. Each split feature contains the attributes of the original feature.
TODO: To identify the source of output features, the user can ensure that each layer contains a column to index by. Alternatively, tuples of indices of the overlapping feature from each layer (null if none) could be returned as the index.
- Parameters
gdfs – GeoDataFrames with non-empty (Multi)Polygon geometries assumed to contain no self-overlaps (see
self_union()
). Names of (non-geometry) columns cannot be used more than once. Any index colums are ignored.how – Spatial overlay method (see
gpd.overlay()
).ratios – Names of columns to rescale by the area fraction of the split feature relative to the original. By default, the original value is used unchanged.
- Raises
ValueError – Duplicate column names in layers.
- Returns
GeoDataFrame with the geometries and attributes resulting from the overlay.
-
pudl.analysis.spatial.
polygonize
(geom: shapely.geometry.base.BaseGeometry) → Union[shapely.geometry.polygon.Polygon, shapely.geometry.multipolygon.MultiPolygon][source]¶ Convert geometry to (Multi)Polygon.
- Parameters
geom – Geometry to convert to (Multi)Polygon.
- Returns
Geometry converted to (Multi)Polygon, with all zero-area components removed.
- Raises
ValueError – Geometry has zero area.
-
pudl.analysis.spatial.
self_union
(gdf: geopandas.geodataframe.GeoDataFrame, ratios: Optional[Iterable[str]] = None) → geopandas.geodataframe.GeoDataFrame[source]¶ Calculate the geometric union of a feature layer with itself.
Areas of overlap are split into two or more geometrically-identical features: one for each of the original overlapping features. Each split feature contains the attributes of the original feature.
- Parameters
gdf – GeoDataFrame with non-zero-area MultiPolygon geometries.
ratios – Names of columns to rescale by the area fraction of the split feature relative to the original. By default, the original value is used unchanged.
- Returns
GeoDataFrame representing the union of the input features with themselves. Its index contains tuples of the index of the original overlapping features.
- Raises
NotImplementedError – MultiPolygon geometries are not yet supported.
Predict state-level electricity demand.
-
pudl.analysis.state_demand.
STANDARD_UTC_OFFSETS
: Dict[str, str] = {'America/Anchorage': -9, 'America/Chicago': -6, 'America/Denver': -7, 'America/Halifax': -4, 'America/Los_Angeles': -8, 'America/New_York': -5, 'Pacific/Honolulu': -10}¶ Hour offset from Coordinated Universal Time (UTC) by time zone.
Time zones are canonical names (e.g. ‘America/Denver’) from tzdata (https://www.iana.org/time-zones) mapped to their standard-time UTC offset.
-
pudl.analysis.state_demand.
STATES
: List[Dict[str, Union[int, str]]] = [{'name': 'Alabama', 'code': 'AL', 'fips': '01'}, {'name': 'Alaska', 'code': 'AK', 'fips': '02'}, {'name': 'Arizona', 'code': 'AZ', 'fips': '04'}, {'name': 'Arkansas', 'code': 'AR', 'fips': '05'}, {'name': 'California', 'code': 'CA', 'fips': '06'}, {'name': 'Colorado', 'code': 'CO', 'fips': '08'}, {'name': 'Connecticut', 'code': 'CT', 'fips': '09'}, {'name': 'Delaware', 'code': 'DE', 'fips': '10'}, {'name': 'District of Columbia', 'code': 'DC', 'fips': '11'}, {'name': 'Florida', 'code': 'FL', 'fips': '12'}, {'name': 'Georgia', 'code': 'GA', 'fips': '13'}, {'name': 'Hawaii', 'code': 'HI', 'fips': '15'}, {'name': 'Idaho', 'code': 'ID', 'fips': '16'}, {'name': 'Illinois', 'code': 'IL', 'fips': '17'}, {'name': 'Indiana', 'code': 'IN', 'fips': '18'}, {'name': 'Iowa', 'code': 'IA', 'fips': '19'}, {'name': 'Kansas', 'code': 'KS', 'fips': '20'}, {'name': 'Kentucky', 'code': 'KY', 'fips': '21'}, {'name': 'Louisiana', 'code': 'LA', 'fips': '22'}, {'name': 'Maine', 'code': 'ME', 'fips': '23'}, {'name': 'Maryland', 'code': 'MD', 'fips': '24'}, {'name': 'Massachusetts', 'code': 'MA', 'fips': '25'}, {'name': 'Michigan', 'code': 'MI', 'fips': '26'}, {'name': 'Minnesota', 'code': 'MN', 'fips': '27'}, {'name': 'Mississippi', 'code': 'MS', 'fips': '28'}, {'name': 'Missouri', 'code': 'MO', 'fips': '29'}, {'name': 'Montana', 'code': 'MT', 'fips': '30'}, {'name': 'Nebraska', 'code': 'NE', 'fips': '31'}, {'name': 'Nevada', 'code': 'NV', 'fips': '32'}, {'name': 'New Hampshire', 'code': 'NH', 'fips': '33'}, {'name': 'New Jersey', 'code': 'NJ', 'fips': '34'}, {'name': 'New Mexico', 'code': 'NM', 'fips': '35'}, {'name': 'New York', 'code': 'NY', 'fips': '36'}, {'name': 'North Carolina', 'code': 'NC', 'fips': '37'}, {'name': 'North Dakota', 'code': 'ND', 'fips': '38'}, {'name': 'Ohio', 'code': 'OH', 'fips': '39'}, {'name': 'Oklahoma', 'code': 'OK', 'fips': '40'}, {'name': 'Oregon', 'code': 'OR', 'fips': '41'}, {'name': 'Pennsylvania', 'code': 'PA', 'fips': '42'}, {'name': 'Rhode Island', 'code': 'RI', 'fips': '44'}, {'name': 'South Carolina', 'code': 'SC', 'fips': '45'}, {'name': 'South Dakota', 'code': 'SD', 'fips': '46'}, {'name': 'Tennessee', 'code': 'TN', 'fips': '47'}, {'name': 'Texas', 'code': 'TX', 'fips': '48'}, {'name': 'Utah', 'code': 'UT', 'fips': '49'}, {'name': 'Vermont', 'code': 'VT', 'fips': '50'}, {'name': 'Virginia', 'code': 'VA', 'fips': '51'}, {'name': 'Washington', 'code': 'WA', 'fips': '53'}, {'name': 'West Virginia', 'code': 'WV', 'fips': '54'}, {'name': 'Wisconsin', 'code': 'WI', 'fips': '55'}, {'name': 'Wyoming', 'code': 'WY', 'fips': '56'}, {'name': 'American Samoa', 'code': 'AS', 'fips': '60'}, {'name': 'Guam', 'code': 'GU', 'fips': '66'}, {'name': 'Northern Mariana Islands', 'code': 'MP', 'fips': '69'}, {'name': 'Puerto Rico', 'code': 'PR', 'fips': '72'}, {'name': 'Virgin Islands', 'code': 'VI', 'fips': '78'}]¶ Attributes of US states and territories.
name (str): Full name.
code (str): US Postal Service (USPS) two-letter alphabetic code.
fips (int): Federal Information Processing Standard (FIPS) code.
-
pudl.analysis.state_demand.
UTC_OFFSETS
: Dict[str, int] = {'ADT': -3, 'AKDT': -8, 'AKST': -9, 'AST': -4, 'CDT': -5, 'CST': -6, 'EDT': -4, 'EST': -5, 'HST': -10, 'MDT': -6, 'MST': -7, 'PDT': -7, 'PST': -8}¶ Hour offset from Coordinated Universal Time (UTC) by time zone.
Time zones are either standard or daylight-savings time zone abbreviations (e.g. ‘MST’).
-
pudl.analysis.state_demand.
clean_ferc714_hourly_demand_matrix
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ Detect and null anomalous values in FERC 714 hourly demand matrix.
Note
Takes about 10 minutes.
- Parameters
df – FERC 714 hourly demand matrix, as described in
load_ferc714_hourly_demand_matrix()
.- Returns
Copy of df with nulled anomalous values.
-
pudl.analysis.state_demand.
compare_state_demand
(a: pandas.core.frame.DataFrame, b: pandas.core.frame.DataFrame, scaled: bool = True) → pandas.core.frame.DataFrame[source]¶ Compute statistics comparing predicted and reference demand.
Statistics are computed for each year.
- Parameters
a – Predicted demand with columns utc_datetime and either demand_mwh (if scaled=False) or `scaled_demand_mwh (if scaled=True).
b – Reference demand with columns utc_datetime and demand_mwh. Every element in utc_datetime must match the one in a.
- Returns
Dataframe with columns year, rmse (root mean square error), and mae (mean absolute error).
- Raises
ValueError – Datetime columns do not match.
-
pudl.analysis.state_demand.
filter_ferc714_hourly_demand_matrix
(df: pandas.core.frame.DataFrame, min_data: int = 100, min_data_fraction: float = 0.9) → pandas.core.frame.DataFrame[source]¶ Filter incomplete years from FERC 714 hourly demand matrix.
Nulls respondent-years with too few data and drops respondents with no data across all years.
- Parameters
df – FERC 714 hourly demand matrix, as described in
load_ferc714_hourly_demand_matrix()
.min_data – Minimum number of non-null hours in a year.
min_data_fraction – Minimum fraction of non-null hours between the first and last non-null hour in a year.
- Returns
Hourly demand matrix df modified in-place.
-
pudl.analysis.state_demand.
impute_ferc714_hourly_demand_matrix
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ Impute null values in FERC 714 hourly demand matrix.
Imputation is performed separately for each year, with only the respondents reporting data in that year.
Note
Takes about 15 minutes.
- Parameters
df – FERC 714 hourly demand matrix, as described in
load_ferc714_hourly_demand_matrix()
.- Returns
Copy of df with imputed values.
-
pudl.analysis.state_demand.
load_counties
(pudl_out: pudl.output.pudltabl.PudlTabl, pudl_settings: dict) → pandas.core.frame.DataFrame[source]¶ Load county attributes.
- Parameters
pudl_out – PUDL database extractor.
pudl_settings – PUDL settings.
- Returns
Dataframe with columns county_id_fips and population.
-
pudl.analysis.state_demand.
load_eia861_state_total_sales
(pudl_out: pudl.output.pudltabl.PudlTabl) → pandas.core.frame.DataFrame[source]¶ Read and format EIA 861 sales by state and year.
- Parameters
pudl_out – Used to access
pudl.output.pudltabl.PudlTabl.sales_eia861()
.- Returns
Dataframe with columns state_id_fips, year, demand_mwh.
-
pudl.analysis.state_demand.
load_ferc714_county_assignments
(pudl_out: pudl.output.pudltabl.PudlTabl) → pandas.core.frame.DataFrame[source]¶ Load FERC 714 county assignments.
- Parameters
pudl_out – PUDL database extractor.
- Returns
Dataframe with columns respondent_id_ferc714, report year (int), and county_id_fips.
-
pudl.analysis.state_demand.
load_ferc714_hourly_demand_matrix
(pudl_out: pudl.output.pudltabl.PudlTabl) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶ Read and format FERC 714 hourly demand into matrix form.
- Parameters
pudl_out – Used to access
pudl.output.pudltabl.PudlTabl.demand_hourly_pa_ferc714()
.- Returns
Hourly demand as a matrix with a datetime row index (e.g. ‘2006-01-01 00:00:00’, …, ‘2019-12-31 23:00:00’) in local time ignoring daylight-savings, and a respondent_id_ferc714 column index (e.g. 101, …, 329). A second Dataframe lists the UTC offset in hours of each respondent_id_ferc714 and reporting year (int).
-
pudl.analysis.state_demand.
load_ventyx_hourly_state_demand
(path: str) → pandas.core.frame.DataFrame[source]¶ Read and format Ventyx hourly state-level demand.
After manual corrections of the listed time zone, ambiguous time zone issues remain. Below is a list of transmission zones (by Transmission Zone ID) with one or more missing timestamps at transitions to or from daylight-savings:
615253 (Indiana)
615261 (Michigan)
615352 (Wisconsin)
615357 (Missouri)
615377 (Saskatchewan)
615401 (Minnesota, Wisconsin)
615516 (Missouri)
615529 (Oklahoma)
615603 (Idaho, Washington)
1836089 (California)
- Parameters
path – Path to the data file (published as ‘state_level_load_2007_2018.csv’).
- Returns
Dataframe with hourly state-level demand. * state_id_fips: FIPS code of US state. * utc_datetime: UTC time of the start of each hour. * demand_mwh: Hourly demand in MWh.
-
pudl.analysis.state_demand.
local_to_utc
(local: pandas.core.series.Series, tz: Iterable, **kwargs: Any) → pandas.core.series.Series[source]¶ Convert local times to UTC.
- Parameters
local – Local times (tz-naive datetime64[ns]).
tz – For each time, a timezone (see
DatetimeIndex.tz_localize()
) or UTC offset in hours (int or float).kwargs – Optional arguments to
DatetimeIndex.tz_localize()
.
- Returns
UTC times (tz-naive datetime64[ns]).
Examples
>>> s = pd.Series([pd.Timestamp(2020, 1, 1), pd.Timestamp(2020, 1, 1)]) >>> local_to_utc(s, [-7, -6]) 0 2020-01-01 07:00:00 1 2020-01-01 06:00:00 dtype: datetime64[ns] >>> local_to_utc(s, ['America/Denver', 'America/Chicago']) 0 2020-01-01 07:00:00 1 2020-01-01 06:00:00 dtype: datetime64[ns]
-
pudl.analysis.state_demand.
lookup_state
(state: Union[str, int]) → dict[source]¶ Lookup US state by state identifier.
- Parameters
state – State name, two-letter abbreviation, or FIPS code. String matching is case-insensitive.
- Returns
State identifers.
Examples
>>> lookup_state('alabama') {'name': 'Alabama', 'code': 'AL', 'fips': '01'} >>> lookup_state('AL') {'name': 'Alabama', 'code': 'AL', 'fips': '01'} >>> lookup_state(1) {'name': 'Alabama', 'code': 'AL', 'fips': '01'}
-
pudl.analysis.state_demand.
melt_ferc714_hourly_demand_matrix
(df: pandas.core.frame.DataFrame, tz: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ Melt FERC 714 hourly demand matrix to long format.
- Parameters
df – FERC 714 hourly demand matrix, as described in
load_ferc714_hourly_demand_matrix()
.tz – FERC 714 respondent time zones, as described in
load_ferc714_hourly_demand_matrix()
.
- Returns
Long-format hourly demand with columns respondent_id_ferc714, report year (int), utc_datetime, and demand_mwh.
-
pudl.analysis.state_demand.
plot_demand_scatter
(a: pandas.core.frame.DataFrame, b: pandas.core.frame.DataFrame, title: Optional[str] = None, path: Optional[str] = None) → None[source]¶ Make a scatter plot comparing predicted and reference demand.
- Parameters
a – Predicted demand with columns utc_datetime and any of demand_mwh (in grey) and scaled_demand_mwh (in orange).
b – Reference demand with columns utc_datetime and demand_mwh. Every element in utc_datetime must match the one in a.
title – Plot title.
path – Plot path. If provided, the figure is saved to file and closed.
- Raises
ValueError – Datetime columns do not match.
-
pudl.analysis.state_demand.
plot_demand_timeseries
(a: pandas.core.frame.DataFrame, b: Optional[pandas.core.frame.DataFrame] = None, window: int = 168, title: Optional[str] = None, path: Optional[str] = None) → None[source]¶ Make a timeseries plot of predicted and reference demand.
- Parameters
a – Predicted demand with columns utc_datetime and any of demand_mwh (in grey) and scaled_demand_mwh (in orange).
b – Reference demand with columns utc_datetime and demand_mwh (in red).
window – Width of window (in rows) to use to compute rolling means, or None to plot raw values.
title – Plot title.
path – Plot path. If provided, the figure is saved to file and closed.
-
pudl.analysis.state_demand.
predict_state_hourly_demand
(demand: pandas.core.frame.DataFrame, counties: pandas.core.frame.DataFrame, assignments: pandas.core.frame.DataFrame, state_totals: Optional[pandas.core.frame.DataFrame] = None, mean_overlaps: bool = False) → pandas.core.frame.DataFrame[source]¶ Predict state hourly demand.
- Parameters
demand – Hourly demand timeseries, with columns respondent_id_ferc714, report year, utc_datetime, and demand_mwh.
counties – Counties, with columns county_id_fips and population.
assignments – County assignments for demand respondents, with columns respondent_id_ferc714, year, and county_id_fips.
state_totals – Total annual demand by state, with columns state_id_fips, year, and demand_mwh. If provided, the predicted hourly demand is scaled to match these totals.
mean_overlaps – Whether to mean the demands predicted for a county in cases when a county is assigned to multiple respondents. By default, demands are summed.
- Returns
Dataframe with columns state_id_fips, utc_datetime, demand_mwh, and (if state_totals was provided) scaled_demand_mwh.
-
pudl.analysis.state_demand.
utc_to_local
(utc: pandas.core.series.Series, tz: Iterable) → pandas.core.series.Series[source]¶ Convert UTC times to local.
- Parameters
utc – UTC times (tz-naive datetime64[ns] or datetime64[ns, UTC]).
tz – For each time, a timezone (see
DatetimeIndex.tz_localize()
) or UTC offset in hours (int or float).
- Returns
Local times (tz-naive datetime64[ns]).
Examples
>>> s = pd.Series([pd.Timestamp(2020, 1, 1), pd.Timestamp(2020, 1, 1)]) >>> utc_to_local(s, [-7, -6]) 0 2019-12-31 17:00:00 1 2019-12-31 18:00:00 dtype: datetime64[ns] >>> utc_to_local(s, ['America/Denver', 'America/Chicago']) 0 2019-12-31 17:00:00 1 2019-12-31 18:00:00 dtype: datetime64[ns]
Screen timeseries for anomalies and impute missing and anomalous values.
The screening methods were originally designed to identify unrealistic data in the electricity demand timeseries reported to EIA on Form 930, and have also been applied to the FERC Form 714, and various historical demand timeseries published by regional grid operators like MISO, PJM, ERCOT, and SPP.
They are adapted from code published and modified by:
Tyler Ruggles <truggles@carnegiescience.edu>
Greg Schivley <greg@carbonimpact.co>
And described at:
The imputation methods were designed for multivariate time series forecasting.
They are adapted from code published by:
Xinyu Chen <chenxy346@gmail.com>
And described at:
-
class
pudl.analysis.timeseries_cleaning.
Timeseries
(x: Union[numpy.ndarray, pandas.core.frame.DataFrame])[source]¶ Bases:
object
Multivariate timeseries for anomalies detection and imputation.
-
xi
¶ Reference to the original values (can be null). Many methods assume that these represent chronological, regular timeseries.
-
flags
¶ Flag label for each value, or null if not flagged.
-
flagged
¶ Running list of flags that have been checked so far.
-
index
¶ Row index.
-
columns
¶ Column names.
-
diff
(shift: int = 1) → numpy.ndarray[source]¶ Values minus the value of their neighbor.
- Parameters
shift – Positions to shift for calculating the difference. Positive values select a preceding (left) neighbor.
-
flag
(mask: numpy.ndarray, flag: str) → None[source]¶ Flag values.
Flags values (if not already flagged) and nulls flagged values.
- Parameters
mask – Boolean mask of the values to flag.
flag – Flag name.
-
flag_anomalous_region
(window: int = 48, threshold: float = 0.15) → None[source]¶ Flag values surrounded by flagged values (ANOMALOUS_REGION).
Original null values are not considered flagged values.
- Parameters
width – Width of regions.
threshold – Fraction of flagged values required for a region to be flagged.
-
flag_double_delta
(iqr_window: int = 240, multiplier: float = 2) → None[source]¶ Flag values very different from their neighbors on either side (DOUBLE_DELTA).
Flags values whose differences to both neighbors on either side exceeds a multiplier times the rolling interquartile range (IQR) of neighbor difference.
- Parameters
iqr_window – Number of values in the moving window for the rolling IQR of neighbor difference.
multiplier – Number of times the rolling IQR of neighbor difference the value’s difference to its neighbors must exceed for the value to be flagged.
-
flag_global_outlier
(medians: float = 9) → None[source]¶ Flag values greater or less than n times the global median (GLOBAL_OUTLIER).
- Parameters
medians – Number of times the median the value must exceed the median.
-
flag_global_outlier_neighbor
(neighbors: int = 1) → None[source]¶ Flag values neighboring global outliers (GLOBAL_OUTLIER_NEIGHBOR).
- Parameters
neighbors – Number of neighbors to flag on either side of each outlier.
- Raises
ValueError – Global outliers must be flagged first.
-
flag_identical_run
(length: int = 3) → None[source]¶ Flag the last values in identical runs (IDENTICAL_RUN).
- Parameters
length – Run length to flag. If 3, the third (and subsequent) identical values are flagged.
- Raises
ValueError – Run length must be 2 or greater.
-
flag_local_outlier
(window: int = 48, shifts: Sequence[int] = range(- 240, 241, 24), long_window: int = 480, iqr_window: int = 240, multiplier: Tuple[float, float] = (3.5, 2.5)) → None[source]¶ Flag local outliers (LOCAL_OUTLIER_HIGH, LOCAL_OUTLIER_LOW).
Flags values which are above or below the
median_prediction()
by more than a multiplier times therolling_iqr_of_rolling_median_offset()
.- Parameters
window – Number of values in the moving window for the local rolling median.
shifts – Positions to shift the local rolling median offset by, for computing its median.
long_window – Number of values in the moving window for the regional (long) rolling median.
iqr_window – Number of values in the moving window for the rolling interquartile range (IQR).
multiplier – Number of times the
rolling_iqr_of_rolling_median_offset()
the value must be above (HIGH) and below (LOW) themedian_prediction()
to be flagged.
-
flag_ruggles
() → None[source]¶ Flag values following the method of Ruggles and others (2020).
Assumes values are hourly electricity demand.
-
flag_single_delta
(window: int = 48, shifts: Sequence[int] = range(- 240, 241, 24), long_window: int = 480, iqr_window: int = 240, multiplier: float = 5, rel_multiplier: float = 15) → None[source]¶ Flag values very different from the nearest unflagged value (SINGLE_DELTA).
Flags values whose difference to the nearest unflagged value, with respect to value and relative median prediction, differ by less than a multiplier times the rolling interquartile range (IQR) of the difference - multiplier times
rolling_iqr_of_diff()
and rel_multiplier timesiqr_of_diff_of_relative_mean_prediction()
, respectively.- Parameters
window – Number of values in the moving window for the rolling median (for the relative median prediction).
shifts – Positions to shift the local rolling median offset by, for computing its median (for the relative median prediction).
long_window – Number of values in the moving window for the long rolling median (for the relative median prediction).
iqr_window – Number of values in the moving window for the rolling IQR of neighbor difference.
multiplier – Number of times the rolling IQR of neighbor difference the value’s difference to its neighbor must exceed for the value to be flagged.
rel_multiplier – Number of times the rolling IQR of relative median prediction the value’s prediction difference to its neighbor must exceed for the value to be flagged.
-
fold_tensor
(x: Optional[numpy.ndarray] = None, periods: int = 24) → numpy.ndarray[source]¶ Fold into a 3-dimensional tensor representation.
Folds the series x (number of observations, number of series) into a 3-d tensor (number of series, number of groups, number of periods), splitting observations into groups of length periods. For example, each group may represent a day and each period the hour of the day.
- Parameters
x – Series array to fold. Uses
x
by default.periods – Number of consecutive values in each series to fold into a group.
- Returns
>>> x = np.column_stack([[1, 2, 3, 4, 5, 6], [10, 20, 30, 40, 50, 60]]) >>> s = Timeseries(x) >>> tensor = s.fold_tensor(periods=3) >>> tensor[0] array([[1, 2, 3], [4, 5, 6]]) >>> np.all(x == s.unfold_tensor(tensor)) True
-
impute
(mask: Optional[numpy.ndarray] = None, periods: int = 24, blocks: int = 1, method: str = 'tubal', **kwargs: Any) → numpy.ndarray[source]¶ Impute null values.
Note
The imputation method requires that nulls be replaced by zeros, so the series cannot already contain zeros.
- Parameters
mask – Boolean mask of values to impute in addition to any null values in
x
.periods – Number of consecutive values in each series to fold into a group. See
fold_tensor()
.blocks – Number of blocks into which to split the series for imputation. This has been found to reduce processing time for method=’tnn’.
method – Imputation method to use (‘tubal’:
impute_latc_tubal()
, ‘tnn’:impute_latc_tnn()
).kwargs – Optional arguments to method.
- Returns
Array of same shape as
x
with all null values (and those selected by mask) replaced with imputed values.- Raises
ValueError – Zero values present. Replace with very small value.
-
iqr_of_diff_of_relative_median_prediction
(shift: int = 1, **kwargs: Any) → numpy.ndarray[source]¶ Interquartile range of the running difference of the relative median prediction.
- Parameters
shift – Positions to shift for calculating the difference. Positive values select a preceding (left) neighbor.
kwargs – Arguments to
relative_median_prediction()
.
-
median_of_rolling_median_offset
(window: int = 48, shifts: Sequence[int] = range(- 240, 241, 24)) → numpy.ndarray[source]¶ Median of the offset from the rolling median.
Calculated by shifting the rolling median offset (
rolling_median_offset()
) by different numbers of values, then taking the median at each position. Estimates the typical local cycle in cyclical data.- Parameters
window – Number of values in the moving window for the rolling median.
shifts – Number of values to shift the rolling median offset by.
-
median_prediction
(window: int = 48, shifts: Sequence[int] = range(- 240, 241, 24), long_window: int = 480) → numpy.ndarray[source]¶ Values predicted from local and regional rolling medians.
Calculated as { local median } + { median of local median offset } * { local median } / { regional median }.
- Parameters
window – Number of values in the moving window for the local rolling median.
shifts – Positions to shift the local rolling median offset by, for computing its median.
long_window – Number of values in the moving window for the regional (long) rolling median.
-
plot_flags
(name: Any = 0) → None[source]¶ Plot cleaned series and anomalous values colored by flag.
- Parameters
name – Series to plot, as either an integer index or name in
columns
.
-
relative_median_prediction
(**kwargs: Any) → numpy.ndarray[source]¶ Values divided by their value predicted from medians.
- Parameters
kwargs – Arguments to
median_prediction()
.
-
rolling_iqr_of_diff
(shift: int = 1, window: int = 240) → numpy.ndarray[source]¶ Rolling interquartile range (IQR) of the difference between neighboring values.
- Parameters
shift – Positions to shift for calculating the difference.
window – Number of values in the moving window for the rolling IQR.
-
rolling_iqr_of_rolling_median_offset
(window: int = 48, iqr_window: int = 240) → numpy.ndarray[source]¶ Rolling interquartile range (IQR) of rolling median offset.
Estimates the spread of the local cycles in cyclical data.
- Parameters
window – Number of values in the moving window for the rolling median.
iqr_window – Number of values in the moving window for the rolling IQR.
-
rolling_median
(window: int = 48) → numpy.ndarray[source]¶ Rolling median of values.
- Parameters
window – Number of values in the moving window.
-
rolling_median_offset
(window: int = 48) → numpy.ndarray[source]¶ Values minus the rolling median.
Estimates the local cycle in cyclical data by removing longterm trends.
- Parameters
window – Number of values in the moving window.
-
simulate_nulls
(lengths: Optional[Sequence[int]] = None, padding: int = 1, intersect: bool = False, overlap: bool = False) → numpy.ndarray[source]¶ Find non-null values to null to match a run-length distribution.
- Parameters
length – Length of null runs to simulate for each series. By default, uses the run lengths of null values in each series.
padding – Minimum number of non-null values between simulated null runs and between simulated and existing null runs.
intersect – Whether simulated null runs can intersect each other.
overlap – Whether simulated null runs can overlap existing null runs. If True, padding is ignored.
- Returns
Boolean mask of current non-null values to set to null.
- Raises
ValueError – Cound not find space for run of length {length}.
Examples
>>> x = np.column_stack([[1, 2, np.nan, 4, 5, 6, 7, np.nan, np.nan]]) >>> s = Timeseries(x) >>> s.simulate_nulls().ravel() array([ True, False, False, False, True, True, False, False, False]) >>> s.simulate_nulls(lengths=[4], padding=0).ravel() array([False, False, False, True, True, True, True, False, False])
-
summarize_flags
() → pandas.core.frame.DataFrame[source]¶ Summarize flagged values by flag, count and median.
-
summarize_imputed
(imputed: numpy.ndarray, mask: numpy.ndarray) → pandas.core.frame.DataFrame[source]¶ Summarize the fit of imputed values to actual values.
Summarizes the agreement between actual and imputed values with the following statistics:
mpe: Mean percent error, (actual - imputed) / actual.
mape: Mean absolute percent error, abs(mpe).
- Parameters
imputed – Series of same shape as
x
with imputed values. Seeimpute()
.mask – Boolean mask of imputed values that were not null in
x
. Seesimulate_nulls()
.
- Returns
Table of imputed value statistics for each series.
-
to_dataframe
(array: Optional[numpy.ndarray] = None, copy: bool = True) → pandas.core.frame.DataFrame[source]¶ Return multivariate timeseries as a
pandas.DataFrame
.- Parameters
array – Two-dimensional array to use. If None, uses
x
.copy – Whether to use a copy of array.
-
unflag
(flags: Optional[Iterable[str]] = None) → None[source]¶ Unflag values.
Unflags values by restoring their original values and removing their flag.
- Parameters
flags – Flag names. If None, all flags are removed.
-
unfold_tensor
(tensor: numpy.ndarray) → numpy.ndarray[source]¶ Unfold a 3-dimensional tensor representation.
Performs the reverse of
fold_tensor()
.
-
-
pudl.analysis.timeseries_cleaning.
array_diff
(x: numpy.ndarray, periods: int = 1, axis: int = 0, fill: Any = nan) → numpy.ndarray[source]¶ First discrete difference of array elements.
This is a fast numpy implementation of
pd.DataFrame.diff()
.- Parameters
periods – Periods to shift for calculating difference, accepts negative values.
axis – Array axis along which to calculate the difference.
fill – Value to use at the margins where a difference cannot be calculated.
- Returns
Array of same shape and type as x with discrete element differences.
Examples
>>> x = np.random.random((4, 2)) >>> np.all(array_diff(x, 1)[1:] == pd.DataFrame(x).diff(1).values[1:]) True >>> np.all(array_diff(x, 2)[2:] == pd.DataFrame(x).diff(2).values[2:]) True >>> np.all(array_diff(x, -1)[:-1] == pd.DataFrame(x).diff(-1).values[:-1]) True
-
pudl.analysis.timeseries_cleaning.
encode_run_length
(x: Union[Sequence, numpy.ndarray]) → Tuple[numpy.ndarray, numpy.ndarray][source]¶ Encode vector with run-length encoding.
- Parameters
x – Vector to encode.
- Returns
Values and their run lengths.
Examples
>>> x = np.array([0, 1, 1, 0, 1]) >>> encode_run_length(x) (array([0, 1, 0, 1]), array([1, 2, 1, 1])) >>> encode_run_length(x.astype('bool')) (array([False, True, False, True]), array([1, 2, 1, 1])) >>> encode_run_length(x.astype('<U1')) (array(['0', '1', '0', '1'], dtype='<U1'), array([1, 2, 1, 1])) >>> encode_run_length(np.where(x == 0, np.nan, x)) (array([nan, 1., nan, 1.]), array([1, 2, 1, 1]))
-
pudl.analysis.timeseries_cleaning.
impute_latc_tnn
(tensor: numpy.ndarray, lags: Sequence[int] = [1], alpha: Sequence[float] = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333], rho0: float = 1e-07, lambda0: float = 2e-07, theta: int = 20, epsilon: float = 1e-07, maxiter: int = 300) → numpy.ndarray[source]¶ Impute tensor values with LATC-TNN method by Chen and Sun (2020).
Uses low-rank autoregressive tensor completion (LATC) with truncated nuclear norm (TNN) minimization.
description: https://arxiv.org/abs/2006.10436
code: https://github.com/xinychen/tensor-learning/blob/master/mats
- Parameters
tensor – Observational series in the form (series, groups, periods). Null values are replaced with zeros, so any zeros will be treated as null.
lags –
alpha –
rho0 –
lambda0 –
theta –
epsilon – Convergence criterion. A smaller number will result in more iterations.
maxiter – Maximum number of iterations.
- Returns
Tensor with missing values in tensor replaced by imputed values.
-
pudl.analysis.timeseries_cleaning.
impute_latc_tubal
(tensor: numpy.ndarray, lags: Sequence[int] = [1], rho0: float = 1e-07, lambda0: float = 2e-07, epsilon: float = 1e-07, maxiter: int = 300) → numpy.ndarray[source]¶ Impute tensor values with LATC-Tubal method by Chen, Chen and Sun (2020).
Uses low-tubal-rank autoregressive tensor completion (LATC-Tubal). It is much faster than
impute_latc_tnn()
for very large datasets, with comparable accuracy.description: https://arxiv.org/abs/2008.03194
code: https://github.com/xinychen/tensor-learning/blob/master/mats
- Parameters
tensor – Observational series in the form (series, groups, periods). Null values are replaced with zeros, so any zeros will be treated as null.
lags –
rho0 –
lambda0 –
epsilon – Convergence criterion. A smaller number will result in more iterations.
maxiter – Maximum number of iterations.
- Returns
Tensor with missing values in tensor replaced by imputed values.
-
pudl.analysis.timeseries_cleaning.
insert_run_length
(x: Union[Sequence, numpy.ndarray], values: Union[Sequence, numpy.ndarray], lengths: Sequence[int], mask: Optional[Sequence[bool]] = None, padding: int = 0, intersect: bool = False) → numpy.ndarray[source]¶ Insert run-length encoded values into a vector.
- Parameters
x – Vector to insert values into.
values – Values to insert.
lengths – Length of run to insert for each value in values.
mask – Boolean mask, of the same length as x, where values can be inserted. By default, values can be inserted anywhere in x.
padding – Minimum space between inserted runs and, if mask is provided, the edges of masked-out areas.
intersect – Whether to allow inserted runs to intersect each other.
- Raises
ValueError – Padding must zero or greater.
ValueError – Run length must be greater than zero.
ValueError – Cound not find space for run of length {length}.
- Returns
Copy of array x with values inserted.
Example
>>> x = [0, 0, 0, 0] >>> mask = [True, False, True, True] >>> insert_run_length(x, values=[1, 2], lengths=[1, 2], mask=mask) array([1, 0, 2, 2])
If we use unique values for the background and each inserted run, the run length encoding of the result (ignoring the background) is the same as the inserted run, albeit in a different order.
>>> x = np.zeros(10, dtype=int) >>> values = [1, 2, 3] >>> lengths = [1, 2, 3] >>> x = insert_run_length(x, values=values, lengths=lengths) >>> rvalues, rlengths = encode_run_length(x[x != 0]) >>> order = np.argsort(rvalues) >>> all(rvalues[order] == values) and all(rlengths[order] == lengths) True
Null values can be inserted into a vector such that the new null runs match the run length encoding of the existing null runs.
>>> x = [1, 2, np.nan, np.nan, 5, 6, 7, 8, np.nan] >>> is_nan = np.isnan(x) >>> rvalues, rlengths = encode_run_length(is_nan) >>> xi = insert_run_length( ... x, ... values=[np.nan] * rvalues.sum(), ... lengths=rlengths[rvalues], ... mask=~is_nan ... ) >>> np.isnan(xi).sum() == 2 * is_nan.sum() True
The same as above, with non-zero padding, yields a unique solution:
>>> insert_run_length( ... x, ... values=[np.nan] * rvalues.sum(), ... lengths=rlengths[rvalues], ... mask=~is_nan, ... padding=1 ... ) array([nan, 2., nan, nan, 5., nan, nan, 8., nan])
-
pudl.analysis.timeseries_cleaning.
slice_axis
(x: numpy.ndarray, start: Optional[int] = None, end: Optional[int] = None, step: Optional[int] = None, axis: int = 0) → Tuple[source]¶ Return an index that slices an array along an axis.
- Parameters
x – Array to slice.
start – Start index of slice.
end – End index of slice.
step – Step size of slice.
axis – Axis along which to slice.
- Returns
Tuple of
slice
that slices array x along axis axis (x[…, start:stop:step]).
Examples
>>> x = np.random.random((3, 4, 5)) >>> np.all(x[1:] == x[slice_axis(x, start=1, axis=0)]) True >>> np.all(x[:, 1:] == x[slice_axis(x, start=1, axis=1)]) True >>> np.all(x[:, :, 1:] == x[slice_axis(x, start=1, axis=2)]) True
Modules providing programmatic analyses that make use of PUDL data.
The pudl.analysis
subpackage is a collection of modules which implement
various systematic analyses using the data compiled by PUDL. Over time this
should grow into a rich library of tools that show how the data can be put to
use. We may also generate post-analysis datapackages for distribution at some
point.
pudl.convert package¶
Convert the US Census DP1 ESRI GeoDatabase into an SQLite Database.
This is a thin wrapper around the GDAL ogr2ogr command line tool. We use it to convert the Census DP1 data which is distributed as an ESRI GeoDB into an SQLite DB. The module provides ogr2ogr with the Census DP 1 data from the PUDL datastore, and directs it to be output into the user’s SQLite directory alongside our other SQLite Databases (ferc1.sqlite and pudl.sqlite)
Note that the ogr2ogr command line utility must be available on the user’s
system for this to work. This tool is part of the pudl-dev
conda
environment, but if you are using PUDL outside of the conda environment, you
will need to install ogr2ogr separately. On Debian Linux based systems such
as Ubuntu it can be installed with sudo apt-get install gdal-bin
(which
is what we do in our CI setup and Docker images.)
-
pudl.convert.censusdp1tract_to_sqlite.
censusdp1tract_to_sqlite
(pudl_settings=None, year=2010)[source]¶ Use GDAL’s ogr2ogr utility to convert the Census DP1 GeoDB to an SQLite DB.
The Census DP1 GeoDB is read from the datastore, where it is stored as a zipped archive. This archive is unzipped into a temporary directory so that ogr2ogr can operate on the ESRI GeoDB, and convert it to SQLite. The resulting SQLite DB file is put in the PUDL output directory alongside the ferc1 and pudl SQLite databases.
Module to convert json metadata into rst files.
All of the information about the transformed pudl tables, namely their fields types and descriptions, resides in the datapackage metadata. This module makes that information available to users, without duplicating any data, by converting json metadata files into documentation-compatible rst files. The functions serve to extract the field names, field data types, and field descriptions of each pudl table and outputs them in a manner that automatically updates the read-the-docs.
-
pudl.convert.datapkg_to_rst.
RST_TEMPLATE
= '\n===============================================================================\nPUDL Data Dictionary\n===============================================================================\n\nThe following data tables have been cleaned and transformed by our ETL process.\n\n{% for resource in resources %}\n.. _{{ resource.name }}:\n\n-------------------------------------------------------------------------------\n{{ resource.name }}\n-------------------------------------------------------------------------------\n\n{{ resource.description | wordwrap(78)}}\n`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__\n\n.. list-table::\n :widths: auto\n :header-rows: 1\n\n * - **Field Name**\n - **Type**\n - **Description**{% for field in resource.schema.fields %}\n * - {{ field.name }}\n - {{ field.type }}{% if field.description %}\n - {{ field.description }}{% else %}\n - N/A{% endif %}{% endfor %}\n{% endfor %}\n'¶ A template to map data from a json dictionary into one rst file. Contains multiple tables seperated by headers.
-
pudl.convert.datapkg_to_rst.
datapkg2rst
(meta_json, meta_rst, ignore=None)[source]¶ Convert json metadata to a single rst file.
-
pudl.convert.datapkg_to_rst.
logger
= <Logger pudl.convert.datapkg_to_rst (WARNING)>¶ The following templates map json data into one long rst file seperated by table titles and document links (RST_TEMPLATE)
It’s important for the templates that the json data do not contain excess white space either at the beginning or the end of each value.
Merge compatible PUDL datapackages and load the result into an SQLite DB.
This script merges a set of compatible PUDL datapackages into a single tabular datapackage, and then loads that package into the PUDL SQLite DB
The input datapackages must all have been produced in the same ETL run, and
share the same datapkg-bundle-uuid
value. Any data sources (e.g. ferc1,
eia923) that appear in more than one of the datapackages to be merged must
also share identical ETL parameters (years, tables, states, etc.), allowing
easy deduplication of resources.
Having the ability to load only a subset of the datapackages resulting from an ETL run into the SQLite database is helpful because larger datasets are much easier to work with via columnar datastores like Apache Parquet – loading all of EPA CEMS into SQLite can take more than 24 hours. PUDL also provides a separate epacems_to_parquet script that can be used to generate a Parquet dataset that is partitioned by state and year, which can be read directly into pandas or dask dataframes, for use in conjunction with the other PUDL data that is stored in the SQLite DB.
-
pudl.convert.datapkg_to_sqlite.
datapkg_to_sqlite
(sqlite_url, out_path, clobber=False, fkeys=False)[source]¶ Load a PUDL datapackage into a sqlite database.
- Parameters
sqlite_url (str) – An SQLite database connection URL.
out_path (path-like) – Path to the base directory of the datapackage to be loaded into SQLite. Must contain the datapackage.json file.
clobber (bool) – If True, replace an existing PUDL DB if it exists. If False (the default), fail if an existing PUDL DB is found.
fkeys (bool) – If true, tell SQLite to check foreign key constraints for the records that are being loaded. Left off by default.
- Returns
None
A script for converting the EPA CEMS dataset from gzip to Apache Parquet.
The original EPA CEMS data is available as ~12,000 gzipped CSV files, one for each month for each state, from 1995 to the present. On disk they take up about 7.3 GB of space, compressed. Uncompressed it is closer to 100 GB. That’s too much data to work with in memory.
Apache Parquet is a compressed, columnar datastore format, widely used in Big Data applications. It’s an open standard, and is very fast to read from disk. It works especially well with both Dask dataframes (a parallel / distributed computing extension of pandas) and Apache Spark (a cloud based Big Data processing pipeline system.)
Since pulling 100 GB of data into SQLite takes a long time, and working with that data en masse isn’t particularly pleasant on a laptop, this script can be used to convert the original EPA CEMS data to the more widely usable Apache Parquet format for use with Dask, either on a multi-core workstation or in an interactive cloud computing environment like Pangeo.
-
pudl.convert.epacems_to_parquet.
create_cems_schema
()[source]¶ Make an explicit Arrow schema for the EPA CEMS data.
Make changes in the types of the generated parquet files by editing this function.
Note that parquet’s internal representation doesn’t use unsigned numbers or 16-bit ints, so just keep things simple here and always use int32 and float32.
- Returns
An Arrow schema for the EPA CEMS data.
- Return type
pyarrow.schema
-
pudl.convert.epacems_to_parquet.
create_in_dtypes
()[source]¶ Create a dictionary of input data types.
This specifies the dtypes of the input columns, which is necessary for some cases where, e.g., a column is always NaN.
-
pudl.convert.epacems_to_parquet.
epacems_to_parquet
(datapkg_path, epacems_years, epacems_states, out_dir, compression='snappy', partition_cols=('year', 'state'), clobber=False)[source]¶ Take transformed EPA CEMS dataframes and output them as Parquet files.
We need to do a few additional manipulations of the dataframes after they have been transformed by PUDL to get them ready for output to the Apache Parquet format. Mostly this has to do with ensuring homogeneous data types across all of the dataframes, and downcasting to the most efficient data type possible for each of them. We also add a ‘year’ column so that we can partition the datset on disk by year as well as state. (Year partitions follow the CEMS input data, based on local plant time. The operating_datetime_utc identifies time in UTC, so there’s a mismatch of a few hours on December 31 / January 1.)
- Parameters
datapkg_path (path-like) – Path to the datapackage.json file describing the datapackage contaning the EPA CEMS data to be converted.
epacems_years (list) – list of years from which we are trying to read CEMS data
epacems_states (list) – list of years from which we are trying to read CEMS data
out_dir (path-like) – The directory in which to output the Parquet files
compression (string) –
partition_cols (tuple) –
clobber (bool) – If True and there is already a directory with out_dirs name, the existing parquet files will be deleted and new ones will be generated in their place.
- Raises
AssertionError – Raised if an output directory is not specified.
Todo
Return to
A script for cloning the FERC Form 1 database into SQLite.
This script generates a SQLite database that is a clone/mirror of the original
FERC Form1 database. We use this cloned database as the starting point for the
main PUDL ETL process. The underlying work in the script is being done in
pudl.extract.ferc1
.
Functions for merging compatible PUDL datapackges together.
-
pudl.convert.merge_datapkgs.
check_etl_params
(dps)[source]¶ Verify that datapackages to be merged have compatible ETL params.
Given that all of the input data packages come from the same ETL run, which means they will have used the same input data, the only way they should potentially differ is in the ETL parameters which were used to generate them. This function pulls the data source specific ETL params which we store in each datapackage descriptor and checks that within a given data source (e.g. eia923, ferc1) all of the ETL parameters are identical (e.g. the years, states, and tables loaded).
- Parameters
dps (iterable) – A list of datapackage.Package objects, representing the datapackages to be merged.
- Returns
None
- Raises
ValueError – If the PUDL ETL parameters associated with any given data source are not identical across all instances of that data source within the datapackages to be merged. Also if the ETL UUIDs for all of the datapackages to be merged are not identical.
-
pudl.convert.merge_datapkgs.
check_identical_vals
(dps, required_vals, optional_vals=())[source]¶ Verify that datapackages to be merged have required identical values.
This only works for elements with simple (hashable) datatypes, which can be added to a set.
- Parameters
dps (iterable) – a list of tabular datapackage objects, output by PUDL.
required_vals (iterable) – A list of strings indicating which top level metadata elements should be compared between the datapackages. All must be present in every datapackage.
optional_vals (iterable) – A list of strings indicating top level metadata elements to be compared between the datapackages. They do not need to appear in all datapackages, but if they do appear, they must be identical.
- Returns
None
- Raises
ValueError – if any of the required or optional metadata elements have different values in the different data packages.
KeyError – if a required metadata element is not found in any of the datapackages.
-
pudl.convert.merge_datapkgs.
merge_data
(dps, out_path)[source]¶ Copy the CSV files into the merged datapackage’s data directory.
Iterates through all of the resources in the input datapackages and copies the files they refer to into the data directory associated with the merged datapackage (a directory named “data” inside the out_path directory).
Function assumes that a fresh (empty) data directory has been created. If a file with the same name already exists, it is not overwritten, in order to prevent unnecessary copying of resources which appear in multiple input packages.
- Parameters
dps (iterable) – A list of datapackage.Package objects, representing the datapackages to be merged.
out_path (path like) – Base directory for the newly created datapackage. The final path element will also be used as the name of the merged data package.
- Returns
None
-
pudl.convert.merge_datapkgs.
merge_datapkgs
(dps, out_path, clobber=False)[source]¶ Merge several compatible datapackages into one larger datapackage.
- Parameters
dps (iterable) – A collection of tabular data package objects that were output by PUDL, to be merged into a single deduplicated datapackage for loading into a database or other storage medium.
out_path (path-like) – Base directory for the newly created datapackage. The final path element will also be used as the name of the merged data package.
clobber (bool) – If the location of the output datapackage already exists, should it be overwritten? If True, yes. If False, no.
- Returns
A report containing information about the validity of the merged datapackage.
- Return type
- Raises
FileNotFoundError – If any of the input datapackage paths do not exist.
FileExistsError – If the output directory exists and clobber is False.
-
pudl.convert.merge_datapkgs.
merge_meta
(dps, datapkg_name)[source]¶ Merge the JSON descriptors of datapackages into one big descriptor.
This function builds up a new tabular datapackage JSON descriptor as a python dictionary, containing the merged metadata from all of the input datapackages.
The process is complex for two reasons. First, there are several different datatypes in the descriptor that need to be merged, and the processes for each of them are different. Second, what constitutes a “merge” may vary depending on the semantic content of the metadata. E.g. the
created
timestamp is a simple string, but we need to choose one of the several values (the earliest one) for inclusion in the merged datapackage, while many other simple string fields are required to be identical across all of the input data packages (e.g.datapkg-bundle-uuid
):- Parameters
dps (iterable) – A collection of datapackage objects, whose metadata will be merged to create a single datapackage descriptor representing the union of all the data in the input datapackages.
datapkg_name (str) – The name associated with the newly merged datapackage. This should be the same as the name of the directory in which the datapackage is found.
- Returns
a Python dictionary representing a tabular datapackage JSON descriptor, encoded as a python dictionary, containing the merged metadata of the input datapackages.
- Return type
Tools for converting datasets between various formats in bulk.
It’s often useful to be able to convert entire datasets in bulk from one format to another, both independent of and within the context of the ETL pipeline. This subpackage collects those tools together in one place.
Currently the tools use a mix of idioms, referring either to a particular
dataset and a particular format, or two formats. Some of them read from the
original raw data as organized by the pudl.workspace
package (e.g.
pudl.convert.ferc1_to_sqlite
or pudl.convert.epacems_to_parquet
),
and others convert the entire collection of data from an output datapackage
into another format (e.g. pudl.convert.datapkg_to_sqlite
).
pudl.extract package¶
Retrieve data from EIA Form 860 spreadsheets for analysis.
This modules pulls data from EIA’s published Excel spreadsheets.
This code is for use analyzing EIA Form 860 data.
-
class
pudl.extract.eia860.
Extractor
(*args, **kwargs)[source]¶ Bases:
pudl.extract.excel.GenericExtractor
Extractor for the excel dataset EIA860.
Retrieve data from EIA Form 860M spreadsheets for analysis.
This modules pulls data from EIA’s published Excel spreadsheets.
This code is for use analyzing EIA Form 860M data. EIA 860M is only used in conjunction with EIA 860. This module boths extracts EIA 860M and appends the extracted EIA 860M dataframes to the extracted EIA 860 dataframes. Example setup with pre-genrated eia860_raw_dfs and datastore as ds:
- eia860m_raw_dfs = pudl.extract.eia860m.Extractor(ds).extract(
pc.working_partitions[‘eia860m’][‘year_month’])
- eia860_raw_dfs = pudl.extract.eia860m.append_eia860m(
eia860_raw_dfs=eia860_raw_dfs, eia860m_raw_dfs=eia860m_raw_dfs)
-
class
pudl.extract.eia860m.
Extractor
(*args, **kwargs)[source]¶ Bases:
pudl.extract.excel.GenericExtractor
Extractor for the excel dataset EIA860M.
-
pudl.extract.eia860m.
append_eia860m
(eia860_raw_dfs, eia860m_raw_dfs)[source]¶ Append EIA 860M to the pages to.
- Parameters
eia860_raw_dfs (dictionary) – dictionary of pandas.Dataframe’s from EIA 860 raw tables. Restult of pudl.extract.eia860.Extractor().extract()
eia860m_raw_dfs (dictionary) – dictionary of pandas.Dataframe’s from EIA 860M raw tables. Restult of pudl.extract.eia860m.Extractor().extract()
- Returns
augumented eia860_raw_dfs dictionary of pandas.DataFrame’s. Each raw page stored in eia860m_raw_dfs appened to its eia860_raw_dfs counterpart.
- Return type
dictionary
Retrieve data from EIA Form 861 spreadsheets for analysis.
This modules pulls data from EIA’s published Excel spreadsheets.
This code is for use analyzing EIA Form 861 data.
-
class
pudl.extract.eia861.
Extractor
(*args, **kwargs)[source]¶ Bases:
pudl.extract.excel.GenericExtractor
Extractor for the excel dataset EIA861.
Retrieves data from EIA Form 923 spreadsheets for analysis.
This modules pulls data from EIA’s published Excel spreadsheets.
This code is for use analyzing EIA Form 923 data. Currenly only years 2009-2016 work, as they share nearly identical file formatting.
-
class
pudl.extract.eia923.
Extractor
(*args, **kwargs)[source]¶ Bases:
pudl.extract.excel.GenericExtractor
Extractor for EIA form 923.
Retrieve data from EPA CEMS hourly zipped CSVs.
This modules pulls data from EPA’s published CSV files.
-
pudl.extract.epacems.
CSV_DTYPES
= {'CO2_MASS': <class 'float'>, 'CO2_MASS (tons)': <class 'float'>, 'CO2_MASS_MEASURE_FLG': StringDtype, 'FAC_ID': Int64Dtype(), 'GLOAD': <class 'float'>, 'GLOAD (MW)': <class 'float'>, 'HEAT_INPUT': <class 'float'>, 'HEAT_INPUT (mmBtu)': <class 'float'>, 'NOX_MASS': <class 'float'>, 'NOX_MASS (lbs)': <class 'float'>, 'NOX_MASS_MEASURE_FLG': StringDtype, 'NOX_RATE': <class 'float'>, 'NOX_RATE (lbs/mmBtu)': <class 'float'>, 'NOX_RATE_MEASURE_FLG': StringDtype, 'OP_DATE': StringDtype, 'OP_HOUR': Int64Dtype(), 'OP_TIME': <class 'float'>, 'ORISPL_CODE': Int64Dtype(), 'SLOAD': <class 'float'>, 'SLOAD (1000 lbs)': <class 'float'>, 'SLOAD (1000lb/hr)': <class 'float'>, 'SO2_MASS': <class 'float'>, 'SO2_MASS (lbs)': <class 'float'>, 'SO2_MASS_MEASURE_FLG': StringDtype, 'STATE': StringDtype, 'UNITID': StringDtype, 'UNIT_ID': Int64Dtype()}¶ A dictionary containing column names (keys) and data types (values) for EPA CEMS.
- Type
-
class
pudl.extract.epacems.
EpaCemsDatastore
(datastore: pudl.workspace.datastore.Datastore)[source]¶ Bases:
object
Helper class to extract EpaCems resources from datastore.
EpaCems resources are identified by a year and a state. Each of these zip files contain monthly zip files that in turn contain csv files. This class implements get_data_frame method that will concatenate tables for a given state and month across all months.
-
get_data_frame
(partition: pudl.extract.epacems.EpaCemsPartition) → pandas.core.frame.DataFrame[source]¶ Constructs dataframe holding data for a given (year, state) partition.
-
-
class
pudl.extract.epacems.
EpaCemsPartition
(year: str, state: str)[source]¶ Bases:
tuple
Represents EpaCems partition identifying unique resource file.
-
get_monthly_file
(month: int) → pathlib.Path[source]¶ Returns the filename (without suffix) that contains the monthly data.
-
-
pudl.extract.epacems.
IGNORE_COLS
= {'CO2_RATE', 'CO2_RATE (tons/mmBtu)', 'CO2_RATE_MEASURE_FLG', 'FACILITY_NAME', 'SO2_RATE', 'SO2_RATE (lbs/mmBtu)', 'SO2_RATE_MEASURE_FLG'}¶ The set of EPA CEMS columns to ignore when reading data.
- Type
-
pudl.extract.epacems.
RENAME_DICT
= {'CO2_MASS': 'co2_mass_tons', 'CO2_MASS (tons)': 'co2_mass_tons', 'CO2_MASS_MEASURE_FLG': 'co2_mass_measurement_code', 'FAC_ID': 'facility_id', 'GLOAD': 'gross_load_mw', 'GLOAD (MW)': 'gross_load_mw', 'HEAT_INPUT': 'heat_content_mmbtu', 'HEAT_INPUT (mmBtu)': 'heat_content_mmbtu', 'NOX_MASS': 'nox_mass_lbs', 'NOX_MASS (lbs)': 'nox_mass_lbs', 'NOX_MASS_MEASURE_FLG': 'nox_mass_measurement_code', 'NOX_RATE': 'nox_rate_lbs_mmbtu', 'NOX_RATE (lbs/mmBtu)': 'nox_rate_lbs_mmbtu', 'NOX_RATE_MEASURE_FLG': 'nox_rate_measurement_code', 'OP_DATE': 'op_date', 'OP_HOUR': 'op_hour', 'OP_TIME': 'operating_time_hours', 'ORISPL_CODE': 'plant_id_eia', 'SLOAD': 'steam_load_1000_lbs', 'SLOAD (1000 lbs)': 'steam_load_1000_lbs', 'SLOAD (1000lb/hr)': 'steam_load_1000_lbs', 'SO2_MASS': 'so2_mass_lbs', 'SO2_MASS (lbs)': 'so2_mass_lbs', 'SO2_MASS_MEASURE_FLG': 'so2_mass_measurement_code', 'STATE': 'state', 'UNITID': 'unitid', 'UNIT_ID': 'unit_id_epa'}¶ A dictionary containing EPA CEMS column names (keys) and replacement names to use when reading those columns into PUDL (values).
- Type
-
pudl.extract.epacems.
extract
(epacems_years, states, ds: pudl.workspace.datastore.Datastore)[source]¶ Coordinate the extraction of EPA CEMS hourly DataFrames.
- Parameters
- Yields
dict – a dictionary with a single EPA CEMS tabular data resource name as the key, having the form “hourly_emissions_epacems_YEAR_STATE” where YEAR is a 4 digit number and STATE is a lower case 2-letter code for a US state. The value is a
pandas.DataFrame
containing all the raw EPA CEMS hourly emissions data for the indicated state and year.
Retrieve data from EPA’s Integrated Planning Model (IPM) v6.
Unlike most of the PUDL data sources, IPM is not an annual timeseries. This file assumes that only v6 will be used as an input, so there are a limited number of files.
This module was written by @gschivley
-
class
pudl.extract.epaipm.
EpaIpmDatastore
(datastore: pudl.workspace.datastore.Datastore)[source]¶ Bases:
object
Helper for extracting EpaIpm dataframes from Datastore.
-
SETTINGS
= (TableSettings(table_name='transmission_single_epaipm', file='table_3-21_annual_transmission_capabilities_of_u.s._model_regions_in_epa_platform_v6_-_2021.xlsx', excel_settings={'skiprows': 3, 'usecols': 'B:F', 'index_col': [0, 1]}), TableSettings(table_name='transmission_joint_epaipm', file='table_3-5_transmission_joint_ipm.csv', excel_settings={}), TableSettings(table_name='load_curves_epaipm', file='table_2-2_load_duration_curves_used_in_epa_platform_v6.xlsx', excel_settings={'skiprows': 3, 'usecols': 'B:AB'}), TableSettings(table_name='plant_region_map_epaipm_active', file='needs_v6_november_2018_reference_case_0.xlsx', excel_settings={'sheet_name': 'NEEDS v6_Active', 'usecols': 'C,I'}), TableSettings(table_name='plant_region_map_epaipm_retired', file='needs_v6_november_2018_reference_case_0.xlsx', excel_settings={'sheet_name': 'NEEDS v6_Retired_Through2021', 'usecols': 'C,I'}))¶
-
get_dataframe
(table_name: str) → pandas.core.frame.DataFrame[source]¶ Retrieve the specified file from the epaipm archive.
- Parameters
table_name – table name, from self.table_filename
pandas_args – pandas arguments for parsing the file
- Returns
Pandas dataframe of EPA IPM data.
-
get_table_settings
(table_name: str) → pudl.extract.epaipm.TableSettings[source]¶ Returns TableSettings for a given table_name.
-
-
class
pudl.extract.epaipm.
TableSettings
(table_name: str, file: str, excel_settings: Dict[str, Any] = {})[source]¶ Bases:
tuple
Contains information for how to access and load EpaIpm dataframes.
-
pudl.extract.epaipm.
extract
(epaipm_tables: List[str], ds: pudl.workspace.datastore.Datastore) → Dict[str, pandas.core.frame.DataFrame][source]¶ Extracts data from IPM files.
- Parameters
epaipm_tables (iterable) – A tuple or list of table names to extract
ds (
EpaIpmDatastore
) – Initialized datastore
- Returns
dictionary of DataFrames with extracted (but not yet transformed) data from each file.
- Return type
Load excel metadata CSV files form a python data package.
-
class
pudl.extract.excel.
GenericExtractor
(ds)[source]¶ Bases:
object
Contains logic for extracting panda.DataFrames from excel spreadsheets.
This class implements the generic dataset agnostic logic to load data from excel spreadsheet simply by using excel Metadata for given dataset.
It is expected that individual datasets wil subclass this code and add custom business logic by overriding necessary methods.
When implementing custom business logic, the following should be modified:
DATASET class attribute controls which excel metadata should be loaded.
2. BLACKLISTED_PAGES class attribute specifies which pages should not be loaded from the underlying excel files even if the metadata is available. This can be used for experimental/new code that should not be run yet.
3. dtypes() should return dict with {column_name: pandas_datatype} if you need to specify which datatypes should be uded upon loading.
4. If data cleanup is necessary, you can apply custom logic by overriding one of the following functions (they all return the modified dataframe):
process_raw() is applied right after loading the excel DataFrame from the disk.
process_renamed() is applied after input columns were renamed to standardized pudl columns.
process_final_page() is applied when data from all available years is merged into single DataFrame for a given page.
5. get_datapackage_resources() if partition is anything other than a year, this method should be overwritten in the dataset-specific extractor.
-
BLACKLISTED_PAGES
= []¶ List of supported pages that should not be extracted.
-
METADATA
= None¶ Instance of metadata object to use with this extractor.
-
excel_filename
(page, **partition)[source]¶ Produce the xlsx document file name as it will appear in the archive.
- Parameters
page – pudl name for the dataset contents, eg “boiler_generator_assn” or “coal_stocks”
partition – partition to load. (ex: 2009 for year partition or “2020-08” for year_month partition)
- Returns
string name of the xlsx file
-
extract
(**partitions)[source]¶ Extracts dataframes.
Returns dict where keys are page names and values are DataFrames containing data across given years.
-
load_excel_file
(page, **partition)[source]¶ Produce the ExcelFile object for the given (partition, page).
- Parameters
page (str) – pudl name for the dataset contents, eg “boiler_generator_assn” or “coal_stocks”
partition – partition to load. (ex: 2009 for year partition or “2020-08” for year_month partition)
- Returns
pd.ExcelFile instance with the parsed excel spreadsheet frame
-
class
pudl.extract.excel.
Metadata
(dataset_name)[source]¶ Bases:
object
Load Excel metadata from Python package data.
Excel sheet files may contain many different tables. When we load those into dataframes, metadata tells us how to do this. Metadata generally informs us about the position of a given page in the file (which sheet and which row) and it informs us how to translate excel column names into standardized column names.
When metadata object is instantiated, it is given ${dataset} name and it will attempt to load csv files from pudl.package_data.meta.xlsx_maps.${dataset} package.
It expects the following kinds of files:
skiprows.csv tells us how many initial rows should be skipped when loading data for given (partition, page).
skipfooter.csv tells us how many bottom rows should be skipped when loading data for given partition (partition, page).
tab_map.csv tells us what is the excel sheet name that should be read when loading data for given (partition, page)
column_map/${page}.csv currently informs us how to translate input column names to standardized pudl names for given (partition, input_col_name). Relevant page is encoded in the filename.
-
get_all_columns
(page)[source]¶ Returns list of all pudl (standardized) columns for a given page (across all partition).
-
get_column_map
(page, **partition)[source]¶ Returns the dictionary mapping input columns to pudl columns for given partition and page.
-
get_sheet_name
(page, **partition)[source]¶ Returns name of the excel sheet that contains the data for given partition and page.
Returns number of bottom rows to skip when loading given partition and page.
Tools for extracting data from the FERC Form 1 FoxPro database for use in PUDL.
FERC distributes the annual responses to Form 1 as binary FoxPro database files. This format is no longer widely supported, and so our first challenge in accessing the Form 1 data is to convert it into a modern format. In addition, FERC distributes one database for each year, and these databases are not explicitly linked together. Over time the structure has changed as new tables and fields have been added. In order to be able to use the data to do analyses across many years, we need to bring all of it into a unified structure. However it appears that these changes are only entirely additive – the most recent versions of the DB contain all the tables and fields that existed in earlier versions.
PUDL uses the most recently released year of data as a template, and infers the structure of the FERC Form 1 database based on the strings embedded within the binary files, pulling out the names of tables and their constituent columns. The structure of the database is also informed by information we found on the FERC website, including a mapping between the table names, DBF file names, and the pages of the Form 1 (add link to file, which should distributed with the docs) that the data was gathered from, as well as a diagram of the structure of the database as it existed in 2015 (add link/embed image).
Using this inferred structure PUDL creates an SQLite database mirroring the
FERC database using sqlalchemy
. Then we use a python package called
dbfread to extract the data from
the DBF tables, and insert it virtually unchanged into the SQLite database.
However, we do compile a master table of the all the respondent IDs and
respondent names, which all the other tables refer to. Unlike the other tables,
this table has no report_year
and so it represents a merge of all the years
of data. In the event that the name associated with a given respondent ID has
changed over time, we retain the most recently reported name.
Ths SQLite based compilation of the original FERC Form 1 databases can accommodate all 116 tables from all the published years of data (beginning in 1994). Including all the data through 2018, the database takes up more than 7GB of disk space. However, almost 90% of that “data” is embeded binary files in two tables. If those tables are excluded, the database is less than 800MB in size.
The process of cloning the FERC Form 1 database(s) is coordinated by a script
called ferc1_to_sqlite
implemented in pudl.convert.ferc1_to_sqlite
which is controlled by a YAML file. See the example file distributed with the
package.
Once the cloned SQLite database has been created, we use it as an input into the PUDL ETL pipeline, and we extract a small subset of the available tables for further processing and integration with other data sources like the EIA 860 and EIA 923.
-
class
pudl.extract.ferc1.
FERC1FieldParser
(table, memofile=None)[source]¶ Bases:
dbfread.field_parser.FieldParser
A custom DBF parser to deal with bad FERC Form 1 data types.
-
parseN
(field, data)[source]¶ Augments the Numeric DBF parser to account for bad FERC data.
There are a small number of bad entries in the backlog of FERC Form 1 data. They take the form of leading/trailing zeroes or null characters in supposedly numeric fields, and occasionally a naked ‘.’
Accordingly, this custom parser strips leading and trailing zeros and null characters, and replaces a bare ‘.’ character with zero, allowing all these fields to be cast to numeric values.
- Parameters
() (data) –
() –
() –
-
-
class
pudl.extract.ferc1.
Ferc1Datastore
(datastore: pudl.workspace.datastore.Datastore)[source]¶ Bases:
object
Simple datastore wrapper for accessing ferc1 resources.
-
PACKAGE_PATH
= 'pudl.package_data.meta.ferc1_row_maps'¶
-
get_dir
(year: int) → pathlib.Path[source]¶ Returns the path where individual ferc1 files are stored inside the yearly archive.
-
-
pudl.extract.ferc1.
PUDL_RIDS
= {514: 'AEP Texas', 519: 'Upper Michigan Energy Resources Company', 522: 'Luning Energy Holdings LLC, Invenergy Investments', 529: 'Tri-State Generation and Transmission Association', 531: 'Basin Electric Power Cooperative'}¶ Missing FERC 1 Respondent IDs for which we have identified the respondent.
-
pudl.extract.ferc1.
accumulated_depreciation
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of the fields of accumulated_depreciation_ferc1.
- Parameters
- Returns
A DataFrame containing all accumulated_depreciation_ferc1 records.
- Return type
-
pudl.extract.ferc1.
add_sqlite_table
(table_name, sqlite_meta, dbc_map, ds, refyear=2019, testing=False, bad_cols=())[source]¶ Adds a new Table to the FERC Form 1 database schema.
Creates a new sa.Table object named
table_name
and add it to the database schema contained insqlite_meta
. Use the information in the dictionarydbc_map
to translate between the DBF filenames in the datastore (e.g.F1_31.DBF
), and the full name of the table in the FoxPro database (e.g.f1_fuel
) and also between truncated column names extracted from that DBF file, and the full column names extracted from the DBC file. Read the column datatypes out of each DBF file and use them to define the columns in the new Table object.- Parameters
table_name (str) – The name of the new table to be added to the database schema.
sqlite_meta (
sqlalchemy.schema.MetaData
) – The database schema to which the newly definedsqlalchemy.Table
will be added.dbc_map (dict) – A dictionary of dictionaries
ds (
Ferc1Datastore
) – Initialized datastoretesting (bool) – Assume this is a test run, use sandboxes
bad_cols (iterable of 2-tuples) – A list or other iterable containing pairs of strings of the form (table_name, column_name), indicating columns (and their parent tables) which should not be cloned into the SQLite database for some reason.
- Returns
None
-
pudl.extract.ferc1.
check_ferc1_tables
(refyear)[source]¶ Test each FERC 1 data year for compatibility with reference year schema.
-
pudl.extract.ferc1.
dbf2sqlite
(tables, years, refyear, pudl_settings, bad_cols=(), clobber=False, datastore=None)[source]¶ Clone the FERC Form 1 Databsae to SQLite.
- Parameters
tables (iterable) – What tables should be cloned?
years (iterable) – Which years of data should be cloned?
refyear (int) – Which database year to use as a template.
pudl_settings (dict) – Dictionary containing paths and database URLs used by PUDL.
bad_cols (iterable of tuples) – A list of (table, column) pairs indicating columns that should be skipped during the cloning process. Both table and column are strings in this case, the names of their respective entities within the database metadata.
datastore (Datastore) – instance of a datastore to access the resources.
- Returns
None
-
pudl.extract.ferc1.
define_sqlite_db
(sqlite_meta, dbc_map, ds, tables={'f1_106_2009': 'F1_106_2009', 'f1_106a_2009': 'F1_106A_2009', 'f1_106b_2009': 'F1_106B_2009', 'f1_208_elc_dep': 'F1_208_ELC_DEP', 'f1_231_trn_stdycst': 'F1_231_TRN_STDYCST', 'f1_324_elc_expns': 'F1_324_ELC_EXPNS', 'f1_325_elc_cust': 'F1_325_ELC_CUST', 'f1_331_transiso': 'F1_331_TRANSISO', 'f1_338_dep_depl': 'F1_338_DEP_DEPL', 'f1_397_isorto_stl': 'F1_397_ISORTO_STL', 'f1_398_ancl_ps': 'F1_398_ANCL_PS', 'f1_399_mth_peak': 'F1_399_MTH_PEAK', 'f1_400_sys_peak': 'F1_400_SYS_PEAK', 'f1_400a_iso_peak': 'F1_400A_ISO_PEAK', 'f1_429_trans_aff': 'F1_429_TRANS_AFF', 'f1_acb_epda': 'F1_2', 'f1_accumdepr_prvsn': 'F1_3', 'f1_accumdfrrdtaxcr': 'F1_4', 'f1_adit_190_detail': 'F1_5', 'f1_adit_190_notes': 'F1_6', 'f1_adit_amrt_prop': 'F1_7', 'f1_adit_other': 'F1_8', 'f1_adit_other_prop': 'F1_9', 'f1_allowances': 'F1_10', 'f1_allowances_nox': 'F1_ALLOWANCES_NOX', 'f1_audit_log': 'F1_78', 'f1_bal_sheet_cr': 'F1_11', 'f1_capital_stock': 'F1_12', 'f1_cash_flow': 'F1_13', 'f1_cmmn_utlty_p_e': 'F1_14', 'f1_cmpinc_hedge': 'F1_CMPINC_HEDGE', 'f1_cmpinc_hedge_a': 'F1_CMPINC_HEDGE_A', 'f1_co_directors': 'F1_18', 'f1_codes_val': 'F1_76', 'f1_col_lit_tbl': 'F1_79', 'f1_comp_balance_db': 'F1_15', 'f1_construction': 'F1_16', 'f1_control_respdnt': 'F1_17', 'f1_cptl_stk_expns': 'F1_19', 'f1_csscslc_pcsircs': 'F1_20', 'f1_dacs_epda': 'F1_21', 'f1_dscnt_cptl_stk': 'F1_22', 'f1_edcfu_epda': 'F1_23', 'f1_elc_op_mnt_expn': 'F1_27', 'f1_elc_oper_rev_nb': 'F1_26', 'f1_elctrc_erg_acct': 'F1_24', 'f1_elctrc_oper_rev': 'F1_25', 'f1_electric': 'F1_28', 'f1_email': 'F1_EMAIL', 'f1_envrnmntl_expns': 'F1_29', 'f1_envrnmntl_fclty': 'F1_30', 'f1_footnote_data': 'F1_85', 'f1_footnote_tbl': 'F1_87', 'f1_fuel': 'F1_31', 'f1_general_info': 'F1_32', 'f1_gnrt_plant': 'F1_33', 'f1_hydro': 'F1_86', 'f1_ident_attsttn': 'F1_88', 'f1_important_chg': 'F1_34', 'f1_incm_stmnt_2': 'F1_35', 'f1_income_stmnt': 'F1_36', 'f1_leased': 'F1_90', 'f1_load_file_names': 'F1_80', 'f1_long_term_debt': 'F1_93', 'f1_misc_dfrrd_dr': 'F1_38', 'f1_miscgen_expnelc': 'F1_37', 'f1_mthly_peak_otpt': 'F1_39', 'f1_mtrl_spply': 'F1_40', 'f1_nbr_elc_deptemp': 'F1_41', 'f1_nonutility_prop': 'F1_42', 'f1_note_fin_stmnt': 'F1_43', 'f1_nuclear_fuel': 'F1_44', 'f1_officers_co': 'F1_45', 'f1_othr_dfrrd_cr': 'F1_46', 'f1_othr_pd_in_cptl': 'F1_47', 'f1_othr_reg_assets': 'F1_48', 'f1_othr_reg_liab': 'F1_49', 'f1_overhead': 'F1_50', 'f1_pccidica': 'F1_51', 'f1_plant': 'F1_92', 'f1_plant_in_srvce': 'F1_52', 'f1_privilege': 'F1_81', 'f1_pumped_storage': 'F1_53', 'f1_purchased_pwr': 'F1_54', 'f1_r_d_demo_actvty': 'F1_59', 'f1_reconrpt_netinc': 'F1_55', 'f1_reg_comm_expn': 'F1_56', 'f1_respdnt_control': 'F1_57', 'f1_respondent_id': 'F1_1', 'f1_retained_erng': 'F1_58', 'f1_rg_trn_srv_rev': 'F1_RG_TRN_SRV_REV', 'f1_row_lit_tbl': 'F1_84', 'f1_s0_checks': 'F1_S0_CHECKS', 'f1_s0_filing_log': 'F1_S0_FILING_LOG', 'f1_sale_for_resale': 'F1_61', 'f1_sales_by_sched': 'F1_60', 'f1_sbsdry_detail': 'F1_91', 'f1_sbsdry_totals': 'F1_62', 'f1_sched_lit_tbl': 'F1_77', 'f1_schedules_list': 'F1_63', 'f1_security': 'F1_SECURITY', 'f1_security_holder': 'F1_64', 'f1_slry_wg_dstrbtn': 'F1_65', 'f1_steam': 'F1_89', 'f1_substations': 'F1_66', 'f1_sys_error_log': 'F1_82', 'f1_taxacc_ppchrgyr': 'F1_67', 'f1_unique_num_val': 'F1_83', 'f1_unrcvrd_cost': 'F1_68', 'f1_utltyplnt_smmry': 'F1_69', 'f1_work': 'F1_70', 'f1_xmssn_adds': 'F1_71', 'f1_xmssn_elc_bothr': 'F1_72', 'f1_xmssn_elc_fothr': 'F1_73', 'f1_xmssn_line': 'F1_74', 'f1_xtraordnry_loss': 'F1_75'}, refyear=2019, bad_cols=())[source]¶ Defines a FERC Form 1 DB structure in a given SQLAlchemy MetaData object.
Given a template from an existing year of FERC data, and a list of target tables to be cloned, convert that information into table and column names, and data types, stored within a SQLAlchemy MetaData object. Use that MetaData object (which is bound to the SQLite database) to create all the tables to be populated later.
- Parameters
sqlite_meta (sa.MetaData) – A SQLAlchemy MetaData object which is bound to the FERC Form 1 SQLite database.
dbc_map (dict of dicts) – A dictionary of dictionaries, of the kind returned by get_dbc_map(), describing the table and column names stored within the FERC Form 1 FoxPro database files.
ds (
Ferc1Datastore
) – Initialized Ferc1Datastoretables (iterable of strings) – List or other iterable of FERC database table names that should be included in the database being defined. e.g. ‘f1_fuel’ and ‘f1_steam’
refyear (integer) – The year of the FERC Form 1 DB to use as a template for creating the overall multi-year database schema.
bad_cols (iterable of 2-tuples) – A list or other iterable containing pairs of strings of the form (table_name, column_name), indicating columns (and their parent tables) which should not be cloned into the SQLite database for some reason.
- Returns
the effects of the function are stored inside sqlite_meta
- Return type
-
pudl.extract.ferc1.
drop_tables
(engine)[source]¶ Drop all FERC Form 1 tables from the SQLite database.
Creates an sa.schema.MetaData object reflecting the structure of the database that the passed in
engine
refers to, and uses that schema to drop all existing tables.Todo
Treat DB connection as a context manager (with/as).
- Parameters
engine (
sqlalchemy.engine.Engine
) – A DB Engine pointing at an exising SQLite database to be deleted.- Returns
None
-
pudl.extract.ferc1.
extract
(ferc1_tables=('fuel_ferc1', 'plants_steam_ferc1', 'plants_small_ferc1', 'plants_hydro_ferc1', 'plants_pumped_storage_ferc1', 'purchased_power_ferc1', 'plant_in_service_ferc1'), ferc1_years=(1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), pudl_settings=None)[source]¶ Coordinates the extraction of all FERC Form 1 tables into PUDL.
- Parameters
ferc1_tables (iterable of strings) – List of the FERC 1 database tables to be loaded into PUDL. These are the names of the tables in the PUDL database, not the FERC Form 1 database.
ferc1_years (iterable of ints) – List of years for which FERC Form 1 data should be loaded into PUDL. Note that not all years for which FERC data is available may have been integrated into PUDL yet.
- Returns
A dictionary of pandas DataFrames, with the names of PUDL database tables as the keys. These are the raw unprocessed dataframes, reflecting the data as it is in the FERC Form 1 DB, for passing off to the data tidying and cleaning fuctions found in the
pudl.transform.ferc1
module.- Return type
- Raises
ValueError – If the year is not in the list of years for which FERC data is available
ValueError – If the year is not in the list of working FERC years
ValueError – If the FERC table requested is not integrated into PUDL
-
pudl.extract.ferc1.
fuel
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of f1_fuel table records with plant names, >0 fuel.
- Parameters
- Returns
A DataFrame containing f1_fuel records that have plant_names and non-zero fuel amounts.
- Return type
-
pudl.extract.ferc1.
get_dbc_map
(ds, year, min_length=4)[source]¶ Extract names of all tables and fields from a FERC Form 1 DBC file.
Read the DBC file associated with the FERC Form 1 database for the given
year
, and extract all printable strings longer thanmin_lengh
. Select those strings that appear to be database table names, and their associated field for use in re-naming the truncated column names extracted from the corresponding DBF files (those names are limited to having only 10 characters in their names.)- Parameters
ds (
Ferc1Datastore
) – Initialized datastoreyear – The year of data from which the database table and column names are to be extracted. Typically this is expected to be the most recently available year of FERC Form 1 data.
- Returns
a dictionary whose keys are the long table names extracted from the DBC file, and whose values are lists of pairs of values, the first of which is the full name of each field in the table with the same name as the key, and the second of which is the truncated (<=10 character) long name of that field as found in the DBF file.
- Return type
-
pudl.extract.ferc1.
get_ferc1_meta
(ferc1_engine)[source]¶ Grab the FERC Form 1 DB metadata and check that tables exist.
Connects to the FERC Form 1 SQLite database and reads in its metadata (table schemas, types, etc.) by reflecting the database. Checks to make sure the DB is not empty, and returns the metadata object.
- Parameters
ferc1_engine (sqlalchemy.engine.Engine) – SQL Alchemy database connection engine for the PUDL FERC 1 DB.
- Returns
sqlalchemy.Metadata A SQL Alchemy metadata object, containing the definition of the DB structure.
- Raises
ValueError – If there are no tables in the SQLite Database.
-
pudl.extract.ferc1.
get_fields
(filedata)[source]¶ Produce the expected table names and fields from a DBC file.
- Parameters
filedata – Contents of the DBC file from which to extract.
- Returns
[fields]
- Return type
dict of table_name
-
pudl.extract.ferc1.
get_raw_df
(ds, table, dbc_map, years=(1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019))[source]¶ Combine several years of a given FERC Form 1 DBF table into a dataframe.
- Parameters
ds (
Ferc1Datastore
) – Initialized datastoretable (string) – The name of the FERC Form 1 table from which data is read.
dbc_map (dict of dicts) – A dictionary of dictionaries, of the kind returned by get_dbc_map(), describing the table and column names stored within the FERC Form 1 FoxPro database files.
min_length (int) – The minimum number of consecutive printable
years (list) – Range of years to be combined into a single DataFrame.
- Returns
A DataFrame containing several years of FERC Form 1 data for the given table.
- Return type
-
pudl.extract.ferc1.
missing_respondents
(reported, observed, identified)[source]¶ Fill in missing respondents for the f1_respondent_id table.
- Parameters
reported (iterable) – Respondent IDs appearing in f1_respondent_id.
observed (iterable) – Respondent IDs appearing anywhere in the ferc1 DB.
identified (dict) – A {respondent_id: respondent_name} mapping for those observed but not reported respondent IDs which we have been able to identify based on circumstantial evidence. See also: pudl.extract.ferc1.PUDL_RIDS
- Returns
A list of dictionaries representing minimal f1_respondent_id table records, of the form {“respondent_id”: ID, “respondent_name”: NAME}. These records are generated only for unreported respondents. Identified respondents get the values passed in through
identified
and the other observed but unidentified respondents are named “Missing Respondent ID”- Return type
-
pudl.extract.ferc1.
observed_respondents
(ferc1_engine)[source]¶ Compile the set of all observed respondent IDs found in the FERC 1 database.
A significant number of FERC 1 respondent IDs appear in the data tables, but not in the f1_respondent_id table. In order to construct a self-consisten database with we need to find all of those missing respondent IDs and inject them into the table when we clone the database.
- Parameters
ferc1_engine (sqlalchemy.engine.Engine) – An engine for connecting to the FERC 1 database.
- Returns
Every respondent ID reported in any of the FERC 1 DB tables.
- Return type
-
pudl.extract.ferc1.
plant_in_service
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of the fields of plant_in_service_ferc1.
- Parameters
- Returns
A DataFrame containing all plant_in_service_ferc1 records.
- Return type
-
pudl.extract.ferc1.
plants_hydro
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of f1_hydro for records that have plant names.
- Parameters
- Returns
A DataFrame containing f1_hydro records that have plant names.
- Return type
-
pudl.extract.ferc1.
plants_pumped_storage
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of f1_plants_pumped_storage records with plant names.
- Parameters
- Returns
A DataFrame containing f1_plants_pumped_storage records that have plant names.
- Return type
-
pudl.extract.ferc1.
plants_small
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame of f1_small for records with minimum data criteria.
- Parameters
- Returns
A DataFrame containing f1_small records that have plant names and non zero demand, generation, operations, maintenance, and fuel costs.
- Return type
-
pudl.extract.ferc1.
plants_steam
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Create a
pandas.DataFrame
containing valid raw f1_steam records.Selected records must indicate a plant capacity greater than 0, and include a non-null plant name.
- Parameters
- Returns
A DataFrame containing f1_steam records that have plant names and non-zero capacities.
- Return type
-
pudl.extract.ferc1.
purchased_power
(ferc1_meta, ferc1_table, ferc1_years)[source]¶ Creates a DataFrame the fields of purchased_power_ferc1.
- Parameters
- Returns
A DataFrame containing all purchased_power_ferc1 records.
- Return type
-
pudl.extract.ferc1.
show_dupes
(table, dbc_map, data_dir, years=(1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), pk=('respondent_id', 'report_year', 'report_prd', 'row_number', 'spplmnt_num'))[source]¶ Identify duplicate primary keys by year within a given FERC Form 1 table.
- Parameters
table (str) – Name of the original FERC Form 1 table to identify duplicate records in.
years (iterable) – a list or other iterable containing the years that should be searched for duplicate records. By default it is all available years of FERC Form 1 data.
pk (list) – A list of strings identifying the columns in the FERC Form 1 table that should be treated as a composite primary key. By default this includes: respondent_id, report_year, report_prd, row_number, and spplmnt_num.
- Returns
None
Routines used for extracting the raw FERC 714 data.
-
pudl.extract.ferc714.
TABLE_ENCODING
= {'adjacency_ba_ferc714': 'iso-8859-1', 'demand_forecast_pa_ferc714': None, 'demand_hourly_pa_ferc714': None, 'demand_monthly_ba_ferc714': None, 'description_pa_ferc714': 'iso-8859-1', 'gen_plants_ba_ferc714': 'iso-8859-1', 'id_certification_ferc714': 'iso-8859-1', 'interchange_ba_ferc714': 'iso-8859-1', 'lambda_description_ferc714': 'iso-8859-1', 'lambda_hourly_ba_ferc714': None, 'net_energy_load_ba_ferc714': None, 'respondent_id_ferc714': None}¶ Dictionary describing the character encodings of the FERC 714 CSV files.
-
pudl.extract.ferc714.
TABLE_FNAME
= {'adjacency_ba_ferc714': 'Part 2 Schedule 4 - Adjacent Balancing Authorities.csv', 'demand_forecast_pa_ferc714': 'Part 3 Schedule 3 - Planning Area Forecast Demand.csv', 'demand_hourly_pa_ferc714': 'Part 3 Schedule 2 - Planning Area Hourly Demand.csv', 'demand_monthly_ba_ferc714': 'Part 2 Schedule 2 - Balancing Authority Monthly Demand.csv', 'description_pa_ferc714': 'Part 3 Schedule 1 - Planning Area Description.csv', 'gen_plants_ba_ferc714': 'Part 2 Schedule 1 - Balancing Authority Generating Plants.csv', 'id_certification_ferc714': 'Part 1 Schedule 1 - Identification Certification.csv', 'interchange_ba_ferc714': 'Part 2 Schedule 5 - Balancing Authority Interchange.csv', 'lambda_description_ferc714': 'Part 2 Schedule 6 - System Lambda Description.csv', 'lambda_hourly_ba_ferc714': 'Part 2 Schedule 6 - Balancing Authority Hourly System Lambda.csv', 'net_energy_load_ba_ferc714': 'Part 2 Schedule 3 - Balancing Authority Net Energy For Load.csv', 'respondent_id_ferc714': 'Respondent IDs.csv'}¶ Dictionary mapping PUDL tables to filenames within the FERC 714 zipfile.
-
pudl.extract.ferc714.
extract
(tables=('respondent_id_ferc714', 'id_certification_ferc714', 'gen_plants_ba_ferc714', 'demand_monthly_ba_ferc714', 'net_energy_load_ba_ferc714', 'adjacency_ba_ferc714', 'interchange_ba_ferc714', 'lambda_hourly_ba_ferc714', 'lambda_description_ferc714', 'description_pa_ferc714', 'demand_forecast_pa_ferc714', 'demand_hourly_pa_ferc714'), pudl_settings=None, ds=None)[source]¶ Extract the raw FERC Form 714 dataframes from their original CSV files.
- Parameters
- Returns
A dictionary of dataframes, with raw FERC 714 table names as the keys, and minimally processed pandas.DataFrame instances as the values.
- Return type
Modules implementing the “Extract” step of the PUDL ETL pipeline.
Each module in this subpackage implements data extraction for a single data
source from the PUDL Data Sources. This process begins with
the original data as retrieved by the pudl.workspace
subpackage, and
ends with a dictionary of “raw” pandas.DataFrame`s, that have been
minimally altered from the original data, and are ready for normalization and
data cleaning by the data source specific modules in the :mod:`pudl.transform
subpackage.
pudl.glue package¶
Extract, clean, and normalize the EPA-EIA crosswalk.
This module defines functions that read the raw EPA-EIA crosswalk file, clean up the column names, and separate it into three distinctive normalize tables for integration in the database. There are many gaps in the mapping of EIA plant and generator ids to EPA plant and unit ids, so, for the time being these tables are sparse.
The EPA, in conjunction with the EIA, plans to relase an crosswalk with fewer gaps at the beginning of 2021. Until then, this module reads and cleans the currently available crosswalk.
The raw crosswalk file was obtained from Greg Schivley. His methods for filling in some of the gaps are not included in this version of the module. https://github.com/grgmiller/EPA-EIA-Unit-Crosswalk
-
pudl.glue.eia_epacems.
grab_clean_split
()[source]¶ Clean raw crosswalk data, drop nans, and return split tables.
- Returns
a dictionary of three normalized DataFrames comprised of the data in the original crosswalk file. EPA plant id to EPA unit id; EPA plant id to EIA plant id; and EIA plant id to EIA generator id to EPA unit id.
- Return type
-
pudl.glue.eia_epacems.
grab_n_clean_epa_orignal
()[source]¶ Retrieve and clean column names for the original EPA-EIA crosswalk file.
- Returns
- a version of the EPA-EIA crosswalk containing only
relevant columns. Columns names are clear and programatically accessible.
- Return type
-
pudl.glue.eia_epacems.
split_tables
(df)[source]¶ Split the cleaned EIA-EPA crosswalk table into three normalized tables.
- Parameters
pandas.DataFrame – a DataFrame of relevant, readible columns from the EIA-EPA crosswalk. Output of grab_n_clean_epa_original().
- Returns
a dictionary of three normalized DataFrames comprised of the data in the original crosswalk file. EPA plant id to EPA unit id; EPA plant id to EIA plant id; and EIA plant id to EIA generator id to EPA unit id. Includes no nan values.
- Return type
Extract and transform glue tables between FERC Form 1 and EIA 860/923.
FERC1 and EIA report on many of the same plants and utilities, but have no embedded connection. We have combed through the FERC and EIA plants and utilities to generate id’s which can connect these datasets. The resulting fields in the PUDL tables are plant_id_pudl and utility_id_pudl, respectively. This was done by hand in a spreadsheet which is in the package_data/glue directory. When mapping plants, we considered a plant a co-located collection of electricity generation equipment. If a coal plant was converted to a natural gas unit, our aim was to consider this the same plant. This module simply reads in the mapping spreadsheet and converts it to a dictionary of dataframes.
Because these mappings were done by hand and for every one of FERC Form 1’s thousands of reported plants, we know there are probably some incorrect or incomplete mappings. If you see a plant_id_pudl or utility_id_pudl mapping that you think is incorrect, please open an issue on our Github!
Note that the PUDL IDs may change over time. They are not guaranteed to be stable. If you need to find a particular plant or utility reliably, you should use its plant_id_eia, utility_id_eia, or utility_id_ferc1.
Another note about these id’s: these id’s map our definition of plants, which is not the most granular level of plant unit. The generators are typically the smaller, more interesting unit. FERC does not typically report in units (although it sometimes does), but it does often break up gas units from coal units. EIA reports on the generator and boiler level. When trying to use these PUDL id’s, consider the granularity that you desire and the potential implications of using a co-located set of plant infrastructure as an id.
-
pudl.glue.ferc1_eia.
get_db_plants_eia
(pudl_engine)[source]¶ Get a list of all EIA plants appearing in the PUDL DB.
This list of plants is used to determine which plants need to be added to the FERC 1 / EIA plant mappings, where we assign PUDL Plant IDs. Unless a new year’s worth of data has been added to the PUDL DB, but the plants have not yet been mapped, all plants in the PUDL DB should also appear in the plant mappings. It only makes sense to run this with a connection to a PUDL DB that has all the EIA data in it.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – A database connection engine for connecting to a PUDL SQLite database.
- Returns
A DataFrame with plant_id_eia, plant_name_eia, and state columns, for addition to the FERC 1 / EIA plant mappings.
- Return type
-
pudl.glue.ferc1_eia.
get_db_plants_ferc1
(pudl_settings, years)[source]¶ Pull a dataframe of all plants in the FERC Form 1 DB for the given years.
This function looks in the f1_steam, f1_gnrt_plant, f1_hydro and f1_pumped_storage tables, and generates a dataframe containing every unique combination of respondent_id (utility_id_ferc1) and plant_name is finds. Also included is the capacity of the plant in MW (as reported in the raw FERC Form 1 DB), the respondent_name (utility_name_ferc1) and a column indicating which of the plant tables the record came from. Plant and utility names are translated to lowercase, with leading and trailing whitespace stripped and repeating internal whitespace compacted to a single space.
This function is primarily meant for use generating inputs into the manual mapping of FERC to EIA plants with PUDL IDs.
- Parameters
pudl_settings (dict) – Dictionary containing various paths and database URLs used by PUDL.
years (iterable) – Years for which plants should be compiled.
- Returns
A dataframe containing columns utility_id_ferc1, utility_name_ferc1, plant_name, capacity_mw, and plant_table. Each row is a unique combination of utility_id_ferc1 and plant_name.
- Return type
-
pudl.glue.ferc1_eia.
get_db_utils_eia
(pudl_engine)[source]¶ Get a list of all EIA Utilities appearing in the PUDL DB.
-
pudl.glue.ferc1_eia.
get_lost_plants_eia
(pudl_engine)[source]¶ Identify any EIA plants which were mapped, but then lost from the DB.
-
pudl.glue.ferc1_eia.
get_lost_utils_eia
(pudl_engine)[source]¶ Get a list of all mapped EIA Utilites not found in the PUDL DB.
-
pudl.glue.ferc1_eia.
get_mapped_plants_eia
()[source]¶ Get a list of all EIA plants that have been assigned PUDL Plant IDs.
Read in the list of already mapped EIA plants from the FERC 1 / EIA plant and utility mapping spreadsheet kept in the package_data.
- Parameters
None –
- Returns
A DataFrame listing the plant_id_eia and plant_name_eia values for every EIA plant which has already been assigned a PUDL Plant ID.
- Return type
-
pudl.glue.ferc1_eia.
get_mapped_plants_ferc1
()[source]¶ Generate a dataframe containing all previously mapped FERC 1 plants.
Many plants are reported in FERC Form 1 with different versions of the same name in different years. Because FERC provides no unique ID for plants, these names must be used as part of their identifier. We manually curate a list of all the versions of plant names which map to the same actual plant. In order to identify new plants each year, we have to compare the new plant names and respondent IDs against this raw mapping, not the contents of the PUDL data, since within PUDL we use one canonical name for the plant. This function pulls that list of various plant names and their corresponding utilities (both name and ID) for use in identifying which plants have yet to be mapped when we are integrating new data.
- Parameters
None –
- Returns
plant_name, utility_id_ferc1, and utility_name_ferc1. Each row represents a unique combination of utility_id_ferc1 and plant_name.
- Return type
pandas.DataFrame A DataFrame with three columns
-
pudl.glue.ferc1_eia.
get_mapped_utils_eia
()[source]¶ Get a list of all the EIA Utilities that have PUDL IDs.
-
pudl.glue.ferc1_eia.
get_mapped_utils_ferc1
()[source]¶ Read in the list of manually mapped utilities for FERC Form 1.
Unless a new utility has appeared in the database, this should be identical to the full list of utilities available in the FERC Form 1 database.
- Parameters
None –
- Returns
pandas.DataFrame
-
pudl.glue.ferc1_eia.
get_unmapped_plants_eia
(pudl_engine)[source]¶ Identify any as-of-yet unmapped EIA Plants.
-
pudl.glue.ferc1_eia.
get_unmapped_plants_ferc1
(pudl_settings, years)[source]¶ Generate a DataFrame of all unmapped FERC plants in the given years.
Pulls all plants from the FERC Form 1 DB for the given years, and compares that list against the already mapped plants. Any plants found in the database but not in the list of mapped plants are returned.
- Parameters
pudl_settings (dict) – Dictionary containing various paths and database URLs used by PUDL.
years (iterable) – Years for which plants should be compiled from the raw FERC Form 1 DB.
- Returns
A dataframe containing five columns: utility_id_ferc1, utility_name_ferc1, plant_name, capacity_mw, and plant_table. Each row is a unique combination of utility_id_ferc1 and plant_name, which appears in the FERC Form 1 DB, but not in the list of manually mapped plants.
- Return type
-
pudl.glue.ferc1_eia.
get_unmapped_utils_eia
(pudl_engine)[source]¶ Get a list of all the EIA Utilities in the PUDL DB without PUDL IDs.
-
pudl.glue.ferc1_eia.
get_unmapped_utils_ferc1
(ferc1_engine)[source]¶ Generate a list of as-of-yet unmapped utilities from the FERC Form 1 DB.
Find any utilities which do exist in the cloned FERC Form 1 DB, but which do not show up in the already mapped FERC respondents.
- Parameters
ferc1_engine (sqlalchemy.engine.Engine) – A database connection engine for the cloned FERC Form 1 DB.
- Returns
with columns “utility_id_ferc1” and “utility_name_ferc1”
- Return type
-
pudl.glue.ferc1_eia.
get_unmapped_utils_with_plants_eia
(pudl_engine)[source]¶ Get all EIA Utilities that lack PUDL IDs but have plants/ownership.
-
pudl.glue.ferc1_eia.
glue
(ferc1=False, eia=False)[source]¶ Generates a dictionary of dataframes for glue tables between FERC1, EIA.
That data is primarily stored in the plant_output and utility_output tabs of package_data/glue/mapping_eia923_ferc1.xlsx in the repository. There are a total of seven relations described in this data:
utilities: Unique id and name for each utility for use across the PUDL DB.
plants: Unique id and name for each plant for use across the PUDL DB.
utilities_eia: EIA operator ids and names attached to a PUDL utility id.
plants_eia: EIA plant ids and names attached to a PUDL plant id.
utilities_ferc: FERC respondent ids & names attached to a PUDL utility id.
plants_ferc: A combination of FERC plant names and respondent ids, associated with a PUDL plant ID. This is necessary because FERC does not provide plant ids, so the unique plant identifier is a combination of the respondent id and plant name.
utility_plant_assn: An association table which describes which plants have relationships with what utilities. If a record exists in this table then combination of PUDL utility id & PUDL plant id does have an association of some kind. The nature of that association is somewhat fluid, and more scrutiny will likely be required for use in analysis.
Presently, the ‘glue’ tables are a very basic piece of infrastructure for the PUDL DB, because they contain the primary key fields for utilities and plants in FERC1.
Tools for integrating & reconciling different PUDL datasets with each other.
Many of the datasets integrated by PUDL report related information, but it’s often not easy to programmatically relate the datasets to each other. The glue subpackage provides tools for doing so, making all of the individual datasets more useful, and enabling richer analyses.
In this subpackage there are two basic types of modules:
those that implement general tools for connecting datasets together (like the
pudl.glue.zipper
module which two tabular datasets based on a set of mutually reported variables with no common IDs), andthose that implement a connection between two specific datasets (like the
pudl.glue.ferc1_eia
module).
In general we try to enable each dataset to be processed independently, and optionally apply the glue to connect them to each other when both datasets for which glue exists are being processed together.
pudl.load package¶
Functions for loading processed PUDL data tables into CSV files.
Once each set of tables pertaining to a data source have been transformed, we need to output them into CSV files which will become the data underlying tabular data resources. Most of these resources contain an entire table. In the case of larger tables (like EPA CEMS) the data may be partitioned into a collection of gzipped CSV files which are all part of a single resource group.
These functions are designed to pick up where the transform step leaves off, taking a dictionary of dataframes and applying a few last alterations that are necessary only in the context of outputting the data as text based files. These include converting floatified integer columns into strings with null values, and appropriately indexing the dataframes as needed.
-
pudl.load.csv.
clean_columns_dump
(df, resource_name, datapkg_dir)[source]¶ Output cleaned data columns to a CSV file.
Ensures that the id column is set appropriately depending on whether the table has a natural primary key or an autoincremnted pseudo-key. Ensures that the set of columns in the dataframe to be output are identical to those in the corresponding metadata definition. Transforms integer columns with NA values into strings for dumping, as appropriate.
- Parameters
resource_name (str) – The exact name of the tabular resource which the DataFrame df is going to be used to populate. This will be used to name the output CSV file, and must match the corresponding stored metadata template.
datapkg_dir (path-like) – Path to the datapackage directory that the CSV will be part of. Assumes CSV files get put in a “data” directory within this directory.
df (pandas.DataFrame) – The dataframe containing the data to be written out into CSV for inclusion in a tabular datapackage.
- Returns
None
-
pudl.load.csv.
csv_dump
(df, resource_name, keep_index, datapkg_dir)[source]¶ Write a dataframe to CSV.
Set
pandas.DataFrame.to_csv()
arguments appropriately depending on what data source we’re writing out, and then write it out. In practice this means adding a .csv to the end of the resource name, and then, if it’s part of epacems, adding a .gz after that.- Parameters
df (pandas.DataFrame) – The DataFrame to be dumped to CSV.
resource_name (str) – The exact name of the tabular resource which the DataFrame df is going to be used to populate. This will be used to name the output CSV file, and must match the corresponding stored metadata template.
keep_index (bool) – if True, use the “id” column of df as the index and output it.
datapkg_dir (path-like) – Path to the top level datapackage directory.
- Returns
None
-
pudl.load.csv.
dict_dump
(transformed_dfs, data_source, datapkg_dir)[source]¶ Wrapper for clean_columns_dump that takes a dictionary of DataFrames.
- Parameters
transformed_dfs (dict) – A dictionary of DataFrame objects in which tables from datasets (keys) correspond to normalized DataFrames of values from that table (values)
data_source (str) – The name of the data source we are working with (eia923, ferc1, etc.)
datapkg_dir (path-like) – Path to the top level directory for the datapackage these CSV files are part of. Will contain a “data” directory and a datapackage.json file.
- Returns
None
Routines for generating PUDL tabular data package and resource metadata.
This module enables the generation and use of the metadata for tabular data packages. It also saves and validates the datapackage once the metadata is compiled. In general the routines in this module can only be used after the referenced CSV’s have been generated by the top level PUDL ETL module, and written out to the datapackage data directory by the pudl.load.csv module.
The metadata comes from three basic sources: the datapkg_settings that are read in from the YAML file specifying the datapackage or bundle of datapackages to be generated, the CSV files themselves (their names, sizes, and hash values) and the stored metadata template which ultimately determines the structure of the relational database that these output tabular data packages represent, and encodes field specific table schemas. See the “megadata” which is stored in src/pudl/package_data/meta/datapkg/datapackage.json.
For unpartitioned tables which are contained in a single tabular data resource this is a relatively straightforward process. However, larger tables that have been partitioned into smaller tabular data resources that are part of a resource group (e.g. EPA CEMS) have additional complexities. We have tried to say “resource” when referring to an individual output CSV that has its own metadata entry, and “table” when referring to whole tables which typically contain only a single resource, but may be composed of hundreds or even thousands of individual resources.
See https://frictionlessdata.io for more details on the tabular data package standards.
In addition, we have included PUDL specific metadata fields that document the ETL parameters which were used to process the data, temporal and spatial coverage for each resource, Zenodo DOIs if appropriate, UUIDs to identify the individual data packages as well as co-generated bundles of data packages that can be used together to instantiate a single database, etc.
-
pudl.load.metadata.
compile_keywords
(data_sources)[source]¶ Compile the set of all keywords associated with given data sources.
The list of keywords we associate with each data source is stored in the
pudl.constants.keywords_by_data_source
dictionary.- Parameters
data_sources (iterable) – List of data source codes (eia923, ferc1, etc.) from which to gather keywords.
- Returns
the set of all unique keywords associated with any of the input data sources.
- Return type
-
pudl.load.metadata.
compile_partitions
(datapkg_settings)[source]¶ Given a datapackage settings dictionary, extract dataset partitions.
Iterates through all the datasets enumerated in the datapackage settings, and compiles a dictionary indicating which datasets should be partitioned and on what basis when they are output as tabular data resources. Currently this only applies to the epacems dataset. Datapackage settings must be validated because currently we inject EPA CEMS partitioning variables (epacems_years, epacems_states) during the validation process.
- Parameters
datapkg_settings (dict) – a dictionary containing validated datapackage settings, mostly read in from a PUDL ETL settings file.
- Returns
Uses table name (e.g. hourly_emissions_epacems) as keys, and lists of partition variables (e.g. [“epacems_years”, “epacems_states”]) as the values. If no datasets within the datapackage are being partitioned, this is an empty dictionary.
- Return type
-
pudl.load.metadata.
data_sources_from_tables
(table_names)[source]¶ Look up data sources used by the given list of PUDL database tables.
- Parameters
tables_names (iterable) – a list of names of ‘seed’ tables, whose dependencies we are seeking to find.
- Returns
The set of data sources for the list of PUDL table names.
- Return type
-
pudl.load.metadata.
generate_metadata
(datapkg_settings, datapkg_resources, datapkg_dir, datapkg_bundle_uuid=None, datapkg_bundle_doi=None)[source]¶ Generate metadata for package tables and validate package.
The metadata for this package is compiled from the pkg_settings and from the “megadata”, which is a json file containing the schema for all of the possible pudl tables. Given a set of tables, this function compiles metadata and validates the metadata and the package. This function assumes datapackage CSVs have already been generated.
See Frictionless Data for the tabular data package specification: http://frictionlessdata.io/specs/tabular-data-package/
- Parameters
datapkg_settings (dict) – a dictionary containing package settings containing top level elements of the data package JSON descriptor specific to the data package including: * name: short, unique package name e.g. pudl-eia923, ferc1-test * title: One line human readable description. * description: A paragraph long description. * version: the version of the data package being published. * keywords: For search purposes.
datapkg_resources (list) – The names of tabular data resources that are included in this data package.
datapkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapkg_dir directory, with the name of the package as the name of the subdirectory.
datapkg_bundle_uuid – A type 4 UUID identifying the ETL run which which generated the data package – this indicates that the data packages are compatible with each other
datapkg_bundle_doi – A digital object identifier (DOI) that will be used to archive the bundle of mutually compatible data packages. Needs to be provided by an archiving service like Zenodo. This field may also be added after the data package has been generated.
- Returns
a Python dictionary representing a valid tabular data package descriptor.
- Return type
-
pudl.load.metadata.
get_autoincrement_columns
(unpartitioned_tables)[source]¶ Grab the autoincrement columns for pkg tables.
-
pudl.load.metadata.
get_datapkg_fks
(datapkg_json)[source]¶ Get a dictionary of foreign key relationships from datapackage metadata.
- Parameters
datapkg_json (path-like) – Path to the datapackage.json containing the schema from which the foreign key relationships will be read.
- Returns
- table names (keys) with lists of table names (values) which the
key table has forgien key relationships with.
- Return type
-
pudl.load.metadata.
get_dependent_tables
(table_name, fk_relash)[source]¶ For a given table, get the list of all the other tables it depends on.
-
pudl.load.metadata.
get_dependent_tables_from_list
(table_names)[source]¶ Given a list of tables, find all the other tables they depend on.
Iterate over a list of input tables, adding them and all of their dependent tables to a set, and return that set. Useful for determining which tables need to be exported together to yield a self-contained subset of the PUDL database.
- Parameters
table_names (iterable) – a list of names of ‘seed’ tables, whose dependencies we are seeking to find.
- Returns
All tables with which any of the input tables have ForeignKey relations.
- Return type
-
pudl.load.metadata.
get_tabular_data_resource
(resource_name, datapkg_dir, datapkg_settings, partitions=False)[source]¶ Create a Tabular Data Resource descriptor for a PUDL table.
Based on the information in the database, and some additional metadata this function will generate a valid Tabular Data Resource descriptor, according to the Frictionless Data specification, which can be found here: https://frictionlessdata.io/specs/tabular-data-resource/
- Parameters
resource_name (string) – name of the tabular data resource for which you want to generate a Tabular Data Resource descriptor. This is the resource name, rather than the database table name, because we partition large tables into resource groups consisting of many files.
datapkg_dir (path-like) – The location of the directory for this package. The data package directory will be a subdirectory in the datapkg_dir directory, with the name of the package as the name of the subdirectory.
datapkg_settings (dict) – Python dictionary represeting the ETL parameters read in from the settings file, pertaining to the tabular datapackage this resource is part of.
partitions (dict) – A dictionary with PUDL database table names as the keys (e.g. hourly_emissions_epacems), and lists of partition variables (e.g. [“epacems_years”, “epacems_states”]) as the keys.
- Returns
A Python dictionary representing a tabular data resource descriptor that complies with the Frictionless Data specification.
- Return type
-
pudl.load.metadata.
get_unpartitioned_tables
(resources, datapkg_settings)[source]¶ Generate a list of database table names from a list of data resources.
In the case of EPA CEMS and potentially other large datasets, we are partitioning a single table into many tabular data resources that are part of a resource group. However in some contexts we want to refer to the list of corresponding databse tables, rather than the list of resources.
The partition key in the datapackage settings is the name of the table without the partition elements, and so in the case of partitioned tables we use that key as the name of the table. Otherwise we just use the name of the resource.
- Parameters
resources (iterable) – A list of tabular data resource names. They must be expected to appear in the datapackage specified by datapkg_settings.
datapkg_settings (dict) – a dictionary containing validated datapackage settings, mostly read in from a PUDL ETL settings file.
- Returns
- The names of the database tables corresponding to the tabular
datapackage resource names that were passed in.
- Return type
-
pudl.load.metadata.
hash_csv
(csv_path)[source]¶ Calculates a SHA-256 hash of the CSV file for data integrity checking.
- Parameters
csv_path (path-like) – Path the CSV file to hash.
- Returns
the hexdigest of the hash, with a ‘sha256:’ prefix.
- Return type
-
pudl.load.metadata.
pull_resource_from_megadata
(resource_name)[source]¶ Read metadata for a given data resource from the stored PUDL megadata.
- Parameters
resource_name (str) – the name of the tabular data resource whose JSON descriptor we are reading.
- Returns
A Python dictionary containing the resource descriptor portion of a data package descriptor, not expected to be valid or complete.
- Return type
- Raises
ValueError – If table_name is not found exactly one time in the PUDL metadata library.
-
pudl.load.metadata.
spatial_coverage
(resource_name)[source]¶ Extract spatial coverage (country and state) for a given source.
- Parameters
resource_name (str) – The name of the (potentially partitioned) resource for which we are enumerating the spatial coverage. Currently this is the only place we are able to access the partitioned spatial coverage after the ETL process has completed.
- Returns
A dictionary containing country and potentially state level spatial coverage elements. Country keys are “country” for the full name of country, “iso_3166-1_alpha-2” for the 2-letter ISO code, and “iso_3166-1_alpha-3” for the 3-letter ISO code. State level elements are “state” (a two letter ISO code for sub-national jurisdiction) and “iso_3166-2” for the combined country-state code conforming to that standard.
- Return type
-
pudl.load.metadata.
temporal_coverage
(resource_name, datapkg_settings)[source]¶ Extract start and end dates from ETL parameters for a given source.
- Parameters
resource_name (str) – The name of the (potentially partitioned) resource for which we are enumerating the spatial coverage. Currently this is the only place we are able to access the partitioned spatial coverage after the ETL process has completed.
datapkg_settings (dict) – Python dictionary represeting the ETL parameters read in from the settings file, pertaining to the tabular datapackage this resource is part of.
- Returns
A dictionary of two items, keys “start_date” and “end_date” with values in ISO 8601 YYYY-MM-DD format, indicating the extent of the time series data contained within the resource. If the resource does not contain time series data, the dates are null.
- Return type
-
pudl.load.metadata.
validate_save_datapkg
(datapkg_descriptor, datapkg_dir)[source]¶ Validate datapackage descriptor, save it, and validate some sample data.
- Parameters
datapkg_descriptor (dict) – A Python dictionary representation of a (hopefully valid) tabular datapackage descriptor.
datapkg_dir (path-like) – Directory into which the datapackage.json file containing the tabular datapackage descriptor should be written.
- Returns
A dictionary containing the goodtables datapackage validation report. Note that this will only be returned if there are no errors, otherwise it is output as an error message.
- Return type
- Raises
ValueError – if the datapackage descriptor passed in is invalid, or if any of the tables has a data validation error.
Tools for handling the load set in pudl ETL.
pudl.output package¶
Functions for reading data out of the Census DP1 SQLite Database.
-
pudl.output.censusdp1tract.
get_layer
(layer: Literal[state, county, tract], pudl_settings=None) → geopandas.geodataframe.GeoDataFrame[source]¶ Select one layer from the Census DP1 database.
Uses information within the Census DP1 database to set the coordinate reference system and to identify the column containing the geometry. The geometry column is renamed to “geom” as that’s the default withing Geopandas. No other column names or types are altered.
- Parameters
- Returns
geopandas.GeoDataFrame
Functions for pulling data primarily from the EIA’s Form 860.
-
pudl.output.eia860.
assign_cc_unit_ids
(gens_df)[source]¶ Assign PUDL Unit IDs for combined cycle generation units.
This applies only to combined cycle units reported as a combination of CT and CA prime movers. All CT and CA generators within a plant that do not already have a unit_id_pudl assigned will be given the same unit ID. The
bga_source
column is set to one of several flags indicating what type of arrangement was found:orphan_ct
(zero CA gens, 1+ CT gens)orphan_ca
(zero CT gens, 1+ CA gens)one_ct_one_ca_inferred
(1 CT, 1 CA)one_ct_many_ca_inferred
(1 CT, 1+ CA)many_ct_one_ca_inferred
(1+ CT, 1 CA)many_ct_many_ca_inferred
(1+ CT, 1+ CA)
Orphaned generators are still assigned a
unit_id_pudl
so that they can potentially be associated with other generators in the same unit across years. It’s likely that these orphans are a result of mislabled or missing generators. Note that as generators are added or removed over time, the flags associated with each generator may change, even though it remains part of the same inferred unit.- Returns
pandas.DataFrame
-
pudl.output.eia860.
assign_prime_fuel_unit_ids
(gens_df, prime_mover_code, fuel_type_code_pudl)[source]¶ Assign a PUDL Unit ID to all generators with a given prime mover and fuel.
Within each plant, assign a Unit ID to all generators that don’t have one, and that share the same fuel_type_code_pudl and prime_mover_code. This is especially useful for differentiating between different types of steam turbine generators, as there are so many different kinds of steam turbines, and the only characteristic we have to differentiate between them in this context is the fuel they consume. E.g. nuclear, geothermal, solar thermal, natural gas, diesel, and coal can all run steam turbines, but it doesn’t make sense to lump those turbines together into a single unit just because they are located at the same plant.
This routine only assigns a PUDL Unit ID to generators that have a consistently reported value of fuel_type_code_pudl across all of the years of data in gens_df. This consistency is important because otherwise the prime-fuel based unit assignment could put the same generator into different units in different years, which is currently not compatible with our concept of “units.”
- Parameters
gens_df (pandas.DataFrame) – A collection of EIA generator records. Must include the
plant_id_eia
,generator_id
andprime_mover_code
andunit_id_pudl
columns.prime_mover_code (str) – List of prime mover codes for which we are attempting to assign simple Unit IDs.
fuel_type_code_pudl (str) – If not None, then limit the records assigned a unit_id to those that have the specified fuel_type_code_pudl (e.g. “coal”, “gas”, “oil”, “nuclear”)
- Returns
- Return type
-
pudl.output.eia860.
assign_single_gen_unit_ids
(gens_df, prime_mover_codes, fuel_type_code_pudl=None, label_prefix='single')[source]¶ Assign a unique PUDL Unit ID to each generator of a given prime mover type.
Calculate the maximum pre-existing PUDL Unit ID within each plant, and assign each as of yet unidentified distinct generator within each plant with an incrementing integer unit_id_pudl, beginning with 1 + the previous maximum unit_id_pudl found in that plant. Mark that generator with a label in the bga_source column consisting of label_prefix + the prime mover code.
If fuel_type_code_pudl is not None, then only assign new Unit IDs to those generators having the specified fuel type code, and use that fuel type code as the label prefix, e.g. “coal_st” for a coal-fired steam turbine.
Only generators having NA unit_id_pudl will be assigned a new ID.
- Parameters
gens_df (pandas.DataFrame) – A collection of EIA generator records. Must include the
plant_id_eia
,generator_id
andprime_mover_code
andunit_id_pudl
columns.prime_mover_codes (list) – List of prime mover codes for which we are attempting to assign simple Unit IDs.
fuel_type_code_pudl (str, None) – If not None, then limit the records assigned a unit_id to those that have the specified fuel_type_code_pudl (e.g. “coal”, “gas”, “oil”, “nuclear”)
label_prefix (str) – String to use in labeling records as to how their unit_id_pudl was set. Will be concatenated with the prime mover code.
- Returns
A new dataframe with the same rows and columns as were passed in, but with the unit_id_pudl and bga_source columns updated to reflect the newly assigned Unit IDs.
- Return type
-
pudl.output.eia860.
assign_unit_ids
(gens_df)[source]¶ Group generators into operational units using various heuristics.
Splits a few columns off from the big generator dataframe and uses several heuristic functions to fill in missing unit_id_pudl values beyond those that are generated in the boiler generator association process. Then merges the new unit ID values back in to the generators dataframe.
- Parameters
gens_df (pandas.DataFrame) – An EIA generator table. Must contain at least the columns: report_date, plant_id_eia, generator_id, unit_id_pudl, bga_source, fuel_type_code_pudl, prime_mover_code,
- Returns
Returned dataframe should only vary from the input in that some NA values in the
unit_id_pudl
andbga_source
columns have been filled in with real values.- Return type
- Raises
ValueError – If the input dataframe is missing required columns.
ValueError – If any generator is associated with more than one unit_id_pudl.
AssertionError – If row or column indices are changed.
AssertionError – If pre-existing unit_id_pudl or bga_source values are altered.
AssertionError – If contents of any other columns are altered at all.
-
pudl.output.eia860.
boiler_generator_assn_eia860
(pudl_engine, start_date=None, end_date=None)[source]¶ Pull all fields from the EIA 860 boiler generator association table.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
- Returns
A DataFrame containing all the fields from the EIA 860 boiler generator association table.
- Return type
-
pudl.output.eia860.
fill_unit_ids
(gens_df)[source]¶ Back and forward fill Unit IDs for each plant / gen combination.
This routine assumes that the mapping of generators to units is constant over time, and extends those mappings into years where no boilers have been reported – since in the BGA we can only connect generators to each other if they are both connected to a boiler.
Prior to 2014, combined cycle units didn’t report any “boilers” but in latter years, they have been given “boilers” that correspond to their generators, so that all of their fuel consumption is recorded alongside that of other types of generators.
The bga_source field is set to “bfill_units” for those that were backfilled, and “ffill_units” for those that were forward filled.
Note: We could back/forward fill the boiler IDs prior to the BGA process and we ought to get consistent units across all the years that are the same as what we fill in here. We could also back/forward fill boiler IDs and Unit IDs after the fact, and we should get the same result. this will address many currently “boilerless” CCNG units that use generator ID as boiler ID in the latter years. We could try and apply this more generally, but in cases of generator IDs that haven’t been used as boiler IDs, it would break the foreign key relationship with the boiler table, unless we added them there too, which seems like too much deep muddling.
- Parameters
gens_df (pandas.DataFrame) – An generators_eia860 dataframe, which must contain columns: report_date, plant_id_eia, generator_id, unit_id_pudl, bga_source.
- Returns
with the same columns as the input dataframe, but having some NA values filled in for both the unit_id_pudl and bga_source columns.
- Return type
-
pudl.output.eia860.
generators_eia860
(pudl_engine, start_date=None, end_date=None, unit_ids=False)[source]¶ Pull all fields reported in the generators_eia860 table.
Merge in other useful fields including the latitude & longitude of the plant that the generators are part of, canonical plant & operator names and the PUDL IDs of the plant and operator, for merging with other PUDL data sources.
Fill in data for adjacent years if requested, but never fill in earlier than the earliest working year of data for EIA923, and never add more than one year on after the reported data (since there should at most be a one year lag between EIA923 and EIA860 reporting)
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
pudl_unit_ids (bool) – If True, use several heuristics to assign individual generators to functional units. EXPERIMENTAL.
- Returns
A DataFrame containing all the fields of the EIA 860 Generators table.
- Return type
-
pudl.output.eia860.
max_unit_id_by_plant
(gens_df)[source]¶ Identify the largest unit ID associated with each plant so we don’t overlap.
The PUDL Unit IDs are sequentially assigned integers. To assign a new ID, we need to know the largest existing Unit ID within a plant. This function calculates that largest existing ID, or uses zero, if no Unit IDs are set within the plant.
Note that this calculation depends on having all of the pre-existing generators and units still available in the dataframe!
- Parameters
gens_df (pandas.DataFrame) – A generators_eia860 dataframe containing at least the columns plant_id_eia and unit_id_pudl.
- Returns
Having two columns: plant_id_eia and max_unit_id_pudl in which each row should be unique.
- Return type
-
pudl.output.eia860.
ownership_eia860
(pudl_engine, start_date=None, end_date=None)[source]¶ Pull a useful set of fields related to ownership_eia860 table.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
- Returns
A DataFrame containing a useful set of fields related to the EIA 860 Ownership table.
- Return type
-
pudl.output.eia860.
plants_eia860
(pudl_engine, start_date=None, end_date=None)[source]¶ Pull all fields from the EIA Plants tables.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
- Returns
A DataFrame containing all the fields of the EIA 860 Plants table.
- Return type
-
pudl.output.eia860.
plants_utils_eia860
(pudl_engine, start_date=None, end_date=None)[source]¶ Create a dataframe of plant and utility IDs and names from EIA 860.
Returns a pandas dataframe with the following columns: - report_date (in which data was reported) - plant_name_eia (from EIA entity) - plant_id_eia (from EIA entity) - plant_id_pudl - utility_id_eia (from EIA860) - utility_name_eia (from EIA860) - utility_id_pudl
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
- Returns
A DataFrame containing plant and utility IDs and names from EIA 860.
- Return type
-
pudl.output.eia860.
utilities_eia860
(pudl_engine, start_date=None, end_date=None)[source]¶ Pull all fields from the EIA860 Utilities table.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
- Returns
A DataFrame containing all the fields of the EIA 860 Utilities table.
- Return type
Functions for pulling EIA 923 data out of the PUDl DB.
-
pudl.output.eia923.
FUEL_COST_CATEGORIES_EIAAPI
= [41696, 41762, 41740]¶ The category ids for fuel costs by fuel for electricity for coal, gas and oil.
Each category id is a peice of a query to EIA’s API. Each query here contains a set of state-level child series which contain fuel cost data.
- See EIA’s query browse here:
-
pudl.output.eia923.
boiler_fuel_eia923
(pudl_engine, freq=None, start_date=None, end_date=None)[source]¶ Pull records from the boiler_fuel_eia923 table in a given data range.
Optionally, aggregate the records over some timescale – monthly, yearly, quarterly, etc. as well as by fuel type within a plant.
If the records are not being aggregated, all of the database fields are available. If they’re being aggregated, then we preserve the following fields. Per-unit values are re-calculated based on the aggregated totals. Totals are summed across whatever time range is being used, within a given plant and fuel type.
fuel_consumed_units
(sum)fuel_mmbtu_per_unit
(weighted average)fuel_consumed_mmbtu
(sum)sulfur_content_pct
(weighted average)ash_content_pct
(weighted average)
In addition, plant and utility names and IDs are pulled in from the EIA 860 tables.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
freq (str) – a pandas timeseries offset alias. The original data is reported monthly, so the best time frequencies to use here are probably month start (freq=’MS’) and year start (freq=’YS’).
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
- Returns
A DataFrame containing all records from the EIA 923 Boiler Fuel table.
- Return type
-
pudl.output.eia923.
convert_cost_json_to_df
(response_fuel_state_annual)[source]¶ Convert a fuel-type/state response into a clean dataframe.
- Parameters
response_fuel_state_annual (api response) – an EIA API response which contains state-level series including monthly fuel cost data.
- Returns
a dataframe containing state-level montly fuel cost. The table contains the following columns, some of which are refernce columns: ‘report_date’, ‘fuel_cost_per_unit’, ‘state’, ‘fuel_type_code_pudl’, ‘units’ (ref), ‘series_id’ (ref), ‘name’ (ref).
- Return type
-
pudl.output.eia923.
fuel_receipts_costs_eia923
(pudl_engine, freq=None, start_date=None, end_date=None, fill=False, roll=False)[source]¶ Pull records from
fuel_receipts_costs_eia923
table in given date range.Optionally, aggregate the records at a monthly or longer timescale, as well as by fuel type within a plant, by setting freq to something other than the default None value.
If the records are not being aggregated, then all of the fields found in the PUDL database are available. If they are being aggregated, then the following fields are preserved, and appropriately summed or re-calculated based on the specified aggregation. In both cases, new total values are calculated, for total fuel heat content and total fuel cost.
plant_id_eia
report_date
fuel_type_code_pudl
(formerly energy_source_simple)fuel_qty_units
(sum)fuel_cost_per_mmbtu
(weighted average)total_fuel_cost
(sum)fuel_consumed_mmbtu
(sum)heat_content_mmbtu_per_unit
(weighted average)sulfur_content_pct
(weighted average)ash_content_pct
(weighted average)moisture_content_pct
(weighted average)mercury_content_ppm
(weighted average)chlorine_content_ppm
(weighted average)
In addition, plant and utility names and IDs are pulled in from the EIA 860 tables.
Optionally fill in missing fuel costs based on monthly state averages which are pulled from the EIA’s open data API, and/or use a rolling average to fill in gaps in the fuel costs. These behaviors are controlled by the
fill
androll
parameters. If you setfill=True
you need to ensure that you have stored your API key in an environment variable namedAPI_KEY_EIA
. You can register for a free EIA API key here:https://www.eia.gov/opendata/register.php
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
freq (str) – a pandas timeseries offset alias. The original data is reported monthly, so the best time frequencies to use here are probably month start (freq=’MS’) and year start (freq=’YS’).
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
fill (boolean) – if set to True, fill in missing coal, gas and oil fuel cost per mmbtu from EIA’s API. This fills with montly state-level averages.
roll (boolean) – if set to True, apply a rolling average to a subset of output table’s columns (currently only ‘fuel_cost_per_mmbtu’ for the frc table).
- Returns
A DataFrame containing all records from the EIA 923 Fuel Receipts and Costs table.
- Return type
-
pudl.output.eia923.
generation_eia923
(pudl_engine, freq=None, start_date=None, end_date=None)[source]¶ Pull records from the boiler_fuel_eia923 table in a given data range.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
freq (str) – a pandas timeseries offset alias. The original data is reported monthly, so the best time frequencies to use here are probably month start (freq=’MS’) and year start (freq=’YS’).
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
- Returns
A DataFrame containing all records from the EIA 923 Generation table.
- Return type
-
pudl.output.eia923.
generation_fuel_eia923
(pudl_engine, freq=None, start_date=None, end_date=None)[source]¶ Pull records from the generation_fuel_eia923 table in given date range.
Optionally, aggregate the records over some timescale – monthly, yearly, quarterly, etc. as well as by fuel type within a plant.
If the records are not being aggregated, all of the database fields are available. If they’re being aggregated, then we preserve the following fields. Per-unit values are re-calculated based on the aggregated totals. Totals are summed across whatever time range is being used, within a given plant and fuel type.
plant_id_eia
report_date
fuel_type_code_pudl
fuel_consumed_units
fuel_consumed_for_electricity_units
fuel_mmbtu_per_unit
fuel_consumed_mmbtu
fuel_consumed_for_electricity_mmbtu
net_generation_mwh
In addition, plant and utility names and IDs are pulled in from the EIA 860 tables.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – SQLAlchemy connection engine for the PUDL DB.
freq (str) – a pandas timeseries offset alias. The original data is reported monthly, so the best time frequencies to use here are probably month start (freq=’MS’) and year start (freq=’YS’).
start_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
end_date (date-like) – date-like object, including a string of the form ‘YYYY-MM-DD’ which will be used to specify the date range of records to be pulled. Dates are inclusive.
- Returns
A DataFrame containing all records from the EIA 923 Generation Fuel table.
- Return type
-
pudl.output.eia923.
get_fuel_cost_avg_eiaapi
(fuel_cost_cat_ids)[source]¶ Get a dataframe of state-level average fuel costs for EIA’s API.
- Parameters
fuel_cost_cat_ids (list) – list of category ids. Known/testing working ids are stored in FUEL_COST_CATEGORIES_EIAAPI.
- Returns
a dataframe containing state-level montly fuel cost. The table contains the following columns, some of which are refernce columns: ‘report_date’, ‘fuel_cost_per_unit’, ‘state’, ‘fuel_type_code_pudl’, ‘units’ (ref), ‘series_id’ (ref), ‘name’ (ref).
- Return type
-
pudl.output.eia923.
grab_fuel_state_monthly
(cat_id)[source]¶ Grab an API response for monthly fuel costs for one fuel category.
The data we want from EIA is in monthly, state-level series for each fuel type. For each fuel category, there are at least 51 embeded child series. This function compiles one fuel type’s child categories into one request. The resulting api response should contain a list of series responses from each state which we can convert into a pandas.DataFrame using convert_cost_json_to_df.
- Parameters
cat_id (int) – category id for one fuel type. Known to be
-
pudl.output.eia923.
make_url_cat_eiaapi
(category_id)[source]¶ Generate a url for a category from EIA’s API.
Requires an environment variable named
API_KEY_EIA
be set, containing a valid EIA API key, which you can obtain from:
Routines that provide user-friendly access to the partitioned EPA CEMS dataset.
-
pudl.output.epacems.
get_plant_states
(plant_ids, pudl_out)[source]¶ Determine what set of states a given set of EIA plant IDs are within.
If you only want to select data about a particular set of power plants from the EPA CEMS data, this is useful for identifying which patitions of the Parquet dataset you will need to search.
- Parameters
plant_ids (iterable) – A collection of integers representing valid plant_id_eia values within the PUDL DB.
pudl_out (pudl.output.pudltabl.PudlTabl) – A PudlTabl output object to use to access the PUDL DB.
- Returns
A list containing the 2-letter state abbreviations for any state that was found in association with one or more of the plant_ids.
- Return type
-
pudl.output.epacems.
get_plant_years
(plant_ids, pudl_out)[source]¶ Determine which years a given set of EIA plant IDs appear in.
If you only want to select data about a particular set of power plants from the EPA CEMS data, this is useful for identifying which patitions of the Parquet dataset you will need to search.
NOTE: the EIA-860 and EIA-923 data which are used here don’t cover as many years as the EPA CEMS, so this is probably of limited utility – you may want to simply include all years, or manually specify the years of interest instead.
- Parameters
plant_ids (iterable) – A collection of integers representing valid plant_id_eia values within the PUDL DB.
pudl_out (pudl.output.pudltabl.PudlTabl) – A PudlTabl output object to use to access the PUDL DB.
- Returns
A list containing the 4-digit integer years found in association with one or more of the plant_ids.
- Return type
-
pudl.output.epacems.
year_state_filter
(years=(), states=())[source]¶ Create filters to read given years and states from partitioned parquet dataset.
A subset of an Apache Parquet dataset can be read in more efficiently if files which don’t need to be queried are avoideed. Some datasets are partitioned based on the values of columns to make this easier. The EPA CEMS dataset which we publish is partitioned by state and report year.
However, the way the filters are specified can be unintuitive. They use DNF (disjunctive normal form) See this blog post for more details:
https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce
This function takes a set of years, and a set of states, and returns a list of lists of tuples, appropriate for use with the read_parquet() methods of pandas and dask dataframes. The filter will include all combinations of the specified years and states. E.g. if years=(2018, 2019) and states=(“CA”, “CO”) then the filter would result in getting 2018 and 2019 data for CO, as well as 2018 and 2019 data for CA.
- Parameters
years (iterable) – 4-digit integers indicating the years of data you would like to read. By default it includes all years.
states (iterable) – 2-letter state abbreviations indicating what states you would like to include. By default it includes all states.
- Returns
A list of lists of tuples, suitable for use as a filter in the read_parquet method of pandas and dask dataframes.
- Return type
Functions for pulling FERC Form 1 data out of the PUDL DB.
-
pudl.output.ferc1.
fuel_by_plant_ferc1
(pudl_engine, thresh=0.5)[source]¶ Summarize FERC fuel data by plant for output.
This is mostly a wrapper around
pudl.transform.ferc1.fuel_by_plant_ferc1()
which calculates some summary values on a per-plant basis (as indicated byutility_id_ferc1
andplant_name_ferc1
) related to fuel consumption.- Parameters
pudl_engine (sqlalchemy.engine.Engine) – Engine for connecting to the PUDL database.
thresh (float) – Minimum fraction of fuel (cost and mmbtu) required in order for a plant to be assigned a primary fuel. Must be between 0.5 and 1.0. default value is 0.5.
- Returns
A DataFrame with fuel use summarized by plant.
- Return type
-
pudl.output.ferc1.
fuel_ferc1
(pudl_engine)[source]¶ Pull a useful dataframe related to FERC Form 1 fuel information.
This function pulls the FERC Form 1 fuel data, and joins in the name of the reporting utility, as well as the PUDL IDs for that utility and the plant, allowing integration with other PUDL tables.
Useful derived values include:
fuel_consumed_mmbtu
(total fuel heat content consumed)fuel_consumed_total_cost
(total cost of that fuel)
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – Engine for connecting to the PUDL database.
- Returns
A DataFrame containing useful FERC Form 1 fuel information.
- Return type
-
pudl.output.ferc1.
plant_in_service_ferc1
(pudl_engine)[source]¶ Pull a dataframe of FERC Form 1 Electric Plant in Service data.
-
pudl.output.ferc1.
plants_hydro_ferc1
(pudl_engine)[source]¶ Pull a useful dataframe related to the FERC Form 1 hydro plants.
-
pudl.output.ferc1.
plants_pumped_storage_ferc1
(pudl_engine)[source]¶ Pull a dataframe of FERC Form 1 Pumped Storage plant data.
-
pudl.output.ferc1.
plants_small_ferc1
(pudl_engine)[source]¶ Pull a useful dataframe related to the FERC Form 1 small plants.
-
pudl.output.ferc1.
plants_steam_ferc1
(pudl_engine)[source]¶ Select and joins some useful fields from the FERC Form 1 steam table.
Select the FERC Form 1 steam plant table entries, add in the reporting utility’s name, and the PUDL ID for the plant and utility for readability and integration with other tables that have PUDL IDs.
Also calculates
capacity_factor
(based onnet_generation_mwh
&capacity_mw
)- Parameters
pudl_engine (sqlalchemy.engine.Engine) – Engine for connecting to the PUDL database.
- Returns
A DataFrame containing useful fields from the FERC Form 1 steam table.
- Return type
-
pudl.output.ferc1.
plants_utils_ferc1
(pudl_engine)[source]¶ Build a dataframe of useful FERC Plant & Utility information.
- Parameters
pudl_engine (sqlalchemy.engine.Engine) – Engine for connecting to the PUDL database.
- Returns
A DataFrame containing useful FERC Form 1 Plant and Utility information.
- Return type
Functions & classes for compiling derived aspects of the FERC Form 714 data.
-
pudl.output.ferc714.
ASSOCIATIONS
: List[Dict[str, Any]] = [{'id': 56669, 'from': 2011, 'to': [2009, 2010]}, {'id': 59504, 'from': 2014, 'to': [2006, 2009], 'exclude': ['NE']}, {'id': 59504, 'from': 2014, 'to': [2010, 2013]}, {'id': 11249, 'from': 2014, 'to': [2006, 2013]}, {'id': 12506, 'from': 2012, 'to': [2013, 2013]}, {'id': 829, 'from': 2008, 'to': [2009, 2013]}, {'id': 14725, 'from': 2011, 'to': [2006, 2010]}, {'id': 16534, 'from': 2013, 'to': [2012, 2012]}, {'id': 17718, 'from': 2010, 'to': [2006, 2009]}, {'id': 13407, 'from': 2009, 'to': [2006, 2008]}, {'id': 13407, 'from': 2013, 'to': [2014, 2019]}]¶ Adjustments to balancing authority-utility associations from EIA 861.
The changes are applied locally to EIA 861 tables.
id (int): EIA balancing authority identifier (balancing_authority_id_eia).
from (int): Reference year, to use as a template for target years.
to (List[int]): Target years, in the closed interval format [minimum, maximum]. Rows in balancing_authority_eia861 are added (if missing) for every target year with the attributes from the reference year. Rows in balancing_authority_assn_eia861 are added (or replaced, if existing) for every target year with the utility associations from the reference year. Rows in service_territory_eia861 are added (if missing) for every target year with the nearest year’s associated utilities’ counties.
exclude (Optional[List[str]]): Utilities to exclude, by state (two-letter code). Rows are excluded from balancing_authority_assn_eia861 with target year and state.
-
class
pudl.output.ferc714.
Respondents
(pudl_out, pudl_settings=None, ba_ids=None, util_ids=None, priority='balancing_authority', limit_by_state=True)[source]¶ Bases:
object
A class coordinating compilation of data related to FERC 714 Respondents.
The FERC 714 Respondents themselves are not complex as they are reported, but various ambiguities and the need to associate service territories with them mean there are a lot of different derived aspects related to them which we repeatedly need to compile in a self consistent way. This class allows you to choose several parameters for that compilation, and then easily access the resulting derived tabular outputs.
Some of these derived attributes are computationally expensive, and so they are cached internally. You can force a new computation in most cases by using
update=True
in the access methods. However, this functionality isn’t totally implemented because we’re still depending on the interim ETL processes for the FERC 714 and EIA 861 data, and we don’t want to trigger whole new ETL runs every time a derived value is updated.-
pudl_out
¶ The PUDL output object which should be used to obtain PUDL data.
-
pudl_settings
¶ A dictionary of settings indicating where data related to PUDL can be found. Needed to obtain US Census DP1 data which has the county geometries.
-
ba_ids
¶ EIA IDs that should be treated as referring to balancing authorities in respondent categorization process. If None, all known values of
balancing_authority_id_eia
will be used.- Type
ordered collection or None
-
util_ids
¶ EIA IDs that should be treated as referring to utilities in respondent categorization process. If None, all known values of
utility_id_eia
will be used.- Type
ordered collection or None
-
priority
¶ Which type of entity should take priority in the categorization of FERC 714 respondents. Must be either
utility
orbalancing_authority.
The default isbalancing_authority
.- Type
-
limit_by_state
¶ Whether to limit respondent service territories to the states where they have documented activity in the EIA 861. Currently this is only implemented for Balancing Authorities.
- Type
-
annualize
(update=False)[source]¶ Broadcast respondent data across all years with reported demand.
The FERC 714 Respondent IDs and names are reported in their own table, without any refence to individual years, but much of the information we are associating with them varies annually. This method creates an annualized version of the respondent table, with each respondent having an entry corresponding to every year in which hourly demand was reported in the FERC 714 dataset as a whole – this necessarily means that many of the respondents will end up having entries for years in which they reported no demand, and that’s fine. They can be filtered later.
Modified balancing_authority_assn_eia861 table.
Modified balancing_authority_eia861 table.
-
categorize
(update=False)[source]¶ Annualized respondents with
respondent_type
assigned if possible.Categorize each respondent as either a
utility
or abalancing_authority
using the parameters stored in the instance of the class. While categorization can also be done without annualizing, this function annualizes as well, since we are adding therespondent_type
in order to be able to compile service territories for the respondent, which vary annually.
-
fipsify
(update=False)[source]¶ Annual respondents with the county FIPS IDs for their service territories.
Given the
respondent_type
associated with each respondent (eitherutility
orbalancing_authority
) compile a list of counties that are part of their service territory on an annual basis, and merge those into the annualized respondent table. This results in a very long dataframe, since there are thousands of counties and many of them are served by more than one entity.Currently respondents categorized as
utility
will include any county that appears in theservice_territory_eia861
table in association with that utility ID in each year, while forbalancing_authority
respondents, some counties can be excluded based on state (ifself.limit_by_state==True
).
-
georef_counties
(update=False)[source]¶ Annual respondents with all associated county-level geometries.
Given the county FIPS codes associated with each respondent in each year, pull in associated geometries from the US Census DP1 dataset, so we can do spatial analyses. This keeps each county record independent – so there will be many records for each respondent in each year. This is fast, and still good for mapping, and retains all of the FIPS IDs so you can also still do ID based analyses.
-
georef_respondents
(update=False)[source]¶ Annual respondents with a single all-encompassing geometry for each year.
Given the county FIPS codes associated with each responent in each year, compile a geometry for the respondent’s entire service territory annually. This results in just a single record per respondent per year, but is computationally expensive and you lose the information about what all counties are associated with the respondent in that year. But it’s useful for merging in other annual data like total demand, so you can see which respondent-years have both reported demand and decent geometries, calculate their areas to see if something changed from year to year, etc.
-
property
service_territory_eia861
¶ Modified service_territory_eia861 table.
-
summarize_demand
(update=False)[source]¶ Compile annualized, categorized respondents and summarize values.
Calculated summary values include: * Total reported electricity demand per respondent (
demand_annual_mwh
) * Reported per-capita electrcity demand (demand_annual_per_capita_mwh
) * Population density (population_density_km2
) * Demand density (demand_density_mwh_km2
)These metrics are helpful identifying suspicious changes in the compiled annual geometries for the planning areas.
-
-
pudl.output.ferc714.
UTILITIES
: List[Dict[str, Any]] = [{'id': 14328, 'reassign': True}, {'id': 16609, 'reassign': True}, {'id': 4922, 'reassign': True}, {'id': 4254}]¶ Balancing authorities to treat as utilities in associations from EIA 861.
The changes are applied locally to EIA 861 tables.
id (int): EIA balancing authority (BA) identifier (balancing_authority_id_eia). Rows for id are removed from balancing_authority_eia861.
reassign (Optional[bool]): Whether to reassign utilities to parent BAs. Rows for id as BA in balancing_authority_assn_eia861 are removed. Utilities assigned to id for a given year are reassigned to the BAs for which id is an associated utility.
replace (Optional[bool]): Whether to remove rows where id is a utility in balancing_authority_assn_eia861. Applies only if reassign=True.
-
pudl.output.ferc714.
add_dates
(rids_ferc714, report_dates)[source]¶ Broadcast respondent data across dates.
- Parameters
rids_ferc714 (pandas.DataFrame) – A simple FERC 714 Respondent ID dataframe, without any date information.
report_dates (ordered collection of datetime) – Dates for which each respondent should be given a record.
- Raises
ValueError – if a
report_date
column exists inrids_ferc714
.- Returns
Dataframe having all the same columns as the input
rids_ferc714
with the addition of areport_date
column, but with all records associated with eachrespondent_id_ferc714
duplicated on a per-date basis.- Return type
-
pudl.output.ferc714.
categorize_eia_code
(eia_codes, ba_ids, util_ids, priority='balancing_authority')[source]¶ Categorize FERC 714
eia_codes
as either balancing authority or utility IDs.Most FERC 714 respondent IDs are associated with an
eia_code
which refers to either abalancing_authority_id_eia
or autility_id_eia
but no indication as to which type of ID each one is. This is further complicated by the fact that EIA uses the same numerical ID to refer to the same entity in most but not all cases, when that entity acts as both a utility and as a balancing authority.This function associates a
respondent_type
ofutility
,balancing_authority
orpandas.NA
with each inputeia_code
using the following rules: * If aeia_code
appears only inutil_ids
therespondent_type
will beutility
. * Ifeia_code
appears only inba_ids
therespondent_type
will be assignedbalancing_authority
. * Ifeia_code
appears in neither set of IDs,respondent_type
will be assignedpandas.NA
. * Ifeia_code
appears in both sets of IDs, then whicheverrespondent_type
has been selected with thepriority
flag will be assigned.Note that the vast majority of
balancing_authority_id_eia
values also show up asutility_id_eia
values, but only a small subset of theutility_id_eia
values are associated with balancing authorities. If you usepriority="utility"
you should probably also be specifically compiling the list of Utility IDs because you know they should take precedence. If you use utility priority with all utility IDs- Parameters
eia_codes (ordered collection of ints) – A collection of IDs which may be either associated with EIA balancing authorities or utilities, to be categorized.
ba_ids_eia (ordered collection of ints) – A collection of IDs which should be interpreted as belonging to EIA Balancing Authorities.
util_ids_eia (ordered collection of ints) – A collection of IDs which should be interpreted as belonging to EIA Utilities.
priorty (str) – Which respondent_type to give priority to if the eia_code shows up in both util_ids_eia and ba_ids_eia. Must be one of “utility” or “balancing_authority”. The default is “balanacing_authority”.
- Returns
A dataframe containing 2 columns:
eia_code
andrespondent_type
.- Return type
This module provides a class enabling tabular compilations from the PUDL DB.
Many of our potential users are comfortable using spreadsheets, not databases, so we are creating a collection of tabular outputs that contain the most useful core information from the PUDL data packages, including additional keys and human readable names for the objects (utilities, plants, generators) being described in the table.
These tabular outputs can be joined with each other using those keys, and used as a data source within Microsoft Excel, Access, R Studio, or other data analysis packages that folks may be familiar with. They aren’t meant to completely replicate all the data and relationships contained within the full PUDL database, but should serve as a generally usable set of PUDL data products.
The PudlTabl class can also provide access to complex derived values, like the generator and plant level marginal cost of electricity (MCOE), which are defined in the analysis module.
In the long run, this is a probably a kind of prototype for pre-packaged API outputs or data products that we might want to be able to provide to users a la carte.
Todo
Return to for update arg and returns values in functions below
-
class
pudl.output.pudltabl.
PudlTabl
(pudl_engine, ds=None, freq=None, start_date=None, end_date=None, fill_fuel_cost=False, roll_fuel_cost=False, fill_net_gen=False)[source]¶ Bases:
object
A class for compiling common useful tabular outputs from the PUDL DB.
An interim EIA 861 output function.
An interim EIA 861 output function.
-
bf_eia923
(update=False)[source]¶ Pull EIA 923 boiler fuel consumption data.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
bga_eia860
(update=False)[source]¶ Pull a dataframe of boiler-generator associations from EIA 860.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
capacity_factor
(update=False, min_cap_fact=None, max_cap_fact=None)[source]¶ Calculate and return generator level capacity factors.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
etl_eia861
(update=False)[source]¶ A single function that runs the temporary EIA 861 ETL and sets all DFs.
This is an interim solution that provides a (somewhat) standard way of accessing the EIA 861 data prior to its being fully integrated into the PUDL database. If any of the dataframes is attempted to be accessed, all of them are set. Only the tables that have actual transform functions are included, and as new transform functions are completed, they would need to be added to the list below. Surely there is a way to do this automatically / magically but that’s beyond my knowledge right now.
- Parameters
update (bool) – Whether to overwrite the existing dataframes if they exist.
-
etl_ferc714
(update=False)[source]¶ A single function that runs the temporary FERC 714 ETL and sets all DFs.
This is an interim solution, so that we can have a (relatively) standard way of accessing the FERC 714 data prior to getting it integrated into the PUDL DB. Some of these are not yet cleaned up, but there are dummy transform functions which pass through the raw DFs with some minor alterations, so all the data is available as it exists right now.
An attempt to access any of the dataframes results in all of them being populated, since generating all of them is almost the same amount of work as generating one of them.
- Parameters
update (bool) – Whether to overwrite the existing dataframes if they exist.
-
fbp_ferc1
(update=False)[source]¶ Summarize FERC Form 1 fuel usage by plant.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
frc_eia923
(update=False)[source]¶ Pull EIA 923 fuel receipts and costs data.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
fuel_cost
(update=False)[source]¶ Calculate and return generator level fuel costs per MWh.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
fuel_ferc1
(update=False)[source]¶ Pull the FERC Form 1 steam plants fuel consumption data.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
gen_allocated_eia923
(update=False)[source]¶ Net generation from gen fuel table allocated to generators.
-
gen_eia923
(update=False)[source]¶ Pull EIA 923 net generation data by generator.
Net generation is reported in two seperate tables in EIA 923: in the generation_eia923 and generation_fuel_eia923 tables. While the generation_fuel_eia923 table is more complete (the generation_eia923 table includes only ~55% of the reported MWhs), the generation_eia923 table is more granular (it is reported at the generator level).
This method either grabs the generation_eia923 table that is reported by generator, or allocates net generation from the generation_fuel_eia923 table to the generator level.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
gen_original_eia923
(update=False)[source]¶ Pull the original EIA 923 net generation data by generator.
-
gens_eia860
(update=False, unit_ids=False)[source]¶ Pull a dataframe describing generators, as reported in EIA 860.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
gf_eia923
(update=False)[source]¶ Pull EIA 923 generation and fuel consumption data.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
hr_by_gen
(update=False)[source]¶ Calculate and return generator level heat rates (mmBTU/MWh).
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
hr_by_unit
(update=False)[source]¶ Calculate and return generation unit level heat rates.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
mcoe
(update=False, min_heat_rate=5.5, min_fuel_cost_per_mwh=0.0, min_cap_fact=0.0, max_cap_fact=1.5, all_gens=True)[source]¶ Calculate and return generator level MCOE based on EIA data.
Eventually this calculation will include non-fuel operating expenses as reported in FERC Form 1, but for now only the fuel costs reported to EIA are included. They are attibuted based on the unit-level heat rates and fuel costs.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
min_heat_rate – lowest plausible heat rate, in mmBTU/MWh. Any MCOE records with lower heat rates are presumed to be invalid, and are discarded before returning.
min_cap_fact – minimum generator capacity factor. Generator records with a lower capacity factor will be filtered out before returning. This allows the user to exclude generators that aren’t being used enough to have valid.
min_fuel_cost_per_mwh – minimum fuel cost on a per MWh basis that is required for a generator record to be considered valid. For some reason there are now a large number of $0 fuel cost records, which previously would have been NaN.
max_cap_fact – maximum generator capacity factor. Generator records with a lower capacity factor will be filtered out before returning. This allows the user to exclude generators that aren’t being used enough to have valid.
all_gens (bool) – Controls whether the output contains records for all generators in the generators_eia860 table, or only those generators with associated MCOE data. True by default.
- Returns
a compilation of generator attributes, including fuel costs per MWh.
- Return type
-
own_eia860
(update=False)[source]¶ Pull a dataframe of generator level ownership data from EIA 860.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
plant_in_service_ferc1
(update=False)[source]¶ Pull the FERC Form 1 Plant in Service Table.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
plants_eia860
(update=False)[source]¶ Pull a dataframe of plant level info reported in EIA 860.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
plants_hydro_ferc1
(update=False)[source]¶ Pull the FERC Form 1 Hydro Plants Table.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
plants_pumped_storage_ferc1
(update=False)[source]¶ Pull the FERC Form 1 Pumped Storage Table.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
plants_small_ferc1
(update=False)[source]¶ Pull the FERC Form 1 Small Plants Table.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
plants_steam_ferc1
(update=False)[source]¶ Pull the FERC Form 1 steam plants data.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
pu_eia860
(update=False)[source]¶ Pull a dataframe of EIA plant-utility associations.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
pu_ferc1
(update=False)[source]¶ Pull a dataframe of FERC plant-utility associations.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
-
purchased_power_ferc1
(update=False)[source]¶ Pull the FERC Form 1 Purchased Power Table.
- Parameters
update (bool) – If true, re-calculate the output dataframe, even if a cached version exists.
- Returns
a denormalized table for interactive use.
- Return type
Useful post-processing and denormalized outputs based on PUDL.
The datapackages which are output by the PUDL ETL pipeline are well normalized and suitable for use as relational database tables. This minimizes data duplication and helps avoid many kinds of data corruption and the potential for internal inconsistency. However, that’s not always the easiest kind of data to work with. Sometimes we want all the names and IDs in a single dataframe or table, for human readability. Sometimes you want the useful derived values.
This subpackage compiles a bunch of outputs we found we were commonly
generating, so that they can be done automatically and uniformly. They are
encapsulated within the pudl.output.pudltabl.PudlTabl
class.
pudl.transform package¶
Code for transforming EIA data that pertains to more than one EIA Form.
This module helps normalize EIA datasets and infers additonal connections between EIA entities (i.e. utilities, plants, units, generators…). This includes:
compiling a master list of plant, utility, boiler, and generator IDs that appear in any of the EIA 860 or 923 tables.
inferring more complete boiler-generator associations.
differentiating between static and time varying attributes associated with the EIA entities, storing the static fields with the entity table, and the variable fields in an annual table.
The boiler generator association inferrence (bga) takes the associations
provided by the EIA 860, and expands on it using several methods which can be
found in pudl.transform.eia._boiler_generator_assn()
.
-
pudl.transform.eia.
harvesting
(entity, eia_transformed_dfs, entities_dfs, eia860_ytd=False, debug=False)[source]¶ Compiles consistent records for various entities.
For each entity(plants, generators, boilers, utilties), this function finds all the harvestable columns from any table that they show up in. It then determines how consistent the records are and keeps the values that are mostly consistent. It compiles those consistent records into one normalized table.
There are a few things to note here. First being that we are not expecting the outcome here to be perfect! We choose to pull the most consistent record as reported across all the EIA tables and years, but we also required a “strictness” level of 70% (this is currently a hard coded argument for _occurrence_consistency). That means at least 70% of the records must be the same for us to use that value. So if values for an entity haven’t been reported 70% consistently, then it will show up as a null value. We built in the ability to add special cases for columns where we want to apply a different method to, but the only ones we added was for latitude and longitude because they are by far the dirtiest.
We have determined which columns should be considered “static” or “annual”. These can be found in constants in the entities dictionary. Static means That is should not change over time. Annual means there is annual variablity. This distinction was made in part by testing the consistency and in part by an understanding of how the entities and columns relate in the real world.
- Parameters
entity (str) – plants, generators, boilers, utilties
eia_transformed_dfs (dict) – A dictionary of tbl names (keys) and transformed dfs (values)
entities_dfs (dict) – A dictionary of entity table names (keys) and entity dfs (values)
eia860_ytd (boolean) – if True, the etl run is attempting to include year-to-date updated from EIA 860M.
debug (bool) – If True, this function will also return an additional dictionary of dataframes that includes the pre-deduplicated compiled records with the number of occurances of the entity and the record to see consistency of reported values.
- Returns
- A tuple containing:
eia_transformed_dfs (dict): dictionary of tbl names (keys) and transformed dfs (values) entity_dfs (dict): dictionary of entity table names (keys) and entity dfs (values)
- Return type
- Raises
AssertionError – If the consistency of any record value is <90%.
Todo
Return to role of debug.
Determine what to do with null records
Determine how to treat mostly static records
-
pudl.transform.eia.
transform
(eia_transformed_dfs, eia860_years=(2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), eia923_years=(2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), eia860_ytd=False, debug=False)[source]¶ Creates DataFrames for EIA Entity tables and modifies EIA tables.
This function coordinates two main actions: generating the entity tables via
harvesting()
and generating the boiler generator associations via_boiler_generator_assn()
.There is also some removal of tables that are no longer needed after the entity harvesting is finished.
- Parameters
eia_transformed_dfs (dict) – a dictionary of table names (kays) and transformed dataframes (values).
eia860_years (list) – a list of years for EIA 860, must be continuous, and only include working years.
eia923_years (list) – a list of years for EIA 923, must be continuous, and include only working years.
eia860_ytd (boolean) – if True, the etl run is attempting to include year-to-date updated from EIA 860M.
debug (bool) – if true, informational columns will be added into boiler_generator_assn
- Returns
two dictionaries having table names as keys and dataframes as values for the entity tables transformed EIA dataframes
- Return type
Module to perform data cleaning functions on EIA860 data tables.
-
pudl.transform.eia860.
OWNERSHIP_PLANT_GEN_ID_DUPES
= [(56032, '1')]¶ EIA Plant IDs which have duplicate generators within the ownership table due to the removal of leading zeroes from the generator IDs.
- Type
-
pudl.transform.eia860.
boiler_generator_assn
(eia860_dfs, eia860_transformed_dfs)[source]¶ Pull and transform the boilder generator association table.
Transformations include:
Drop non-data rows with EIA notes.
Drop duplicate rows.
- Parameters
eia860_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA860 form, as reported in the Excel spreadsheets they distribute.
eia860_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
eia860_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia860.
generators
(eia860_dfs, eia860_transformed_dfs)[source]¶ Pull and transform the generators table.
There are three tabs that the generator records come from (proposed, existing, retired). Pre 2009, the existing and retired data are lumped together under a single generator file with one tab. We pull each tab into one dataframe and include an
operational_status
to indicate which tab the record came from. We useoperational_status
to parse the pre 2009 files as well.Transformations include:
Replace . values with NA.
Update
operational_status_code
to reflect plant status as either proposed, existing or retired.Drop values with NA for plant and generator id.
Replace 0 values with NA where appropriate.
Convert Y/N/X values to boolean True/False.
Convert U/Unknown values to NA.
Map full spelling onto code values.
Create a fuel_type_code_pudl field that organizes fuel types into clean, distinguishable categories.
- Parameters
eia860_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA860 form, as reported in the Excel spreadsheets they distribute.
eia860_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to a normalized DataFrame of values from that page (values).
- Returns
eia860_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia860.
ownership
(eia860_dfs, eia860_transformed_dfs)[source]¶ Pull and transform the ownership table.
Transformations include:
Replace . values with NA.
Convert pre-2012 ownership percentages to proportions to match post-2012 reporting.
- Parameters
eia860_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA860 form, as reported in the Excel spreadsheets they distribute.
eia860_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
eia860_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia860.
plants
(eia860_dfs, eia860_transformed_dfs)[source]¶ Pull and transform the plants table.
Much of the static plant information is reported repeatedly, and scattered across several different pages of EIA 923. The data frame which this function uses is assembled from those many different pages, and passed in via the same dictionary of dataframes that all the other ingest functions use for uniformity.
Transformations include:
Replace . values with NA.
Homogenize spelling of county names.
Convert Y/N/X values to boolean True/False.
- Parameters
eia860_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA860 form, as reported in the Excel spreadsheets they distribute.
eia860_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
eia860_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia860.
transform
(eia860_raw_dfs, eia860_tables=('boiler_generator_assn_eia860', 'utilities_eia860', 'plants_eia860', 'generators_eia860', 'ownership_eia860'))[source]¶ Transform EIA 860 DataFrames.
- Parameters
- Returns
A dictionary of DataFrame objects in which pages from EIA860 form (keys) corresponds to a normalized DataFrame of values from that page (values).
- Return type
-
pudl.transform.eia860.
utilities
(eia860_dfs, eia860_transformed_dfs)[source]¶ Pull and transform the utilities table.
Transformations include:
Replace . values with NA.
Fix typos in state abbreviations, convert to uppercase.
Drop address_3 field (all NA).
Combine phone number columns into one field and set values that don’t mimic real US phone numbers to NA.
Convert Y/N/X values to boolean True/False.
Map full spelling onto code values.
- Parameters
eia860_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA860 form, as reported in the Excel spreadsheets they distribute.
eia860_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
eia860_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA860 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
Module to perform data cleaning functions on EIA861 data tables.
All transformations include: - Replace . values with NA.
-
pudl.transform.eia861.
advanced_metering_infrastructure
(tfr_dfs)[source]¶ Transform the EIA 861 Advanced Metering Infrastructure table.
Transformations include:
Tidy data by customer class.
Drop total_meters columns (it’s calculable with other fields).
Transform the EIA 861 Balancing Authority table.
Transformations include:
Fill in balancing authrority IDs based on date, utility ID, and BA Name.
Backfill balancing authority codes based on BA ID.
Fix BA code and ID typos.
Compile a balancing authority, utility, state association table.
For the years up through 2012, the only BA-Util information that’s available comes from the balancing_authority_eia861 table, and it does not include any state-level information. However, there is utility-state association information in the sales_eia861 and other data tables.
For the years from 2013 onward, there’s explicit BA-Util-State information in the data tables (e.g. sales_eia861). These observed associations can be compiled to give us a picture of which BA-Util-State associations exist. However, we need to merge in the balancing authority IDs since the data tables only contain the balancing authority codes.
- Parameters
tfr_dfs (dict) – A dictionary of transformed EIA 861 dataframes. This must include any dataframes from which we want to compile BA-Util-State associations, which means this function has to be called after all the basic transformfunctions that depend on only a single raw table.
- Returns
a dictionary of transformed dataframes. This function both compiles the association table, and finishes the normalization of the balancing authority table. It may be that once the harvesting process incorporates the EIA 861, some or all of this functionality should be pulled into the phase-2 transform functions.
- Return type
-
pudl.transform.eia861.
demand_response
(tfr_dfs)[source]¶ Transform the EIA 861 Demand Response table.
Transformations include:
Fill in NA balancing authority codes with UNK (because it’s part of the primary key).
Tidy subset of the data by customer class.
Drop duplicate rows based on primary keys.
Convert 1000s of dollars into dollars.
-
pudl.transform.eia861.
demand_side_management
(tfr_dfs)[source]¶ Transform the EIA 861 Demand Side Management table.
In 2013, the EIA changed the contents of the 861 form so that information pertaining to demand side management was no longer housed in a single table, but rather two seperate ones pertaining to energy efficiency and demand response. While the pre and post 2013 tables contain similar information, one column in the pre-2013 demand side management table may not have an obvious column equivalent in the post-2013 energy efficiency or demand response data. We’ve addressed this by keeping the demand side management and energy efficiency and demand response tables seperate. Use the DSM table for pre 2013 data and the EE / DR tables for post 2013 data. Despite the uncertainty of comparing across these years, the data are similar and we hope to provide a cohesive dataset in the future with all years and comprable columns combined.
Transformations include:
Clean up NERC codes and ensure one per row.
Remove demand_side_management and data_observed columns (they are all the same).
Tidy subset of the data by customer class.
Convert Y/N columns to booleans.
Convert 1000s of dollars into dollars.
-
pudl.transform.eia861.
distributed_generation
(tfr_dfs)[source]¶ Transform the EIA 861 Distributed Generation table.
Transformations include:
Map full spelling onto code values.
Convert pre-2010 percent values in mw values.
Remove total columns calculable with other fields.
Tidy subset of the data by tech class.
Tidy subset of the data by fuel class.
-
pudl.transform.eia861.
distribution_systems
(tfr_dfs)[source]¶ Transform the EIA 861 Distribution Systems table.
Transformations include:
No additional transformations.
-
pudl.transform.eia861.
dynamic_pricing
(tfr_dfs)[source]¶ Transform the EIA 861 Dynamic Pricing table.
Transformations include:
Tidy subset of the data by customer class.
Convert Y/N columns to booleans.
-
pudl.transform.eia861.
energy_efficiency
(tfr_dfs)[source]¶ Transform the EIA 861 Energy Efficiency table.
Transformations include:
Tidy subset of the data by customer class.
Drop website column (almost no valid information).
Convert 1000s of dollars into dollars.
-
pudl.transform.eia861.
green_pricing
(tfr_dfs)[source]¶ Transform the EIA 861 Green Pricing table.
Transformations include:
Tidy subset of the data by customer class.
Convert 1000s of dollars into dollars.
-
pudl.transform.eia861.
mergers
(tfr_dfs)[source]¶ Transform the EIA 861 Mergers table.
Transformations include:
Map full spelling onto code values.
Retain preceeding zeros in zipcode field.
-
pudl.transform.eia861.
net_metering
(tfr_dfs)[source]¶ Transform the EIA 861 Net Metering table.
Transformations include:
Remove rows with utility ids 99999.
Tidy subset of the data by customer class.
Tidy subset of the data by tech class.
-
pudl.transform.eia861.
non_net_metering
(tfr_dfs)[source]¶ Transform the EIA 861 Non-Net Metering table.
Transformations include:
Remove rows with utility ids 99999.
Drop duplicate rows.
Tidy subset of the data by customer class.
Tidy subset of the data by tech class.
Finish the normalization of the balancing_authority_eia861 table.
The balancing_authority_assn_eia861 table depends on information that is only available in the UN-normalized form of the balancing_authority_eia861 table, so and also on having access to a bunch of transformed data tables, so it can compile the observed combinations of report dates, balancing authorities, states, and utilities. This means that we have to hold off on the final normalization of the balancing_authority_eia861 table until the rest of the transform process is over.
-
pudl.transform.eia861.
operational_data
(tfr_dfs)[source]¶ Transform the EIA 861 Operational Data table.
Transformations include:
Remove rows with utility ids 88888.
Remove rows with NA utility id.
Clean up NERC codes and ensure one per row.
Convert data_observed field I/O into boolean.
Tidy subset of the data by revenue class.
Convert 1000s of dollars into dollars.
-
pudl.transform.eia861.
reliability
(tfr_dfs)[source]¶ Transform the EIA 861 Reliability table.
Transformations include:
Tidy subset of the data by reliability standard.
Convert Y/N columns to booleans.
Map full spelling onto code values.
Drop duplicate rows.
-
pudl.transform.eia861.
sales
(tfr_dfs)[source]¶ Transform the EIA 861 Sales table.
Transformations include:
Remove rows with utility ids 88888 and 99999.
Tidy data by customer class.
Drop primary key duplicates.
Convert 1000s of dollars into dollars.
Convert data_observed field I/O into boolean.
Map full spelling onto code values.
-
pudl.transform.eia861.
service_territory
(tfr_dfs)[source]¶ Transform the EIA 861 utility service territory table.
Transformations include:
Homogenize spelling of county names.
Add field for state/county FIPS code.
- Parameters
tfr_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA861 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
- a dictionary of pandas.DataFrame objects in which pages from EIA861 form
(keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia861.
transform
(raw_dfs, eia861_tables=('service_territory_eia861', 'balancing_authority_eia861', 'sales_eia861', 'advanced_metering_infrastructure_eia861', 'demand_response_eia861', 'demand_side_management_eia861', 'distributed_generation_eia861', 'distribution_systems_eia861', 'dynamic_pricing_eia861', 'energy_efficiency_eia861', 'green_pricing_eia861', 'mergers_eia861', 'net_metering_eia861', 'non_net_metering_eia861', 'operational_data_eia861', 'reliability_eia861', 'utility_data_eia861'))[source]¶ Transform EIA 861 DataFrames.
- Parameters
- Returns
A dictionary of DataFrame objects in which pages from EIA 861 form (keys) corresponds to a normalized DataFrame of values from that page (values).
- Return type
-
pudl.transform.eia861.
utility_assn
(tfr_dfs)[source]¶ Harvest a Utility-Date-State Association Table.
-
pudl.transform.eia861.
utility_data
(tfr_dfs)[source]¶ Transform the EIA 861 Utility Data table.
Transformations include:
Remove rows with utility ids 88888.
Clean up NERC codes and ensure one per row.
Tidy subset of the data by NERC region.
Tidy subset of the data by RTO.
Convert Y/N columns to booleans.
Module to perform data cleaning functions on EIA923 data tables.
-
pudl.transform.eia923.
boiler_fuel
(eia923_dfs, eia923_transformed_dfs)[source]¶ Transforms the boiler_fuel_eia923 table.
Transformations include:
Remove fields implicated elsewhere.
Drop values with plant and boiler id values of NA.
Replace . values with NA.
Create a fuel_type_code_pudl field that organizes fuel types into clean, distinguishable categories.
Combine year and month columns into a single date column.
- Parameters
eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.
eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
- eia923_transformed_dfs, a dictionary of DataFrame objects in which pages
from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia923.
coalmine
(eia923_dfs, eia923_transformed_dfs)[source]¶ Transforms the coalmine_eia923 table.
Transformations include:
Remove fields implicated elsewhere.
Drop duplicates with MSHA ID.
- Parameters
eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.
eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia923.
fuel_receipts_costs
(eia923_dfs, eia923_transformed_dfs)[source]¶ Transforms the fuel_receipts_costs_eia923 dataframe.
Transformations include:
Remove fields implicated elsewhere.
Replace . values with NA.
Standardize codes values.
Fix dates.
Replace invalid mercury content values with NA.
Fuel cost is reported in cents per mmbtu. Converts cents to dollars.
- Parameters
eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.
eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia923.
generation
(eia923_dfs, eia923_transformed_dfs)[source]¶ Transforms the generation_eia923 table.
Transformations include:
Drop rows with NA for generator id.
Remove fields implicated elsewhere.
Replace . values with NA.
Drop generator-date row duplicates (all have no data).
- Parameters
eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.
eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia923.
generation_fuel
(eia923_dfs, eia923_transformed_dfs)[source]¶ Transforms the generation_fuel_eia923 table.
Transformations include:
Remove fields implicated elsewhere.
Replace . values with NA.
Remove rows with utility ids 99999.
Create a fuel_type_code_pudl field that organizes fuel types into clean, distinguishable categories.
Combine year and month columns into a single date column.
- Parameters
eia923_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA923 form, as reported in the Excel spreadsheets they distribute.
eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia923.
plants
(eia923_dfs, eia923_transformed_dfs)[source]¶ Transforms the plants_eia923 table.
Much of the static plant information is reported repeatedly, and scattered across several different pages of EIA 923. The data frame that this function uses is assembled from those many different pages, and passed in via the same dictionary of dataframes that all the other ingest functions use for uniformity.
Transformations include:
Map full spelling onto code values.
Convert Y/N columns to booleans.
Remove excess white space around values.
Drop duplicate rows.
- Parameters
eia923_dfs (dictionary of pandas.DataFrame) – Each entry in this dictionary of DataFrame objects corresponds to a page from the EIA 923 form, as reported in the Excel spreadsheets they distribute.
eia923_transformed_dfs (dict) – A dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Returns
eia923_transformed_dfs, a dictionary of DataFrame objects in which pages from EIA923 form (keys) correspond to normalized DataFrames of values from that page (values).
- Return type
-
pudl.transform.eia923.
transform
(eia923_raw_dfs, eia923_tables=('generation_fuel_eia923', 'boiler_fuel_eia923', 'generation_eia923', 'coalmine_eia923', 'fuel_receipts_costs_eia923'))[source]¶ Transforms all the EIA 923 tables.
- Parameters
- Returns
A dictionary of DataFrame with table names as keys and
pandas.DataFrame
objects as values, where the contents of the DataFrames correspond to cleaned and normalized PUDL database tables, ready for loading.- Return type
Module to perform data cleaning functions on EPA CEMS data tables.
-
pudl.transform.epacems.
add_facility_id_unit_id_epa
(df)[source]¶ Harmonize columns that are added later.
The datapackage validation checks for consistent column names, and these two columns aren’t present before August 2008, so this adds them in.
- Parameters
df (pandas.DataFrame) – A CEMS dataframe
- Returns
The same DataFrame guaranteed to have int facility_id and unit_id_epa cols.
- Return type
pandas.Dataframe
-
pudl.transform.epacems.
correct_gross_load_mw
(df)[source]¶ Fix values of gross load that are wrong by orders of magnitude.
- Parameters
df (pandas.DataFrame) – A CEMS dataframe
- Returns
The same DataFrame with corrected gross load values.
- Return type
-
pudl.transform.epacems.
fix_up_dates
(df, plant_utc_offset)[source]¶ Fix the dates for the CEMS data.
Transformations include:
Account for timezone differences with offset from UTC.
- Parameters
df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state plant_utc_offset (pandas.DataFrame): A dataframe of plants’ timezones.
- Returns
The same data, with an op_datetime_utc column added and the op_date and op_hour columns removed.
- Return type
-
pudl.transform.epacems.
harmonize_eia_epa_orispl
(df)[source]¶ Harmonize the ORISPL code to match the EIA data – NOT YET IMPLEMENTED.
The EIA plant IDs and CEMS ORISPL codes almost match, but not quite. EPA has compiled a crosswalk that maps one set of IDs to the other, but we haven’t integrated it yet. It can be found at:
https://github.com/USEPA/camd-eia-crosswalk
Note that this transformation needs to be run before fix_up_dates, because fix_up_dates uses the plant ID to look up timezones.
- Parameters
df (pandas.DataFrame) – A CEMS hourly dataframe for one year-month-state.
- Returns
The same data, with the ORISPL plant codes corrected to match the EIA plant IDs.
- Return type
Todo
Actually implement the function…
Module to perform data cleaning functions on EPA IPM data tables.
-
pudl.transform.epaipm.
load_curves
(epaipm_dfs, epaipm_transformed_dfs)[source]¶ Transform the load curve table from wide to tidy format.
- Parameters
epaipm_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from EPA’s IPM, as reported in the Excel spreadsheets they distribute.
epa_epaipm_transformed_dfs (dict) – A dictionary of DataFrame objects in which tables from EPA IPM (keys) correspond to normalized DataFrames of values from that table (values)
- Returns
A dictionary of DataFrame objects in which tables from EPA IPM (keys) correspond to normalized DataFrames of values from that table (values)
- Return type
-
pudl.transform.epaipm.
plant_region_map
(epaipm_dfs, epaipm_transformed_dfs)[source]¶ Transforms the map of plant ids to IPM regions for all plants.
- Parameters
epaipm_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from EPA’s IPM, as reported in the Excel spreadsheets they distribute.
epaipm_transformed_dfs (dict) – A dictionary of DataFrame objects in which tables from EPA IPM(keys) correspond to normalized DataFrames of values from that table(values)
- Returns
A dictionary of DataFrame objects in which tables from EPA IPM(keys) correspond to normalized DataFrames of values from that table(values)
- Return type
-
pudl.transform.epaipm.
transform
(epaipm_raw_dfs, epaipm_tables=('transmission_single_epaipm', 'transmission_joint_epaipm', 'load_curves_epaipm', 'plant_region_map_epaipm'))[source]¶ Transform EPA IPM DataFrames.
- Parameters
- Returns
A dictionary of DataFrame objects in which tables from EPA IPM(keys) correspond to normalized DataFrames of values from that table(values)
- Return type
-
pudl.transform.epaipm.
transmission_joint
(epaipm_dfs, epaipm_transformed_dfs)[source]¶ Transforms transmission constraints between multiple inter-regional links.
- Parameters
epaipm_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from EPA’s IPM, as reported in the Excel spreadsheets they distribute.
epa_epaipm_transformed_dfs (dict) – A dictionary of DataFrame objects in which tables from EPA IPM (keys) correspond to normalized DataFrames of values from that table (values)
- Returns
A dictionary of DataFrame objects in which tables from EPA IPM (keys) correspond to normalized DataFrames of values from that table (values)
- Return type
-
pudl.transform.epaipm.
transmission_single
(epaipm_dfs, epaipm_transformed_dfs)[source]¶ Transforms the transmission constraints between individual regions.
- Parameters
epaipm_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from EPA’s IPM, as reported in the Excel spreadsheets they distribute.
epa_epaipm_transformed_dfs (dict) – A dictionary of DataFrame objects in which tables from EPA IPM (keys) correspond to normalized DataFrames of values from that table (values)
- Returns
A dictionary of DataFrame objects in which tables from EPA IPM (keys) correspond to normalized DataFrames of values from that table (values)
- Return type
Routines for transforming FERC Form 1 data before loading into the PUDL DB.
This module provides a variety of functions that are used in cleaning up the FERC Form 1 data prior to loading into our database. This includes adopting standardized units and column names, standardizing the formatting of some string values, and correcting data entry errors which we can infer based on the existing data. It may also include removing bad data, or replacing it with the appropriate NA values.
-
pudl.transform.ferc1.
CONSTRUCTION_TYPE_STRINGS
= {'conventional': ['conventional', 'conventional', 'conventional boiler', 'conv-b', 'conventionall', 'convention', 'conventional', 'coventional', 'conven full boiler', 'c0nventional', 'conventtional', 'conventialunderground', 'conventional bulb', 'conventrional', '*conventional', 'convential', 'convetional', 'conventioanl', 'conventioinal', 'conventaional', 'indoor construction', 'convenional', 'conventional steam', 'conventinal', 'convntional', 'conventionl', 'conventionsl', 'conventiional', 'convntl steam plants', 'indoor const.', 'full indoor', 'indoor', 'indoor automatic', 'indoor boiler', '(peak load) indoor', 'conventionl,indoor', 'conventionl, indoor', 'conventional, indoor', 'comb. cycle indoor', '3 indoor boiler', '2 indoor boilers', '1 indoor boiler', '2 indoor boiler', '3 indoor boilers', 'fully contained', 'conv - b', 'conventional/boiler', 'cnventional', 'comb. cycle indooor', 'sonventional', 'ind enclosures'], 'outdoor': ['outdoor', 'outdoor boiler', 'full outdoor', 'outdoor boiler', 'outdoor boilers', 'outboilers', 'fuel outdoor', 'full outdoor', 'outdoors', 'outdoor', 'boiler outdoor& full', 'boiler outdoor&full', 'outdoor boiler& full', 'full -outdoor', 'outdoor steam', 'outdoor boiler', 'ob', 'outdoor automatic', 'outdoor repower', 'full outdoor boiler', 'fo', 'outdoor boiler & ful', 'full-outdoor', 'fuel outdoor', 'outoor', 'outdoor', 'outdoor boiler&full', 'boiler outdoor &full', 'outdoor boiler &full', 'boiler outdoor & ful', 'outdoor-boiler', 'outdoor - boiler', 'outdoor const.', '4 outdoor boilers', '3 outdoor boilers', 'full outdoor', 'full outdoors', 'full oudoors', 'outdoor (auto oper)', 'outside boiler', 'outdoor boiler&full', 'outdoor hrsg', 'outdoor hrsg', 'outdoor-steel encl.', 'boiler-outdr & full', 'con.& full outdoor', 'partial outdoor', 'outdoor (auto. oper)', 'outdoor (auto.oper)', 'outdoor construction', '1 outdoor boiler', '2 outdoor boilers', 'outdoor enclosure', '2 outoor boilers', 'boiler outdr.& full', 'boiler outdr. & full', 'ful outdoor', 'outdoor-steel enclos', 'outdoor (auto oper.)', 'con. & full outdoor', 'outdore', 'boiler & full outdor', 'full & outdr boilers', 'outodoor (auto oper)', 'outdoor steel encl.', 'full outoor', 'boiler & outdoor ful', 'otdr. blr. & f. otdr', 'f.otdr & otdr.blr.', 'oudoor (auto oper)', 'outdoor constructin', 'f. otdr. & otdr. blr', 'outdoor boiler & fue'], 'semioutdoor': ['more than 50% outdoo', 'more than 50% outdos', 'over 50% outdoor', 'over 50% outdoors', 'semi-outdoor', 'semi - outdoor', 'semi outdoor', 'semi-enclosed', 'semi-outdoor boiler', 'semi outdoor boiler', 'semi- outdoor', 'semi - outdoors', 'semi -outdoorconven & semi-outdr', 'conv & semi-outdoor', 'conv & semi- outdoor', 'convent. semi-outdr', 'conv. semi outdoor', 'conv(u1)/semiod(u2)', 'conv u1/semi-od u2', 'conv-one blr-semi-od', 'convent semioutdoor', 'conv. u1/semi-od u2', 'conv - 1 blr semi od', 'conv. ui/semi-od u2', 'conv-1 blr semi-od', 'conven. semi-outdoor', 'conv semi-outdoor', 'u1-conv./u2-semi-od', 'u1-conv./u2-semi -od', 'convent. semi-outdoo', 'u1-conv. / u2-semi', 'conven & semi-outdr', 'semi -outdoor', 'outdr & conventnl', 'conven. full outdoor', 'conv. & outdoor blr', 'conv. & outdoor blr.', 'conv. & outdoor boil', 'conv. & outdr boiler', 'conv. & out. boiler', 'convntl,outdoor blr', 'outdoor & conv.', '2 conv., 1 out. boil', 'outdoor/conventional', 'conv. boiler outdoor', 'conv-one boiler-outd', 'conventional outdoor', 'conventional outdor', 'conv. outdoor boiler', 'conv.outdoor boiler', 'conventional outdr.', 'conven,outdoorboiler', 'conven full outdoor', 'conven,full outdoor', '1 out boil, 2 conv', 'conv. & full outdoor', 'conv. & outdr. boilr', 'conv outdoor boiler', 'convention. outdoor', 'conv. sem. outdoor', 'convntl, outdoor blr', 'conv & outdoor boil', 'conv & outdoor boil.', 'outdoor & conv', 'conv. broiler outdor', '1 out boilr, 2 conv', 'conv.& outdoor boil.', 'conven,outdr.boiler', 'conven,outdr boiler', 'outdoor & conventil', '1 out boilr 2 conv', 'conv & outdr. boilr', 'conven, full outdoor', 'conven full outdr.', 'conven, full outdr.', 'conv/outdoor boiler', "convnt'l outdr boilr", '1 out boil 2 conv', 'conv full outdoor', 'conven, outdr boiler', 'conventional/outdoor', 'conv&outdoor boiler', 'outdoor & convention', 'conv & outdoor boilr', 'conv & full outdoor', 'convntl. outdoor blr', 'conv - ob', "1conv'l/2odboilers", "2conv'l/1odboiler", 'conv-ob', 'conv.-ob', '1 conv/ 2odboilers', '2 conv /1 odboilers', 'conv- ob', 'conv -ob', 'con sem outdoor', 'cnvntl, outdr, boilr', 'less than 50% outdoo', 'under 50% outdoor', 'under 50% outdoors', '1cnvntnl/2odboilers', '2cnvntnl1/1odboiler', 'con & ob', 'combination (b)', 'indoor & outdoor', 'conven. blr. & full', 'conv. & otdr. blr.', 'combination', 'indoor and outdoor', 'conven boiler & full', "2conv'l/10dboiler", '4 indor/outdr boiler', '4 indr/outdr boilerr', '4 indr/outdr boiler', 'indoor & outdoof'], 'unknown': ['', 'automatic operation', 'comb. turb. installn', 'comb. turb. instaln', 'com. turb. installn', 'n/a', 'for detailed info.', 'for detailed info', 'combined cycle', 'na', 'not applicable', 'gas', 'heated individually', 'metal enclosure', 'pressurized water', 'nuclear', 'jet engine', 'gas turbine', 'storage/pipelines', '0', 'during 1994', 'peaking - automatic', 'gas turbine/int. cm', '2 oil/gas turbines', 'wind', 'package', 'mobile', 'auto-operated', 'steam plants', 'other production', 'all nuclear plants', 'other power gen.', 'automatically operad', 'automatically operd', 'circ fluidized bed', 'jet turbine', 'gas turbne/int comb', 'automatically oper.', 'retired 1/1/95', 'during 1995', '1996. plant sold', 'reactivated 7/1/96', 'gas turbine/int comb', 'portable', 'head individually', 'automatic opertion', 'peaking-automatic', 'cycle', 'full order', 'circ. fluidized bed', 'gas turbine/intcomb', '0.0000', 'none', '2 oil / gas', 'block & steel', 'and 2000', 'comb.turb. instaln', 'automatic oper.', 'pakage', '---', 'n/a (ct)', 'comb turb instain', 'ind encloures', '2 oil /gas turbines', 'combustion turbine', '1970', 'gas/oil turbines', 'combined cycle steam', 'pwr', '2 oil/ gas', '2 oil / gas turbines', 'gas / oil turbines', 'no boiler', 'internal combustion', 'gasturbine no boiler', 'boiler', 'tower -10 unit facy', 'gas trubine', '4 gas/oil trubines', '2 oil/ 4 gas/oil tur', '5 gas/oil turbines', 'tower 16', '2 on 1 gas turbine', 'tower 23', 'tower -10 unit', 'tower - 101 unit', '3 on 1 gas turbine', 'tower - 10 units', 'tower - 165 units', 'wind turbine', 'fixed tilt pv', 'tracking pv', 'o', 'wind trubine', 'subcritical', 'sucritical', 'simple cycle', 'simple & reciprocat']}¶ A dictionary of construction types (keys) and lists of construction type strings associated with each type (values) from FERC Form 1.
There are many strings that weren’t categorized, including crosses between conventional and outdoor, PV, wind, combined cycle, and internal combustion. The lists are broken out into the two types specified in Form 1: conventional and outdoor. These lists are inclusive so that variants of conventional (e.g. “conventional full”) and outdoor (e.g. “outdoor full” and “outdoor hrsg”) are included.
- Type
-
class
pudl.transform.ferc1.
FERCPlantClassifier
(min_sim=0.75, plants_df=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
A classifier for identifying FERC plant time series in FERC Form 1 data.
We want to be able to give the classifier a FERC plant record, and get back the group of records(or the ID of the group of records) that it ought to be part of.
There are hundreds of different groups of records, and we can only know what they are by looking at the whole dataset ahead of time. This is the “fitting” step, in which the groups of records resulting from a particular set of model parameters(e.g. the weights that are attributes of the class) are generated.
Once we have that set of record categories, we can test how well the classifier performs, by checking it against test / training data which we have already classified by hand. The test / training set is a list of lists of unique FERC plant record IDs(each record ID is the concatenation of: report year, respondent id, supplement number, and row number). It could also be stored as a dataframe where each column is associated with a year of data(some of which could be empty). Not sure what the best structure would be.
If it’s useful, we can assign each group a unique ID that is the time ordered concatenation of each of the constituent record IDs. Need to understand what the process for checking the classification of an input record looks like.
To score a given classifier, we can look at what proportion of the records in the test dataset are assigned to the same group as in our manual classification of those records. There are much more complicated ways to do the scoring too… but for now let’s just keep it as simple as possible.
-
fit
(X, y=None)[source]¶ Use weighted FERC plant features to group records into time series.
The fit method takes the vectorized, normalized, weighted FERC plant features (X) as input, calculates the pairwise cosine similarity matrix between all records, and groups the records in their best time series. The similarity matrix and best time series are stored as data members in the object for later use in scoring & predicting.
This isn’t quite the way a fit method would normally work.
- Parameters
() (y) – a sparse matrix of size n_samples x n_features.
() –
- Returns
- Return type
Todo
Zane revisit args and returns
-
predict
(X, y=None)[source]¶ Identify time series of similar records to input record_ids.
Given a one-dimensional dataframe X, containing FERC record IDs, return a dataframe in which each row corresponds to one of the input record_id values (ordered as the input was ordered), with each column corresponding to one of the years worth of data. Values in the returned dataframe are the FERC record_ids of the record most similar to the input record within that year. Some of them may be null, if there was no sufficiently good match.
Row index is the seed record IDs. Column index is years.
TODO: * This method is hideously inefficient. It should be vectorized. * There’s a line that throws a FutureWarning that needs to be fixed.
-
score
(X, y=None)[source]¶ Scores a collection of FERC plant categorizations.
For every record ID in X, predict its record group and calculate a metric of similarity between the prediction and the “ground truth” group that was passed in for that value of X.
- Parameters
X (pandas.DataFrame) – an n_samples x 1 pandas dataframe of FERC Form 1 record IDs.
y (pandas.DataFrame) – a dataframe of “ground truth” FERC Form 1 record groups, corresponding to the list record IDs in X
- Returns
The average of all the similarity metrics as the score.
- Return type
-
-
pudl.transform.ferc1.
FUEL_STRINGS
= {'coal': ['coal', 'coal-subbit', 'lignite', 'coal(sb)', 'coal (sb)', 'coal-lignite', 'coke', 'coa', 'lignite/coal', 'coal - subbit', 'coal-subb', 'coal-sub', 'coal-lig', 'coal-sub bit', 'coals', 'ciak', 'petcoke', 'coal.oil', 'coal/gas', 'bit coal', 'coal-unit #3', 'coal-subbitum', 'coal tons', 'coal mcf', 'coal unit #3', 'pet. coke', 'coal-u3', 'coal&coke', 'tons'], 'gas': ['gas', 'gass', 'methane', 'natural gas', 'blast gas', 'gas mcf', 'propane', 'prop', 'natural gas', 'nat.gas', 'nat gas', 'nat. gas', 'natl gas', 'ga', 'gas`', 'syngas', 'ng', 'mcf', 'blast gaa', 'nat gas', 'gac', 'syngass', 'prop.', 'natural', 'coal.gas', 'n. gas', 'lp gas', 'natuaral gas', 'coke gas', 'gas #2016', 'propane**', '* propane', 'propane **', 'gas expander', 'gas ct', '# 6 gas', '#6 gas', 'coke oven gas'], 'hydro': [], 'nuclear': ['nuclear', 'grams of uran', 'grams of', 'grams of ura', 'grams', 'nucleur', 'nulear', 'nucl', 'nucleart', 'nucelar', 'gr.uranium', 'grams of urm', 'nuclear (9)', 'nulcear', 'nuc', 'gr. uranium', 'nuclear mw da', 'grams of ura'], 'oil': ['oil', '#6 oil', '#2 oil', 'fuel oil', 'jet', 'no. 2 oil', 'no.2 oil', 'no.6& used', 'used oil', 'oil-2', 'oil (#2)', 'diesel oil', 'residual oil', '# 2 oil', 'resid. oil', 'tall oil', 'oil/gas', 'no.6 oil', 'oil-fuel', 'oil-diesel', 'oil / gas', 'oil bbls', 'oil bls', 'no. 6 oil', '#1 kerosene', 'diesel', 'no. 2 oils', 'blend oil', '#2oil diesel', '#2 oil-diesel', '# 2 oil', 'light oil', 'heavy oil', 'gas.oil', '#2', '2', '6', 'bbl', 'no 2 oil', 'no 6 oil', '#1 oil', '#6', 'oil-kero', 'oil bbl', 'biofuel', 'no 2', 'kero', '#1 fuel oil', 'no. 2 oil', 'blended oil', 'no 2. oil', '# 6 oil', 'nno. 2 oil', '#2 fuel', 'oill', 'oils', 'gas/oil', 'no.2 oil gas', '#2 fuel oil', 'oli', 'oil (#6)', 'oil/diesel', '2 oil', '#6 hvy oil', 'jet fuel', 'diesel/compos', 'oil-8', 'oil {6}', 'oil-unit #1', 'bbl.', 'oil.', 'oil #6', 'oil (6)', 'oil(#2)', 'oil-unit1&2', 'oil-6', '#2 fue oil', 'dielel oil', 'dielsel oil', '#6 & used', 'barrels', 'oil un 1 & 2', 'jet oil', 'oil-u1&2', 'oiul', 'pil', 'oil - 2', '#6 & used', 'oial'], 'solar': [], 'unknown': ['steam', 'purch steam', 'all', 'tdf', 'n/a', 'purch. steam', 'other', 'composite', 'composit', 'mbtus', 'total', 'avg', 'avg.', 'blo', 'all fuel', 'comb.', 'alt. fuels', 'na', 'comb', '/#=2\x80â\x91?', 'kã\xadgv¸\x9d?', "mbtu's", 'gas, oil', 'rrm', '3\x9c', 'average', 'furfural', '0', 'watson bng', 'toal', 'bng', '# 6 & used', 'combined', 'blo bls', 'compsite', '*', 'compos.', 'gas / oil', 'mw days', 'g', 'c', 'lime', 'all fuels', 'at right', '20', '1', 'comp oil/gas', 'all fuels to', 'the right are', 'c omposite', 'all fuels are', 'total pr crk', 'all fuels =', 'total pc', 'comp', 'alternative', 'alt. fuel', 'bio fuel', 'total prairie', ''], 'waste': ['tires', 'tire', 'refuse', 'switchgrass', 'wood waste', 'woodchips', 'biomass', 'wood', 'wood chips', 'rdf', 'tires/refuse', 'tire refuse', 'waste oil', 'waste', 'woodships', 'tire chips'], 'wind': []}¶ A mapping a canonical fuel name to a list of strings which are used to represent that fuel in the FERC Form 1 Reporting. Case is ignored, as all fuel strings are converted to a lower case in the data set.
- Type
-
pudl.transform.ferc1.
FUEL_UNIT_STRINGS
= {'bbl': ['barrel', 'bbls', 'bbl', 'barrels', 'bbrl', 'bbl.', 'bbls.', 'oil 42 gal', 'oil-barrels', 'barrrels', 'bbl-42 gal', 'oil-barrel', 'bb.', 'barrells', 'bar', 'bbld', 'oil- barrel', 'barrels .', 'bbl .', 'barels', 'barrell', 'berrels', 'bb', 'bbl.s', 'oil-bbl', 'bls', 'bbl:', 'barrles', 'blb', 'propane-bbl', 'barriel', 'berriel', 'barrile', '(bbl.)', 'barrel *(4)', '(4) barrel', 'bbf', 'blb.', '(bbl)', 'bb1', 'bbsl', 'barrrel', 'barrels 100%', 'bsrrels', "bbl's", '*barrels', 'oil - barrels', 'oil 42 gal ba', 'bll', 'boiler barrel', 'gas barrel', '"boiler" barr', '"gas" barrel', '"boiler"barre', '"boiler barre', 'barrels .', 'bariel', 'brrels', 'oil barrel'], 'btu': ['btus', 'btu'], 'gal': ['gallons', 'gal.', 'gals', 'gals.', 'gallon', 'gal', 'galllons'], 'gramsU': ['gram', 'grams', 'gm u', 'grams u235', 'grams u-235', 'grams of uran', 'grams: u-235', 'grams:u-235', 'grams:u235', 'grams u308', 'grams: u235', 'grams of', 'grams - n/a', 'gms uran', 's e uo2 grams', 'gms uranium', 'grams of urm', 'gms. of uran', 'grams (100%)', 'grams v-235', 'se uo2 grams'], 'kgU': ['kg of uranium', 'kg uranium', 'kilg. u-235', 'kg u-235', 'kilograms-u23', 'kg', 'kilograms u-2', 'kilograms', 'kg of', 'kg-u-235', 'kilgrams', 'kilogr. u235', 'uranium kg', 'kg uranium25', 'kilogr. u-235', 'kg uranium 25', 'kilgr. u-235', 'kguranium 25', 'kg-u235', 'kgm'], 'kgal': ['oil(1000 gal)', 'oil(1000)', 'oil (1000)', 'oil(1000', 'oil(1000ga)'], 'klbs': ['k lbs.', 'k lbs'], 'mcf': ['mcf', "mcf's", 'mcfs', 'mcf.', 'gas mcf', '"gas" mcf', 'gas-mcf', 'mfc', 'mct', ' mcf', 'msfs', 'mlf', 'mscf', 'mci', 'mcl', 'mcg', 'm.cu.ft.', 'kcf', '(mcf)', 'mcf *(4)', 'mcf00', 'm.cu.ft..'], 'mmbtu': ['mmbtu', 'mmbtus', 'mbtus', '(mmbtu)', "mmbtu's", 'nuclear-mmbtu', 'nuclear-mmbt', 'mmbtul'], 'mwdth': ['mwd therman', 'mw days-therm', 'mwd thrml', 'mwd thermal', 'mwd/mtu', 'mw days', 'mwdth', 'mwd', 'mw day', 'dth', 'mwdaysthermal', 'mw day therml', 'mw days thrml', 'nuclear mwd', 'mmwd', 'mw day/thermlmw days/therm', 'mw days (th', 'ermal)'], 'mwhth': ['mwh them', 'mwh threm', 'nwh therm', 'mwhth', 'mwh therm', 'mwh', 'mwh therms.', 'mwh term.uts', 'mwh thermal', 'mwh thermals', 'mw hr therm', 'mwh therma', 'mwh therm.uts'], 'ton': ['toms', 'taons', 'tones', 'col-tons', 'toncoaleq', 'coal', 'tons coal eq', 'coal-tons', 'ton', 'tons', 'tons coal', 'coal-ton', 'tires-tons', 'coal tons -2 ', 'oil-tons', 'coal tons 200', 'ton-2000', 'coal tons', 'coal tons -2', 'coal-tone', 'tire-ton', 'tire-tons', 'ton coal eqv', 'tos', 'coal tons - 2', 'c. t.', 'c.t.', 'toncoalequiv'], 'unknown': ['', '1265', 'mwh units', 'composite', 'therms', 'n/a', 'mbtu/kg', 'uranium 235', 'oil', 'ccf', '2261', 'uo2', '(7)', 'oil #2', 'oil #6', '\x99å\x83\x90?"', 'dekatherm', '0', 'mw day/therml', 'nuclear', 'gas', '62,679', 'mw days/therm', 'na', 'uranium', 'oil/gas', 'thermal', '(thermal)', 'se uo2', '181679', '83', '3070', '248', '273976', '747', '-', 'are total', 'pr. creek', 'decatherms', 'uramium', '.', 'total pr crk', '>>>>>>>>', 'all', 'total', 'alternative-t', 'oil-mcf', '3303671', '929', '7182175', '319', '1490442', '10881', '1363663', '7171', '1726497', '4783', '7800', '12559', '2398', 'creek fuels', 'propane-barre', '509', 'barrels/mcf', 'propane-bar', '4853325', '4069628', '1431536', '708903', 'mcf/oil (1000']}¶ A dictionary linking fuel units (keys) to lists of various strings representing those fuel units (values)
- Type
-
pudl.transform.ferc1.
PLANT_KIND_STRINGS
= {'combined_cycle': ['Combined cycle', 'combined cycle', 'combined', 'gas & steam turbine', 'gas turb. & heat rec', 'combined cycle', 'com. cyc', 'com. cycle', 'gas turb-combined cy', 'combined cycle ctg', 'combined cycle - 40%', 'com cycle gas turb', 'combined cycle oper', 'gas turb/comb. cyc', 'combine cycle', 'cc', 'comb. cycle', 'gas turb-combined cy', 'steam and cc', 'steam cc', 'gas steam', 'ctg steam gas', 'steam comb cycle', 'gas/steam comb. cycl', 'steam (comb. cycle)gas turbine/steam', 'steam & gas turbine', 'gas trb & heat rec', 'steam & combined ce', 'st/gas turb comb cyc', 'gas tur & comb cycl', 'combined cycle (a,b)', 'gas turbine/ steam', 'steam/gas turb.', 'steam & comb cycle', 'gas/steam comb cycle', 'comb cycle (a,b)', 'igcc', 'steam/gas turbine', 'gas turbine / steam', 'gas tur & comb cyc', 'comb cyc (a) (b)', 'comb cycle', 'comb cyc', 'combined turbine', 'combine cycle oper', 'comb cycle/steam tur', 'cc / gas turb', 'steam (comb. cycle)', 'steam & cc', 'gas turbine/steam', 'gas turb/cumbus cycl', 'gas turb/comb cycle', 'gasturb/comb cycle', 'gas turb/cumb. cyc', 'igcc/gas turbine', 'gas / steam', 'ctg/steam-gas', 'ctg/steam -gas', 'gas fired cc turbine', 'combinedcycle', 'comb cycle gas turb', 'combined cycle opern', 'comb. cycle gas turb'], 'combustion_turbine': ['combustion turbine', 'gt', 'gas turbine', 'gas turbine # 1', 'gas turbine', 'gas turbine (note 1)', 'gas turbines', 'simple cycle', 'combustion turbine', 'comb.turb.peak.units', 'gas turbine', 'combustion turbine', 'com turbine peaking', 'gas turbine peaking', 'comb turb peaking', 'combustine turbine', 'comb. turine', 'conbustion turbine', 'combustine turbine', 'gas turbine (leased)', 'combustion tubine', 'gas turb', 'gas turbine peaker', 'gtg/gas', 'simple cycle turbine', 'gas-turbine', 'gas turbine-simple', 'gas turbine - note 1', 'gas turbine #1', 'simple cycle', 'gasturbine', 'combustionturbine', 'gas turbine (2)', 'comb turb peak units', 'jet engine', 'jet powered turbine', '*gas turbine', 'gas turb.(see note5)', 'gas turb. (see note', 'combutsion turbine', 'combustion turbin', 'gas turbine-unit 2', 'gas - turbine', 'comb turbine peaking', 'gas expander turbine', 'jet turbine', 'gas turbin (lease', 'gas turbine (leased', 'gas turbine/int. cm', 'comb.turb-gas oper.', 'comb.turb.gas/oil op', 'comb.turb.oil oper.', 'jet', 'comb. turbine (a)', 'gas turb.(see notes)', 'gas turb(see notes)', 'comb. turb-gas oper', 'comb.turb.oil oper', 'gas turbin (leasd)', 'gas turbne/int comb', 'gas turbine (note1)', 'combution turbin', '* gas turbine', 'add to gas turbine', 'gas turbine (a)', 'gas turbinint comb', 'gas turbine (note 3)', 'resp share gas note3', 'gas trubine', '*gas turbine(note3)', 'gas turbine note 3,6', 'gas turbine note 4,6', 'gas turbine peakload', 'combusition turbine', 'gas turbine (lease)', 'comb. turb-gas oper.', 'combution turbine', 'combusion turbine', 'comb. turb. oil oper', 'combustion burbine', 'combustion and gas', 'comb. turb.', 'gas turbine (lease', 'gas turbine (leasd)', 'gas turbine/int comb', '*gas turbine(note 3)', 'gas turbine (see nos', 'i.c.e./gas turbine', 'gas turbine/intcomb', 'cumbustion turbine', 'gas turb, int. comb.', 'gas turb, diesel', 'gas turb, int. comb', 'i.c.e/gas turbine', 'diesel turbine', 'comubstion turbine', 'i.c.e. /gas turbine', 'i.c.e/ gas turbine', 'i.c.e./gas tubine'], 'geothermal': ['steam - geothermal', 'steam_geothermal', 'geothermal'], 'internal_combustion': ['ic', 'internal combustion', 'internal comb.', 'internl combustiondiesel turbine', 'int combust (note 1)', 'int. combust (note1)', 'int.combustine', 'comb. cyc', 'internal comb', 'diesel', 'diesel engine', 'internal combustion', 'int combust - note 1', 'int. combust - note1', 'internal comb recip', 'reciprocating engine', 'comb. turbine', 'internal combust.', 'int. combustion (1)', '*int combustion (1)', "*internal combust'n", 'internal', 'internal comb.', 'steam internal comb', 'combustion', 'int. combustion', 'int combust (note1)', 'int. combustine', 'internl combustion', '*int. combustion (1)'], 'nuclear': ['nuclear', 'nuclear (3)', 'steam(nuclear)', 'nuclear(see note4)nuclear steam', 'nuclear turbine', 'nuclear - steam', 'nuclear (a)(b)(c)', 'nuclear (b)(c)', '* nuclear', 'nuclear (b) (c)', 'nuclear (see notes)', 'steam (nuclear)', '* nuclear (note 2)', 'nuclear (note 2)', 'nuclear (see note 2)', 'nuclear(see note4)', 'nuclear steam', 'nuclear(see notes)', 'nuclear-steam', 'nuclear (see note 3)'], 'photovoltaic': ['solar photovoltaic', 'photovoltaic', 'solar', 'solar project'], 'solar_thermal': ['solar thermal'], 'steam': ['coal', 'steam', 'steam units 1 2 3', 'steam units 4 5', 'steam fossil', 'steam turbine', 'steam a', 'steam 100', 'steam units 1 2 3', 'steams', 'steam 1', 'steam retired 2013', 'stream', 'steam units 1,2,3', 'steam units 4&5', 'steam units 4&6', 'steam conventional', 'unit total-steam', 'unit total steam', '*resp. share steam', 'resp. share steam', 'steam (see note 1,', 'steam (see note 3)', 'mpc 50%share steam', '40% share steamsteam (2)', 'steam (3)', 'steam (4)', 'steam (5)', 'steam (6)', 'steam (7)', 'steam (8)', 'steam units 1 and 2', 'steam units 3 and 4', 'steam (note 1)', 'steam (retired)', 'steam (leased)', 'coal-fired steam', 'oil-fired steam', 'steam/fossil', 'steam (a,b)', 'steam (a)', 'stean', 'steam-internal comb', 'steam (see notes)', 'steam units 4 & 6', 'resp share stm note3', 'mpc50% share steam', 'mpc40%share steam', 'steam - 64%', 'steam - 100%', 'steam (1) & (2)', 'resp share st note3', 'mpc 50% shares steam', 'steam-64%', 'steam-100%', 'steam (see note 1)', 'mpc 50% share steam', 'steam units 1, 2, 3', 'steam units 4, 5', 'steam (2)', 'steam (1)', 'steam 4, 5', 'steam - 72%', 'steam (incl i.c.)', 'steam- 72%', 'steam;retired - 2013', "respondent's sh.-st.", "respondent's sh-st", '40% share steam', 'resp share stm note3', 'mpc50% share steam', 'resp share st note 3', '\x02steam (1)'], 'unknown': ['', 'n/a', 'see pgs 402.1-402.3', 'see pgs 403.1-403.9', "respondent's share", '--', '(see note 7)', 'other', 'not applicable', 'peach bottom', 'none.', 'fuel facilities', '0', 'not in service', 'none', 'common expenses', 'expenses common to', 'retired in 1981', 'retired in 1978', 'na', 'unit total (note3)', 'unit total (note2)', 'resp. share (note2)', 'resp. share (note8)', 'resp. share (note 9)', 'resp. share (note11)', 'resp. share (note4)', 'resp. share (note6)', 'conventional', 'expenses commom to', 'not in service in', 'unit total (note 3)', 'unit total (note 2)', 'resp. share (note 8)', 'resp. share (note 3)', 'resp. share note 11', 'resp. share (note 4)', 'resp. share (note 6)', '(see note 5)', 'resp. share (note 2)', 'package', '(left blank)', 'common', '0.0000', 'other generation', 'resp share (note 11)', 'retired', 'storage/pipelines', 'sold april 16, 1999', 'sold may 07, 1999', 'plants sold in 1999', 'gas', 'not applicable.', 'resp. share - note 2', 'resp. share - note 8', 'resp. share - note 9', 'resp share - note 11', 'resp. share - note 4', 'resp. share - note 6', 'plant retired- 2013', 'retired - 2013', 'resp share - note 5', 'resp. share - note 7', 'non-applicable', 'other generation plt', 'combined heat/power', 'oil'], 'wind': ['wind', 'wind energy', 'wind turbine', 'wind - turbine', 'wind generation']}¶ A mapping from canonical plant kinds (keys) to the associated freeform strings (values) identified as being associated with that kind of plant in the FERC Form 1 raw data. There are many strings that weren’t categorized, Solar and Solar Project were not classified as these do not indicate if they are solar thermal or photovoltaic. Variants on Steam (e.g. “steam 72” and “steam and gas”) were classified based on additional research of the plants on the Internet.
- Type
-
pudl.transform.ferc1.
accumulated_depreciation
(ferc1_raw_dfs, ferc1_transformed_dfs)[source]¶ Transforms FERC Form 1 depreciation data for loading into PUDL.
This information is organized by FERC account, with each line of the FERC Form 1 having a different descriptive identifier like ‘balance_end_of_year’ or ‘transmission’.
-
pudl.transform.ferc1.
cols_to_cats
(df, cat_name, col_cats)[source]¶ Turn top-level MultiIndex columns into a categorial column.
In some cases FERC Form 1 data comes with many different types of related values interleaved in the same table – e.g. current year and previous year income – this can result in DataFrames that are hundreds of columns wide, which is unwieldy. This function takes those top level MultiIndex labels and turns them into categories in a single column, which can be used to select a particular type of report.
- Parameters
df (pandas.DataFrame) – the dataframe to be simplified.
cat_name (str) – the label of the column to be created indicating what MultiIndex label the values came from.
col_cats (dict) – a dictionary with top level MultiIndex labels as keys, and the category to which they should be mapped as values.
- Returns
A re-shaped/re-labeled dataframe with one fewer levels of MultiIndex in the columns, and an additional column containing the assigned labels.
- Return type
-
pudl.transform.ferc1.
fuel
(ferc1_raw_dfs, ferc1_transformed_dfs)[source]¶ Transforms FERC Form 1 fuel data for loading into PUDL Database.
This process includes converting some columns to be in terms of our preferred units, like MWh and mmbtu instead of kWh and btu. Plant names are also standardized (stripped & lower). Fuel and fuel unit strings are also standardized using our cleanstrings() function and string cleaning dictionaries found above (FUEL_STRINGS, etc.)
-
pudl.transform.ferc1.
fuel_by_plant_ferc1
(fuel_df, thresh=0.5)[source]¶ Calculates useful FERC Form 1 fuel metrics on a per plant-year basis.
Each record in the FERC Form 1 corresponds to a particular type of fuel. Many plants – especially coal plants – use more than one fuel, with gas and/or diesel serving as startup fuels. In order to be able to classify the type of plant based on relative proportions of fuel consumed or fuel costs it is useful to aggregate these per-fuel records into a single record for each plant.
Fuel cost (in nominal dollars) and fuel heat content (in mmBTU) are calculated for each fuel based on the cost and heat content per unit, and the number of units consumed, and then summed by fuel type (there can be more than one record for a given type of fuel in each plant because we are simplifying the fuel categories). The per-fuel records are then pivoted to create one column per fuel type. The total is summed and stored separately, and the individual fuel costs & heat contents are divided by that total, to yield fuel proportions. Based on those proportions and a minimum threshold that’s passed in, a “primary” fuel type is then assigned to the plant-year record and given a string label.
- Parameters
fuel_df (pandas.DataFrame) – Pandas DataFrame resembling the post-transform result for the fuel_ferc1 table.
thresh (float) – A value between 0.5 and 1.0 indicating the minimum fraction of overall heat content that must have been provided by a fuel in a plant-year for it to be considered the “primary” fuel for the plant in that year. Default value: 0.5.
- Returns
A DataFrame with a single record for each plant-year, including the columns required to merge it with the plants_steam_ferc1 table/DataFrame (report_year, utility_id_ferc1, and plant_name) as well as totals for fuel mmbtu consumed in that plant-year, and the cost of fuel in that year, the proportions of heat content and fuel costs for each fuel in that year, and a column that labels the plant’s primary fuel for that year.
- Return type
- Raises
AssertionError – If the DataFrame input does not have the columns required to run the function.
-
pudl.transform.ferc1.
make_ferc1_clf
(plants_df, ngram_min=2, ngram_max=10, min_sim=0.75, plant_name_ferc1_wt=2.0, plant_type_wt=2.0, construction_type_wt=1.0, capacity_mw_wt=1.0, construction_year_wt=1.0, utility_id_ferc1_wt=1.0, fuel_fraction_wt=1.0)[source]¶ Create a FERC Plant Classifier using several weighted features.
Given a FERC steam plants dataframe plants_df, which also includes fuel consumption information, transform a selection of useful columns into features suitable for use in calculating inter-record cosine similarities. Individual features are weighted according to the keyword arguments.
Features include:
plant_name (via TF-IDF, with ngram_min and ngram_max as parameters)
plant_type (OneHot encoded categorical feature)
construction_type (OneHot encoded categorical feature)
capacity_mw (MinMax scaled numerical feature)
construction year (OneHot encoded categorical feature)
utility_id_ferc1 (OneHot encoded categorical feature)
fuel_fraction_mmbtu (several MinMax scaled numerical columns, which are normalized and treated as a single feature.)
This feature matrix is then used to instantiate a FERCPlantClassifier.
The combination of the ColumnTransformer and FERCPlantClassifier are combined in a sklearn Pipeline, which is returned by the function.
- Parameters
ngram_min (int) – the minimum lengths to consider in the vectorization of the plant_name feature.
ngram_max (int) – the maximum n-gram lengths to consider in the vectorization of the plant_name feature.
min_sim (float) – the minimum cosine similarity between two records that can be considered a “match” (a number between 0.0 and 1.0).
plant_name_ferc1_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.
plant_type_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.
construction_type_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.
capacity_mw_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.
construction_year_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.
utility_id_ferc1_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.
fuel_fraction_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.
- Returns
an sklearn Pipeline that performs reprocessing and classification with a FERCPlantClassifier object.
- Return type
-
pudl.transform.ferc1.
plant_in_service
(ferc1_raw_dfs, ferc1_transformed_dfs)[source]¶ Transforms FERC Form 1 Plant in Service data for loading into PUDL.
Re-organizes the original FERC Form 1 Plant in Service data by unpacking the rows as needed on a year by year basis, to organize them into columns. The “columns” in the original FERC Form 1 denote starting balancing, ending balance, additions, retirements, adjustments, and transfers – these categories are turned into labels in a column called “amount_type”. Because each row in the transformed table is composed of many individual records (rows) from the original table, row_number can’t be part of the record_id, which means they are no longer unique. To infer exactly what record a given piece of data came from, the record_id and the row_map (found in the PUDL package_data directory) can be used.
-
pudl.transform.ferc1.
plants_hydro
(ferc1_raw_dfs, ferc1_transformed_dfs)[source]¶ Transforms FERC Form 1 plant_hydro data for loading into PUDL Database.
Standardizes plant names (stripping whitespace and Using Title Case). Also converts into our preferred units of MW and MWh.
-
pudl.transform.ferc1.
plants_pumped_storage
(ferc1_raw_dfs, ferc1_transformed_dfs)[source]¶ Transforms FERC Form 1 pumped storage data for loading into PUDL.
Standardizes plant names (stripping whitespace and Using Title Case). Also converts into our preferred units of MW and MWh.
-
pudl.transform.ferc1.
plants_small
(ferc1_raw_dfs, ferc1_transformed_dfs)[source]¶ Transforms FERC Form 1 plant_small data for loading into PUDL Database.
This FERC Form 1 table contains information about a large number of small plants, including many small hydroelectric and other renewable generation facilities. Unfortunately the data is not well standardized, and so the plants have been categorized manually, with the results of that categorization stored in an Excel spreadsheet. This function reads in the plant type data from the spreadsheet and merges it with the rest of the information from the FERC DB based on record number, FERC respondent ID, and report year. When possible the FERC license number for small hydro plants is also manually extracted from the data.
This categorization will need to be renewed with each additional year of FERC data we pull in. As of v0.1 the small plants have been categorized for 2004-2015.
-
pudl.transform.ferc1.
plants_steam
(ferc1_raw_dfs, ferc1_transformed_dfs)[source]¶ Transforms FERC Form 1 plant_steam data for loading into PUDL Database.
This includes converting to our preferred units of MWh and MW, as well as standardizing the strings describing the kind of plant and construction.
- Parameters
- Returns
of transformed dataframes, including the newly transformed plants_steam_ferc1 dataframe.
- Return type
-
pudl.transform.ferc1.
plants_steam_validate_ids
(ferc1_steam_df)[source]¶ Tests that plant_id_ferc1 times series includes one record per year.
- Parameters
ferc1_steam_df (pandas.DataFrame) – A DataFrame of the data from the FERC 1 Steam table.
- Returns
None
-
pudl.transform.ferc1.
purchased_power
(ferc1_raw_dfs, ferc1_transformed_dfs)[source]¶ Transforms FERC Form 1 pumped storage data for loading into PUDL.
This table has data about inter-utility power purchases into the PUDL DB. This includes how much electricty was purchased, how much it cost, and who it was purchased from. Unfortunately the field describing which other utility the power was being bought from is poorly standardized, making it difficult to correlate with other data. It will need to be categorized by hand or with some fuzzy matching eventually.
-
pudl.transform.ferc1.
transform
(ferc1_raw_dfs, ferc1_tables=('fuel_ferc1', 'plants_steam_ferc1', 'plants_small_ferc1', 'plants_hydro_ferc1', 'plants_pumped_storage_ferc1', 'purchased_power_ferc1', 'plant_in_service_ferc1'))[source]¶ Transforms FERC 1.
- Parameters
- Returns
A dictionary of the transformed DataFrames.
- Return type
-
pudl.transform.ferc1.
unpack_table
(ferc1_df, table_name, data_cols, data_rows)[source]¶ Normalize a row-and-column based FERC Form 1 table.
Pulls the named database table from the FERC Form 1 DB and uses the corresponding ferc1_row_map to unpack the row_number coded data.
- Parameters
ferc1_df (pandas.DataFrame) – Raw FERC Form 1 DataFrame from the DB.
table_name (str) – Original name of the FERC Form 1 DB table.
data_cols (list) – List of strings corresponding to the original FERC Form 1 database table column labels – these are the columns of data that we are extracting (it can be a subset of the columns which are present in the original database).
data_rows (list) – List of row_names to extract, as defined in the FERC 1 row maps. Set to slice(None) if you want all rows.
- Returns
pandas.DataFrame
Transformation of the FERC Form 714 data.
-
pudl.transform.ferc714.
BAD_RESPONDENTS
= [319, 99991, 99992, 99993, 99994, 99995]¶ Fake respondent IDs for database test entities.
-
pudl.transform.ferc714.
EIA_CODE_FIXES
= {125: 2775, 134: 5416, 203: 12341, 257: 59504, 292: 20382, 295: 40229, 301: 14725, 302: 14725, 303: 14725, 304: 14725, 305: 14725, 306: 14725, 307: 14379, 309: 12427, 315: 56090, 323: 58790, 324: 58791, 329: 39347}¶ Overrides of FERC 714 respondent IDs with wrong or missing EIA Codes
-
pudl.transform.ferc714.
OFFSET_CODES
= {'AKDT': Timedelta('-1 days +15:00:00'), 'AKST': Timedelta('-1 days +15:00:00'), 'CDT': Timedelta('-1 days +18:00:00'), 'CST': Timedelta('-1 days +18:00:00'), 'EDT': Timedelta('-1 days +19:00:00'), 'EST': Timedelta('-1 days +19:00:00'), 'HST': Timedelta('-1 days +14:00:00'), 'MDT': Timedelta('-1 days +17:00:00'), 'MST': Timedelta('-1 days +17:00:00'), 'PDT': Timedelta('-1 days +16:00:00'), 'PST': Timedelta('-1 days +16:00:00')}¶ A mapping of timezone offset codes to Timedelta offsets from UTC.
from one year to the next, and these result in duplicate records, which are Note that the FERC 714 instructions state that all hourly demand is to be reported in STANDARD time for whatever timezone is being used. Even though many respondents use daylight savings / standard time abbreviations, a large majority do appear to conform to using a single UTC offset throughout the year. There are 6 instances in which the timezone associated with reporting changed dropped.
-
pudl.transform.ferc714.
TZ_CODES
= {'AKDT': 'America/Anchorage', 'AKST': 'America/Anchorage', 'CDT': 'America/Chicago', 'CST': 'America/Chicago', 'EDT': 'America/New_York', 'EST': 'America/New_York', 'HST': 'Pacific/Honolulu', 'MDT': 'America/Denver', 'MST': 'America/Denver', 'PDT': 'America/Los_Angeles', 'PST': 'America/Los_Angeles'}¶ Mapping between standardized time offset codes and canonical timezones.
-
pudl.transform.ferc714.
demand_hourly_pa
(tfr_dfs)[source]¶ Transform the hourly demand time series by Planning Area.
Transformations include:
Clean UTC offset codes.
Replace UTC offset codes with UTC offset and timezone.
Drop 25th hour rows.
Set records with 0 UTC code to 0 demand.
Drop duplicate rows.
Flip negative signs for reported demand.
-
pudl.transform.ferc714.
respondent_id
(tfr_dfs)[source]¶ Transform the FERC 714 respondent IDs, names, and EIA utility IDs.
This consists primarily of dropping test respondents and manually assigning EIA utility IDs to a few FERC Form 714 respondents that report planning area demand, but which don’t have their corresponding EIA utility IDs provided by FERC for some reason (including PacifiCorp).
-
pudl.transform.ferc714.
transform
(raw_dfs, tables=('respondent_id_ferc714', 'id_certification_ferc714', 'gen_plants_ba_ferc714', 'demand_monthly_ba_ferc714', 'net_energy_load_ba_ferc714', 'adjacency_ba_ferc714', 'interchange_ba_ferc714', 'lambda_hourly_ba_ferc714', 'lambda_description_ferc714', 'description_pa_ferc714', 'demand_forecast_pa_ferc714', 'demand_hourly_pa_ferc714'))[source]¶ Transform the raw FERC 714 dataframes into datapackage ready ouputs.
- Parameters
raw_dfs (dict) – A dictionary of raw pandas.DataFrame objects, as read out of the original FERC 714 CSV files. Generated by the pudl.extract.ferc714.extract() function.
tables (iterable) – The set of PUDL tables within FERC 714 that we should process. Typically set to all of them, unless
- Returns
A dictionary of pandas.DataFrame objects that are ready to be output in a data package / database table.
- Return type
Modules implementing the “Transform” step of the PUDL ETL pipeline.
Each module in this subpackage transforms the tabular data associated with a
single data source from the PUDL :ref: data-sources. This process begins
with a dictionary of “raw” pandas.DataFrame
objects produced by the
corresponding data source specific routines from the pudl.extract
subpackage, and ends with a dictionary of pandas.DataFrame
objects
that are fully normalized, cleaned, and congruent with the tabular datapackage
metadata – i.e. they are ready to be exported by the pudl.load
module.
Inputs to the transform functions are a dictionary of dataframes, each of which represents a concatenation of records with common column names from across some set of years of reported data. The names of those columns are determined by the xlsx_maps metadata associated with the given dataset in PUDL’s package_metadata.
This raw data is transformed in 3 main steps:
Structural transformations that re-shape / tidy the data and turn it into rows that represent a single observation, and columns that represent a single variable. These transformations should not require knowledge of or access to the contents of the data, which may or may not yet be usable at this point, depending on the true data type and how much cleaning has to happen. One exception to this that may come up is the need to clean up columns that are part of the primary composite key, since you can’t usefully index on NA values. Alternatively this might mean removing rows that have invalid key values.
Data type compatibility: whatever massaging of the data is required to ensure that it can be cast to the appropriate data type, including identifying NA values and assigning them to an appropriate type-specific NA value. At the end of this you can assign all the columns their (preferably nullable) types. Note that because some of the columns that exist at this point may not end up in the final database table, you may need to set them individually, rather than using the systemwide dictionary of column data types.
Value based data cleaning: At this point every column should have a known, homogenous type, allowing it to be reliably manipulated as a Series, so we can move on to cleaning up the values themselves. This includes re-coding freeform string fields to impose a controlled vocabulary, converting column units (e.g. kWh to MWh) and renaming the columns appropriately, as well as correcting clear data entry errors.
At the end of the main coordinating transform() function, every column that remains in each of the transformed dataframes should correspond to a column that will exist in the database and be associated with the EIA datasets, which means it is also part of the EIA column namespace. It’s important that you make sure these column names match the naming conventions that are being used, and if any of the columns exist in other tables, that they have exactly the same name and datatype.
If you find that you need to rename a column for it to conform to those requirements, in many cases that should happen in the xlsx_map metadata, so that column renamings can be kept to a minimum and only used for real semantic transformations of a column (like a unit conversion).
At the end of this step, it should be easy to categorize every column in every dataframe as to whether it is a “data” column (containing data unique to the table it is found in) or whether it is part of the primary key for the table (the minimal set of columns whose values are required to uniquely specify a record), and/or whether it is a “denormalized” column whose home table is really elsewhere in the database. Note that denormalized columns may also be part of the primary key. This information is important for the step after the intra-table transformations during which the collection of EIA tables is normalized as a whole.
pudl.workspace package¶
Datastore manages file retrieval for PUDL datasets.
-
exception
pudl.workspace.datastore.
ChecksumMismatch
[source]¶ Bases:
ValueError
Resource checksum (md5) does not match.
-
class
pudl.workspace.datastore.
DatapackageDescriptor
(datapackage_json: dict, dataset: str, doi: str)[source]¶ Bases:
object
A simple wrapper providing access to datapackage.json contents.
-
get_json_string
() → str[source]¶ Exports the underlying json as normalized (sorted, indented) json string.
-
get_partitions
(name: Optional[str] = None) → Dict[str, Set[str]][source]¶ Returns mapping of all known partition keys to the set of its known values.
-
get_resource_path
(name: str) → str[source]¶ Returns zenodo url that holds contents of given named resource.
-
get_resources
(name: Optional[str] = None, **filters: Any) → Iterator[pudl.workspace.resource_cache.PudlResourceKey][source]¶ Returns series of PudlResourceKey identifiers for matching resources.
-
-
class
pudl.workspace.datastore.
Datastore
(local_cache_path: Optional[pathlib.Path] = None, gcs_cache_path: Optional[str] = None, sandbox: bool = False, timeout: float = 15)[source]¶ Bases:
object
Handle connections and downloading of Zenodo Source archives.
-
get_datapackage_descriptor
(dataset: str) → pudl.workspace.datastore.DatapackageDescriptor[source]¶ Fetch datapackage descriptor for given dataset either from cache or from zenodo.
-
get_resources
(dataset: str, cached_only: bool = False, skip_optimally_cached: bool = False, **filters: Any) → Iterator[Tuple[pudl.workspace.resource_cache.PudlResourceKey, bytes]][source]¶ Return content of the matching resources.
- Parameters
dataset (str) – name of the dataset to query.
cached_only (bool) – if True, only retrieve resources that are present in the cache.
skip_optimally_cached (bool) – if True, only retrieve resources that are not optimally cached. This triggers attempt to optimally cache these resources.
filters (key=val) – only return resources that match the key-value mapping in their
metadata["parts"] –
- Yields
(PudlResourceKey, io.BytesIO) holding content for each matching resource
-
get_unique_resource
(dataset: str, **filters: Any) → bytes[source]¶ Returns content of a resource assuming there is exactly one that matches.
-
get_zipfile_resource
(dataset: str, **filters: Any) → zipfile.ZipFile[source]¶ Retrieves unique resource and opens it as a ZipFile.
-
remove_from_cache
(res: pudl.workspace.resource_cache.PudlResourceKey)[source]¶ Remove given resource from the associated cache.
-
-
class
pudl.workspace.datastore.
ParseKeyValues
(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]¶ Bases:
argparse.Action
Transforms k1=v1,k2=v2,… into dict(k1=v1, k2=v2, …).
-
class
pudl.workspace.datastore.
ZenodoFetcher
(sandbox: bool = False, timeout: float = 15.0)[source]¶ Bases:
object
API for fetching datapackage descriptors and resource contents from zenodo.
-
API_ROOT
= {'production': 'https://zenodo.org/api', 'sandbox': 'https://sandbox.zenodo.org/api'}¶
-
DOI
= {'production': {'censusdp1tract': '10.5281/zenodo.4127049', 'eia860': '10.5281/zenodo.4127027', 'eia860m': '10.5281/zenodo.4540268', 'eia861': '10.5281/zenodo.4127029', 'eia923': '10.5281/zenodo.4127040', 'epacems': '10.5281/zenodo.4660268', 'ferc1': '10.5281/zenodo.4127044', 'ferc714': '10.5281/zenodo.4127101'}, 'sandbox': {'censusdp1tract': '10.5072/zenodo.674992', 'eia860': '10.5072/zenodo.672210', 'eia860m': '10.5072/zenodo.692655', 'eia861': '10.5072/zenodo.687052', 'eia923': '10.5072/zenodo.687071', 'epacems': '10.5072/zenodo.672963', 'ferc1': '10.5072/zenodo.687072', 'ferc714': '10.5072/zenodo.672224'}}¶
-
TOKEN
= {'production': 'KXcG5s9TqeuPh1Ukt5QYbzhCElp9LxuqAuiwdqHP0WS4qGIQiydHn6FBtdJ5', 'sandbox': 'qyPC29wGPaflUUVAv1oGw99ytwBqwEEdwi4NuUrpwc3xUcEwbmuB4emwysco'}¶
-
get_descriptor
(dataset: str) → pudl.workspace.datastore.DatapackageDescriptor[source]¶ Returns DatapackageDescriptor for given dataset.
-
get_resource
(res: pudl.workspace.resource_cache.PudlResourceKey) → bytes[source]¶ Given resource key, retrieve contents of the file from zenodo.
-
get_resource_key
(dataset: str, name: str) → pudl.workspace.resource_cache.PudlResourceKey[source]¶ Returns PudlResourceKey for given resource.
-
-
pudl.workspace.datastore.
fetch_resources
(dstore: pudl.workspace.datastore.Datastore, datasets: List[str], args: argparse.Namespace) → None[source]¶ Retrieve all matching resources and store them in the cache.
-
pudl.workspace.datastore.
print_partitions
(dstore: pudl.workspace.datastore.Datastore, datasets: List[str]) → None[source]¶ Prints known partition keys and its values for each of the datasets.
-
pudl.workspace.datastore.
validate_cache
(dstore: pudl.workspace.datastore.Datastore, datasets: List[str], args: argparse.Namespace) → None[source]¶ Validate elements in the datastore cache. Delete invalid entires from cache.
Implementations of datastore resource caches.
-
class
pudl.workspace.resource_cache.
AbstractCache
(read_only: bool = False)[source]¶ Bases:
abc.ABC
Defines interaface for the generic resource caching layer.
-
abstract
add
(resource: pudl.workspace.resource_cache.PudlResourceKey, content: bytes) → None[source]¶ Adds resource to the cache and sets the content.
-
abstract
contains
(resource: pudl.workspace.resource_cache.PudlResourceKey) → bool[source]¶ Returns True if the resource is present in the cache.
-
abstract
delete
(resource: pudl.workspace.resource_cache.PudlResourceKey) → None[source]¶ Removes the resource from cache.
-
abstract
get
(resource: pudl.workspace.resource_cache.PudlResourceKey) → bytes[source]¶ Retrieves content of given resource or throws KeyError.
-
abstract
-
class
pudl.workspace.resource_cache.
GoogleCloudStorageCache
(gcs_path: str, **kwargs: Any)[source]¶ Bases:
pudl.workspace.resource_cache.AbstractCache
Implements file cache backed by Google Cloud Storage bucket.
-
add
(resource: pudl.workspace.resource_cache.PudlResourceKey, value: bytes)[source]¶ Adds (or updates) resource to the cache with given value.
-
contains
(resource: pudl.workspace.resource_cache.PudlResourceKey) → bool[source]¶ Returns True if resource is present in the cache.
-
delete
(resource: pudl.workspace.resource_cache.PudlResourceKey)[source]¶ Deletes resource from the cache.
-
get
(resource: pudl.workspace.resource_cache.PudlResourceKey) → bytes[source]¶ Retrieves value associated with given resource.
-
-
class
pudl.workspace.resource_cache.
LayeredCache
(*caches: List[pudl.workspace.resource_cache.AbstractCache], **kwargs: Any)[source]¶ Bases:
pudl.workspace.resource_cache.AbstractCache
Implements multi-layered system of caches.
This allows building multi-layered system of caches. The idea is that you can have faster local caches with fall-back to the more remote or expensive caches that can be acessed in case of missing content.
Only the closest layer is being written to (set, delete), while all remaining layers are read-only (get).
-
add
(resource: pudl.workspace.resource_cache.PudlResourceKey, value)[source]¶ Adds (or replaces) resource into the cache with given value.
-
add_cache_layer
(cache: pudl.workspace.resource_cache.AbstractCache)[source]¶ Adds caching layer. The priority is below all other.
-
contains
(resource: pudl.workspace.resource_cache.PudlResourceKey) → bool[source]¶ Returns True if resource is present in the cache.
-
delete
(resource: pudl.workspace.resource_cache.PudlResourceKey)[source]¶ Removes resource from the cache if the cache is not in the read_only mode.
-
get
(resource: pudl.workspace.resource_cache.PudlResourceKey) → bytes[source]¶ Returns content of a given resource.
-
is_optimally_cached
(resource: pudl.workspace.resource_cache.PudlResourceKey) → bool[source]¶ Returns true if the resource is contained in the closest write-enabled layer.
-
-
class
pudl.workspace.resource_cache.
LocalFileCache
(cache_root_dir: pathlib.Path, **kwargs: Any)[source]¶ Bases:
pudl.workspace.resource_cache.AbstractCache
Simple key-value store mapping PudlResourceKeys to ByteIO contents.
-
add
(resource: pudl.workspace.resource_cache.PudlResourceKey, content: bytes)[source]¶ Adds (or updates) resource to the cache with given value.
-
contains
(resource: pudl.workspace.resource_cache.PudlResourceKey) → bool[source]¶ Returns True if resource is present in the cache.
-
delete
(resource: pudl.workspace.resource_cache.PudlResourceKey)[source]¶ Deletes resource from the cache.
-
get
(resource: pudl.workspace.resource_cache.PudlResourceKey) → bytes[source]¶ Retrieves value associated with a given resource.
-
Tools for setting up and managing PUDL workspaces.
-
pudl.workspace.setup.
deploy
(pkg_path, deploy_dir, ignore_files, clobber=False)[source]¶ Deploy all files from a package_data directory into a workspace.
- Parameters
pkg_path (str) – Dotted module path to the subpackage inside of package_data containing the resources to be deployed.
deploy_dir (os.PathLike) – Directory on the filesystem to which the files within pkg_path should be deployed.
ignore_files (iterable) – List of filenames (strings) that may be present in the pkg_path subpackage, but that should be ignored.
clobber (bool) – if True, replace existing copies of the files that are being deployed from pkg_path to deploy_dir. If False, do not replace existing files.
- Returns
None
-
pudl.workspace.setup.
derive_paths
(pudl_in, pudl_out)[source]¶ Derive PUDL paths based on given input and output paths.
If no configuration file path is provided, attempt to read in the user configuration from a file called .pudl.yml in the user’s HOME directory. Presently the only values we expect are pudl_in and pudl_out, directories that store files that PUDL either depends on that rely on PUDL.
- Parameters
pudl_in (os.PathLike) – Path to the directory containing the PUDL input files, most notably the
data
directory which houses the raw data downloaded from public agencies by thepudl.workspace.datastore
tools.pudl_in
may be the same directory aspudl_out
.pudl_out (os.PathLike) – Path to the directory where PUDL should write the outputs it generates. These will be organized into directories according to the output format (sqlite, datapackage, etc.).
- Returns
- A dictionary containing common PUDL settings, derived from those
read out of the YAML file. Mostly paths for inputs & outputs.
- Return type
-
pudl.workspace.setup.
get_defaults
()[source]¶ Read paths to default PUDL input/output dirs from user’s $HOME/.pudl.yml.
- Parameters
None –
- Returns
The contents of the user’s PUDL settings file, with keys
pudl_in
andpudl_out
defining their default PUDL workspace. If the$HOME/.pudl.yml
file does not exist, set these paths to None.- Return type
-
pudl.workspace.setup.
init
(pudl_in, pudl_out, clobber=False)[source]¶ Set up a new PUDL working environment based on the user settings.
- Parameters
pudl_in (os.PathLike) – Path to the directory containing the PUDL input files, most notably the
data
directory which houses the raw data downloaded from public agencies by thepudl.workspace.datastore
tools.pudl_in
may be the same directory aspudl_out
.pudl_out (os.PathLike) – Path to the directory where PUDL should write the outputs it generates. These will be organized into directories according to the output format (sqlite, datapackage, etc.).
clobber (bool) – if True, replace existing files. If False (the default) do not replace existing files.
- Returns
None
-
pudl.workspace.setup.
set_defaults
(pudl_in, pudl_out, clobber=False)[source]¶ Set default user input and output locations in
$HOME/.pudl.yml
.Create a user settings file for future reference, that defines the default PUDL input and output directories. If this file already exists, behavior depends on the clobber parameter, which is False by default. If it’s True, the existing file is replaced. If False, the existing file is not changed.
- Parameters
pudl_in (os.PathLike) – Path to be used as the default input directory for PUDL – this is where
pudl.workspace.datastore
will look to find thedata
directory, full of data from public agencies.pudl_out (os.PathLike) – Path to the default output directory for PUDL, where results of data processing will be organized.
clobber (bool) – If True and a user settings file exists, overwrite it. If False, do not alter the existing file. Defaults to False.
- Returns
None
Set up a well-organized PUDL data management workspace.
This script creates a well-defined directory structure for use by the PUDL package, and copies several example settings files and Jupyter notebooks into it to get you started. If the command is run without any arguments, it will create this workspace in your current directory.
The script will also create a file named .pudl.yml, describing the location of your PUDL workspace. The PUDL package will refer to this location in the future to know where it should look for raw data, where to put its outputs, etc. This file can be edited to change the default input and output directories if you wish. However, make sure those workspaces are set up using this script.
It’s also possible to specify different input and output directories, which is useful if you want to use a single PUDL data store (which may contain many GB of data) to support several different workspaces. See the –pudl_in and –pudl_out options.
By default the script will not overwrite existing files. If you want it to replace existing files (including your .pudl.yml file which defines your default PUDL workspace) use the –clobber option.
The directory structure set up for PUDL looks like this:
- PUDL_IN
- └── data
├── censusdp1tract ├── eia860 ├── eia860m ├── eia861 ├── eia923 ├── epacems ├── ferc1 ├── ferc714 └── tmp
- PUDL_OUT
├── datapkg ├── parquet ├── settings └── sqlite
Initially, the directories in the data store will be empty. The pudl_datastore or pudl_etl commands will download data from public sources and organize it for you there by source. The PUDL_OUT directories are organized by the type of file they contain.
Tools for acquiring PUDL’s original input data and organizing it locally.
The datastore subpackage takes care of downloading original data form various public sources, organizing it locally, and providing a programmatic interface to that collection of raw inputs, which we refer to as the PUDL datastore.
These tools are available both as a library module, and via a command line
interface installed as an entrypoint script called pudl_datastore
. For
full reproducibility of PUDL’s ETL pipeline outputs, the datastore should be
archived alongside the PUDL release which was used and the resulting
datapackage outputs.
Submodules¶
pudl.cli module¶
A command line interface (CLI) to the main PUDL ETL functionality.
This script generates datapacakges based on the datapackage settings enumerated in the settings_file which is given as an argument to this script. If the settings has empty datapackage parameters (meaning there are no years or tables included), no datapacakges will be generated. If the settings include a datapackage that has empty parameters, the other valid datatpackages will be generated, but not the empty one. If there are invalid parameters (meaning a partition that is not included in the pudl.constant.working_partitions), the build will fail early on in the process.
The datapackages will be stored in “PUDL_OUT” in the “datapackge” subdirectory. Currently, this function only uses default directories for “PUDL_IN” and “PUDL_OUT” (meaning those stored in $HOME/.pudl.yml). To setup your default pudl directories see the pudl_setup script (pudl_setup –help for more details).
pudl.constants module¶
A warehouse for constant values required to initilize the PUDL Database.
This constants module stores and organizes a bunch of constant values which are used throughout PUDL to populate static lists within the data packages or for data cleaning purposes.
-
pudl.constants.
TRANSIT_TYPE_DICT
= {'CV': 'conveyer', 'PL': 'pipeline', 'RR': 'railroad', 'TK': 'truck', 'UN': 'unknown', 'WA': 'water'}¶ A dictionary of datasets (keys) and keywords (values).
- Type
-
pudl.constants.
aer_coal_strings
= ['col', 'woc', 'pc']¶ A list of EIA 923 AER fuel type strings associated with coal.
- Type
-
pudl.constants.
aer_fuel_type_strings
= {'coal': ['col', 'woc', 'pc'], 'gas': ['mlg', 'ng', 'oog'], 'hydro': ['hps', 'hyc'], 'nuclear': ['nuc'], 'oil': ['dfo', 'rfo', 'woo'], 'other': ['geo', 'orw', 'oth'], 'solar': ['sun'], 'waste': ['www'], 'wind': ['wnd']}¶ A dictionary mapping EIA 923 AER fuel types (keys) to lists of strings associated with that fuel type (values).
- Type
-
pudl.constants.
aer_gas_strings
= ['mlg', 'ng', 'oog']¶ A list of EIA 923 AER fuel type strings associated with gas.
- Type
-
pudl.constants.
aer_hydro_strings
= ['hps', 'hyc']¶ A list of EIA 923 AER fuel type strings associated with hydro power.
- Type
-
pudl.constants.
aer_nuclear_strings
= ['nuc']¶ A list of EIA 923 AER fuel type strings associated with nuclear power.
- Type
-
pudl.constants.
aer_oil_strings
= ['dfo', 'rfo', 'woo']¶ A list of EIA 923 AER fuel type strings associated with oil.
- Type
-
pudl.constants.
aer_other_strings
= ['geo', 'orw', 'oth']¶ A list of EIA 923 AER fuel type strings associated with other fuel.
- Type
-
pudl.constants.
aer_solar_strings
= ['sun']¶ A list of EIA 923 AER fuel type strings associated with solar power.
- Type
-
pudl.constants.
aer_waste_strings
= ['www']¶ A list of EIA 923 AER fuel type strings associated with waste.
- Type
-
pudl.constants.
aer_wind_strings
= ['wnd']¶ A list of EIA 923 AER fuel type strings associated with wind power.
- Type
-
pudl.constants.
base_data_urls
= {'eia860': 'https://www.eia.gov/electricity/data/eia860', 'eia861': 'https://www.eia.gov/electricity/data/eia861/zip', 'eia923': 'https://www.eia.gov/electricity/data/eia923', 'epacems': 'ftp://newftp.epa.gov/dmdnload/emissions/hourly/monthly', 'epaipm': 'https://www.epa.gov/sites/production/files/2019-03', 'ferc1': 'ftp://eforms1.ferc.gov/f1allyears', 'ferc714': 'https://www.ferc.gov/docs-filing/forms/form-714/data', 'ferceqr': 'ftp://eqrdownload.ferc.gov/DownloadRepositoryProd/BulkNew/CSV', 'msha': 'https://arlweb.msha.gov/OpenGovernmentData/DataSets', 'pudl': 'https://catalyst.coop/pudl/'}¶ A dictionary containing data sources (keys) and their base data URLs (values).
- Type
-
pudl.constants.
canada_prov_terr
= {'AB': 'Alberta', 'BC': 'British Columbia', 'CN': 'Canada', 'MB': 'Manitoba', 'NB': 'New Brunswick', 'NL': 'Newfoundland and Labrador', 'NS': 'Nova Scotia', 'NT': 'Northwest Territories', 'NU': 'Nunavut', 'ON': 'Ontario', 'PE': 'Prince Edwards Island', 'QC': 'Quebec', 'SK': 'Saskatchewan', 'YT': 'Yukon Territory'}¶ A dictionary containing Canadian provinces’ and territories’ abbreviations (keys) and names (values)
- Type
-
pudl.constants.
cems_states
= {'AL': 'Alabama', 'AR': 'Arkansas', 'AZ': 'Arizona', 'CA': 'California', 'CO': 'Colorado', 'CT': 'Connecticut', 'DC': 'District of Columbia', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia', 'IA': 'Iowa', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'MA': 'Massachusetts', 'MD': 'Maryland', 'ME': 'Maine', 'MI': 'Michigan', 'MN': 'Minnesota', 'MO': 'Missouri', 'MS': 'Mississippi', 'MT': 'Montana', 'NC': 'North Carolina', 'ND': 'North Dakota', 'NE': 'Nebraska', 'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico', 'NV': 'Nevada', 'NY': 'New York', 'OH': 'Ohio', 'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania', 'RI': 'Rhode Island', 'SC': 'South Carolina', 'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VA': 'Virginia', 'VT': 'Vermont', 'WA': 'Washington', 'WI': 'Wisconsin', 'WV': 'West Virginia', 'WY': 'Wyoming'}¶ A dictionary containing US state abbreviations (keys) and names (values) that are present in the CEMS dataset
- Type
-
pudl.constants.
census_region
= {'ENC': 'East North Central', 'ESC': 'East South Central', 'MAT': 'Middle Atlantic', 'MTN': 'Mountain', 'NEW': 'New England', 'PACC': 'Pacific Contiguous (OR, WA, CA)', 'PACN': 'Pacific Non-Contiguous (AK, HI)', 'SAT': 'South Atlantic', 'WNC': 'West North Central', 'WSC': 'West South Central'}¶ A dictionary mapping Census Region abbreviations (keys) to Census Region names (values).
- Type
-
pudl.constants.
coalmine_country_eia923
= {'AU': 'AUS', 'CL': 'COL', 'CN': 'CAN', 'IM': 'unknown', 'IS': 'IDN', 'OT': 'other_country', 'PL': 'POL', 'RS': 'RUS', 'UK': 'GBR', 'VZ': 'VEN'}¶ A dictionary mapping coal mine country codes (keys) to ISO-3166-1 three letter country codes (values).
- Type
-
pudl.constants.
coalmine_type_eia923
= {'P': 'Preparation Plant', 'S': 'Surface', 'SU': 'Both an underground and surface mine with most coal extracted from surface', 'U': 'Underground', 'US': 'Both an underground and surface mine with most coal extracted from underground'}¶ A dictionary mapping EIA 923 coal mine type codes (keys) to descriptions (values).
- Type
-
pudl.constants.
contract_type_eia923
= {'C': 'Contract - Fuel received under a purchase order or contract with a term of one year or longer. Contracts with a shorter term are considered spot purchases ', 'N': 'New Contract - see NC code. This abbreviation existed only in 2008 before being replaced by NC.', 'NC': 'New Contract - Fuel received under a purchase order or contract with duration of one year or longer, under which deliveries were first made during the reporting month', 'S': 'Spot Purchase', 'T': 'Tolling Agreement – Fuel received under a tolling agreement (bartering arrangement of fuel for generation)'}¶ A dictionary mapping EIA 923 contract codes (keys) to contract descriptions (values) for each month in the Fuel Receipts and Costs table.
- Type
-
pudl.constants.
contributors
= {'alana-wilson': {'email': 'alana.wilson@catalyst.coop', 'organization': 'Catalyst Cooperative', 'role': 'contributor', 'title': 'Alana Wilson'}, 'catalyst-cooperative': {'email': 'pudl@catalyst.coop', 'organization': 'Catalyst Cooperative', 'path': 'https://catalyst.coop/', 'role': 'publisher', 'title': 'Catalyst Cooperative'}, 'christina-gosnell': {'email': 'christina.gosnell@catalyst.coop', 'organization': 'Catalyst Cooperative', 'role': 'contributor', 'title': 'Christina Gosnell'}, 'greg-schivley': {'role': 'contributor', 'title': 'Greg Schivley'}, 'karl-dunkle-werner': {'email': 'karldw@berkeley.edu', 'organization': 'UC Berkeley', 'path': 'https://karldw.org/', 'role': 'contributor', 'title': 'Karl Dunkle Werner'}, 'steven-winter': {'email': 'steven.winter@catalyst.coop', 'organization': 'Catalyst Cooperative', 'role': 'contributor', 'title': 'Steven Winter'}, 'zane-selvans': {'email': 'zane.selvans@catalyst.coop', 'organization': 'Catalyst Cooperative', 'path': 'https://amateurearthling.org/', 'role': 'wrangler', 'title': 'Zane Selvans'}}¶ A dictionary of dictionaries containing organization names (keys) and their attributes (values).
- Type
-
pudl.constants.
contributors_by_source
= {'eia860': ['catalyst-cooperative', 'zane-selvans', 'christina-gosnell', 'steven-winter', 'alana-wilson'], 'eia923': ['catalyst-cooperative', 'zane-selvans', 'christina-gosnell', 'steven-winter'], 'epacems': ['catalyst-cooperative', 'karl-dunkle-werner', 'zane-selvans'], 'epaipm': ['greg-schivley'], 'ferc1': ['catalyst-cooperative', 'zane-selvans', 'christina-gosnell', 'steven-winter', 'alana-wilson'], 'pudl': ['catalyst-cooperative', 'zane-selvans', 'christina-gosnell', 'steven-winter', 'alana-wilson', 'karl-dunkle-werner']}¶ A dictionary of data sources (keys) and lists of contributors (values).
- Type
-
pudl.constants.
data_source_info
= {'eia860': {'path': 'https://www.eia.gov/electricity/data/eia860/', 'title': 'EIA Form 860'}, 'eia861': {'path': 'https://www.eia.gov/electricity/data/eia861/', 'title': 'EIA Form 861'}, 'eia923': {'path': 'https://www.eia.gov/electricity/data/eia923/', 'title': 'EIA Form 923'}, 'eiawater': {'path': 'https://www.eia.gov/electricity/data/water/', 'title': 'EIA Water Use for Power'}, 'epacems': {'path': 'https://ampd.epa.gov/ampd/', 'title': 'EPA Air Markets Program Data'}, 'epaipm': {'path': 'https://www.epa.gov/airmarkets/national-electric-energy-data-system-needs-v6', 'title': 'EPA Integrated Planning Model'}, 'ferc1': {'path': 'https://www.ferc.gov/docs-filing/forms/form-1/data.asp', 'title': 'FERC Form 1'}, 'ferc714': {'path': 'https://www.ferc.gov/docs-filing/forms/form-714/data.asp', 'title': 'FERC Form 714'}, 'ferceqr': {'path': 'https://www.ferc.gov/docs-filing/eqr.asp', 'title': 'FERC Electric Quarterly Report'}, 'msha': {'path': 'https://www.msha.gov/mine-data-retrieval-system', 'title': 'Mining Safety and Health Administration'}, 'phmsa': {'path': 'https://www.phmsa.dot.gov/data-and-statistics/pipeline/data-and-statistics-overview', 'title': 'Pipelines and Hazardous Materials Safety Administration'}, 'pudl': {'email': 'pudl@catalyst.coop', 'path': 'https://catalyst.coop/pudl/', 'title': 'The Public Utility Data Liberation Project (PUDL)'}}¶ A dictionary of dictionaries containing datasources (keys) and associated attributes (values)
- Type
-
pudl.constants.
data_sources
= ('eia860', 'eia861', 'eia923', 'epacems', 'epaipm', 'ferc1', 'ferc714')¶ A tuple containing the data sources we are able to pull into PUDL.
- Type
-
pudl.constants.
data_years
= {'eia860': (2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), 'eia861': (1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), 'eia923': (2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), 'epacems': (1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020), 'epaipm': (None,), 'ferc1': (1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019), 'ferc714': (None,)}¶ A dictionary of data sources (keys) and tuples containing the years that we expect to be able to download for each data source (values).
- Type
-
pudl.constants.
dbf_typemap
= {'+': 'XXX', '0': <class 'sqlalchemy.sql.sqltypes.Integer'>, '@': 'XXX', 'B': 'XXX', 'C': <class 'sqlalchemy.sql.sqltypes.String'>, 'D': <class 'sqlalchemy.sql.sqltypes.Date'>, 'F': <class 'sqlalchemy.sql.sqltypes.Float'>, 'G': 'XXX', 'I': <class 'sqlalchemy.sql.sqltypes.Integer'>, 'L': <class 'sqlalchemy.sql.sqltypes.Boolean'>, 'M': <class 'sqlalchemy.sql.sqltypes.Text'>, 'N': <class 'sqlalchemy.sql.sqltypes.Float'>, 'O': 'XXX', 'T': <class 'sqlalchemy.sql.sqltypes.DateTime'>}¶ A dictionary mapping field types in the DBF objects (keys) to the corresponding generic SQLAlchemy Column types.
- Type
-
pudl.constants.
eia860_pudl_tables
= ('boiler_generator_assn_eia860', 'utilities_eia860', 'plants_eia860', 'generators_eia860', 'ownership_eia860')¶ A tuple enumerating EIA 860 tables for which PUDL’s ETL works.
- Type
-
pudl.constants.
eia923_pudl_tables
= ('generation_fuel_eia923', 'boiler_fuel_eia923', 'generation_eia923', 'coalmine_eia923', 'fuel_receipts_costs_eia923')¶ A tuple containing the EIA923 tables that can be successfully integrated into PUDL.
- Type
-
pudl.constants.
energy_source_eia923
= {'ANT': 'Anthracite Coal', 'BFG': 'Blast Furnace Gas', 'BIT': 'Bituminous Coal', 'BM': 'Biomass', 'DFO': 'Distillate Fuel Oil. Including diesel, No. 1, No. 2, and No. 4 fuel oils.', 'JF': 'Jet Fuel', 'KER': 'Kerosene', 'LIG': 'Lignite Coal', 'NG': 'Natural Gas', 'OG': 'Other Gas', 'PC': 'Petroleum Coke', 'PG': 'Gaseous Propone', 'RC': 'Refined Coal', 'RFO': 'Residual Fuel Oil. Including No. 5 & 6 fuel oils and bunker C fuel oil.', 'SC': 'Coal-based Synfuel. Including briquettes, pellets, or extrusions, which are formed by binding materials or processes that recycle materials.', 'SG': 'Synthesis Gas from Petroleum Coke', 'SGP': 'Petroleum Coke Derived Synthesis Gas', 'SUB': 'Subbituminous Coal', 'WC': 'Waste/Other Coal. Including anthracite culm, bituminous gob, fine coal, lignite waste, waste coal.', 'WO': 'Waste/Other Oil. Including crude oil, liquid butane, liquid propane, naphtha, oil waste, re-refined moto oil, sludge oil, tar oil, or other petroleum-based liquid wastes.'}¶ A dictionary mapping fuel codes (keys) to fuel descriptions (values) for each fuel receipt from the EIA 923 Fuel Receipts and Costs table.
- Type
-
pudl.constants.
energy_source_eia_simple_map
= {'coal': ['ANT', 'BIT', 'LIG', 'PC', 'SUB', 'WC', 'RC'], 'gas': ['BFG', 'LFG', 'NG', 'OBG', 'OG', 'PG', 'SG', 'SGC', 'SGP'], 'hydro': ['WAT'], 'nuclear': ['NUC'], 'oil': ['DFO', 'JF', 'KER', 'RFO', 'WO'], 'other': ['GEO', 'MWH', 'OTH', 'PUR', 'WH'], 'solar': ['SUN'], 'waste': ['AB', 'BLQ', 'MSW', 'OBL', 'OBS', 'SLW', 'TDF', 'WDL', 'WDS'], 'wind': ['WND']}¶ A dictionary mapping EIA fuel types (keys) to fuel codes (values).
- Type
-
pudl.constants.
entities
= {'boilers': [['plant_id_eia', 'boiler_id'], ['prime_mover_code'], [], {}], 'generators': [['plant_id_eia', 'generator_id'], ['prime_mover_code', 'duct_burners', 'operating_date', 'topping_bottoming_code', 'solid_fuel_gasification', 'pulverized_coal_tech', 'fluidized_bed_tech', 'subcritical_tech', 'supercritical_tech', 'ultrasupercritical_tech', 'stoker_tech', 'other_combustion_tech', 'bypass_heat_recovery', 'rto_iso_lmp_node_id', 'rto_iso_location_wholesale_reporting_id', 'associated_combined_heat_power', 'original_planned_operating_date', 'operating_switch', 'previously_canceled'], ['capacity_mw', 'fuel_type_code_pudl', 'multiple_fuels', 'ownership_code', 'owned_by_non_utility', 'deliver_power_transgrid', 'summer_capacity_mw', 'winter_capacity_mw', 'summer_capacity_estimate', 'winter_capacity_estimate', 'minimum_load_mw', 'distributed_generation', 'technology_description', 'reactive_power_output_mvar', 'energy_source_code_1', 'energy_source_code_2', 'energy_source_code_3', 'energy_source_code_4', 'energy_source_code_5', 'energy_source_code_6', 'energy_source_1_transport_1', 'energy_source_1_transport_2', 'energy_source_1_transport_3', 'energy_source_2_transport_1', 'energy_source_2_transport_2', 'energy_source_2_transport_3', 'startup_source_code_1', 'startup_source_code_2', 'startup_source_code_3', 'startup_source_code_4', 'time_cold_shutdown_full_load_code', 'syncronized_transmission_grid', 'turbines_num', 'operational_status_code', 'operational_status', 'planned_modifications', 'planned_net_summer_capacity_uprate_mw', 'planned_net_winter_capacity_uprate_mw', 'planned_new_capacity_mw', 'planned_uprate_date', 'planned_net_summer_capacity_derate_mw', 'planned_net_winter_capacity_derate_mw', 'planned_derate_date', 'planned_new_prime_mover_code', 'planned_energy_source_code_1', 'planned_repower_date', 'other_planned_modifications', 'other_modifications_date', 'planned_retirement_date', 'carbon_capture', 'cofire_fuels', 'switch_oil_gas', 'turbines_inverters_hydrokinetics', 'nameplate_power_factor', 'uprate_derate_during_year', 'uprate_derate_completed_date', 'current_planned_operating_date', 'summer_estimated_capability_mw', 'winter_estimated_capability_mw', 'retirement_date', 'utility_id_eia', 'data_source'], {}], 'plants': [['plant_id_eia'], ['balancing_authority_code_eia', 'balancing_authority_name_eia', 'city', 'county', 'ferc_cogen_status', 'ferc_exempt_wholesale_generator', 'ferc_small_power_producer', 'grid_voltage_2_kv', 'grid_voltage_3_kv', 'grid_voltage_kv', 'iso_rto_code', 'latitude', 'longitude', 'service_area', 'plant_name_eia', 'primary_purpose_naics_id', 'sector_id', 'sector_name', 'state', 'street_address', 'zip_code'], ['ash_impoundment', 'ash_impoundment_lined', 'ash_impoundment_status', 'datum', 'energy_storage', 'ferc_cogen_docket_no', 'water_source', 'ferc_exempt_wholesale_generator_docket_no', 'ferc_small_power_producer_docket_no', 'liquefied_natural_gas_storage', 'natural_gas_local_distribution_company', 'natural_gas_storage', 'natural_gas_pipeline_name_1', 'natural_gas_pipeline_name_2', 'natural_gas_pipeline_name_3', 'nerc_region', 'net_metering', 'pipeline_notes', 'regulatory_status_code', 'transmission_distribution_owner_id', 'transmission_distribution_owner_name', 'transmission_distribution_owner_state', 'utility_id_eia'], {}], 'utilities': [['utility_id_eia'], ['utility_name_eia'], ['street_address', 'city', 'state', 'zip_code', 'entity_type', 'plants_reported_owner', 'plants_reported_operator', 'plants_reported_asset_manager', 'plants_reported_other_relationship', 'attention_line', 'address_2', 'zip_code_4', 'contact_firstname', 'contact_lastname', 'contact_title', 'contact_firstname_2', 'contact_lastname_2', 'contact_title_2', 'phone_extension_1', 'phone_extension_2', 'phone_number_1', 'phone_number_2'], {'utility_id_eia': 'int64'}]}¶ A dictionary containing table name strings (keys) and lists of columns to keep for those tables (values).
- Type
-
pudl.constants.
entity_tables
= ['utilities_entity_eia', 'plants_entity_eia', 'generators_entity_eia', 'boilers_entity_eia', 'regions_entity_epaipm']¶ A list of PUDL entity tables.
- Type
-
pudl.constants.
epacems_tables
= 'hourly_emissions_epacems'¶ A tuple containing tables of EPA CEMS data to pull into PUDL.
- Type
-
pudl.constants.
epaipm_pudl_tables
= ('transmission_single_epaipm', 'transmission_joint_epaipm', 'load_curves_epaipm', 'plant_region_map_epaipm')¶ A tuple containing the EPA IPM tables that can be successfully integrated into PUDL.
- Type
-
pudl.constants.
epaipm_region_aggregations
= {'ISONE': ['NENG_CT', 'NENGREST', 'NENG_ME'], 'MISO': ['MIS_AR', 'MIS_IL', 'MIS_INKY', 'MIS_IA', 'MIS_MIDA', 'MIS_LA', 'MIS_LMI', 'MIS_MNWI', 'MIS_D_MS', 'MIS_MO', 'MIS_MAPP', 'MIS_AMSO', 'MIS_WOTA', 'MIS_WUMS'], 'NYISO': ['NY_Z_A', 'NY_Z_B', 'NY_Z_C&E', 'NY_Z_D', 'NY_Z_F', 'NY_Z_G-I', 'NY_Z_J', 'NY_Z_K'], 'PJM': ['PJM_AP', 'PJM_ATSI', 'PJM_COMD', 'PJM_Dom', 'PJM_EMAC', 'PJM_PENE', 'PJM_SMAC', 'PJM_WMAC'], 'SPP': ['SPP_NEBR', 'SPP_N', 'SPP_SPS', 'SPP_WEST', 'SPP_KIAM', 'SPP_WAUE'], 'WECC_NW': ['WECC_CO', 'WECC_ID', 'WECC_MT', 'WECC_NNV', 'WECC_PNW', 'WECC_UT', 'WECC_WY']}¶ A dictionary containing EPA IPM regions (keys) and lists of their associated abbreviations (values).
- Type
-
pudl.constants.
epaipm_region_names
= ['ERC_PHDL', 'ERC_REST', 'ERC_FRNT', 'ERC_GWAY', 'ERC_WEST', 'FRCC', 'NENG_CT', 'NENGREST', 'NENG_ME', 'MIS_AR', 'MIS_IL', 'MIS_INKY', 'MIS_IA', 'MIS_MIDA', 'MIS_LA', 'MIS_LMI', 'MIS_MNWI', 'MIS_D_MS', 'MIS_MO', 'MIS_MAPP', 'MIS_AMSO', 'MIS_WOTA', 'MIS_WUMS', 'NY_Z_A', 'NY_Z_B', 'NY_Z_C&E', 'NY_Z_D', 'NY_Z_F', 'NY_Z_G-I', 'NY_Z_J', 'NY_Z_K', 'PJM_West', 'PJM_AP', 'PJM_ATSI', 'PJM_COMD', 'PJM_Dom', 'PJM_EMAC', 'PJM_PENE', 'PJM_SMAC', 'PJM_WMAC', 'S_C_KY', 'S_C_TVA', 'S_D_AECI', 'S_SOU', 'S_VACA', 'SPP_NEBR', 'SPP_N', 'SPP_SPS', 'SPP_WEST', 'SPP_KIAM', 'SPP_WAUE', 'WECC_AZ', 'WEC_BANC', 'WECC_CO', 'WECC_ID', 'WECC_IID', 'WEC_LADW', 'WECC_MT', 'WECC_NM', 'WEC_CALN', 'WECC_NNV', 'WECC_PNW', 'WEC_SDGE', 'WECC_SCE', 'WECC_SNV', 'WECC_UT', 'WECC_WY', 'CN_AB', 'CN_BC', 'CN_NL', 'CN_MB', 'CN_NB', 'CN_NF', 'CN_NS', 'CN_ON', 'CN_PE', 'CN_PQ', 'CN_SK']¶ A list of EPA IPM region names.
- Type
-
pudl.constants.
epaipm_url_ext
= {'load_curves_epaipm': 'table_2-2_load_duration_curves_used_in_epa_platform_v6.xlsx', 'plant_region_map_epaipm': 'needs_v6_november_2018_reference_case_0.xlsx', 'transmission_single_epaipm': 'table_3-21_annual_transmission_capabilities_of_u.s._model_regions_in_epa_platform_v6_-_2021.xlsx'}¶ A dictionary of EPA IPM tables and associated URLs extensions for downloading that table’s data.
- Type
-
pudl.constants.
ferc1_data_tables
= ('f1_acb_epda', 'f1_accumdepr_prvsn', 'f1_accumdfrrdtaxcr', 'f1_adit_190_detail', 'f1_adit_190_notes', 'f1_adit_amrt_prop', 'f1_adit_other', 'f1_adit_other_prop', 'f1_allowances', 'f1_bal_sheet_cr', 'f1_capital_stock', 'f1_cash_flow', 'f1_cmmn_utlty_p_e', 'f1_comp_balance_db', 'f1_construction', 'f1_control_respdnt', 'f1_co_directors', 'f1_cptl_stk_expns', 'f1_csscslc_pcsircs', 'f1_dacs_epda', 'f1_dscnt_cptl_stk', 'f1_edcfu_epda', 'f1_elctrc_erg_acct', 'f1_elctrc_oper_rev', 'f1_elc_oper_rev_nb', 'f1_elc_op_mnt_expn', 'f1_electric', 'f1_envrnmntl_expns', 'f1_envrnmntl_fclty', 'f1_fuel', 'f1_general_info', 'f1_gnrt_plant', 'f1_important_chg', 'f1_incm_stmnt_2', 'f1_income_stmnt', 'f1_miscgen_expnelc', 'f1_misc_dfrrd_dr', 'f1_mthly_peak_otpt', 'f1_mtrl_spply', 'f1_nbr_elc_deptemp', 'f1_nonutility_prop', 'f1_note_fin_stmnt', 'f1_nuclear_fuel', 'f1_officers_co', 'f1_othr_dfrrd_cr', 'f1_othr_pd_in_cptl', 'f1_othr_reg_assets', 'f1_othr_reg_liab', 'f1_overhead', 'f1_pccidica', 'f1_plant_in_srvce', 'f1_pumped_storage', 'f1_purchased_pwr', 'f1_reconrpt_netinc', 'f1_reg_comm_expn', 'f1_respdnt_control', 'f1_retained_erng', 'f1_r_d_demo_actvty', 'f1_sales_by_sched', 'f1_sale_for_resale', 'f1_sbsdry_totals', 'f1_schedules_list', 'f1_security_holder', 'f1_slry_wg_dstrbtn', 'f1_substations', 'f1_taxacc_ppchrgyr', 'f1_unrcvrd_cost', 'f1_utltyplnt_smmry', 'f1_work', 'f1_xmssn_adds', 'f1_xmssn_elc_bothr', 'f1_xmssn_elc_fothr', 'f1_xmssn_line', 'f1_xtraordnry_loss', 'f1_hydro', 'f1_steam', 'f1_leased', 'f1_sbsdry_detail', 'f1_plant', 'f1_long_term_debt', 'f1_106_2009', 'f1_106a_2009', 'f1_106b_2009', 'f1_208_elc_dep', 'f1_231_trn_stdycst', 'f1_324_elc_expns', 'f1_325_elc_cust', 'f1_331_transiso', 'f1_338_dep_depl', 'f1_397_isorto_stl', 'f1_398_ancl_ps', 'f1_399_mth_peak', 'f1_400_sys_peak', 'f1_400a_iso_peak', 'f1_429_trans_aff', 'f1_allowances_nox', 'f1_cmpinc_hedge_a', 'f1_cmpinc_hedge', 'f1_rg_trn_srv_rev')¶ A tuple containing the FERC Form 1 tables that have the same composite primary keys: [respondent_id, report_year, report_prd, row_number, spplmnt_num].
- Type
-
pudl.constants.
ferc1_dbf2tbl
= {'F1_1': 'f1_respondent_id', 'F1_10': 'f1_allowances', 'F1_106A_2009': 'f1_106a_2009', 'F1_106B_2009': 'f1_106b_2009', 'F1_106_2009': 'f1_106_2009', 'F1_11': 'f1_bal_sheet_cr', 'F1_12': 'f1_capital_stock', 'F1_13': 'f1_cash_flow', 'F1_14': 'f1_cmmn_utlty_p_e', 'F1_15': 'f1_comp_balance_db', 'F1_16': 'f1_construction', 'F1_17': 'f1_control_respdnt', 'F1_18': 'f1_co_directors', 'F1_19': 'f1_cptl_stk_expns', 'F1_2': 'f1_acb_epda', 'F1_20': 'f1_csscslc_pcsircs', 'F1_208_ELC_DEP': 'f1_208_elc_dep', 'F1_21': 'f1_dacs_epda', 'F1_22': 'f1_dscnt_cptl_stk', 'F1_23': 'f1_edcfu_epda', 'F1_231_TRN_STDYCST': 'f1_231_trn_stdycst', 'F1_24': 'f1_elctrc_erg_acct', 'F1_25': 'f1_elctrc_oper_rev', 'F1_26': 'f1_elc_oper_rev_nb', 'F1_27': 'f1_elc_op_mnt_expn', 'F1_28': 'f1_electric', 'F1_29': 'f1_envrnmntl_expns', 'F1_3': 'f1_accumdepr_prvsn', 'F1_30': 'f1_envrnmntl_fclty', 'F1_31': 'f1_fuel', 'F1_32': 'f1_general_info', 'F1_324_ELC_EXPNS': 'f1_324_elc_expns', 'F1_325_ELC_CUST': 'f1_325_elc_cust', 'F1_33': 'f1_gnrt_plant', 'F1_331_TRANSISO': 'f1_331_transiso', 'F1_338_DEP_DEPL': 'f1_338_dep_depl', 'F1_34': 'f1_important_chg', 'F1_35': 'f1_incm_stmnt_2', 'F1_36': 'f1_income_stmnt', 'F1_37': 'f1_miscgen_expnelc', 'F1_38': 'f1_misc_dfrrd_dr', 'F1_39': 'f1_mthly_peak_otpt', 'F1_397_ISORTO_STL': 'f1_397_isorto_stl', 'F1_398_ANCL_PS': 'f1_398_ancl_ps', 'F1_399_MTH_PEAK': 'f1_399_mth_peak', 'F1_4': 'f1_accumdfrrdtaxcr', 'F1_40': 'f1_mtrl_spply', 'F1_400A_ISO_PEAK': 'f1_400a_iso_peak', 'F1_400_SYS_PEAK': 'f1_400_sys_peak', 'F1_41': 'f1_nbr_elc_deptemp', 'F1_42': 'f1_nonutility_prop', 'F1_429_TRANS_AFF': 'f1_429_trans_aff', 'F1_43': 'f1_note_fin_stmnt', 'F1_44': 'f1_nuclear_fuel', 'F1_45': 'f1_officers_co', 'F1_46': 'f1_othr_dfrrd_cr', 'F1_47': 'f1_othr_pd_in_cptl', 'F1_48': 'f1_othr_reg_assets', 'F1_49': 'f1_othr_reg_liab', 'F1_5': 'f1_adit_190_detail', 'F1_50': 'f1_overhead', 'F1_51': 'f1_pccidica', 'F1_52': 'f1_plant_in_srvce', 'F1_53': 'f1_pumped_storage', 'F1_54': 'f1_purchased_pwr', 'F1_55': 'f1_reconrpt_netinc', 'F1_56': 'f1_reg_comm_expn', 'F1_57': 'f1_respdnt_control', 'F1_58': 'f1_retained_erng', 'F1_59': 'f1_r_d_demo_actvty', 'F1_6': 'f1_adit_190_notes', 'F1_60': 'f1_sales_by_sched', 'F1_61': 'f1_sale_for_resale', 'F1_62': 'f1_sbsdry_totals', 'F1_63': 'f1_schedules_list', 'F1_64': 'f1_security_holder', 'F1_65': 'f1_slry_wg_dstrbtn', 'F1_66': 'f1_substations', 'F1_67': 'f1_taxacc_ppchrgyr', 'F1_68': 'f1_unrcvrd_cost', 'F1_69': 'f1_utltyplnt_smmry', 'F1_7': 'f1_adit_amrt_prop', 'F1_70': 'f1_work', 'F1_71': 'f1_xmssn_adds', 'F1_72': 'f1_xmssn_elc_bothr', 'F1_73': 'f1_xmssn_elc_fothr', 'F1_74': 'f1_xmssn_line', 'F1_75': 'f1_xtraordnry_loss', 'F1_76': 'f1_codes_val', 'F1_77': 'f1_sched_lit_tbl', 'F1_78': 'f1_audit_log', 'F1_79': 'f1_col_lit_tbl', 'F1_8': 'f1_adit_other', 'F1_80': 'f1_load_file_names', 'F1_81': 'f1_privilege', 'F1_82': 'f1_sys_error_log', 'F1_83': 'f1_unique_num_val', 'F1_84': 'f1_row_lit_tbl', 'F1_85': 'f1_footnote_data', 'F1_86': 'f1_hydro', 'F1_87': 'f1_footnote_tbl', 'F1_88': 'f1_ident_attsttn', 'F1_89': 'f1_steam', 'F1_9': 'f1_adit_other_prop', 'F1_90': 'f1_leased', 'F1_91': 'f1_sbsdry_detail', 'F1_92': 'f1_plant', 'F1_93': 'f1_long_term_debt', 'F1_ALLOWANCES_NOX': 'f1_allowances_nox', 'F1_CMPINC_HEDGE': 'f1_cmpinc_hedge', 'F1_CMPINC_HEDGE_A': 'f1_cmpinc_hedge_a', 'F1_EMAIL': 'f1_email', 'F1_RG_TRN_SRV_REV': 'f1_rg_trn_srv_rev', 'F1_S0_CHECKS': 'f1_s0_checks', 'F1_S0_FILING_LOG': 'f1_s0_filing_log', 'F1_SECURITY': 'f1_security'}¶ A dictionary mapping FERC Form 1 DBF files(w / o .DBF file extension) (keys) to database table names (values).
- Type
-
pudl.constants.
ferc1_huge_tables
= {'f1_footnote_data', 'f1_footnote_tbl', 'f1_note_fin_stmnt'}¶ A set containing large FERC Form 1 tables.
- Type
-
pudl.constants.
ferc1_power_purchase_type
= {'AD': 'adjustment', 'EX': 'electricity_exchange', 'IF': 'intermediate_firm', 'IU': 'intermediate_unit', 'LF': 'long_firm', 'LU': 'long_unit', 'OS': 'other_service', 'RQ': 'requirement', 'SF': 'short_firm'}¶ A dictionary of abbreviations (keys) and types (values) for power purchase agreements from FERC Form 1.
- Type
-
pudl.constants.
ferc1_pudl_tables
= ('fuel_ferc1', 'plants_steam_ferc1', 'plants_small_ferc1', 'plants_hydro_ferc1', 'plants_pumped_storage_ferc1', 'purchased_power_ferc1', 'plant_in_service_ferc1')¶ A tuple containing the FERC Form 1 tables that can be successfully integrated into PUDL.
- Type
-
pudl.constants.
ferc1_tbl2dbf
= {'f1_106_2009': 'F1_106_2009', 'f1_106a_2009': 'F1_106A_2009', 'f1_106b_2009': 'F1_106B_2009', 'f1_208_elc_dep': 'F1_208_ELC_DEP', 'f1_231_trn_stdycst': 'F1_231_TRN_STDYCST', 'f1_324_elc_expns': 'F1_324_ELC_EXPNS', 'f1_325_elc_cust': 'F1_325_ELC_CUST', 'f1_331_transiso': 'F1_331_TRANSISO', 'f1_338_dep_depl': 'F1_338_DEP_DEPL', 'f1_397_isorto_stl': 'F1_397_ISORTO_STL', 'f1_398_ancl_ps': 'F1_398_ANCL_PS', 'f1_399_mth_peak': 'F1_399_MTH_PEAK', 'f1_400_sys_peak': 'F1_400_SYS_PEAK', 'f1_400a_iso_peak': 'F1_400A_ISO_PEAK', 'f1_429_trans_aff': 'F1_429_TRANS_AFF', 'f1_acb_epda': 'F1_2', 'f1_accumdepr_prvsn': 'F1_3', 'f1_accumdfrrdtaxcr': 'F1_4', 'f1_adit_190_detail': 'F1_5', 'f1_adit_190_notes': 'F1_6', 'f1_adit_amrt_prop': 'F1_7', 'f1_adit_other': 'F1_8', 'f1_adit_other_prop': 'F1_9', 'f1_allowances': 'F1_10', 'f1_allowances_nox': 'F1_ALLOWANCES_NOX', 'f1_audit_log': 'F1_78', 'f1_bal_sheet_cr': 'F1_11', 'f1_capital_stock': 'F1_12', 'f1_cash_flow': 'F1_13', 'f1_cmmn_utlty_p_e': 'F1_14', 'f1_cmpinc_hedge': 'F1_CMPINC_HEDGE', 'f1_cmpinc_hedge_a': 'F1_CMPINC_HEDGE_A', 'f1_co_directors': 'F1_18', 'f1_codes_val': 'F1_76', 'f1_col_lit_tbl': 'F1_79', 'f1_comp_balance_db': 'F1_15', 'f1_construction': 'F1_16', 'f1_control_respdnt': 'F1_17', 'f1_cptl_stk_expns': 'F1_19', 'f1_csscslc_pcsircs': 'F1_20', 'f1_dacs_epda': 'F1_21', 'f1_dscnt_cptl_stk': 'F1_22', 'f1_edcfu_epda': 'F1_23', 'f1_elc_op_mnt_expn': 'F1_27', 'f1_elc_oper_rev_nb': 'F1_26', 'f1_elctrc_erg_acct': 'F1_24', 'f1_elctrc_oper_rev': 'F1_25', 'f1_electric': 'F1_28', 'f1_email': 'F1_EMAIL', 'f1_envrnmntl_expns': 'F1_29', 'f1_envrnmntl_fclty': 'F1_30', 'f1_footnote_data': 'F1_85', 'f1_footnote_tbl': 'F1_87', 'f1_fuel': 'F1_31', 'f1_general_info': 'F1_32', 'f1_gnrt_plant': 'F1_33', 'f1_hydro': 'F1_86', 'f1_ident_attsttn': 'F1_88', 'f1_important_chg': 'F1_34', 'f1_incm_stmnt_2': 'F1_35', 'f1_income_stmnt': 'F1_36', 'f1_leased': 'F1_90', 'f1_load_file_names': 'F1_80', 'f1_long_term_debt': 'F1_93', 'f1_misc_dfrrd_dr': 'F1_38', 'f1_miscgen_expnelc': 'F1_37', 'f1_mthly_peak_otpt': 'F1_39', 'f1_mtrl_spply': 'F1_40', 'f1_nbr_elc_deptemp': 'F1_41', 'f1_nonutility_prop': 'F1_42', 'f1_note_fin_stmnt': 'F1_43', 'f1_nuclear_fuel': 'F1_44', 'f1_officers_co': 'F1_45', 'f1_othr_dfrrd_cr': 'F1_46', 'f1_othr_pd_in_cptl': 'F1_47', 'f1_othr_reg_assets': 'F1_48', 'f1_othr_reg_liab': 'F1_49', 'f1_overhead': 'F1_50', 'f1_pccidica': 'F1_51', 'f1_plant': 'F1_92', 'f1_plant_in_srvce': 'F1_52', 'f1_privilege': 'F1_81', 'f1_pumped_storage': 'F1_53', 'f1_purchased_pwr': 'F1_54', 'f1_r_d_demo_actvty': 'F1_59', 'f1_reconrpt_netinc': 'F1_55', 'f1_reg_comm_expn': 'F1_56', 'f1_respdnt_control': 'F1_57', 'f1_respondent_id': 'F1_1', 'f1_retained_erng': 'F1_58', 'f1_rg_trn_srv_rev': 'F1_RG_TRN_SRV_REV', 'f1_row_lit_tbl': 'F1_84', 'f1_s0_checks': 'F1_S0_CHECKS', 'f1_s0_filing_log': 'F1_S0_FILING_LOG', 'f1_sale_for_resale': 'F1_61', 'f1_sales_by_sched': 'F1_60', 'f1_sbsdry_detail': 'F1_91', 'f1_sbsdry_totals': 'F1_62', 'f1_sched_lit_tbl': 'F1_77', 'f1_schedules_list': 'F1_63', 'f1_security': 'F1_SECURITY', 'f1_security_holder': 'F1_64', 'f1_slry_wg_dstrbtn': 'F1_65', 'f1_steam': 'F1_89', 'f1_substations': 'F1_66', 'f1_sys_error_log': 'F1_82', 'f1_taxacc_ppchrgyr': 'F1_67', 'f1_unique_num_val': 'F1_83', 'f1_unrcvrd_cost': 'F1_68', 'f1_utltyplnt_smmry': 'F1_69', 'f1_work': 'F1_70', 'f1_xmssn_adds': 'F1_71', 'f1_xmssn_elc_bothr': 'F1_72', 'f1_xmssn_elc_fothr': 'F1_73', 'f1_xmssn_line': 'F1_74', 'f1_xtraordnry_loss': 'F1_75'}¶ A dictionary mapping database table names (keys) to FERC Form 1 DBF files(w / o .DBF file extension) (values).
- Type
-
pudl.constants.
ferc_accumulated_depreciation
= row_number ... ferc_account_description 0 1 ... Balance Beginning of Year 1 3 ... (403) Depreciation Expense 2 4 ... (403.1) Depreciation Expense for Asset Retirem... 3 5 ... (413) Exp. of Elec. Plt. Leas. to Others 4 6 ... Transportation Expenses-Clearing 5 7 ... Other Clearing Accounts 6 8 ... Other Accounts (Specify, details in footnote): 7 9 ... Other Charges: 8 10 ... TOTAL Deprec. Prov for Year (Enter Total of li... 9 11 ... Net Charges for Plant Retired: 10 12 ... Book Cost of Plant Retired 11 13 ... Cost of Removal 12 14 ... Salvage (Credit) 13 15 ... TOTAL Net Chrgs. for Plant Ret. (Enter Total o... 14 16 ... Other Debit or Cr. Items (Describe, details in... 15 17 ... Other Charges 2 16 18 ... Book Cost or Asset Retirement Costs Retired 17 19 ... Balance End of Year (Enter Totals of lines 1, ... 18 20 ... Steam Production 19 21 ... Nuclear Production 20 22 ... Hydraulic Production-Conventional 21 23 ... Hydraulic Production-Pumped Storage 22 24 ... Other Production 23 25 ... Transmission 24 26 ... Distribution 25 27 ... Regional Transmission and Market Operation 26 28 ... General 27 29 ... TOTAL (Enter Total of lines 20 thru 28) [28 rows x 3 columns]¶ A list of tuples containing row numbers, FERC account IDs, and FERC account descriptions from FERC Form 1 page 219, Accumulated Provision for Depreciation of electric utility plant(Account 108).
- Type
-
pudl.constants.
ferc_electric_plant_accounts
= row_number ... ferc_account_description 0 2.0 ... Intangible: Organization 1 3.0 ... Intangible: Franchises and consents 2 4.0 ... Intangible: Miscellaneous intangible plant 3 5.0 ... Subtotal: Intangible Plant 4 8.0 ... Steam production: Land and land rights .. ... ... ... 92 100.0 ... Electric plant in service (Major only) 93 101.0 ... Electric plant purchased 94 102.0 ... Electric plant sold 95 103.0 ... Experimental plant unclassified 96 104.0 ... TOTAL Electric Plant in Service [97 rows x 3 columns]¶ A list of tuples containing row numbers, FERC account IDs, and FERC account descriptions from FERC Form 1 pages 204 - 207, Electric Plant in Service.
- Type
-
pudl.constants.
files_dict_epaipm
= {'load_curves_epaipm': '*table_2-2_*', 'plant_region_map_epaipm': '*needs_v6*', 'transmission_joint_epaipm': '*transmission_joint_ipm*', 'transmission_single_epaipm': '*table_3-21*'}¶ A dictionary of EPA IPM tables and strings that files of those tables contain.
- Type
-
pudl.constants.
fuel_group_eia923
= ('coal', 'natural_gas', 'petroleum', 'petroleum_coke', 'other_gas')¶ A tuple containing EIA 923 fuel groups.
- Type
-
pudl.constants.
fuel_group_eia923_simple_map
= {'coal': ['coal', 'petroleum coke'], 'gas': ['natural gas', 'other gas'], 'oil': ['petroleum']}¶ A dictionary mapping EIA 923 simple fuel types(“oil”, “coal”, “gas”) (keys) to fuel types (values).
- Type
-
pudl.constants.
fuel_type_aer_eia923
= {'COL': 'Coal', 'DFO': 'Distillate Petroleum', 'GEO': 'Geothermal', 'HPS': 'Hydroelectric Pumped Storage', 'HYC': 'Hydroelectric Conventional', 'MLG': 'Biogenic Municipal Solid Waste and Landfill Gas', 'NG': 'Natural Gas', 'NUC': 'Nuclear', 'OOG': 'Other Gases', 'ORW': 'Other Renewables', 'OTH': 'Other (including nonbiogenic MSW)', 'PC': 'Petroleum Coke', 'RFO': 'Residual Petroleum', 'SUN': 'Solar PV and thermal', 'WND': 'Wind', 'WOC': 'Waste Coal', 'WOO': 'Waste Oil', 'WWW': 'Wood and Wood Waste'}¶ A dictionary mapping EIA 923 AER fuel types (keys) to lists of strings associated with that fuel type (values).
- Type
-
pudl.constants.
fuel_type_eia860_coal_strings
= ['ant', 'bit', 'cbl', 'lig', 'pc', 'rc', 'sc', 'sub', 'wc', 'coal', 'petroleum coke', 'col', 'woc']¶ A list of strings from EIA 860 associated with fuel type coal.
- Type
-
pudl.constants.
fuel_type_eia860_gas_strings
= ['bfg', 'lfg', 'mlg', 'ng', 'obg', 'og', 'pg', 'sgc', 'sgp', 'natural gas', 'other gas', 'oog', 'sg']¶ A list of strings from EIA 860 associated with fuel type gas.
- Type
-
pudl.constants.
fuel_type_eia860_hydro_strings
= ['wat', 'hyc', 'hps', 'hydro']¶ A list of strings from EIA 860 associated with hydro power.
- Type
-
pudl.constants.
fuel_type_eia860_nuclear_strings
= ['nuc', 'nuclear']¶ A list of strings from EIA 860 associated with nuclear power.
- Type
-
pudl.constants.
fuel_type_eia860_oil_strings
= ['dfo', 'jf', 'ker', 'rfo', 'wo', 'woo', 'petroleum']¶ A list of strings from EIA 860 associated with fuel type oil.
- Type
-
pudl.constants.
fuel_type_eia860_other_strings
= ['mwh', 'oth', 'pur', 'wh', 'geo', 'none', 'orw', 'other']¶ A list of strings from EIA 860 associated with fuel type other.
- Type
-
pudl.constants.
fuel_type_eia860_simple_map
= {'coal': ['ant', 'bit', 'cbl', 'lig', 'pc', 'rc', 'sc', 'sub', 'wc', 'coal', 'petroleum coke', 'col', 'woc'], 'gas': ['bfg', 'lfg', 'mlg', 'ng', 'obg', 'og', 'pg', 'sgc', 'sgp', 'natural gas', 'other gas', 'oog', 'sg'], 'hydro': ['wat', 'hyc', 'hps', 'hydro'], 'nuclear': ['nuc', 'nuclear'], 'oil': ['dfo', 'jf', 'ker', 'rfo', 'wo', 'woo', 'petroleum'], 'other': ['mwh', 'oth', 'pur', 'wh', 'geo', 'none', 'orw', 'other'], 'solar': ['sun', 'solar'], 'waste': ['ab', 'blq', 'bm', 'msb', 'msn', 'obl', 'obs', 'slw', 'tdf', 'wdl', 'wds', 'biomass', 'msw', 'www'], 'wind': ['wnd', 'wind', 'wt']}¶ A dictionary mapping EIA 860 fuel types (keys) to lists of strings associated with that fuel type (values).
- Type
-
pudl.constants.
fuel_type_eia860_solar_strings
= ['sun', 'solar']¶ A list of strings from EIA 860 associated with solar power.
- Type
-
pudl.constants.
fuel_type_eia860_waste_strings
= ['ab', 'blq', 'bm', 'msb', 'msn', 'obl', 'obs', 'slw', 'tdf', 'wdl', 'wds', 'biomass', 'msw', 'www']¶ A list of strings from EIA 860 associated with fuel type waste.
- Type
-
pudl.constants.
fuel_type_eia860_wind_strings
= ['wnd', 'wind', 'wt']¶ A list of strings from EIA 860 associated with wind power.
- Type
-
pudl.constants.
fuel_type_eia923
= {'AB': 'Agricultural By-Products', 'ANT': 'Anthracite Coal', 'BFG': 'Blast Furnace Gas', 'BIT': 'Bituminous Coal', 'BLQ': 'Black Liquor', 'CBL': 'Coal, Blended', 'DFO': 'Distillate Fuel Oil. Including diesel, No. 1, No. 2, and No. 4 fuel oils.', 'GEO': 'Geothermal', 'JF': 'Jet Fuel', 'KER': 'Kerosene', 'LFG': 'Landfill Gas', 'LIG': 'Lignite Coal', 'MSB': 'Biogenic Municipal Solid Waste', 'MSN': 'Non-biogenic Municipal Solid Waste', 'MSW': 'Municipal Solid Waste', 'MWH': 'Electricity used for energy storage', 'NG': 'Natural Gas', 'NUC': 'Nuclear. Including Uranium, Plutonium, and Thorium.', 'OBG': 'Other Biomass Gas. Including digester gas, methane, and other biomass gases.', 'OBL': 'Other Biomass Liquids', 'OBS': 'Other Biomass Solids', 'OG': 'Other Gas', 'OTH': 'Other Fuel', 'PC': 'Petroleum Coke', 'PG': 'Gaseous Propane', 'PUR': 'Purchased Steam', 'RC': 'Refined Coal', 'RFO': 'Residual Fuel Oil. Including No. 5 & 6 fuel oils and bunker C fuel oil.', 'SC': 'Coal-based Synfuel. Including briquettes, pellets, or extrusions, which are formed by binding materials or processes that recycle materials.', 'SGC': 'Coal-Derived Synthesis Gas', 'SGP': 'Synthesis Gas from Petroleum Coke', 'SLW': 'Sludge Waste', 'SUB': 'Subbituminous Coal', 'SUN': 'Solar', 'TDF': 'Tire-derived Fuels', 'WAT': 'Water at a Conventional Hydroelectric Turbine and water used in Wave Buoy Hydrokinetic Technology, current Hydrokinetic Technology, Tidal Hydrokinetic Technology, and Pumping Energy for Reversible (Pumped Storage) Hydroelectric Turbines.', 'WC': 'Waste/Other Coal. Including anthracite culm, bituminous gob, fine coal, lignite waste, waste coal.', 'WDL': 'Wood Waste Liquids, excluding Black Liquor. Including red liquor, sludge wood, spent sulfite liquor, and other wood-based liquids.', 'WDS': 'Wood/Wood Waste Solids. Including paper pellets, railroad ties, utility polies, wood chips, bark, and other wood waste solids.', 'WH': 'Waste Heat not directly attributed to a fuel source', 'WND': 'Wind', 'WO': 'Waste/Other Oil. Including crude oil, liquid butane, liquid propane, naphtha, oil waste, re-refined moto oil, sludge oil, tar oil, or other petroleum-based liquid wastes.'}¶ A dictionary mapping EIA 923 fuel type codes (keys) and fuel type names / descriptions (values).
- Type
-
pudl.constants.
fuel_type_eia923_boiler_fuel_coal_strings
= ['ant', 'bit', 'lig', 'pc', 'rc', 'sc', 'sub', 'wc']¶ A list of strings from EIA 923 Boiler Fuel associated with fuel type coal.
- Type
-
pudl.constants.
fuel_type_eia923_boiler_fuel_gas_strings
= ['bfg', 'lfg', 'ng', 'og', 'obg', 'pg', 'sgc', 'sgp']¶ A list of strings from EIA 923 Boiler Fuel associated with fuel type gas.
- Type
-
pudl.constants.
fuel_type_eia923_boiler_fuel_oil_strings
= ['dfo', 'rfo', 'wo', 'jf', 'ker']¶ A list of strings from EIA 923 Boiler Fuel associated with fuel type oil.
- Type
-
pudl.constants.
fuel_type_eia923_boiler_fuel_other_strings
= ['oth', 'pur', 'wh']¶ A list of strings from EIA 923 Boiler Fuel associated with fuel type other.
- Type
-
pudl.constants.
fuel_type_eia923_boiler_fuel_simple_map
= {'coal': ['ant', 'bit', 'lig', 'pc', 'rc', 'sc', 'sub', 'wc'], 'gas': ['bfg', 'lfg', 'ng', 'og', 'obg', 'pg', 'sgc', 'sgp'], 'oil': ['dfo', 'rfo', 'wo', 'jf', 'ker'], 'other': ['oth', 'pur', 'wh'], 'waste': ['ab', 'blq', 'msb', 'msn', 'obl', 'obs', 'slw', 'tdf', 'wdl', 'wds']}¶ A dictionary mapping EIA 923 Boiler Fuel fuel types (keys) to lists of strings associated with that fuel type (values).
- Type
-
pudl.constants.
fuel_type_eia923_boiler_fuel_waste_strings
= ['ab', 'blq', 'msb', 'msn', 'obl', 'obs', 'slw', 'tdf', 'wdl', 'wds']¶ A list of strings from EIA 923 Boiler Fuel associated with fuel type waste.
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_coal_strings
= ['ant', 'bit', 'cbl', 'lig', 'pc', 'rc', 'sc', 'sub', 'wc']¶ The list of EIA 923 Generation Fuel strings associated with coal fuel.
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_gas_strings
= ['bfg', 'lfg', 'ng', 'og', 'obg', 'pg', 'sgc', 'sgp']¶ The list of EIA 923 Generation Fuel strings associated with gas fuel.
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_hydro_strings
= ['wat']¶ The list of EIA 923 Generation Fuel strings associated with hydro power.
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_nuclear_strings
= ['nuc']¶ The list of EIA 923 Generation Fuel strings associated with nuclear power.
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_oil_strings
= ['dfo', 'rfo', 'wo', 'jf', 'ker']¶ The list of EIA 923 Generation Fuel strings associated with oil fuel.
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_other_strings
= ['geo', 'mwh', 'oth', 'pur', 'wh']¶ The list of EIA 923 Generation Fuel strings associated with geothermal power.
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_simple_map
= {'coal': ['ant', 'bit', 'cbl', 'lig', 'pc', 'rc', 'sc', 'sub', 'wc'], 'gas': ['bfg', 'lfg', 'ng', 'og', 'obg', 'pg', 'sgc', 'sgp'], 'hydro': ['wat'], 'nuclear': ['nuc'], 'oil': ['dfo', 'rfo', 'wo', 'jf', 'ker'], 'other': ['geo', 'mwh', 'oth', 'pur', 'wh'], 'solar': ['sun'], 'waste': ['ab', 'blq', 'msb', 'msn', 'msw', 'obl', 'obs', 'slw', 'tdf', 'wdl', 'wds'], 'wind': ['wnd']}¶ A dictionary mapping EIA 923 Generation Fuel fuel types (keys) to lists of strings associated with that fuel type (values).
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_solar_strings
= ['sun']¶ The list of EIA 923 Generation Fuel strings associated with solar power.
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_waste_strings
= ['ab', 'blq', 'msb', 'msn', 'msw', 'obl', 'obs', 'slw', 'tdf', 'wdl', 'wds']¶ The list of EIA 923 Generation Fuel strings associated with solid waste fuel.
- Type
-
pudl.constants.
fuel_type_eia923_gen_fuel_wind_strings
= ['wnd']¶ The list of EIA 923 Generation Fuel strings associated with wind power.
- Type
-
pudl.constants.
fuel_units_eia923
= {'barrels': 'Barrels (for liquids)', 'mcf': 'Thousands of cubic feet (for gases)', 'short_tons': 'Short tons (for solids)'}¶ A dictionary mapping EIA 923 fuel units (keys) to fuel unit descriptions (values).
- Type
-
pudl.constants.
glue_pudl_tables
= ('plants_eia', 'plants_ferc', 'plants', 'utilities_eia', 'utilities_ferc', 'utilities', 'utility_plant_assn')¶ A dictionary of dictionaries containing EPA IPM tables (keys) and items for each table to be renamed along with the replacement name (values).
- Type
-
pudl.constants.
keywords_by_data_source
= {'eia860': ['electricity', 'electric', 'boiler', 'generator', 'plant', 'utility', 'fuel', 'coal', 'natural gas', 'prime mover', 'eia860', 'retirement', 'capacity', 'planned', 'proposed', 'energy', 'hydro', 'solar', 'wind', 'nuclear', 'form 860', 'eia', 'annual', 'gas', 'ownership', 'steam', 'turbine', 'combustion', 'combined cycle', 'eia', 'energy information administration'], 'eia923': ['fuel', 'boiler', 'generator', 'plant', 'utility', 'cost', 'price', 'natural gas', 'coal', 'eia923', 'energy', 'electricity', 'form 923', 'receipts', 'generation', 'net generation', 'monthly', 'annual', 'gas', 'fuel consumption', 'MWh', 'energy information administration', 'eia', 'mercury', 'sulfur', 'ash', 'lignite', 'bituminous', 'subbituminous', 'heat content'], 'epacems': ['epa', 'us', 'emissions', 'pollution', 'ghg', 'so2', 'co2', 'sox', 'nox', 'load', 'utility', 'electricity', 'plant', 'generator', 'unit', 'generation', 'capacity', 'output', 'power', 'heat content', 'mmbtu', 'steam', 'cems', 'continuous emissions monitoring system', 'hourlyenvironmental protection agency', 'ampd', 'air markets program data'], 'epaipm': ['epaipm', 'integrated planning'], 'ferc1': ['electricity', 'electric', 'utility', 'plant', 'steam', 'generation', 'cost', 'expense', 'price', 'heat content', 'ferc', 'form 1', 'federal energy regulatory commission', 'capital', 'accounting', 'depreciation', 'finance', 'plant in service', 'hydro', 'coal', 'natural gas', 'gas', 'opex', 'capex', 'accounts', 'investment', 'capacity'], 'ferc714': ['electricity', 'electric', 'utility', 'planning area', 'form 714', 'balancing authority', 'demand', 'system lambda', 'ferc', 'federal energy regulatory commission', 'hourly', 'generation', 'interchange', 'forecast', 'load', 'adjacency', 'plants'], 'pudl': ['us', 'electricity']}¶ A dictionary of datasets (keys) and keywords (values).
- Type
-
pudl.constants.
licenses
= {'cc-by-4.0': {'name': 'CC-BY-4.0', 'path': 'https://creativecommons.org/licenses/by/4.0/', 'title': 'Creative Commons Attribution 4.0'}, 'us-govt': {'name': 'other-pd', 'path': 'http://www.usa.gov/publicdomain/label/1.0/', 'title': 'U.S. Government Work'}}¶ A dictionary of dictionaries containing license types and their attributes.
- Type
-
pudl.constants.
need_fix_inting
= {'hourly_emissions_epacems': ('facility_id', 'unit_id_epa'), 'plants_hydro_ferc1': ('construction_year', 'installation_year'), 'plants_pumped_storage_ferc1': ('construction_year', 'installation_year'), 'plants_small_ferc1': ('construction_year', 'ferc_license_id'), 'plants_steam_ferc1': ('construction_year', 'installation_year')}¶ A dictionary containing tables (keys) and column names (values) containing integer - type columns whose null values need fixing.
- Type
-
pudl.constants.
nerc_region
= {'ASCC': 'Alaska Systems Coordinating Council', 'FRCC': 'Florida Reliability Coordinating Council', 'HICC': 'Hawaiian Islands Coordinating Council', 'MRO': 'Midwest Reliability Organization', 'NPCC': 'Northeast Power Coordinating Council', 'RFC': 'Reliability First Corporation', 'SERC': 'SERC Reliability Corporation', 'SPP': 'Southwest Power Pool', 'TRE': 'Texas Regional Entity', 'WECC': 'Western Electricity Coordinating Council'}¶ A dictionary mapping NERC Region abbreviations (keys) to NERC Region names (values).
- Type
-
pudl.constants.
output_formats
= ['sqlite', 'parquet', 'datapkg']¶ A list of types of PUDL output formats.
- Type
-
pudl.constants.
prime_movers
= ['steam_turbine', 'gas_turbine', 'hydro', 'internal_combustion', 'solar_pv', 'wind_turbine']¶ A list of the types of prime movers
- Type
-
pudl.constants.
prime_movers_eia923
= {'BA': 'Energy Storage, Battery', 'BT': 'Turbines Used in a Binary Cycle. Including those used for geothermal applications', 'CA': 'Combined-Cycle -- Steam Part', 'CC': 'Combined-Cycle, Total Unit', 'CE': 'Energy Storage, Compressed Air', 'CP': 'Energy Storage, Concentrated Solar Power', 'CS': 'Combined-Cycle Single-Shaft Combustion Turbine and Steam Turbine share of single', 'CT': 'Combined-Cycle Combustion Turbine Part', 'ES': 'Energy Storage, Other (Specify on Schedule 9, Comments)', 'FC': 'Fuel Cell', 'FW': 'Energy Storage, Flywheel', 'GT': 'Combustion (Gas) Turbine. Including Jet Engine design', 'HA': 'Hydrokinetic, Axial Flow Turbine', 'HB': 'Hydrokinetic, Wave Buoy', 'HK': 'Hydrokinetic, Other', 'HY': 'Hydraulic Turbine. Including turbines associated with delivery of water by pipeline.', 'IC': 'Internal Combustion (diesel, piston, reciprocating) Engine', 'OT': 'Other', 'PS': 'Energy Storage, Reversible Hydraulic Turbine (Pumped Storage)', 'PV': 'Photovoltaic', 'ST': 'Steam Turbine. Including Nuclear, Geothermal, and Solar Steam (does not include Combined Cycle).', 'WS': 'Wind Turbine, Offshore', 'WT': 'Wind Turbine, Onshore'}¶ A dictionary mapping EIA 923 prime mover codes (keys) and prime mover names / descriptions (values).
- Type
-
pudl.constants.
pudl_tables
= {'eia860': ('boiler_generator_assn_eia860', 'utilities_eia860', 'plants_eia860', 'generators_eia860', 'ownership_eia860'), 'eia861': ('service_territory_eia861', 'balancing_authority_eia861', 'sales_eia861', 'advanced_metering_infrastructure_eia861', 'demand_response_eia861', 'demand_side_management_eia861', 'distributed_generation_eia861', 'distribution_systems_eia861', 'dynamic_pricing_eia861', 'energy_efficiency_eia861', 'green_pricing_eia861', 'mergers_eia861', 'net_metering_eia861', 'non_net_metering_eia861', 'operational_data_eia861', 'reliability_eia861', 'utility_data_eia861'), 'eia923': ('generation_fuel_eia923', 'boiler_fuel_eia923', 'generation_eia923', 'coalmine_eia923', 'fuel_receipts_costs_eia923'), 'epacems': 'hourly_emissions_epacems', 'epaipm': ('transmission_single_epaipm', 'transmission_joint_epaipm', 'load_curves_epaipm', 'plant_region_map_epaipm'), 'ferc1': ('fuel_ferc1', 'plants_steam_ferc1', 'plants_small_ferc1', 'plants_hydro_ferc1', 'plants_pumped_storage_ferc1', 'purchased_power_ferc1', 'plant_in_service_ferc1'), 'ferc714': ('respondent_id_ferc714', 'id_certification_ferc714', 'gen_plants_ba_ferc714', 'demand_monthly_ba_ferc714', 'net_energy_load_ba_ferc714', 'adjacency_ba_ferc714', 'interchange_ba_ferc714', 'lambda_hourly_ba_ferc714', 'lambda_description_ferc714', 'description_pa_ferc714', 'demand_forecast_pa_ferc714', 'demand_hourly_pa_ferc714'), 'glue': ('plants_eia', 'plants_ferc', 'plants', 'utilities_eia', 'utilities_ferc', 'utilities', 'utility_plant_assn')}¶ A dictionary containing data sources (keys) and the list of associated tables from that datasource that can be pulled into PUDL (values).
- Type
-
pudl.constants.
rto_iso
= {'CAISO': 'California ISO', 'ERCOT': 'Electric Reliability Council of Texas', 'ISO-NE': 'ISO New England', 'MISO': 'Midcontinent ISO', 'NYISO': 'New York ISO', 'PJM': 'PJM Interconnection', 'SPP': 'Southwest Power Pool'}¶ A dictionary containing ISO/RTO abbreviations (keys) and names (values)
- Type
-
pudl.constants.
sector_eia
= {'1': 'Electric Utility', '2': 'NAICS-22 Non-Cogen', '3': 'NAICS-22 Cogen', '4': 'Commercial NAICS Non-Cogen', '5': 'Commercial NAICS Cogen', '6': 'Industrial NAICS Non-Cogen', '7': 'Industrial NAICS Cogen'}¶ A dictionary mapping EIA numeric codes (keys) to EIA’s internal consolidated NAICS sectors (values).
- Type
-
pudl.constants.
state_tz_approx
= {'AB': 'America/Edmonton', 'AK': 'US/Alaska', 'AL': 'US/Central', 'AR': 'US/Central', 'AS': 'Pacific/Pago_Pago', 'AZ': 'US/Arizona', 'BC': 'America/Vancouver', 'CA': 'US/Pacific', 'CO': 'US/Mountain', 'CT': 'US/Eastern', 'DC': 'US/Eastern', 'DE': 'US/Eastern', 'FL': 'US/Eastern', 'GA': 'US/Eastern', 'GU': 'Pacific/Guam', 'HI': 'US/Hawaii', 'IA': 'US/Central', 'ID': 'US/Mountain', 'IL': 'US/Central', 'IN': 'US/Eastern', 'KS': 'US/Central', 'KY': 'US/Eastern', 'LA': 'US/Central', 'MA': 'US/Eastern', 'MB': 'America/Winnipeg', 'MD': 'US/Eastern', 'ME': 'US/Eastern', 'MI': 'America/Detroit', 'MN': 'US/Central', 'MO': 'US/Central', 'MP': 'Pacific/Saipan', 'MS': 'US/Central', 'MT': 'US/Mountain', 'NB': 'America/Moncton', 'NC': 'US/Eastern', 'ND': 'US/Central', 'NE': 'US/Central', 'NH': 'US/Eastern', 'NJ': 'US/Eastern', 'NL': 'America/St_Johns', 'NM': 'US/Mountain', 'NS': 'America/Halifax', 'NT': 'America/Yellowknife', 'NU': 'America/Iqaluit', 'NV': 'US/Pacific', 'NY': 'US/Eastern', 'OH': 'US/Eastern', 'OK': 'US/Central', 'ON': 'America/Toronto', 'OR': 'US/Pacific', 'PA': 'US/Eastern', 'PE': 'America/Halifax', 'PR': 'America/Puerto_Rico', 'QC': 'America/Montreal', 'RI': 'US/Eastern', 'SC': 'US/Eastern', 'SD': 'US/Central', 'SK': 'America/Regina', 'TN': 'US/Central', 'TX': 'US/Central', 'UT': 'US/Mountain', 'VA': 'US/Eastern', 'VI': 'America/Puerto_Rico', 'VT': 'US/Eastern', 'WA': 'US/Pacific', 'WI': 'US/Central', 'WV': 'US/Eastern', 'WY': 'US/Mountain', 'YT': 'America/Whitehorse'}¶ A dictionary containing US and Canadian state/territory abbreviations (keys) and timezones (values)
- Type
-
pudl.constants.
table_map_ferc1_pudl
= {'fuel_ferc1': 'f1_fuel', 'plant_in_service_ferc1': 'f1_plant_in_srvce', 'plants_hydro_ferc1': 'f1_hydro', 'plants_pumped_storage_ferc1': 'f1_pumped_storage', 'plants_small_ferc1': 'f1_gnrt_plant', 'plants_steam_ferc1': 'f1_steam', 'purchased_power_ferc1': 'f1_purchased_pwr'}¶ A dictionary mapping PUDL table names (keys) to the corresponding FERC Form 1 DBF table names.
- Type
-
pudl.constants.
transport_modes_eia923
= {'GL': 'Great Lakes: Shipments of coal moved to consumers via the Great Lakes. These shipments are moved via the Great Lakes coal loading docks, which are identified by name and location as follows: Conneaut Coal Storage & Transfer, Conneaut, Ohio; NS Coal Dock (Ashtabula Coal Dock), Ashtabula, Ohio; Sandusky Coal Pier, Sandusky, Ohio; Toledo Docks, Toledo, Ohio; KCBX Terminals Inc., Chicago, Illinois; Superior Midwest Energy Terminal, Superior, Wisconsin', 'PL': 'Pipeline: Shipments of fuel moved to consumers by pipeline', 'RR': 'Rail: Shipments of fuel moved to consumers by rail (private or public/commercial). Included is coal hauled to or away from a railroad siding by truck if the truck did not use public roads.', 'RV': 'River: Shipments of fuel moved to consumers via river by barge. Not included are shipments to Great Lakes coal loading docks, tidewater piers, or coastal ports.', 'SP': 'Slurry Pipeline: Shipments of coal moved to consumers by slurry pipeline.', 'TC': 'Tramway/Conveyor: Shipments of fuel moved to consumers by tramway or conveyor.', 'TP': 'Tidewater Piers and Coastal Ports: Shipments of coal moved to Tidewater Piers and Coastal Ports for further shipments to consumers via coastal water or ocean. The Tidewater Piers and Coastal Ports are identified by name and location as follows: Dominion Terminal Associates, Newport News, Virginia; McDuffie Coal Terminal, Mobile, Alabama; IC Railmarine Terminal, Convent, Louisiana; International Marine Terminals, Myrtle Grove, Louisiana; Cooper/T. Smith Stevedoring Co. Inc., Darrow, Louisiana; Seward Terminal Inc., Seward, Alaska; Los Angeles Export Terminal, Inc., Los Angeles, California; Levin-Richmond Terminal Corp., Richmond, California; Baltimore Terminal, Baltimore, Maryland; Norfolk Southern Lamberts Point P-6, Norfolk, Virginia; Chesapeake Bay Piers, Baltimore, Maryland; Pier IX Terminal Company, Newport News, Virginia; Electro-Coal Transport Corp., Davant, Louisiana', 'TR': 'Truck: Shipments of fuel moved to consumers by truck. Not included is fuel hauled to or away from a railroad siding by truck on non-public roads.', 'WT': 'Water: Shipments of fuel moved to consumers by other waterways.', 'tr': 'Truck: Shipments of fuel moved to consumers by truck. Not included is fuel hauled to or away from a railroad siding by truck on non-public roads.'}¶ A dictionary mapping primary and secondary transportation mode codes (keys) to descriptions (values).
- Type
-
pudl.constants.
us_states
= {'AK': 'Alaska', 'AL': 'Alabama', 'AR': 'Arkansas', 'AS': 'American Samoa', 'AZ': 'Arizona', 'CA': 'California', 'CO': 'Colorado', 'CT': 'Connecticut', 'DC': 'District of Columbia', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia', 'GU': 'Guam', 'HI': 'Hawaii', 'IA': 'Iowa', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'MA': 'Massachusetts', 'MD': 'Maryland', 'ME': 'Maine', 'MI': 'Michigan', 'MN': 'Minnesota', 'MO': 'Missouri', 'MP': 'Northern Mariana Islands', 'MS': 'Mississippi', 'MT': 'Montana', 'NA': 'National', 'NC': 'North Carolina', 'ND': 'North Dakota', 'NE': 'Nebraska', 'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico', 'NV': 'Nevada', 'NY': 'New York', 'OH': 'Ohio', 'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania', 'PR': 'Puerto Rico', 'RI': 'Rhode Island', 'SC': 'South Carolina', 'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VA': 'Virginia', 'VI': 'Virgin Islands', 'VT': 'Vermont', 'WA': 'Washington', 'WI': 'Wisconsin', 'WV': 'West Virginia', 'WY': 'Wyoming'}¶ A dictionary containing US state abbreviations (keys) and names (values)
- Type
-
pudl.constants.
working_partitions
= {'eia860': {'years': (2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)}, 'eia860m': {'year_month': '2020-11'}, 'eia861': {'years': (2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)}, 'eia923': {'years': (2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)}, 'epacems': {'states': ('AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY'), 'years': (1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)}, 'ferc1': {'years': (1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)}, 'ferc714': {}}¶ A dictionary of data sources (keys) and dictionaries (values) of names of partition type (sub-key) and paritions (sub-value) containing the paritions such as tuples of years for each data source that are able to be ingested into PUDL.
- Type
-
pudl.constants.
xlsx_maps_pkg
= 'pudl.package_data.meta.xlsx_maps'¶ The location of the xlsx maps within the PUDL package data.
- Type
string
pudl.dfc module¶
Implemenation of DataFrameCollection.
Pudl ETL needs to exchange collections of named tables (pandas.DataFrame) between ETL tasks and the volume of data contained in these tables can far exceed the memory of a single machine.
Prefect framework currently caches task results in-memory and this can lead to out of memory problem, especially when dealing with large datasets (e.g. during the full data release). To alleviate this problem, prefect team recommends passing “references” to actual data that is stored separately.
DataFrameCollection does just this. It keeps lightweight references to named data frames and stores the data either locally or on cloud storage (we use pandas.to_pickle method which supports these various storage backends out of the box).
Think of DataFrameCollection as a dict-like structure backed by a disk.
-
class
pudl.dfc.
DataFrameCollection
(storage_path: Optional[str] = None, **data_frames: Dict[str, pandas.core.frame.DataFrame])[source]¶ Bases:
object
This class can hold named pandas.DataFrame that are stored on disk or GCS.
This should be used whenever dictionaries of named pandas.DataFrames are passed between prefect tasks. Due to the implicit in-memory caching of task results it is important to keep the in-memory footprint of the exchanged data small.
This wrapper achieves this by maintaining references to tables that themselves are stored on a persistent medium such as local disk of GCS bucket.
This is intended to be used from within prefect flows and new instances can be configured by setting relevant prefect.context variables.
-
add_reference
(name: str, table_id: uuid.UUID)[source]¶ Adds reference to a named dataframe to this collection.
This assumes that the data is already present on disk.
-
static
from_dict
(d: Dict[str, pandas.core.frame.DataFrame])[source]¶ Constructs new DataFrameCollection from dataframe dictionary.
-
get_table_names
() → List[str][source]¶ Returns sorted list of dataframes that are contained in this collection.
-
items
() → Iterator[Tuple[str, pandas.core.frame.DataFrame]][source]¶ Iterates over table names and the corresponding pd.DataFrame objects.
-
references
() → Iterator[Tuple[str, uuid.UUID]][source]¶ Returns a set-like object with (name, table_id) tuples.
-
store
(name: str, data: pandas.core.frame.DataFrame)[source]¶ Adds named dataframe to collection and stores its contents on disk.
-
pudl.etl module¶
Run the PUDL ETL Pipeline.
The PUDL project integrates several different public data sets into well normalized data packages allowing easier access and interaction between all each dataset. This module coordinates the extract/transfrom/load process for data from:
US Energy Information Agency (EIA): - Form 860 (eia860) - Form 923 (eia923)
US Federal Energy Regulatory Commission (FERC): - Form 1 (ferc1)
US Environmental Protection Agency (EPA): - Continuous Emissions Monitory System (epacems) - Integrated Planning Model (epaipm)
-
pudl.etl.
etl
(datapkg_settings, output_dir, pudl_settings, ds_kwargs)[source]¶ Run ETL process for data package specified by datapkg_settings dictionary.
This is the coordinating function for generating all of the CSV’s for a data package. For each of the datasets enumerated in the datapkg_settings, this function runs the dataset specific ETL function. Along the way, we are accumulating which tables have been loaded. This is useful for generating the metadata associated with the package.
- Parameters
datapkg_settings (dict) – Validated ETL parameters for a single datapackage, originally read in from the PUDL ETL input file.
output_dir (path-like) – The individual datapackage directory, which will contain the datapackage.json file and the data directory.
pudl_settings (dict) – a dictionary describing paths to various resources and outputs.
ds_kwargs (dict) – named-arguments to pass to Datastore constructor when creating new instance. This contains values derived from command-line flags that control how caching layers operate.
- Returns
The names of the tables included in the output datapackage.
- Return type
-
pudl.etl.
generate_datapkg_bundle
(datapkg_bundle_settings, pudl_settings, datapkg_bundle_name, datapkg_bundle_doi=None, clobber=False, use_local_cache: bool = True, gcs_cache_path: Optional[str] = None)[source]¶ Coordinate the generation of data packages.
For each bundle of packages laid out in the package_settings, this function generates data packages. First, the settings are validated (which runs through each of the settings listed in the package_settings). Then for each of the packages, run through the etl (extract, transform, load) functions, which generates CSVs. Then the metadata for the packages is generated by pulling from the metadata (which is a json file containing the schema for all of the possible pudl tables).
- Parameters
datapkg_bundle_settings (iterable) – a list of dictionaries. Each item in the list corresponds to a data package. Each data package’s dictionary contains the arguements for its ETL function.
pudl_settings (dict) – a dictionary filled with settings that mostly describe paths to various resources and outputs.
datapkg_bundle_name (str) – name of directory you want the bundle of data packages to live.
clobber (bool) – If True and there is already a directory with data packages with the datapkg_bundle_name, the existing data packages will be deleted and new data packages will be generated in their place.
use_local_cache (bool) – controls whether datastore should be using local file cache.
gcs_cache_path (str) – controls whether datastore should be using Google Cloud Storage based cache.
- Returns
A dictionary with datapackage names as the keys, and Python dictionaries representing tabular datapackage resource descriptors as the values, one per datapackage that was generated as part of the bundle.
- Return type
-
pudl.etl.
get_flattened_etl_parameters
(datapkg_bundle_settings)[source]¶ Compile flattened etl parameters.
The datapkg_bundle_settings is a list of dictionaries with the specific etl parameters for each dataset nested inside the dictionary. This function extracts the years, states, tables, etc. from the list datapackage settings and compiles them into one dictionary.
- Parameters
datapkg_bundle_settings (iterable) – a list of data package parameters, with each element of the list being a dictionary specifying the data to be packaged.
- Returns
dictionary of etl parameters with etl parameter names (keys) (i.e. ferc1_years, eia923_years) and etl parameters (values) (i.e. a list of years for ferc1_years)
- Return type
-
pudl.etl.
validate_params
(datapkg_bundle_settings, pudl_settings)[source]¶ Enforce validity of ETL parameters found in datapackage bundle settings.
For each enumerated data package in the datapkg_bundle_settings, this function checks to ensure the input parameters for each of the datasets are consistent with the known input options. Most of those options are enumerated in pudl.constants. For each dataset, the years, states, tables, etc. are checked to ensure that they are valid and present. If parameters are not valid, assertions will be raised.
There is some options that have default options or are hard coded during validation. Tables will typically be defaulted to all of the tables if they aren’t set. CEMS is always going to be partitioned by year and state. This means we have functinoally removed the option to not partition or partition another way.
- Parameters
datapkg_bundle_settings (iterable) – a list of data package parameters, with each element of the list being a dictionary specifying the data to be packaged.
pudl_settings (dict) – a dictionary describing paths to various resources and outputs.
- Returns
- validated list of data package parameters, with each element
of the list being a dictionary specitying the data to be packaged.
- Return type
iterable
pudl.helpers module¶
General utility functions that are used in a variety of contexts.
The functions in this module are used in various stages of the ETL and post-etl processes. They are usually not dataset specific, but not always. If a function is designed to be used as a general purpose tool, applicable in multiple scenarios, it should probably live here. There are lost of transform type functions in here that help with cleaning and restructing dataframes.
-
pudl.helpers.
add_fips_ids
(df, state_col='state', county_col='county', vintage=2015)[source]¶ Add State and County FIPS IDs to a dataframe.
-
pudl.helpers.
clean_eia_counties
(df, fixes, state_col='state', county_col='county')[source]¶ Replace non-standard county names with county nmes from US Census.
-
pudl.helpers.
clean_merge_asof
(left, right, left_on='report_date', right_on='report_date', by={})[source]¶ Merge two dataframes having different time report_date frequencies.
We often need to bring together data which is reported on a monthly basis, and entity attributes that are reported on an annual basis. The
pandas.merge_asof()
is designed to do this, but requires that dataframes are sorted by the merge keys (left_on
,right_on
, andby.keys()
here). We also need to make sure that all merge keys have identical data types in the two dataframes (e.g.plant_id_eia
needs to be a nullable integer in both dataframes, not a python int in one, and a nullablepandas.Int64Dtype()
in the other). Note thatpandas.merge_asof()
performs a left merge, so the higher frequency dataframe must be the left dataframe.We also force both
left_on
andright_on
to be a Datetime usingpandas.to_datetime()
to allow merging dataframes having integer years with those having datetime columns.Because
pandas.merge_asof()
searches backwards for the first matching date, this function only works if the less granular dataframe uses the convention of reporting the first date in the time period for which it reports. E.g. annual dataframes need to have January 1st as the date. This is what happens by defualt if only a year or year-month are provided topandas.to_datetime()
as strings.- Parameters
left (pandas.DataFrame) – The higher frequency “data” dataframe. Typically monthly in our use cases. E.g.
generation_eia923
. Must containreport_date
and any columns specified in theby
argument.right (pandas.DataFrame) – The lower frequency “attribute” dataframe. Typically annual in our uses cases. E.g.
generators_eia860
. Must containreport_date
and any columns specified in theby
argument.left_on (str) – Column in
left
to merge on using merge_asof. Default isreport_date
. Must be convertible to a Datetime usingpandas.to_datetime()
right_on (str) – Column in
right
to merge on using merge_asof. Default isreport_date
. Must be convertible to a Datetime usingpandas.to_datetime()
by (dict) – A dictionary enumerating any columns to merge on other than
report_date
. Typically ID columns likeplant_id_eia
,generator_id
orboiler_id
. The keys of the dictionary are the names of the columns, and the values are their data source, as defined inpudl.constants
(e.g.ferc1
oreia
). The data source is used to look up the column’s canonical data type.
- Returns
Merged contents of left and right input dataframes. Will be sorted by
left_on
and any columns specified inby
. See documentation forpandas.merge_asof()
to understand how this kind of merge works.- Return type
- Raises
ValueError – if
left_on
orright_on
columns are missing from their respective input dataframes.ValueError – if any of the labels referenced in
by
are missing from either the left or right dataframes.
-
pudl.helpers.
cleanstrings
(df, columns, stringmaps, unmapped=None, simplify=True)[source]¶ Consolidate freeform strings in several dataframe columns.
This function will consolidate freeform strings found in columns into simplified categories, as defined by stringmaps. This is useful when a field contains many different strings that are really meant to represent a finite number of categories, e.g. a type of fuel. It can also be used to create simplified categories that apply to similar attributes that are reported in various data sources from different agencies that use their own taxonomies.
The function takes and returns a pandas.DataFrame, making it suitable for use with the
pandas.DataFrame.pipe()
method in a chain.- Parameters
df (pandas.DataFrame) – the DataFrame containing the string columns to be cleaned up.
columns (list) – a list of string column labels found in the column index of df. These are the columns that will be cleaned.
stringmaps (list) – a list of dictionaries. The keys of these dictionaries are strings, and the values are lists of strings. Each dictionary in the list corresponds to a column in columns. The keys of the dictionaries are the values with which every string in the list of values will be replaced.
unmapped (str, None) – the value with which strings not found in the stringmap dictionary will be replaced. Typically the null string ‘’. If None, then strings found in the columns but not in the stringmap will be left unchanged.
simplify (bool) – If true, strip whitespace, remove duplicate whitespace, and force lower-case on both the string map and the values found in the columns to be cleaned. This can reduce the overall number of string values that need to be tracked.
- Returns
The function returns a new DataFrame containing the cleaned strings.
- Return type
-
pudl.helpers.
cleanstrings_series
(col, str_map, unmapped=None, simplify=True)[source]¶ Clean up the strings in a single column/Series.
- Parameters
col (pandas.Series) – A pandas Series, typically a single column of a dataframe, containing the freeform strings that are to be cleaned.
str_map (dict) – A dictionary of lists of strings, in which the keys are the simplified canonical strings, witch which each string found in the corresponding list will be replaced.
unmapped (str) – A value with which to replace any string found in col that is not found in one of the lists of strings in map. Typically the null string ‘’. If None, these strings will not be replaced.
simplify (bool) – If True, strip and compact whitespace, and lowercase all strings in both the list of values to be replaced, and the values found in col. This can reduce the number of strings that need to be kept track of.
- Returns
The cleaned up Series / column, suitable for replacing the original messy column in a
pandas.DataFrame
.- Return type
-
pudl.helpers.
cleanstrings_snake
(df, cols)[source]¶ Clean the strings in a columns in a dataframe with snake case.
- Parameters
df (panda.DataFrame) – original dataframe.
cols (list) – list of columns in df to apply snake case to.
-
pudl.helpers.
convert_cols_dtypes
(df, data_source, name=None)[source]¶ Convert the data types for a dataframe.
This function will convert a PUDL dataframe’s columns to the correct data type. It uses a dictionary in constants.py called column_dtypes to assign the right type. Within a given data source (e.g. eia923, ferc1) each column name is assumed to always have the same data type whenever it is found.
Boolean type conversions created a special problem, because null values in boolean columns get converted to True (which is bonkers!)… we generally want to preserve the null values and definitely don’t want them to be True, so we are keeping those columns as objects and preforming a simple mask for the boolean columns.
The other exception in here is with the utility_id_eia column. It is often an object column of strings. All of the strings are numbers, so it should be possible to convert to
pandas.Int32Dtype()
directly, but it is requiring us to convert to int first. There will probably be other columns that have this problem… and hopefully pandas just enables this direct conversion.- Parameters
df (pandas.DataFrame) – dataframe with columns that appear in the PUDL tables.
data_source (str) – the name of the datasource (eia, ferc1, etc.)
name (str) – name of the table (for logging only!)
- Returns
a dataframe with columns as specified by the
pudl.constants
column_dtypes
dictionary.- Return type
-
pudl.helpers.
convert_dfs_dict_dtypes
(dfs_dict, data_source)[source]¶ Convert the data types of a dictionary of dataframes.
This is a wrapper for
pudl.helpers.convert_cols_dtypes()
which loops over an entire dictionary of dataframes, assuming they are all from the specified data source, and appropriately assigning data types to each column based on the data source specific type map stored in pudl.constants
-
pudl.helpers.
convert_to_date
(df, date_col='report_date', year_col='report_year', month_col='report_month', day_col='report_day', month_value=1, day_value=1)[source]¶ Convert specified year, month or day columns into a datetime object.
If the input
date_col
already exists in the input dataframe, then no conversion is applied, and the original dataframe is returned unchanged. Otherwise the constructed date is placed in that column, and the columns which were used to create the date are dropped.- Parameters
df (pandas.DataFrame) – dataframe to convert
date_col (str) – the name of the column you want in the output.
year_col (str) – the name of the year column in the original table.
month_col (str) – the name of the month column in the original table.
day_col – the name of the day column in the original table.
month_value (int) – generated month if no month exists.
day_value (int) – generated day if no month exists.
- Returns
A DataFrame in which the year, month, day columns values have been converted into datetime objects.
- Return type
Todo
Update docstring.
-
pudl.helpers.
count_records
(df, cols, new_count_col_name)[source]¶ Count the number of unique records in group in a dataframe.
- Parameters
df (panda.DataFrame) – dataframe you would like to groupby and count.
cols (iterable) – list of columns to group and count by.
new_count_col_name (string) – the name that will be assigned to the column that will contain the count.
- Returns
dataframe containing only
cols
andnew_count_col_name
.- Return type
-
pudl.helpers.
download_zip_url
(url, save_path, chunk_size=128)[source]¶ Download and save a Zipfile locally.
Useful for acquiring and storing non-PUDL data locally.
- Parameters
url (str) – The URL from which to download the Zipfile
save_path (pathlib.Path) – The location to save the file.
chunk_size (int) – Data chunk in bytes to use while downloading.
- Returns
None
-
pudl.helpers.
drop_tables
(engine, clobber=False)[source]¶ Drops all tables from a SQLite database.
Creates an sa.schema.MetaData object reflecting the structure of the database that the passed in
engine
refers to, and uses that schema to drop all existing tables.Todo
Treat DB connection as a context manager (with/as).
- Parameters
engine (sa.engine.Engine) – An SQL Alchemy SQLite database Engine pointing at an exising SQLite database to be deleted.
- Returns
None
-
pudl.helpers.
fillna_w_rolling_avg
(df_og, group_cols, data_col, window=12, **kwargs)[source]¶ Filling NaNs with a rolling average.
Imputes null values from a dataframe on a rolling monthly average. To note, this was designed to work with the PudlTabl object’s tables.
- Parameters
df_og (pandas.DataFrame) – Original dataframe. Must have group_cols column, a data_col column and a ‘report_date’ column.
group_cols (iterable) – a list of columns to groupby.
data_col (str) – the name of the data column.
window (int) – window from pandas.Series.rolling
kwargs – Additional arguments to pass to
pandas.Series.rolling
.
- Returns
dataframe with nulls filled in.
- Return type
-
pudl.helpers.
find_timezone
(*, lng=None, lat=None, state=None, strict=True)[source]¶ Find the timezone associated with the a specified input location.
Note that this function requires named arguments. The names are lng, lat, and state. lng and lat must be provided, but they may be NA. state isn’t required, and isn’t used unless lng/lat are NA or timezonefinder can’t find a corresponding timezone.
Timezones based on states are imprecise, so it’s far better to use lng/lat if possible. If strict is True, state will not be used. More on state-to-timezone conversion here: https://en.wikipedia.org/wiki/List_of_time_offsets_by_U.S._state_and_territory
- Parameters
- Returns
The timezone (as an IANA string) for that location.
- Return type
Todo
Update docstring.
-
pudl.helpers.
fix_eia_na
(df)[source]¶ Replace common ill-posed EIA NA spreadsheet values with np.nan.
Currently replaces empty string, single decimal points with no numbers, and any single whitespace character with np.nan.
- Parameters
df (pandas.DataFrame) – The DataFrame to clean.
- Returns
The cleaned DataFrame.
- Return type
-
pudl.helpers.
fix_int_na
(df, columns, float_na=nan, int_na=- 1, str_na='')[source]¶ Convert NA containing integer columns from float to string.
Numpy doesn’t have a real NA value for integers. When pandas stores integer data which has NA values, it thus upcasts integers to floating point values, using np.nan values for NA. However, in order to dump some of our dataframes to CSV files for use in data packages, we need to write out integer formatted numbers, with empty strings as the NA value. This function replaces np.nan values with a sentinel value, converts the column to integers, and then to strings, finally replacing the sentinel value with the desired NA string.
This is an interim solution – now that pandas extension arrays have been implemented, we need to go back through and convert all of these integer columns that contain NA values to Nullable Integer types like Int64.
- Parameters
df (pandas.DataFrame) – The dataframe to be fixed. This argument allows method chaining with the pipe() method.
columns (iterable of strings) – A list of DataFrame column labels indicating which columns need to be reformatted for output.
float_na (float) – The floating point value to be interpreted as NA and replaced in col.
int_na (int) – Sentinel value to substitute for float_na prior to conversion of the column to integers.
str_na (str) – sa.String value to substitute for int_na after the column has been converted to strings.
- Returns
a new DataFrame, with the selected columns converted to strings that look like integers, compatible with the postgresql COPY FROM command.
- Return type
df (pandas.DataFrame)
-
pudl.helpers.
fix_leading_zero_gen_ids
(df)[source]¶ Remove leading zeros from EIA generator IDs which are numeric strings.
If the DataFrame contains a column named
generator_id
then that column will be cast to a string, and any all numeric value with leading zeroes will have the leading zeroes removed. This is necessary because in some but not all years of data, some of the generator IDs are treated as integers in the Excel spreadsheets published by EIA, so the same generator may show up with the ID “0001” and “1” in different years.Alphanumeric generator IDs with leadings zeroes are not affected, as we found no instances in which an alphanumeric generator ID appeared both with and without leading zeroes.
- Parameters
df (pandas.DataFrame) – DataFrame, presumably containing a column named generator_id (otherwise no action will be taken.)
- Returns
pandas.DataFrame
-
pudl.helpers.
generate_rolling_avg
(df, group_cols, data_col, window, **kwargs)[source]¶ Generate a rolling average.
For a given dataframe with a
report_date
column, generate a monthly rolling average and use this rolling average to impute missing values.- Parameters
df (pandas.DataFrame) – Original dataframe. Must have group_cols column, a data_col column and a
report_date
column.group_cols (iterable) – a list of columns to groupby.
data_col (str) – the name of the data column.
window (int) – window from
pandas.Series.rolling()
.kwargs – Additional arguments to pass to
pandas.Series.rolling()
.
- Returns
pandas.DataFrame
-
pudl.helpers.
get_pudl_dtype
(col, data_source)[source]¶ Look up a column’s canonical data type based on its PUDL data source.
-
pudl.helpers.
get_pudl_dtypes
(col_source_dict)[source]¶ Look up canonical PUDL data types for columns based on data sources.
-
pudl.helpers.
is_doi
(doi)[source]¶ Determine if a string is a valid digital object identifier (DOI).
Function simply checks whether the offered string matches a regular expresssion – it doesn’t check whether the DOI is actually registered with the relevant authority.
-
pudl.helpers.
iterate_multivalue_dict
(**kwargs)[source]¶ Make dicts from dict with main dict key and one value of main dict.
-
pudl.helpers.
merge_dicts
(list_of_dicts)[source]¶ Merge multipe dictionaries together.
Given any number of dicts, shallow copy and merge into a new dict, precedence goes to key value pairs in latter dicts.
- Parameters
dict_args (list) – a list of dictionaries.
- Returns
dict
-
pudl.helpers.
month_year_to_date
(df)[source]¶ Convert all pairs of year/month fields in a dataframe into Date fields.
This function finds all column names within a dataframe that match the regular expression ‘_month$’ and ‘_year$’, and looks for pairs that have identical prefixes before the underscore. These fields are assumed to describe a date, accurate to the month. The two fields are used to construct a new _date column (having the same prefix) and the month/year columns are then dropped.
Todo
This function needs to be combined with convert_to_date, and improved: * find and use a _day$ column as well * allow specification of default month & day values, if none are found. * allow specification of lists of year, month, and day columns to be combined, rather than automataically finding all the matching ones. * Do the Right Thing when invalid or NA values are encountered.
- Parameters
df (pandas.DataFrame) – The DataFrame in which to convert year/months fields to Date fields.
- Returns
A DataFrame in which the year/month fields have been converted into Date fields.
- Return type
-
pudl.helpers.
oob_to_nan
(df, cols, lb=None, ub=None)[source]¶ Set non-numeric values and those outside of a given rage to NaN.
- Parameters
df (pandas.DataFrame) – The dataframe containing values to be altered.
cols (iterable) – Labels of the columns whose values are to be changed.
lb – (number): Lower bound, below which values are set to NaN. If None, don’t use a lower bound.
ub – (number): Upper bound, below which values are set to NaN. If None, don’t use an upper bound.
- Returns
The altered DataFrame.
- Return type
-
pudl.helpers.
organize_cols
(df, cols)[source]¶ Organize columns into key ID & name fields & alphabetical data columns.
For readability, it’s nice to group a few key columns at the beginning of the dataframe (e.g. report_year or report_date, plant_id…) and then put all the rest of the data columns in alphabetical order.
- Parameters
df – The DataFrame to be re-organized.
cols – The columns to put first, in their desired output ordering.
- Returns
A dataframe with the same columns as the input DataFrame df, but with cols first, in the same order as they were passed in, and the remaining columns sorted alphabetically.
- Return type
-
pudl.helpers.
prep_dir
(dir_path, clobber=False)[source]¶ Create (or delete and recreate) a directory.
- Parameters
dir_path (path-like) – path to the directory that you are trying to clean and prepare.
clobber (bool) – If True and dir_path exists, it will be removed and replaced with a new, empty directory.
- Raises
FileExistsError – if a file or directory already exists at dir_path.
- Returns
Path to the created directory.
- Return type
-
pudl.helpers.
simplify_columns
(df)[source]¶ Simplify column labels for use as snake_case database fields.
All columns will be re-labeled by: * Replacing all non-alphanumeric characters with spaces. * Forcing all letters to be lower case. * Compacting internal whitespace to a single ” “. * Stripping leading and trailing whitespace. * Replacing all remaining whitespace with underscores.
- Parameters
df (pandas.DataFrame) – The DataFrame to clean.
- Returns
The cleaned DataFrame.
- Return type
Todo
Update docstring.
-
pudl.helpers.
simplify_strings
(df, columns)[source]¶ Simplify the strings contained in a set of dataframe columns.
Performs several operations to simplify strings for comparison and parsing purposes. These include removing Unicode control characters, stripping leading and trailing whitespace, using lowercase characters, and compacting all internal whitespace to a single space.
Leaves null values unaltered. Casts other values with astype(str).
- Parameters
df (pandas.DataFrame) – DataFrame whose columns are being cleaned up.
columns (iterable) – The labels of the string columns to be simplified.
- Returns
The whole DataFrame that was passed in, with the string columns cleaned up.
- Return type
-
pudl.helpers.
zero_pad_zips
(zip_series, n_digits)[source]¶ Retain prefix zeros in zipcodes.
- Parameters
zip_series (pd.Series) – series containing the zipcode values.
n_digits (int) – zipcode length (likely 4 or 5 digits).
- Returns
a series containing zipcodes with their prefix zeros intact and invalid zipcodes rendered as na.
- Return type
pudl.validate module¶
PUDL data validation functions and test case specifications.
- What defines a data validation?
What data are we checking? * What table or output does it come from? * What selection criteria do we apply to that table or output?
What are we checking it against? * Itself (helps validate that the tests themselves are working) * A processed version of itself (aggregation or derived values) * A hard-coded external standard (e.g. heat rates, fuel heat content)
-
pudl.validate.
bf_eia923_agg
= [{'title': 'Coal ash content', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.2, 'mid_q': 0.7, 'hi_q': 0.95, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_consumed_units'}, {'title': 'Coal sulfur content', 'query': "fuel_type_code_pudl=='coal'", 'low_q': False, 'mid_q': False, 'hi_q': False, 'data_col': 'sulfur_content_pct', 'weight_col': 'fuel_consumed_units'}, {'title': 'Coal heat content', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Petroleum heat content', 'query': "fuel_type_code_pudl=='oil'", 'low_q': 0.1, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Gas heat content', 'query': "fuel_type_code_pudl=='gas'", 'low_q': 0.1, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}]¶ EIA923 Boiler Fuel data validation against aggregated data.
-
pudl.validate.
bf_eia923_coal_ash_content
= [{'title': 'Bituminous coal ash content (middle)', 'query': "fuel_type_code=='BIT'", 'low_q': 0.5, 'low_bound': 6.0, 'hi_q': 0.5, 'hi_bound': 15.0, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_consumed_units'}, {'title': 'Sub-bituminous coal ash content (middle)', 'query': "fuel_type_code=='SUB'", 'low_q': 0.5, 'low_bound': 4.5, 'hi_q': 0.5, 'hi_bound': 7.0, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_consumed_units'}, {'title': 'Lignite ash content (middle)', 'query': "fuel_type_code=='LIG'", 'low_q': 0.5, 'low_bound': 7.0, 'hi_q': 0.5, 'hi_bound': 30.0, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_consumed_units'}, {'title': 'All coal ash content (middle)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.5, 'low_bound': 4.0, 'hi_q': 0.5, 'hi_bound': 20.0, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_consumed_units'}]¶ Valid coal ash content (%). Based on historical reporting in EIA 923.
-
pudl.validate.
bf_eia923_coal_heat_content
= [{'title': 'Bituminous coal heat content (middle)', 'query': "fuel_type_code=='BIT'", 'low_q': 0.5, 'low_bound': 20.5, 'hi_q': 0.5, 'hi_bound': 26.5, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Bituminous coal heat content (tails)', 'query': "fuel_type_code=='BIT'", 'low_q': 0.05, 'low_bound': 17.0, 'hi_q': 0.95, 'hi_bound': 30.0, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Sub-bituminous coal heat content (middle)', 'query': "fuel_type_code=='SUB'", 'low_q': 0.5, 'low_bound': 16.5, 'hi_q': 0.5, 'hi_bound': 18.0, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Sub-bituminous coal heat content (tails)', 'query': "fuel_type_code=='SUB'", 'low_q': 0.05, 'low_bound': 15.0, 'hi_q': 0.95, 'hi_bound': 20.5, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Lignite heat content (middle)', 'query': "fuel_type_code=='LIG'", 'low_q': 0.5, 'low_bound': 12.0, 'hi_q': 0.5, 'hi_bound': 14.0, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Lignite heat content (tails)', 'query': "fuel_type_code=='LIG'", 'low_q': 0.05, 'low_bound': 10.0, 'hi_q': 0.95, 'hi_bound': 15.0, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'All coal heat content (middle)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.5, 'low_bound': 10.0, 'hi_q': 0.5, 'hi_bound': 30.0, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}]¶ Valid coal (bituminous, sub-bituminous, and lignite) heat content values.
-
pudl.validate.
bf_eia923_coal_sulfur_content
= [{'title': 'Coal sulfur content (tails)', 'query': "fuel_type_code_pudl=='coal'", 'hi_q': 0.95, 'hi_bound': 4.0, 'low_q': 0.05, 'low_bound': 0.15, 'data_col': 'sulfur_content_pct', 'weight_col': 'fuel_consumed_units'}]¶ Valid coal sulfur content values.
Based on historically reported values in EIA 923 Fuel Receipts and Costs.
-
pudl.validate.
bf_eia923_gas_heat_content
= [{'title': 'Natural Gas heat content (middle)', 'query': "fuel_type_code_pudl=='gas'", 'hi_q': 0.5, 'hi_bound': 1.036, 'low_q': 0.5, 'low_bound': 1.018, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Natural Gas heat content (tails)', 'query': "fuel_type_code_pudl=='gas'", 'hi_q': 0.99, 'hi_bound': 1.15, 'low_q': 0.01, 'low_bound': 0.95, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}]¶ Valid natural gas heat content values.
Based on historically reported values in EIA 923 Fuel Receipts and Costs. May fail because of a population of bad data around 0.1 mmbtu/unit. This appears to be an off-by-10x error, possibly due to reporting error in units used.
-
pudl.validate.
bf_eia923_oil_heat_content
= [{'title': 'Diesel Fuel Oil heat content (tails)', 'query': "fuel_type_code=='DFO'", 'low_q': 0.05, 'low_bound': 5.5, 'hi_q': 0.95, 'hi_bound': 6.0, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Diesel Fuel Oil heat content (middle)', 'query': "fuel_type_code=='DFO'", 'low_q': 0.5, 'low_bound': 5.75, 'hi_q': 0.5, 'hi_bound': 5.85, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'All petroleum heat content (tails)', 'query': "fuel_type_code_pudl=='oil'", 'low_q': 0.05, 'low_bound': 5.0, 'hi_q': 0.95, 'hi_bound': 6.6, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}]¶ Valid petroleum based fuel heat content values.
Based on historically reported values in EIA 923 Fuel Receipts and Costs.
-
pudl.validate.
bf_eia923_self
= [{'title': 'Bituminous coal ash content', 'query': "fuel_type_code=='BIT'", 'low_q': 0.05, 'mid_q': 0.25, 'hi_q': 0.95, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_consumed_units'}, {'title': 'Subbituminous coal ash content', 'query': "fuel_type_code=='SUB'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_consumed_units'}, {'title': 'Lignite coal ash content', 'query': "fuel_type_code=='LIG'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_consumed_units'}, {'title': 'Bituminous coal heat content', 'query': "fuel_type_code=='BIT'", 'low_q': 0.07, 'mid_q': 0.5, 'hi_q': 0.98, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Subbituminous coal heat content', 'query': "fuel_type_code=='SUB'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.9, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Lignite heat content', 'query': "fuel_type_code=='LIG'", 'low_q': 0.1, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Diesel Fuel Oil heat content', 'query': "fuel_type_code=='DFO'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}]¶ EIA923 Boiler Fuel data validation against itself.
-
pudl.validate.
bounds_histogram
(df, data_col, weight_col, query, low_q, hi_q, low_bound, hi_bound, title='')[source]¶ Plot a weighted histogram showing acceptable bounds/actual values.
-
pudl.validate.
check_date_freq
(df1, df2, mult)[source]¶ Verify an expected relationship between time frequencies of two dataframes.
Identify all distinct values of
report_date
in each of the input dataframes and check that the number of distinctreport_date
values indf2
ismult
times the number ofreport_date
values indf1
across only those years which appear in both dataframes. This is primarily aimed at comparing annual and monthly dataframes, but should also work with e.g. annual (df1) and quarterly (df2) frequency data usingmult=4
.Note the function assumes that a dataframe with sub-annual frequency will cover the entire year it’s part of. If you have a partial year of monthly data in one dataframe that overlaps with annual data in another dataframe you’ll probably get unexpected behavior.
We use this method rather than attempting to infer a frequency from the observed values because often we have only a single year of data, and you need at least 3 values in a DatetimeIndex to infer the frequency.
- Parameters
df1 (pandas.DataFrame) – A dataframe with a column named
report_date
which contains dates.df2 (pandas.DataFrame) – A dataframe with a column named
report_date
which contains dates. frequency.mult (int) – A multiplicative factor indicating the expected ratio between the number of distinct date values found in
df1
anddf2
. E.g. ifdf1
is annual anddf2
is monthly,mult
should be 12.
- Returns
None
- Raises
AssertionError – if the number of distinct
report_date
values indf2
is notmult
times the number of distinctreport_date
values indf1
.ValueError – if either
df1
ordf2
does not have a column namedreport_date
-
pudl.validate.
check_max_rows
(df, expected_rows=inf, margin=0.05, df_name='')[source]¶ Validate that a dataframe has less than a maximum number of rows.
-
pudl.validate.
check_min_rows
(df, expected_rows=0, margin=0.05, df_name='')[source]¶ Validate that a dataframe has a certain minimum number of rows.
-
pudl.validate.
check_unique_rows
(df, subset=None, df_name='')[source]¶ Test whether dataframe has unique records within a subset of columns.
- Parameters
df (pandas.DataFrame) – DataFrame to check for duplicate records.
subset (iterable or None) – Columns to consider in checking for dupes.
df_name (str) – Name of the dataframe, to aid in debugging/logging.
- Returns
- The same DataFrame as was passed in, for use in
DataFrame.pipe().
- Return type
- Raises
ValueError – If there are duplicate records in the subset of selected columns.
-
pudl.validate.
frc_eia923_ag_byproduct_heat_content
= [{'title': 'Agricultural byproduct heat content (tails)', 'query': "energy_source_code=='AB'", 'low_q': 0.05, 'low_bound': 7.0, 'hi_q': 0.95, 'hi_bound': 18.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable agricultural byproduct heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_agg
= [{'title': 'Coal ash content', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.2, 'mid_q': 0.7, 'hi_q': 0.95, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Coal chlorine content', 'query': "fuel_type_code_pudl=='coal'", 'low_q': False, 'mid_q': False, 'hi_q': False, 'data_col': 'chlorine_content_ppm', 'weight_col': 'fuel_qty_units'}, {'title': 'Coal fuel costs', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'fuel_cost_per_mmbtu', 'weight_col': 'fuel_qty_units'}, {'title': 'Coal sulfur content', 'query': "fuel_type_code_pudl=='coal'", 'low_q': False, 'mid_q': False, 'hi_q': False, 'data_col': 'sulfur_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Gas heat content', 'query': "fuel_type_code_pudl=='gas'", 'low_q': 0.1, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Gas fuel costs', 'query': "fuel_type_code_pudl=='gas'", 'low_q': False, 'mid_q': 0.5, 'hi_q': False, 'data_col': 'fuel_cost_per_mmbtu', 'weight_col': 'fuel_qty_units'}, {'title': 'Petroleum fuel cost', 'query': "fuel_type_code_pudl=='oil'", 'low_q': False, 'mid_q': 0.5, 'hi_q': False, 'data_col': 'fuel_cost_per_mmbtu', 'weight_col': 'fuel_qty_units'}, {'title': 'Petroleum heat content', 'query': "fuel_type_code_pudl=='oil'", 'low_q': 0.1, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ EIA923 fuel receipts & costs data validation against aggregated data.
-
pudl.validate.
frc_eia923_biomass_gas_heat_content
= [{'title': 'Other biomass gas heat content (tails)', 'query': "energy_source_code=='OBG'", 'low_q': 0.05, 'low_bound': 0.36, 'hi_q': 0.95, 'hi_bound': 1.6, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable other biomass gas heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_biomass_liquids_heat_content
= [{'title': 'Other biomass liquids heat content (tails)', 'query': "energy_source_code=='OBL'", 'low_q': 0.05, 'low_bound': 3.5, 'hi_q': 0.95, 'hi_bound': 4.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable other biomass liquids heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_biomass_solids_heat_content
= [{'title': 'Other biomass solids heat content (tails)', 'query': "energy_source_code=='OBS'", 'low_q': 0.05, 'low_bound': 8.0, 'hi_q': 0.95, 'hi_bound': 25.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable other biomass solids heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_black_liquor_heat_content
= [{'title': 'Black liquor heat content (tails)', 'query': "energy_source_code=='BLQ'", 'low_q': 0.05, 'low_bound': 10.0, 'hi_q': 0.95, 'hi_bound': 14.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable black liquor heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_blast_furnace_gas_heat_content
= [{'title': 'Blast furnace gas heat content (tails)', 'query': "energy_source_code=='BFG'", 'low_q': 0.05, 'low_bound': 0.07, 'hi_q': 0.95, 'hi_bound': 0.12, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable blast furnace gas heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_coal_ant_heat_content
= [{'title': 'Anthracite coal heat content (middle)', 'query': "energy_source_code=='ANT'", 'low_q': 0.5, 'low_bound': 20.5, 'hi_q': 0.5, 'hi_bound': 26.5, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Anthracite coal heat content (tails)', 'query': "energy_source_code=='ANT'", 'low_q': 0.05, 'low_bound': 22.0, 'hi_q': 0.95, 'hi_bound': 29.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable anthracite coal heat content.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_coal_ash_content
= [{'title': 'Bituminous coal ash content (middle)', 'query': "energy_source_code=='BIT'", 'low_q': 0.5, 'low_bound': 6.0, 'hi_q': 0.5, 'hi_bound': 15.0, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Sub-bituminous coal ash content (middle)', 'query': "energy_source_code=='SUB'", 'low_q': 0.5, 'low_bound': 4.5, 'hi_q': 0.5, 'hi_bound': 7.0, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Lignite ash content (middle)', 'query': "energy_source_code=='LIG'", 'low_q': 0.5, 'low_bound': 7.0, 'hi_q': 0.5, 'hi_bound': 30.0, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'All coal ash content (middle)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.5, 'low_bound': 4.0, 'hi_q': 0.5, 'hi_bound': 20.0, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_qty_units'}]¶ Valid coal ash content (%). Based on historical reporting in EIA 923.
-
pudl.validate.
frc_eia923_coal_bit_heat_content
= [{'title': 'Bituminous coal heat content (middle)', 'query': "energy_source_code=='BIT'", 'low_q': 0.5, 'low_bound': 20.5, 'hi_q': 0.5, 'hi_bound': 26.5, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Bituminous coal heat content (tails)', 'query': "energy_source_code=='BIT'", 'low_q': 0.05, 'low_bound': 18.0, 'hi_q': 0.95, 'hi_bound': 29.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable bituminous coal heat content.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_coal_cc_heat_content
= [{'title': 'Refined coal heat content (tails)', 'query': "energy_source_code=='RC'", 'low_q': 0.05, 'low_bound': 6.5, 'hi_q': 0.95, 'hi_bound': 16.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable refined coal heat content.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_coal_lig_heat_content
= [{'title': 'Lignite heat content (middle)', 'query': "energy_source_code=='LIG'", 'low_q': 0.5, 'low_bound': 12.0, 'hi_q': 0.5, 'hi_bound': 14.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Lignite heat content (tails)', 'query': "energy_source_code=='LIG'", 'low_q': 0.05, 'low_bound': 10.0, 'hi_q': 0.95, 'hi_bound': 15.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable lignite coal heat content.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_coal_mercury_content
= [{'title': 'Coal mercury content (upper tail)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': False, 'low_bound': False, 'hi_q': 0.95, 'hi_bound': 0.125, 'data_col': 'mercury_content_ppm', 'weight_col': 'fuel_qty_units'}, {'title': 'Coal mercury content (middle)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.5, 'low_bound': 0.0, 'hi_q': 0.5, 'hi_bound': 0.1, 'data_col': 'mercury_content_ppm', 'weight_col': 'fuel_qty_units'}]¶ Valid coal mercury content limits.
Based on USGS FS095-01: https://pubs.usgs.gov/fs/fs095-01/fs095-01.html Upper tail may fail because of a population of extremely high mercury content coal (9.0ppm) which is likely a reporting error.
-
pudl.validate.
frc_eia923_coal_moisture_content
= [{'title': 'Bituminous coal moisture content (middle)', 'query': "energy_source_code=='BIT'", 'low_q': 0.5, 'low_bound': 5.0, 'hi_q': 0.5, 'hi_bound': 16.5, 'data_col': 'moisture_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Sub-bituminous coal moisture content (middle)', 'query': "energy_source_code=='SUB'", 'low_q': 0.5, 'low_bound': 15.0, 'hi_q': 0.5, 'hi_bound': 32.5, 'data_col': 'moisture_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Lignite moisture content (middle)', 'query': "energy_source_code=='LIG'", 'low_q': 0.5, 'low_bound': 25.0, 'hi_q': 0.5, 'hi_bound': 45.0, 'data_col': 'moisture_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'All coal moisture content (middle)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.5, 'low_bound': 5.0, 'hi_q': 0.5, 'hi_bound': 40.0, 'data_col': 'moisture_content_pct', 'weight_col': 'fuel_qty_units'}]¶ Valid coal moisture content, based on historical EIA 923 reporting.
-
pudl.validate.
frc_eia923_coal_sub_heat_content
= [{'title': 'Sub-bituminous coal heat content (middle)', 'query': "energy_source_code=='SUB'", 'low_q': 0.5, 'low_bound': 16.5, 'hi_q': 0.5, 'hi_bound': 18.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Sub-bituminous coal heat content (tails)', 'query': "energy_source_code=='SUB'", 'low_q': 0.05, 'low_bound': 15.0, 'hi_q': 0.95, 'hi_bound': 20.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable Sub-bituminous coal heat content.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_coal_sulfur_content
= [{'title': 'Coal sulfur content (tails)', 'query': "fuel_type_code_pudl=='coal'", 'hi_q': 0.95, 'hi_bound': 4.0, 'low_q': 0.05, 'low_bound': 0.15, 'data_col': 'sulfur_content_pct', 'weight_col': 'fuel_qty_units'}]¶ Valid coal sulfur content values.
Based on historically reported values in EIA 923 Fuel Receipts and Costs.
-
pudl.validate.
frc_eia923_coal_wc_heat_content
= [{'title': 'Waste coal heat content (tails)', 'query': "energy_source_code=='WC'", 'low_q': 0.05, 'low_bound': 6.5, 'hi_q': 0.95, 'hi_bound': 16.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable waste coal heat content.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_gas_sgc_heat_content
= [{'title': 'Coal syngas heat content (tails)', 'query': "energy_source_code=='SGC'", 'low_q': 0.05, 'low_bound': 0.2, 'hi_q': 0.95, 'hi_bound': 0.3, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable coal syngas heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_landfill_gas_heat_content
= [{'title': 'Landfill gas heat content (tails)', 'query': "energy_source_code=='LFG'", 'low_q': 0.05, 'low_bound': 0.3, 'hi_q': 0.95, 'hi_bound': 0.6, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable landfill gas heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_muni_solids_heat_content
= [{'title': 'Municipal solid waste heat content (tails)', 'query': "energy_source_code=='MSW'", 'low_q': 0.05, 'low_bound': 9.0, 'hi_q': 0.95, 'hi_bound': 12.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable municipal solid waste heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_natural_gas_heat_content
= [{'title': 'Natural gas heat content (tails)', 'query': "energy_source_code=='NG'", 'low_q': 0.05, 'low_bound': 0.8, 'hi_q': 0.95, 'hi_bound': 1.2, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable natural gas heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_oil_dfo_heat_content
= [{'title': 'Diesel Fuel Oil heat content (tails)', 'query': "energy_source_code=='DFO'", 'low_q': 0.05, 'low_bound': 5.5, 'hi_q': 0.95, 'hi_bound': 6.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Diesel Fuel Oil heat content (middle)', 'query': "energy_source_code=='DFO'", 'low_q': 0.5, 'low_bound': 5.75, 'hi_q': 0.5, 'hi_bound': 5.85, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable diesel fuel oil heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_oil_jf_heat_content
= [{'title': 'Jet fuel heat content (tails)', 'query': "energy_source_code=='JF'", 'low_q': 0.05, 'low_bound': 5.0, 'hi_q': 0.95, 'hi_bound': 6.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable jet fuel heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_oil_ker_heat_content
= [{'title': 'Kerosene heat content (tails)', 'query': "energy_source_code=='KER'", 'low_q': 0.05, 'low_bound': 5.4, 'hi_q': 0.95, 'hi_bound': 6.1, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable kerosene heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_other_gas_heat_content
= [{'title': 'Other gas heat content (tails)', 'query': "energy_source_code=='OG'", 'low_q': 0.05, 'low_bound': 0.07, 'hi_q': 0.95, 'hi_bound': 3.3, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable other gas heat contents.
Based on values given in the EIA 923 instructions, but with the lower bound set by the expected lower bound of heat content on blast furnace gas (since there were “other” gasses with bounds lower than the expected 0.32 in the data) https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_petcoke_heat_content
= [{'title': 'Petroleum coke heat content (tails)', 'query': "energy_source_code=='PC'", 'low_q': 0.05, 'low_bound': 24.0, 'hi_q': 0.95, 'hi_bound': 30.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable petroleum coke heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_petcoke_syngas_heat_content
= [{'title': 'Petcoke syngas heat content (tails)', 'query': "energy_source_code=='SGP'", 'low_q': 0.05, 'low_bound': 0.2, 'hi_q': 0.95, 'hi_bound': 1.1, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable petcoke syngas heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_propane_heat_content
= [{'title': 'Propane heat content (tails)', 'query': "energy_source_code=='PG'", 'low_q': 0.05, 'low_bound': 2.5, 'hi_q': 0.95, 'hi_bound': 2.75, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable propane heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_rfo_heat_content
= [{'title': 'Residual fuel oil heat content (tails)', 'query': "energy_source_code=='RFO'", 'low_q': 0.05, 'low_bound': 5.7, 'hi_q': 0.95, 'hi_bound': 6.9, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable residual fuel oil heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_self
= [{'title': 'Bituminous coal ash content', 'query': "energy_source_code=='BIT'", 'low_q': 0.05, 'mid_q': 0.25, 'hi_q': 0.95, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Subbituminous coal ash content', 'query': "energy_source_code=='SUB'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Lignite coal ash content', 'query': "energy_source_code=='LIG'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'ash_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Bituminous coal heat content', 'query': "energy_source_code=='BIT'", 'low_q': 0.07, 'mid_q': 0.5, 'hi_q': 0.98, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Subbituminous coal heat content', 'query': "energy_source_code=='SUB'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.9, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Lignite heat content', 'query': "energy_source_code=='LIG'", 'low_q': 0.1, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Diesel Fuel Oil heat content', 'query': "energy_source_code=='DFO'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}, {'title': 'Bituminous coal moisture content', 'query': "energy_source_code=='BIT'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'moisture_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Subbituminous coal moisture content', 'query': "energy_source_code=='SUB'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'moisture_content_pct', 'weight_col': 'fuel_qty_units'}, {'title': 'Lignite moisture content', 'query': "energy_source_code=='LIG'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 1.0, 'data_col': 'moisture_content_pct', 'weight_col': 'fuel_qty_units'}]¶ EIA923 fuel receipts & costs data validation against itself.
-
pudl.validate.
frc_eia923_sludge_heat_content
= [{'title': 'Sludge waste heat content (tails)', 'query': "energy_source_code=='SLW'", 'low_q': 0.05, 'low_bound': 10.0, 'hi_q': 0.95, 'hi_bound': 16.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable sludget waste heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_waste_oil_heat_content
= [{'title': 'Waste oil heat content (tails)', 'query': "energy_source_code=='WO'", 'low_q': 0.05, 'low_bound': 3.0, 'hi_q': 0.95, 'hi_bound': 5.9, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable waste oil heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_wood_liquids_heat_content
= [{'title': 'Wood waste liquids heat content (tails)', 'query': "energy_source_code=='WDL'", 'low_q': 0.05, 'low_bound': 8.0, 'hi_q': 0.95, 'hi_bound': 14.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable wood waste liquids heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
frc_eia923_wood_solids_heat_content
= [{'title': 'Wood solids heat content (tails)', 'query': "energy_source_code=='WDS'", 'low_q': 0.05, 'low_bound': 7.0, 'hi_q': 0.95, 'hi_bound': 18.0, 'data_col': 'heat_content_mmbtu_per_unit', 'weight_col': 'fuel_qty_units'}]¶ Check for reasonable wood solids heat contents.
Based on values given in the EIA 923 instructions: https://www.eia.gov/survey/form/eia_923/instructions.pdf
-
pudl.validate.
gf_eia923_agg
= [{'title': 'Coal heat content', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.05, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Petroleum heat content', 'query': "fuel_type_code_pudl=='oil'", 'low_q': 0.1, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Gas heat content', 'query': "fuel_type_code_pudl=='gas'", 'low_q': 0.1, 'mid_q': 0.5, 'hi_q': 0.95, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}]¶ EIA923 Boiler Fuel data validation against aggregated data.
-
pudl.validate.
gf_eia923_coal_heat_content
= [{'title': 'All coal heat content (middle)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.5, 'low_bound': 10.0, 'hi_q': 0.5, 'hi_bound': 30.0, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}]¶ Valid coal heat content values (all coal types).
The Generation Fuel table does not break different coal types out separately, so we can only test the validity of the entire suite of coal records.
-
pudl.validate.
gf_eia923_gas_heat_content
= [{'title': 'All gas heat content (middle)', 'query': "fuel_type_code_pudl=='gas'", 'low_q': 0.5, 'low_bound': 0.975, 'hi_q': 0.5, 'hi_bound': 1.075, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'All gas heat content (middle)', 'query': "fuel_type_code_pudl=='gas'", 'low_q': 0.2, 'low_bound': 0.95, 'hi_q': 0.9, 'hi_bound': 1.1, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}]¶ Valid natural gas heat content values.
Focuses on natural gas proper. Lower bound excludes other types of gaseous fuels intentionally.
-
pudl.validate.
gf_eia923_oil_heat_content
= [{'title': 'Diesel Fuel Oil heat content (tails)', 'query': "fuel_type_code_aer=='DFO'", 'low_q': 0.05, 'low_bound': 5.5, 'hi_q': 0.95, 'hi_bound': 6.0, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'Diesel Fuel Oil heat content (middle)', 'query': "fuel_type_code_aer=='DFO'", 'low_q': 0.5, 'low_bound': 5.75, 'hi_q': 0.5, 'hi_bound': 5.85, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}, {'title': 'All petroleum heat content (tails)', 'query': "fuel_type_code_pudl=='oil'", 'low_q': 0.05, 'low_bound': 5.0, 'hi_q': 0.95, 'hi_bound': 6.6, 'data_col': 'fuel_mmbtu_per_unit', 'weight_col': 'fuel_consumed_units'}]¶ Valid petroleum based fuel heat content values.
Based on historically reported values in EIA 923 Fuel Receipts and Costs.
-
pudl.validate.
historical_distribution
(df, data_col, weight_col, quantile)[source]¶ Calculate a historical distribution of weighted values of a column.
In order to know what a “reasonable” value of a particular column is in the pudl data, we can use this function to see what the value in that column has been in each of the years of data we have on hand, and a given quantile. This population of values can then be used to set boundaries on acceptable data distributions in the aggregated and processed data.
- Parameters
df (pandas.DataFrame) – a dataframe containing historical data, with a column named either
report_date
orreport_year
.data_col (str) – Label of the column containing the data of interest.
weight_col (str) – Label of the column containing the weights to be used in scaling the data.
- Returns
The weighted quantiles of data, for each of the years found in the historical data of df.
- Return type
-
pudl.validate.
historical_histogram
(orig_df, test_df, data_col, weight_col, query='', low_q=0.05, mid_q=0.5, hi_q=0.95, low_bound=None, hi_bound=None, title='')[source]¶ Weighted histogram comparing distribution with historical subsamples.
-
pudl.validate.
intersect_indexes
(indexes)[source]¶ Calculate the intersection of a collection of pandas Indexes.
- Parameters
indexes (iterable of pandas.Index objects) –
- Returns
The intersection of all values found in the input indexes.
- Return type
-
pudl.validate.
mcoe_coal_capacity_factor
= [{'title': 'Coal Capacity Factor (middle)', 'query': "fuel_type_code_pudl=='coal' and capacity_factor!=0.0", 'low_q': 0.6, 'low_bound': 0.5, 'hi_q': 0.6, 'hi_bound': 0.9, 'data_col': 'capacity_factor', 'weight_col': 'capacity_mw'}, {'title': 'Coal Capacity Factor (tails)', 'query': "fuel_type_code_pudl=='coal' and capacity_factor!=0.0", 'low_q': 0.1, 'low_bound': 0.04, 'hi_q': 0.95, 'hi_bound': 0.95, 'data_col': 'capacity_factor', 'weight_col': 'capacity_mw'}]¶ Static constraints on coal fired generator capacity factors.
-
pudl.validate.
mcoe_coal_heat_rate
= [{'title': 'Coal Unit Heat Rates (middle)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.5, 'low_bound': 10.0, 'hi_q': 0.5, 'hi_bound': 11.0, 'data_col': 'heat_rate_mmbtu_mwh', 'weight_col': 'net_generation_mwh'}, {'title': 'Coal Unit Heat Rates (tails)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.05, 'low_bound': 9.0, 'hi_q': 0.95, 'hi_bound': 12.5, 'data_col': 'heat_rate_mmbtu_mwh', 'weight_col': 'net_generation_mwh'}]¶ Static constraints on coal fired generator heat rates.
-
pudl.validate.
mcoe_fuel_cost_per_mmbtu
= [{'title': 'Coal Fuel Costs (middle)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.5, 'low_bound': 1.5, 'hi_q': 0.5, 'hi_bound': 3.0, 'data_col': 'fuel_cost_per_mmbtu', 'weight_col': 'total_mmbtu'}, {'title': 'Coal Fuel Costs (tails)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.05, 'low_bound': 1.2, 'hi_q': 0.95, 'hi_bound': 4.5, 'data_col': 'fuel_cost_per_mmbtu', 'weight_col': 'total_mmbtu'}, {'title': 'Natural Gas Fuel Costs (middle, 2015+)', 'query': "fuel_type_code_pudl=='gas' and report_date>='2015-01-01'", 'low_q': 0.5, 'low_bound': 2.0, 'hi_q': 0.5, 'hi_bound': 4.0, 'data_col': 'fuel_cost_per_mmbtu', 'weight_col': 'total_mmbtu'}, {'title': 'Natural Gas Fuel Costs (tails, 2015+)', 'query': "fuel_type_code_pudl=='gas' and report_date>='2015-01-01'", 'low_q': 0.05, 'low_bound': 1.75, 'hi_q': 0.95, 'hi_bound': 6.7, 'data_col': 'fuel_cost_per_mmbtu', 'weight_col': 'total_mmbtu'}]¶ Static constraints on fuel costs per mmbtu of fuel consumed.
-
pudl.validate.
mcoe_fuel_cost_per_mwh
= [{'title': 'Coal Fuel Costs (middle)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.5, 'low_bound': 18.0, 'hi_q': 0.5, 'hi_bound': 27.0, 'data_col': 'fuel_cost_per_mwh', 'weight_col': 'net_generation_mwh'}, {'title': 'Coal Fuel Costs (tails)', 'query': "fuel_type_code_pudl=='coal'", 'low_q': 0.05, 'low_bound': 10.0, 'hi_q': 0.95, 'hi_bound': 50.0, 'data_col': 'fuel_cost_per_mwh', 'weight_col': 'net_generation_mwh'}, {'title': 'Natural Gas Fuel Costs (middle, 2015+)', 'query': "fuel_type_code_pudl=='gas' and report_date>='2015-01-01'", 'low_q': 0.5, 'low_bound': 20.0, 'hi_q': 0.5, 'hi_bound': 30.0, 'data_col': 'fuel_cost_per_mwh', 'weight_col': 'net_generation_mwh'}, {'title': 'Natural Gas Fuel Costs (tails, 2015+)', 'query': "fuel_type_code_pudl=='gas' and report_date>='2015-01-01'", 'low_q': 0.05, 'low_bound': 10.0, 'hi_q': 0.95, 'hi_bound': 55.0, 'data_col': 'fuel_cost_per_mwh', 'weight_col': 'net_generation_mwh'}]¶ Static constraints on fuel costs per MWh net generation.
-
pudl.validate.
mcoe_gas_capacity_factor
= [{'title': 'Natural Gas Capacity Factor (middle, 2015+)', 'query': "fuel_type_code_pudl=='gas' and report_date>='2015-01-01' and capacity_factor!=0.0", 'low_q': 0.65, 'low_bound': 0.4, 'hi_q': 0.65, 'hi_bound': 0.7, 'data_col': 'capacity_factor', 'weight_col': 'capacity_mw'}, {'title': 'Natural Gas Capacity Factor (tails, 2015+)', 'query': "fuel_type_code_pudl=='gas' and report_date>='2015-01-01' and capacity_factor!=0.0", 'low_q': 0.15, 'low_bound': 0.01, 'hi_q': 0.95, 'hi_bound': 0.95, 'data_col': 'capacity_factor', 'weight_col': 'capacity_mw'}]¶ Static constraints on natural gas generator capacity factors.
-
pudl.validate.
mcoe_gas_heat_rate
= [{'title': 'Natural Gas Unit Heat Rates (middle, 2015+)', 'query': "fuel_type_code_pudl=='gas' and report_date>='2015-01-01'", 'low_q': 0.5, 'low_bound': 7.0, 'hi_q': 0.5, 'hi_bound': 7.5, 'data_col': 'heat_rate_mmbtu_mwh', 'weight_col': 'net_generation_mwh'}, {'title': 'Natural Gas Unit Heat Rates (tails, 2015+)', 'query': "fuel_type_code_pudl=='gas' and report_date>='2015-01-01'", 'low_q': 0.05, 'low_bound': 6.5, 'hi_q': 0.95, 'hi_bound': 13.0, 'data_col': 'heat_rate_mmbtu_mwh', 'weight_col': 'net_generation_mwh'}]¶ Static constraints on gas fired generator heat rates.
-
pudl.validate.
no_null_cols
(df, cols='all', df_name='')[source]¶ Check that a dataframe has no all-NaN columns.
Occasionally in the concatenation / merging of dataframes we get a label wrong, and it results in a fully NaN column… which should probably never actually happen. This is a quick verification.
- Parameters
df (pandas.DataFrame) – DataFrame to check for null columns.
cols (iterable or "all") – The labels of columns to check for all-null values. If “all” check all columns.
df_name (str) – Name of the dataframe, to aid in debugging/logging.
- Returns
- The same DataFrame as was passed in, for use in
DataFrame.pipe().
- Return type
- Raises
ValueError – If any completely NaN / Null valued columns are found.
-
pudl.validate.
no_null_rows
(df, cols='all', df_name='', thresh=0.9)[source]¶ Check for rows filled with NA values indicating bad merges.
Sum up the number of NA values in each row and the columns specified by
cols
. If the NA values make up more thanthresh
of the columns overall, the row is considered Null and the check fails.- Parameters
df (pandas.DataFrame) – DataFrame to check for null rows.
cols (iterable or "all") – The labels of columns to check for all-null values. If “all” check all columns.
- Returns
The input DataFrame, for use with DataFrame.pipe().
- Return type
- Raises
ValueError – If the fraction of NA values in any row is greater than
thresh` –
-
pudl.validate.
plot_vs_agg
(orig_df, agg_df, validation_cases)[source]¶ Validate a bunch of distributions against aggregated versions.
-
pudl.validate.
plot_vs_bounds
(df, validation_cases)[source]¶ Run through a data validation based on absolute bounds.
-
pudl.validate.
plot_vs_self
(df, validation_cases)[source]¶ Validate a bunch of distributions against themselves.
-
pudl.validate.
vs_bounds
(df, data_col, weight_col, query='', title='', low_q=False, low_bound=False, hi_q=False, hi_bound=False)[source]¶ Test a distribution against an upper bound, lower bound, or both.
-
pudl.validate.
vs_historical
(orig_df, test_df, data_col, weight_col, query='', low_q=0.05, mid_q=0.5, hi_q=0.95, title='')[source]¶ Validate aggregated distributions against original data.
-
pudl.validate.
vs_self
(df, data_col, weight_col, query='', title='', low_q=0.05, mid_q=0.5, hi_q=0.95)[source]¶ Test a distribution against its own historical range.
This is a special case of the
pudl.validate.vs_historical()
function, in which both theorig_df
andtest_df
are the same. Mostly it helps ensure that the test itself is valid for the given distribution.
-
pudl.validate.
weighted_quantile
(data, weights, quantile)[source]¶ Calculate the weighted quantile of a Series or DataFrame column.
This function allows us to take two columns from a
pandas.DataFrame
one of which contains an observed value (data) like heat content per unit of fuel, and the other of which (weights) contains a quantity like quantity of fuel delivered which should be used to scale the importance of the observed value in an overall distribution, and calculate the values that the scaled distribution will have at various quantiles.- Parameters
data (pandas.Series) – A series containing numeric data.
weights (pandas.series) – Weights to use in scaling the data. Must have the same length as data.
quantile (float) – A number between 0 and 1, representing the quantile at which we want to find the value of the weighted data.
- Returns
the value in the weighted data corresponding to the given quantile. If there are no values in the data, return
numpy.na
.- Return type
Module contents¶
The Public Utility Data Liberation (PUDL) Project.