pudl.transform.ferc1 module

Routines for transforming FERC Form 1 data before loading into the PUDL DB.

This module provides a variety of functions that are used in cleaning up the FERC Form 1 data prior to loading into our database. This includes adopting standardized units and column names, standardizing the formatting of some string values, and correcting data entry errors which we can infer based on the existing data. It may also include removing bad data, or replacing it with the appropriate NA values.

pudl.transform.ferc1.CONSTRUCTION_TYPE_STRINGS = {'conventional': ['conventional', 'conventional', 'conventional boiler', 'conv-b', 'conventionall', 'convention', 'conventional', 'coventional', 'conven full boiler', 'c0nventional', 'conventtional', 'conventialunderground', 'conventional bulb', 'conventrional', '*conventional', 'convential', 'convetional', 'conventioanl', 'conventioinal', 'conventaional', 'indoor construction', 'convenional', 'conventional steam', 'conventinal', 'convntional', 'conventionl', 'conventionsl', 'conventiional', 'convntl steam plants', 'indoor const.', 'full indoor', 'indoor', 'indoor automatic', 'indoor boiler', '(peak load) indoor', 'conventionl,indoor', 'conventionl, indoor', 'conventional, indoor', 'comb. cycle indoor', '3 indoor boiler', '2 indoor boilers', '1 indoor boiler', '2 indoor boiler', '3 indoor boilers', 'fully contained', 'conv - b', 'conventional/boiler', 'cnventional', 'comb. cycle indooor', 'sonventional', 'ind enclosures'], 'outdoor': ['outdoor', 'outdoor boiler', 'full outdoor', 'outdoor boiler', 'outdoor boilers', 'outboilers', 'fuel outdoor', 'full outdoor', 'outdoors', 'outdoor', 'boiler outdoor& full', 'boiler outdoor&full', 'outdoor boiler& full', 'full -outdoor', 'outdoor steam', 'outdoor boiler', 'ob', 'outdoor automatic', 'outdoor repower', 'full outdoor boiler', 'fo', 'outdoor boiler & ful', 'full-outdoor', 'fuel outdoor', 'outoor', 'outdoor', 'outdoor  boiler&full', 'boiler outdoor &full', 'outdoor boiler &full', 'boiler outdoor & ful', 'outdoor-boiler', 'outdoor - boiler', 'outdoor const.', '4 outdoor boilers', '3 outdoor boilers', 'full outdoor', 'full outdoors', 'full oudoors', 'outdoor (auto oper)', 'outside boiler', 'outdoor boiler&full', 'outdoor hrsg', 'outdoor hrsg', 'outdoor-steel encl.', 'boiler-outdr & full', 'con.& full outdoor', 'partial outdoor', 'outdoor (auto. oper)', 'outdoor (auto.oper)', 'outdoor construction', '1 outdoor boiler', '2 outdoor boilers', 'outdoor enclosure', '2 outoor boilers', 'boiler outdr.& full', 'boiler outdr. & full', 'ful outdoor', 'outdoor-steel enclos', 'outdoor (auto oper.)', 'con. & full outdoor', 'outdore', 'boiler & full outdor', 'full & outdr boilers', 'outodoor (auto oper)', 'outdoor steel encl.', 'full outoor', 'boiler & outdoor ful', 'otdr. blr. & f. otdr', 'f.otdr & otdr.blr.', 'oudoor (auto oper)', 'outdoor constructin', 'f. otdr. & otdr. blr', 'outdoor boiler & fue'], 'semioutdoor': ['more than 50% outdoo', 'more than 50% outdos', 'over 50% outdoor', 'over 50% outdoors', 'semi-outdoor', 'semi - outdoor', 'semi outdoor', 'semi-enclosed', 'semi-outdoor boiler', 'semi outdoor boiler', 'semi- outdoor', 'semi - outdoors', 'semi -outdoorconven & semi-outdr', 'conv & semi-outdoor', 'conv & semi- outdoor', 'convent. semi-outdr', 'conv. semi outdoor', 'conv(u1)/semiod(u2)', 'conv u1/semi-od u2', 'conv-one blr-semi-od', 'convent semioutdoor', 'conv. u1/semi-od u2', 'conv - 1 blr semi od', 'conv. ui/semi-od u2', 'conv-1 blr semi-od', 'conven. semi-outdoor', 'conv semi-outdoor', 'u1-conv./u2-semi-od', 'u1-conv./u2-semi -od', 'convent. semi-outdoo', 'u1-conv. / u2-semi', 'conven & semi-outdr', 'semi -outdoor', 'outdr & conventnl', 'conven. full outdoor', 'conv. & outdoor blr', 'conv. & outdoor blr.', 'conv. & outdoor boil', 'conv. & outdr boiler', 'conv. & out. boiler', 'convntl,outdoor blr', 'outdoor & conv.', '2 conv., 1 out. boil', 'outdoor/conventional', 'conv. boiler outdoor', 'conv-one boiler-outd', 'conventional outdoor', 'conventional outdor', 'conv. outdoor boiler', 'conv.outdoor boiler', 'conventional outdr.', 'conven,outdoorboiler', 'conven full outdoor', 'conven,full outdoor', '1 out boil, 2 conv', 'conv. & full outdoor', 'conv. & outdr. boilr', 'conv outdoor boiler', 'convention. outdoor', 'conv. sem. outdoor', 'convntl, outdoor blr', 'conv & outdoor boil', 'conv & outdoor boil.', 'outdoor & conv', 'conv. broiler outdor', '1 out boilr, 2 conv', 'conv.& outdoor boil.', 'conven,outdr.boiler', 'conven,outdr boiler', 'outdoor & conventil', '1 out boilr 2 conv', 'conv & outdr. boilr', 'conven, full outdoor', 'conven full outdr.', 'conven, full outdr.', 'conv/outdoor boiler', "convnt'l outdr boilr", '1 out boil 2 conv', 'conv full outdoor', 'conven, outdr boiler', 'conventional/outdoor', 'conv&outdoor boiler', 'outdoor & convention', 'conv & outdoor boilr', 'conv & full outdoor', 'convntl. outdoor blr', 'conv - ob', "1conv'l/2odboilers", "2conv'l/1odboiler", 'conv-ob', 'conv.-ob', '1 conv/ 2odboilers', '2 conv /1 odboilers', 'conv- ob', 'conv -ob', 'con sem outdoor', 'cnvntl, outdr, boilr', 'less than 50% outdoo', 'under 50% outdoor', 'under 50% outdoors', '1cnvntnl/2odboilers', '2cnvntnl1/1odboiler', 'con & ob', 'combination (b)', 'indoor & outdoor', 'conven. blr. & full', 'conv. & otdr. blr.', 'combination', 'indoor and outdoor', 'conven boiler & full', "2conv'l/10dboiler", '4 indor/outdr boiler', '4 indr/outdr boilerr', '4 indr/outdr boiler', 'indoor & outdoof'], 'unknown': ['', 'automatic operation', 'comb. turb. installn', 'comb. turb. instaln', 'com. turb. installn', 'n/a', 'for detailed info.', 'for detailed info', 'combined cycle', 'na', 'not applicable', 'gas', 'heated individually', 'metal enclosure', 'pressurized water', 'nuclear', 'jet engine', 'gas turbine', 'storage/pipelines', '0', 'during 1994', 'peaking - automatic', 'gas turbine/int. cm', '2 oil/gas turbines', 'wind', 'package', 'mobile', 'auto-operated', 'steam plants', 'other production', 'all nuclear plants', 'other power gen.', 'automatically operad', 'automatically operd', 'circ fluidized bed', 'jet turbine', 'gas turbne/int comb', 'automatically oper.', 'retired 1/1/95', 'during 1995', '1996. plant sold', 'reactivated 7/1/96', 'gas turbine/int comb', 'portable', 'head individually', 'automatic opertion', 'peaking-automatic', 'cycle', 'full order', 'circ. fluidized bed', 'gas turbine/intcomb', '0.0000', 'none', '2 oil / gas', 'block & steel', 'and 2000', 'comb.turb. instaln', 'automatic oper.', 'pakage', '---', 'n/a (ct)', 'comb turb instain', 'ind encloures', '2 oil /gas turbines', 'combustion turbine', '1970', 'gas/oil turbines', 'combined cycle steam', 'pwr', '2 oil/ gas', '2 oil / gas turbines', 'gas / oil turbines', 'no boiler', 'internal combustion', 'gasturbine no boiler', 'boiler', 'tower -10 unit facy', 'gas trubine', '4 gas/oil trubines', '2 oil/ 4 gas/oil tur', '5 gas/oil turbines', 'tower 16', '2 on 1 gas turbine', 'tower 23', 'tower -10 unit', 'tower - 101 unit', '3 on 1 gas turbine', 'tower - 10 units', 'tower - 165 units', 'wind turbine', 'fixed tilt pv', 'tracking pv', 'o', 'wind trubine', 'subcritical', 'sucritical', 'simple cycle', 'simple & reciprocat']}

A dictionary of construction types (keys) and lists of construction type strings associated with each type (values) from FERC Form 1.

There are many strings that weren’t categorized, including crosses between conventional and outdoor, PV, wind, combined cycle, and internal combustion. The lists are broken out into the two types specified in Form 1: conventional and outdoor. These lists are inclusive so that variants of conventional (e.g. “conventional full”) and outdoor (e.g. “outdoor full” and “outdoor hrsg”) are included.

Type

dict

class pudl.transform.ferc1.FERCPlantClassifier(min_sim=0.75, plants_df=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

A classifier for identifying FERC plant time series in FERC Form 1 data.

We want to be able to give the classifier a FERC plant record, and get back the group of records(or the ID of the group of records) that it ought to be part of.

There are hundreds of different groups of records, and we can only know what they are by looking at the whole dataset ahead of time. This is the “fitting” step, in which the groups of records resulting from a particular set of model parameters(e.g. the weights that are attributes of the class) are generated.

Once we have that set of record categories, we can test how well the classifier performs, by checking it against test / training data which we have already classified by hand. The test / training set is a list of lists of unique FERC plant record IDs(each record ID is the concatenation of: report year, respondent id, supplement number, and row number). It could also be stored as a dataframe where each column is associated with a year of data(some of which could be empty). Not sure what the best structure would be.

If it’s useful, we can assign each group a unique ID that is the time ordered concatenation of each of the constituent record IDs. Need to understand what the process for checking the classification of an input record looks like.

To score a given classifier, we can look at what proportion of the records in the test dataset are assigned to the same group as in our manual classification of those records. There are much more complicated ways to do the scoring too… but for now let’s just keep it as simple as possible.

fit(X, y=None)[source]

Use weighted FERC plant features to group records into time series.

The fit method takes the vectorized, normalized, weighted FERC plant features (X) as input, calculates the pairwise cosine similarity matrix between all records, and groups the records in their best time series. The similarity matrix and best time series are stored as data members in the object for later use in scoring & predicting.

This isn’t quite the way a fit method would normally work.

Parameters
  • () (y) – a sparse matrix of size n_samples x n_features.

  • ()

Returns

Return type

pandas.DataFrame

Todo

Zane revisit args and returns

predict(X, y=None)[source]

Identify time series of similar records to input record_ids.

Given a one-dimensional dataframe X, containing FERC record IDs, return a dataframe in which each row corresponds to one of the input record_id values (ordered as the input was ordered), with each column corresponding to one of the years worth of data. Values in the returned dataframe are the FERC record_ids of the record most similar to the input record within that year. Some of them may be null, if there was no sufficiently good match.

Row index is the seed record IDs. Column index is years.

TODO: * This method is hideously inefficient. It should be vectorized. * There’s a line that throws a FutureWarning that needs to be fixed.

score(X, y=None)[source]

Scores a collection of FERC plant categorizations.

For every record ID in X, predict its record group and calculate a metric of similarity between the prediction and the “ground truth” group that was passed in for that value of X.

Parameters
  • X (pandas.DataFrame) – an n_samples x 1 pandas dataframe of FERC Form 1 record IDs.

  • y (pandas.DataFrame) – a dataframe of “ground truth” FERC Form 1 record groups, corresponding to the list record IDs in X

Returns

The average of all the similarity metrics as the score.

Return type

numpy.ndarray

transform(X, y=None)[source]

Passthrough transform method – just returns self.

pudl.transform.ferc1.FUEL_STRINGS = {'coal': ['coal', 'coal-subbit', 'lignite', 'coal(sb)', 'coal (sb)', 'coal-lignite', 'coke', 'coa', 'lignite/coal', 'coal - subbit', 'coal-subb', 'coal-sub', 'coal-lig', 'coal-sub bit', 'coals', 'ciak', 'petcoke', 'coal.oil', 'coal/gas', 'bit coal', 'coal-unit #3', 'coal-subbitum', 'coal tons', 'coal mcf', 'coal unit #3', 'pet. coke', 'coal-u3', 'coal&coke', 'tons'], 'gas': ['gas', 'gass', 'methane', 'natural gas', 'blast gas', 'gas mcf', 'propane', 'prop', 'natural  gas', 'nat.gas', 'nat gas', 'nat. gas', 'natl gas', 'ga', 'gas`', 'syngas', 'ng', 'mcf', 'blast gaa', 'nat  gas', 'gac', 'syngass', 'prop.', 'natural', 'coal.gas', 'n. gas', 'lp gas', 'natuaral gas', 'coke gas', 'gas #2016', 'propane**', '* propane', 'propane **', 'gas expander', 'gas ct', '# 6 gas', '#6 gas', 'coke oven gas'], 'hydro': [], 'nuclear': ['nuclear', 'grams of uran', 'grams of', 'grams of  ura', 'grams', 'nucleur', 'nulear', 'nucl', 'nucleart', 'nucelar', 'gr.uranium', 'grams of urm', 'nuclear (9)', 'nulcear', 'nuc', 'gr. uranium', 'nuclear mw da', 'grams of ura'], 'oil': ['oil', '#6 oil', '#2 oil', 'fuel oil', 'jet', 'no. 2 oil', 'no.2 oil', 'no.6& used', 'used oil', 'oil-2', 'oil (#2)', 'diesel oil', 'residual oil', '# 2 oil', 'resid. oil', 'tall oil', 'oil/gas', 'no.6 oil', 'oil-fuel', 'oil-diesel', 'oil / gas', 'oil bbls', 'oil bls', 'no. 6 oil', '#1 kerosene', 'diesel', 'no. 2 oils', 'blend oil', '#2oil diesel', '#2 oil-diesel', '# 2  oil', 'light oil', 'heavy oil', 'gas.oil', '#2', '2', '6', 'bbl', 'no 2 oil', 'no 6 oil', '#1 oil', '#6', 'oil-kero', 'oil bbl', 'biofuel', 'no 2', 'kero', '#1 fuel oil', 'no. 2  oil', 'blended oil', 'no 2. oil', '# 6 oil', 'nno. 2 oil', '#2 fuel', 'oill', 'oils', 'gas/oil', 'no.2 oil gas', '#2 fuel oil', 'oli', 'oil (#6)', 'oil/diesel', '2 oil', '#6 hvy oil', 'jet fuel', 'diesel/compos', 'oil-8', 'oil {6}', 'oil-unit #1', 'bbl.', 'oil.', 'oil #6', 'oil (6)', 'oil(#2)', 'oil-unit1&2', 'oil-6', '#2 fue oil', 'dielel oil', 'dielsel oil', '#6 & used', 'barrels', 'oil un 1 & 2', 'jet oil', 'oil-u1&2', 'oiul', 'pil', 'oil - 2', '#6 & used', 'oial'], 'solar': [], 'unknown': ['steam', 'purch steam', 'all', 'tdf', 'n/a', 'purch. steam', 'other', 'composite', 'composit', 'mbtus', 'total', 'avg', 'avg.', 'blo', 'all fuel', 'comb.', 'alt. fuels', 'na', 'comb', '/#=2\x80â\x91?', 'kã\xadgv¸\x9d?', "mbtu's", 'gas, oil', 'rrm', '3\x9c', 'average', 'furfural', '0', 'watson bng', 'toal', 'bng', '# 6 & used', 'combined', 'blo bls', 'compsite', '*', 'compos.', 'gas / oil', 'mw days', 'g', 'c', 'lime', 'all fuels', 'at right', '20', '1', 'comp oil/gas', 'all fuels to', 'the right are', 'c omposite', 'all fuels are', 'total pr crk', 'all fuels =', 'total pc', 'comp', 'alternative', 'alt. fuel', 'bio fuel', 'total prairie', ''], 'waste': ['tires', 'tire', 'refuse', 'switchgrass', 'wood waste', 'woodchips', 'biomass', 'wood', 'wood chips', 'rdf', 'tires/refuse', 'tire refuse', 'waste oil', 'waste', 'woodships', 'tire chips'], 'wind': []}

A mapping a canonical fuel name to a list of strings which are used to represent that fuel in the FERC Form 1 Reporting. Case is ignored, as all fuel strings are converted to a lower case in the data set.

Type

dict

pudl.transform.ferc1.FUEL_UNIT_STRINGS = {'bbl': ['barrel', 'bbls', 'bbl', 'barrels', 'bbrl', 'bbl.', 'bbls.', 'oil 42 gal', 'oil-barrels', 'barrrels', 'bbl-42 gal', 'oil-barrel', 'bb.', 'barrells', 'bar', 'bbld', 'oil- barrel', 'barrels    .', 'bbl .', 'barels', 'barrell', 'berrels', 'bb', 'bbl.s', 'oil-bbl', 'bls', 'bbl:', 'barrles', 'blb', 'propane-bbl', 'barriel', 'berriel', 'barrile', '(bbl.)', 'barrel *(4)', '(4) barrel', 'bbf', 'blb.', '(bbl)', 'bb1', 'bbsl', 'barrrel', 'barrels 100%', 'bsrrels', "bbl's", '*barrels', 'oil - barrels', 'oil 42 gal ba', 'bll', 'boiler barrel', 'gas barrel', '"boiler" barr', '"gas" barrel', '"boiler"barre', '"boiler barre', 'barrels .', 'bariel', 'brrels', 'oil barrel'], 'btu': ['btus', 'btu'], 'gal': ['gallons', 'gal.', 'gals', 'gals.', 'gallon', 'gal', 'galllons'], 'gramsU': ['gram', 'grams', 'gm u', 'grams u235', 'grams u-235', 'grams of uran', 'grams: u-235', 'grams:u-235', 'grams:u235', 'grams u308', 'grams: u235', 'grams of', 'grams - n/a', 'gms uran', 's e uo2 grams', 'gms uranium', 'grams of urm', 'gms. of uran', 'grams (100%)', 'grams v-235', 'se uo2 grams'], 'kgU': ['kg of uranium', 'kg uranium', 'kilg. u-235', 'kg u-235', 'kilograms-u23', 'kg', 'kilograms u-2', 'kilograms', 'kg of', 'kg-u-235', 'kilgrams', 'kilogr. u235', 'uranium kg', 'kg uranium25', 'kilogr. u-235', 'kg uranium 25', 'kilgr. u-235', 'kguranium 25', 'kg-u235', 'kgm'], 'kgal': ['oil(1000 gal)', 'oil(1000)', 'oil (1000)', 'oil(1000', 'oil(1000ga)'], 'klbs': ['k lbs.', 'k lbs'], 'mcf': ['mcf', "mcf's", 'mcfs', 'mcf.', 'gas mcf', '"gas" mcf', 'gas-mcf', 'mfc', 'mct', ' mcf', 'msfs', 'mlf', 'mscf', 'mci', 'mcl', 'mcg', 'm.cu.ft.', 'kcf', '(mcf)', 'mcf *(4)', 'mcf00', 'm.cu.ft..'], 'mmbtu': ['mmbtu', 'mmbtus', 'mbtus', '(mmbtu)', "mmbtu's", 'nuclear-mmbtu', 'nuclear-mmbt', 'mmbtul'], 'mwdth': ['mwd therman', 'mw days-therm', 'mwd thrml', 'mwd thermal', 'mwd/mtu', 'mw days', 'mwdth', 'mwd', 'mw day', 'dth', 'mwdaysthermal', 'mw day therml', 'mw days thrml', 'nuclear mwd', 'mmwd', 'mw day/thermlmw days/therm', 'mw days (th', 'ermal)'], 'mwhth': ['mwh them', 'mwh threm', 'nwh therm', 'mwhth', 'mwh therm', 'mwh', 'mwh therms.', 'mwh term.uts', 'mwh thermal', 'mwh thermals', 'mw hr therm', 'mwh therma', 'mwh therm.uts'], 'ton': ['toms', 'taons', 'tones', 'col-tons', 'toncoaleq', 'coal', 'tons coal eq', 'coal-tons', 'ton', 'tons', 'tons coal', 'coal-ton', 'tires-tons', 'coal tons -2 ', 'oil-tons', 'coal tons 200', 'ton-2000', 'coal tons', 'coal tons -2', 'coal-tone', 'tire-ton', 'tire-tons', 'ton coal eqv', 'tos', 'coal tons - 2', 'c. t.', 'c.t.', 'toncoalequiv'], 'unknown': ['', '1265', 'mwh units', 'composite', 'therms', 'n/a', 'mbtu/kg', 'uranium 235', 'oil', 'ccf', '2261', 'uo2', '(7)', 'oil #2', 'oil #6', '\x99å\x83\x90?"', 'dekatherm', '0', 'mw day/therml', 'nuclear', 'gas', '62,679', 'mw days/therm', 'na', 'uranium', 'oil/gas', 'thermal', '(thermal)', 'se uo2', '181679', '83', '3070', '248', '273976', '747', '-', 'are total', 'pr. creek', 'decatherms', 'uramium', '.', 'total pr crk', '>>>>>>>>', 'all', 'total', 'alternative-t', 'oil-mcf', '3303671', '929', '7182175', '319', '1490442', '10881', '1363663', '7171', '1726497', '4783', '7800', '12559', '2398', 'creek fuels', 'propane-barre', '509', 'barrels/mcf', 'propane-bar', '4853325', '4069628', '1431536', '708903', 'mcf/oil (1000']}

A dictionary linking fuel units (keys) to lists of various strings representing those fuel units (values)

Type

dict

pudl.transform.ferc1.PLANT_KIND_STRINGS = {'combined_cycle': ['Combined cycle', 'combined cycle', 'combined', 'gas & steam turbine', 'gas turb. & heat rec', 'combined cycle', 'com. cyc', 'com. cycle', 'gas turb-combined cy', 'combined cycle ctg', 'combined cycle - 40%', 'com cycle gas turb', 'combined cycle oper', 'gas turb/comb. cyc', 'combine cycle', 'cc', 'comb. cycle', 'gas turb-combined cy', 'steam and cc', 'steam cc', 'gas steam', 'ctg steam gas', 'steam comb cycle', 'gas/steam comb. cycl', 'steam (comb. cycle)gas turbine/steam', 'steam & gas turbine', 'gas trb & heat rec', 'steam & combined ce', 'st/gas turb comb cyc', 'gas tur & comb cycl', 'combined cycle (a,b)', 'gas turbine/ steam', 'steam/gas turb.', 'steam & comb cycle', 'gas/steam comb cycle', 'comb cycle (a,b)', 'igcc', 'steam/gas turbine', 'gas turbine / steam', 'gas tur & comb cyc', 'comb cyc (a) (b)', 'comb cycle', 'comb cyc', 'combined turbine', 'combine cycle oper', 'comb cycle/steam tur', 'cc / gas turb', 'steam (comb. cycle)', 'steam & cc', 'gas turbine/steam', 'gas turb/cumbus cycl', 'gas turb/comb cycle', 'gasturb/comb cycle', 'gas turb/cumb. cyc', 'igcc/gas turbine', 'gas / steam', 'ctg/steam-gas', 'ctg/steam -gas', 'gas fired cc turbine', 'combinedcycle', 'comb cycle gas turb', 'combined cycle opern', 'comb. cycle gas turb'], 'combustion_turbine': ['combustion turbine', 'gt', 'gas turbine', 'gas turbine # 1', 'gas turbine', 'gas turbine (note 1)', 'gas turbines', 'simple cycle', 'combustion turbine', 'comb.turb.peak.units', 'gas turbine', 'combustion turbine', 'com turbine peaking', 'gas turbine peaking', 'comb turb peaking', 'combustine turbine', 'comb. turine', 'conbustion turbine', 'combustine turbine', 'gas turbine (leased)', 'combustion tubine', 'gas turb', 'gas turbine peaker', 'gtg/gas', 'simple cycle turbine', 'gas-turbine', 'gas turbine-simple', 'gas turbine - note 1', 'gas turbine #1', 'simple cycle', 'gasturbine', 'combustionturbine', 'gas turbine (2)', 'comb turb peak units', 'jet engine', 'jet powered turbine', '*gas turbine', 'gas turb.(see note5)', 'gas turb. (see note', 'combutsion turbine', 'combustion turbin', 'gas turbine-unit 2', 'gas - turbine', 'comb turbine peaking', 'gas expander turbine', 'jet turbine', 'gas turbin (lease', 'gas turbine (leased', 'gas turbine/int. cm', 'comb.turb-gas oper.', 'comb.turb.gas/oil op', 'comb.turb.oil oper.', 'jet', 'comb. turbine (a)', 'gas turb.(see notes)', 'gas turb(see notes)', 'comb. turb-gas oper', 'comb.turb.oil oper', 'gas turbin (leasd)', 'gas turbne/int comb', 'gas turbine (note1)', 'combution turbin', '* gas turbine', 'add to gas turbine', 'gas turbine (a)', 'gas turbinint comb', 'gas turbine (note 3)', 'resp share gas note3', 'gas trubine', '*gas turbine(note3)', 'gas turbine note 3,6', 'gas turbine note 4,6', 'gas turbine peakload', 'combusition turbine', 'gas turbine (lease)', 'comb. turb-gas oper.', 'combution turbine', 'combusion turbine', 'comb. turb. oil oper', 'combustion burbine', 'combustion and gas', 'comb. turb.', 'gas turbine (lease', 'gas turbine (leasd)', 'gas turbine/int comb', '*gas turbine(note 3)', 'gas turbine (see nos', 'i.c.e./gas turbine', 'gas turbine/intcomb', 'cumbustion turbine', 'gas turb, int. comb.', 'gas turb, diesel', 'gas turb, int. comb', 'i.c.e/gas turbine', 'diesel turbine', 'comubstion turbine', 'i.c.e. /gas turbine', 'i.c.e/ gas turbine', 'i.c.e./gas tubine'], 'geothermal': ['steam - geothermal', 'steam_geothermal', 'geothermal'], 'internal_combustion': ['ic', 'internal combustion', 'internal comb.', 'internl combustiondiesel turbine', 'int combust (note 1)', 'int. combust (note1)', 'int.combustine', 'comb. cyc', 'internal comb', 'diesel', 'diesel engine', 'internal combustion', 'int combust - note 1', 'int. combust - note1', 'internal comb recip', 'reciprocating engine', 'comb. turbine', 'internal combust.', 'int. combustion (1)', '*int combustion (1)', "*internal combust'n", 'internal', 'internal comb.', 'steam internal comb', 'combustion', 'int. combustion', 'int combust (note1)', 'int. combustine', 'internl combustion', '*int. combustion (1)'], 'nuclear': ['nuclear', 'nuclear (3)', 'steam(nuclear)', 'nuclear(see note4)nuclear steam', 'nuclear turbine', 'nuclear - steam', 'nuclear (a)(b)(c)', 'nuclear (b)(c)', '* nuclear', 'nuclear (b) (c)', 'nuclear (see notes)', 'steam (nuclear)', '* nuclear (note 2)', 'nuclear (note 2)', 'nuclear (see note 2)', 'nuclear(see note4)', 'nuclear steam', 'nuclear(see notes)', 'nuclear-steam', 'nuclear (see note 3)'], 'photovoltaic': ['solar photovoltaic', 'photovoltaic', 'solar', 'solar project'], 'solar_thermal': ['solar thermal'], 'steam': ['coal', 'steam', 'steam units 1 2 3', 'steam units 4 5', 'steam fossil', 'steam turbine', 'steam a', 'steam 100', 'steam units 1 2 3', 'steams', 'steam 1', 'steam retired 2013', 'stream', 'steam units 1,2,3', 'steam units 4&5', 'steam units 4&6', 'steam conventional', 'unit total-steam', 'unit total steam', '*resp. share steam', 'resp. share steam', 'steam (see note 1,', 'steam (see note 3)', 'mpc 50%share steam', '40% share steamsteam (2)', 'steam (3)', 'steam (4)', 'steam (5)', 'steam (6)', 'steam (7)', 'steam (8)', 'steam units 1 and 2', 'steam units 3 and 4', 'steam (note 1)', 'steam (retired)', 'steam (leased)', 'coal-fired steam', 'oil-fired steam', 'steam/fossil', 'steam (a,b)', 'steam (a)', 'stean', 'steam-internal comb', 'steam (see notes)', 'steam units 4 & 6', 'resp share stm note3', 'mpc50% share steam', 'mpc40%share steam', 'steam - 64%', 'steam - 100%', 'steam (1) & (2)', 'resp share st note3', 'mpc 50% shares steam', 'steam-64%', 'steam-100%', 'steam (see note 1)', 'mpc 50% share steam', 'steam units 1, 2, 3', 'steam units 4, 5', 'steam (2)', 'steam (1)', 'steam 4, 5', 'steam - 72%', 'steam (incl i.c.)', 'steam- 72%', 'steam;retired - 2013', "respondent's sh.-st.", "respondent's sh-st", '40% share steam', 'resp share stm note3', 'mpc50% share steam', 'resp share st note 3', '\x02steam (1)'], 'unknown': ['', 'n/a', 'see pgs 402.1-402.3', 'see pgs 403.1-403.9', "respondent's share", '--', '(see note 7)', 'other', 'not applicable', 'peach bottom', 'none.', 'fuel facilities', '0', 'not in service', 'none', 'common expenses', 'expenses common to', 'retired in 1981', 'retired in 1978', 'na', 'unit total (note3)', 'unit total (note2)', 'resp. share (note2)', 'resp. share (note8)', 'resp. share (note 9)', 'resp. share (note11)', 'resp. share (note4)', 'resp. share (note6)', 'conventional', 'expenses commom to', 'not in service in', 'unit total (note 3)', 'unit total (note 2)', 'resp. share (note 8)', 'resp. share (note 3)', 'resp. share note 11', 'resp. share (note 4)', 'resp. share (note 6)', '(see note 5)', 'resp. share (note 2)', 'package', '(left blank)', 'common', '0.0000', 'other generation', 'resp share (note 11)', 'retired', 'storage/pipelines', 'sold april 16, 1999', 'sold may 07, 1999', 'plants sold in 1999', 'gas', 'not applicable.', 'resp. share - note 2', 'resp. share - note 8', 'resp. share - note 9', 'resp share - note 11', 'resp. share - note 4', 'resp. share - note 6', 'plant retired- 2013', 'retired - 2013', 'resp share - note 5', 'resp. share - note 7', 'non-applicable', 'other generation plt', 'combined heat/power', 'oil'], 'wind': ['wind', 'wind energy', 'wind turbine', 'wind - turbine', 'wind generation']}

A mapping from canonical plant kinds (keys) to the associated freeform strings (values) identified as being associated with that kind of plant in the FERC Form 1 raw data. There are many strings that weren’t categorized, Solar and Solar Project were not classified as these do not indicate if they are solar thermal or photovoltaic. Variants on Steam (e.g. “steam 72” and “steam and gas”) were classified based on additional research of the plants on the Internet.

Type

dict

pudl.transform.ferc1.accumulated_depreciation(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 depreciation data for loading into PUDL.

This information is organized by FERC account, with each line of the FERC Form 1 having a different descriptive identifier like ‘balance_end_of_year’ or ‘transmission’.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of the transformed DataFrames.

Return type

dict

pudl.transform.ferc1.cols_to_cats(df, cat_name, col_cats)[source]

Turn top-level MultiIndex columns into a categorial column.

In some cases FERC Form 1 data comes with many different types of related values interleaved in the same table – e.g. current year and previous year income – this can result in DataFrames that are hundreds of columns wide, which is unwieldy. This function takes those top level MultiIndex labels and turns them into categories in a single column, which can be used to select a particular type of report.

Parameters
  • df (pandas.DataFrame) – the dataframe to be simplified.

  • cat_name (str) – the label of the column to be created indicating what MultiIndex label the values came from.

  • col_cats (dict) – a dictionary with top level MultiIndex labels as keys, and the category to which they should be mapped as values.

Returns

A re-shaped/re-labeled dataframe with one fewer levels of MultiIndex in the columns, and an additional column containing the assigned labels.

Return type

pandas.DataFrame

pudl.transform.ferc1.fuel(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 fuel data for loading into PUDL Database.

This process includes converting some columns to be in terms of our preferred units, like MWh and mmbtu instead of kWh and btu. Plant names are also standardized (stripped & lower). Fuel and fuel unit strings are also standardized using our cleanstrings() function and string cleaning dictionaries found above (FUEL_STRINGS, etc.)

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of transformed dataframes.

Return type

dict

pudl.transform.ferc1.fuel_by_plant_ferc1(fuel_df, thresh=0.5)[source]

Calculates useful FERC Form 1 fuel metrics on a per plant-year basis.

Each record in the FERC Form 1 corresponds to a particular type of fuel. Many plants – especially coal plants – use more than one fuel, with gas and/or diesel serving as startup fuels. In order to be able to classify the type of plant based on relative proportions of fuel consumed or fuel costs it is useful to aggregate these per-fuel records into a single record for each plant.

Fuel cost (in nominal dollars) and fuel heat content (in mmBTU) are calculated for each fuel based on the cost and heat content per unit, and the number of units consumed, and then summed by fuel type (there can be more than one record for a given type of fuel in each plant because we are simplifying the fuel categories). The per-fuel records are then pivoted to create one column per fuel type. The total is summed and stored separately, and the individual fuel costs & heat contents are divided by that total, to yield fuel proportions. Based on those proportions and a minimum threshold that’s passed in, a “primary” fuel type is then assigned to the plant-year record and given a string label.

Parameters
  • fuel_df (pandas.DataFrame) – Pandas DataFrame resembling the post-transform result for the fuel_ferc1 table.

  • thresh (float) – A value between 0.5 and 1.0 indicating the minimum fraction of overall heat content that must have been provided by a fuel in a plant-year for it to be considered the “primary” fuel for the plant in that year. Default value: 0.5.

Returns

A DataFrame with a single record for each plant-year, including the columns required to merge it with the plants_steam_ferc1 table/DataFrame (report_year, utility_id_ferc1, and plant_name) as well as totals for fuel mmbtu consumed in that plant-year, and the cost of fuel in that year, the proportions of heat content and fuel costs for each fuel in that year, and a column that labels the plant’s primary fuel for that year.

Return type

pandas.DataFrame

Raises

AssertionError – If the DataFrame input does not have the columns required to run the function.

pudl.transform.ferc1.make_ferc1_clf(plants_df, ngram_min=2, ngram_max=10, min_sim=0.75, plant_name_ferc1_wt=2.0, plant_type_wt=2.0, construction_type_wt=1.0, capacity_mw_wt=1.0, construction_year_wt=1.0, utility_id_ferc1_wt=1.0, fuel_fraction_wt=1.0)[source]

Create a FERC Plant Classifier using several weighted features.

Given a FERC steam plants dataframe plants_df, which also includes fuel consumption information, transform a selection of useful columns into features suitable for use in calculating inter-record cosine similarities. Individual features are weighted according to the keyword arguments.

Features include:

  • plant_name (via TF-IDF, with ngram_min and ngram_max as parameters)

  • plant_type (OneHot encoded categorical feature)

  • construction_type (OneHot encoded categorical feature)

  • capacity_mw (MinMax scaled numerical feature)

  • construction year (OneHot encoded categorical feature)

  • utility_id_ferc1 (OneHot encoded categorical feature)

  • fuel_fraction_mmbtu (several MinMax scaled numerical columns, which are normalized and treated as a single feature.)

This feature matrix is then used to instantiate a FERCPlantClassifier.

The combination of the ColumnTransformer and FERCPlantClassifier are combined in a sklearn Pipeline, which is returned by the function.

Parameters
  • ngram_min (int) – the minimum lengths to consider in the vectorization of the plant_name feature.

  • ngram_max (int) – the maximum n-gram lengths to consider in the vectorization of the plant_name feature.

  • min_sim (float) – the minimum cosine similarity between two records that can be considered a “match” (a number between 0.0 and 1.0).

  • plant_name_ferc1_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • plant_type_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • construction_type_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • capacity_mw_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • construction_year_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • utility_id_ferc1_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

  • fuel_fraction_wt (float) – weight used to determine the relative importance of each of the features in the feature matrix used to calculate the cosine similarity between records. Used to scale each individual feature before the vectors are normalized.

Returns

an sklearn Pipeline that performs reprocessing and classification with a FERCPlantClassifier object.

Return type

sklearn.pipeline.Pipeline

pudl.transform.ferc1.plant_in_service(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 Plant in Service data for loading into PUDL.

Re-organizes the original FERC Form 1 Plant in Service data by unpacking the rows as needed on a year by year basis, to organize them into columns. The “columns” in the original FERC Form 1 denote starting balancing, ending balance, additions, retirements, adjustments, and transfers – these categories are turned into labels in a column called “amount_type”. Because each row in the transformed table is composed of many individual records (rows) from the original table, row_number can’t be part of the record_id, which means they are no longer unique. To infer exactly what record a given piece of data came from, the record_id and the row_map (found in the PUDL package_data directory) can be used.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of the transformed DataFrames.

Return type

dict

pudl.transform.ferc1.plants_hydro(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 plant_hydro data for loading into PUDL Database.

Standardizes plant names (stripping whitespace and Using Title Case). Also converts into our preferred units of MW and MWh.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of transformed dataframes.

Return type

dict

pudl.transform.ferc1.plants_pumped_storage(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 pumped storage data for loading into PUDL.

Standardizes plant names (stripping whitespace and Using Title Case). Also converts into our preferred units of MW and MWh.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of transformed dataframes.

Return type

dict

pudl.transform.ferc1.plants_small(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 plant_small data for loading into PUDL Database.

This FERC Form 1 table contains information about a large number of small plants, including many small hydroelectric and other renewable generation facilities. Unfortunately the data is not well standardized, and so the plants have been categorized manually, with the results of that categorization stored in an Excel spreadsheet. This function reads in the plant type data from the spreadsheet and merges it with the rest of the information from the FERC DB based on record number, FERC respondent ID, and report year. When possible the FERC license number for small hydro plants is also manually extracted from the data.

This categorization will need to be renewed with each additional year of FERC data we pull in. As of v0.1 the small plants have been categorized for 2004-2015.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of transformed dataframes.

Return type

dict

pudl.transform.ferc1.plants_steam(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 plant_steam data for loading into PUDL Database.

This includes converting to our preferred units of MWh and MW, as well as standardizing the strings describing the kind of plant and construction.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

of transformed dataframes, including the newly transformed plants_steam_ferc1 dataframe.

Return type

dict

pudl.transform.ferc1.plants_steam_validate_ids(ferc1_steam_df)[source]

Tests that plant_id_ferc1 times series includes one record per year.

Parameters

ferc1_steam_df (pandas.DataFrame) – A DataFrame of the data from the FERC 1 Steam table.

Returns

None

pudl.transform.ferc1.purchased_power(ferc1_raw_dfs, ferc1_transformed_dfs)[source]

Transforms FERC Form 1 pumped storage data for loading into PUDL.

This table has data about inter-utility power purchases into the PUDL DB. This includes how much electricty was purchased, how much it cost, and who it was purchased from. Unfortunately the field describing which other utility the power was being bought from is poorly standardized, making it difficult to correlate with other data. It will need to be categorized by hand or with some fuzzy matching eventually.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database.

  • ferc1_transformed_dfs (dict) – A dictionary of DataFrames to be transformed.

Returns

The dictionary of the transformed DataFrames.

Return type

dict

pudl.transform.ferc1.transform(ferc1_raw_dfs, ferc1_tables=('fuel_ferc1', 'plants_steam_ferc1', 'plants_small_ferc1', 'plants_hydro_ferc1', 'plants_pumped_storage_ferc1', 'purchased_power_ferc1', 'plant_in_service_ferc1'))[source]

Transforms FERC 1.

Parameters
  • ferc1_raw_dfs (dict) – Each entry in this dictionary of DataFrame objects corresponds to a table from the FERC Form 1 DBC database

  • ferc1_tables (tuple) – A tuple containing the set of tables which have been successfully integrated into PUDL

Returns

A dictionary of the transformed DataFrames.

Return type

dict

pudl.transform.ferc1.unpack_table(ferc1_df, table_name, data_cols, data_rows)[source]

Normalize a row-and-column based FERC Form 1 table.

Pulls the named database table from the FERC Form 1 DB and uses the corresponding ferc1_row_map to unpack the row_number coded data.

Parameters
  • ferc1_df (pandas.DataFrame) – Raw FERC Form 1 DataFrame from the DB.

  • table_name (str) – Original name of the FERC Form 1 DB table.

  • data_cols (list) – List of strings corresponding to the original FERC Form 1 database table column labels – these are the columns of data that we are extracting (it can be a subset of the columns which are present in the original database).

  • data_rows (list) – List of row_names to extract, as defined in the FERC 1 row maps. Set to slice(None) if you want all rows.

Returns

pandas.DataFrame