Presented by:
Zachary H. Draper, James J. Dukarm, and Claude Beauchemin
Delta-X Research, Inc. and TJH2b Analytical Services
TechCon 2022
The main results for the comparative study of dissolved-gas analysis (DGA) fault severity methods are published in CIGRE Canada 2021 and IEEE PES 2022. The CIGRE conference paper compared methods derived from IEEE, IEC, and Reliability-based DGA and found that Reliability-based DGA performed the best at identifying transformers at risk of near-term failure. The same data and toolset of comparative methods were later used in the IEEE conference paper to illustrate how the interpretive method of IEEE C57.104-2019 could be improved by formally defining an “Extreme” DGA category. Here we present the underlying tools (statistics and optimization curves) that were used to perform this comparative analysis so that more advanced DGA fault severity interpretation methods can be built in an objective and scientific manner.
Introduction
The implicit promise behind DGA is to allow electrical power system operators to monitor the health of power transformers and other critical assets in the grid. The ideal scenario is that DGA will detect a problem in a transformer prior to near-term failure so that it can be removed from service for maintenance before a forced outage or catastrophic failure can occur. Transformer failure is relatively rare, but the cost associated with failure for large critical assets warrants the cost of routine monitoring. Since the gases in the transformer oil do not represent the true “condition” of the transformer but are rather a symptom of a problem, the correct interpretation of the data is crucial to realize this ideal.
Beyond a limited set of case studies, however, the capability of DGA fault severity interpretation methods to predict near-term failure in a fleet of transformers has not been formally tested. This is important because for any given method if the balance between false positives and true positives on operational transformers is not good enough, the method may not be cost-effective or useful. Currently, most methods of DGA fault severity interpretation are based on statistical analysis of a large database of DGA results where the corresponding condition of the transformer is unknown. Therefore, these methods use statistical limits (high percentiles) representing atypical gas concentrations, deltas, or rates of increase rather than limits based on actual failures or known faults.
What is absolutely crucial in order to make progress in DGA fault severity interpretation is knowing what we are actually looking for. In other words, what does failure actually look like? Based on that, we can develop algorithms using DGA to identify failing transformers. Then by using a set of mathematical criteria, we can objectively determine which algorithms are performing better at classifying transformers at risk of near-term failure.
Failure Data
Collecting and gathering data related to transformers that have failed is fundamental to being able to predict failure in the future. Not all failures will be predictable by DGA, but any insulation failure should show some symptoms of degradation if the transformer was routinely sampled. When assets fail, an internal investigation of the root cause of failure can create a trove of valuable information. A full root cause analysis would be ideal, but even minor inspections or physical testing to determine if the transformer can continue to operate can help remove cases that are clearly not DGA-related, such as vandalism or wildlife intrusion. Such investigative data are crucial, as we will see below because they represent the “True Condition” that we are attempting to predict or test for. It should also be noted that “failure” is not necessarily the only condition we may try to predict with DGA. For example, cooling system insufficiency indicated by carbon oxide gassing may not result in near-term failure, but a rather long-term degradation of the insulation system. Therefore test methods may be adapted or the condition being tested for may vary, but the statistical tools described in this paper can be used to evaluate the relative effectiveness of any of these methods.
The data set used in contained 15,239 operational transformers and 307 transformers that failed in-service. For each failure case, we required at least 3 samples and that the sample prior to failure was taken within 2 years of the date of failure. Post-Failure samples were excluded because they are not predictive of the failure and rather a result of a severe fault. In most cases, transformers were sampled on a yearly cadence.
Confusion Matrix
In essence, DGA fault severity interpretation is a classification problem to identify faulty transformers and pass those that do not pose a risk of near-term failure or require extra attention. A confusion matrix is the primary basis for screening test statistics such as the true positive rate and the false-negative rate. These statistics provide an objective basis for assessing the relative performance of classification methods (See Table 1).
The following section describes in boolean terms the variables in the confusion matrix used in this study. The descriptions refer to a fleet of transformers, DGA screening tests, and incipient failure. C+ denotes a transformer with a condition leading to a near-term failure in the short term for “condition positive”. Ideally, such a condition is detectable by DGA before failure. C− denotes healthy transformers operating normally without an incipient failure as “condition negative”. T+ denotes positive DGA interpretation test results indicating the presence of incipient fault as “test positive”. Usually, this means the DGA gases are above some limits, depending on the interpretation method. T− denotes a DGA interpretation that does not indicate incipient failure as “test negative”. Usually, it designates DGA results below some limits, again depending on the method. Note that the analysis process presented here has been adapted to DGA but is not limited to DGA. The statistics presented here could be applied to any other equipment, operating condition, and/or screening tests.
Subjects
A fleet of N transformers that are in service and are periodically screened by DGA to select which transformers to investigate as possibly having a near-term failure. For discussion purposes, condition (C) is an impending failure, but it could be any other definable state of practical interest. The transformers are referred to as “cases” or “subjects”.
Condition
The set of all “condition positive” cases, i.e., transformers having condition C, near-term failure, is denoted by C+. The set of all “condition negative” cases, i.e., those without condition C, healthy transformer, is denoted by C−. Obviously |C+|+|C−| = N, the population size. The sets C+ and C− must be known to a high degree of certainty, beyond the capabilities of the screening tests being compared, for the purpose of this comparative analysis. Typically a nearly infallible “gold standard” test is available as a standard, but it may be very expensive or impractical for wide application. For instance, electrical tests or internal inspection performed with the transformer de-energized would be the gold standard method of determining whether a transformer in service has any internal defect or problem with its insulation. In practice, however, those tests cannot be applied as routinely as a DGA screening test. A forced outage or catastrophic failure could be considered as a retrospective gold standard confirmation of the impending failure in previous data.
Test result
A DGA test (T) has a “test positive” result associated with the prediction of near-term failure and a “test negative” result associated with the prediction of a healthy transformer. The sets of transformers whose latest screening test results are positive or negative are, respectively, T+ and T−, and their sizes are |T+| and |T−|, with |T+|+|T−| = N. The DGA screening test is used on the entire population. It is desirable that the DGA test result should match the actual condition in a large majority of cases. The intersection T+ ∩ C+ consists of all the transformers that are both DGA test positive and condition positive for near-term failure.

Table 1
Table showing boolean states the transformers fleet can have versus the possible DGA test predictions. Every transformer in the fleet is evaluated with a binary score (i.e. good or bad) and tallied in one of the four categories in the matrix. Row and column sub-totals for each DGA outcome and each condition are calculated. It is possible to have a larger confusion matrix (NxN) for a ranking system featuring multiple score values, like 1 to 4, or even for ranges of continuous risk value (i.e individualized score).
Prevalence: The prevalence of condition C (near-term failure) is the probability

or the fraction of the transformer population that will have a near-term failure. This is also the Bayesian prior probability of a randomly selected transformer having a near-term failure before its DGA test result is known.
True Positive Rate: The true positive rate of a screening test, also known as Sensitivity or Recall, is the fraction of transformers that will have near-term failure and a DGA test positive. A test with high sensitivity is positive for a large majority of condition-positive subjects.

False Positive Rate: The fraction of healthy transformers that test positive during a DGA screening is the false positive rate (FPR). Generally, an ideal DGA screening test should have a low false-positive rate, (i.e., true negative rate a.k.a. specificity), but if follow-up testing is not too expensive, or the cost of an unexpected failure in service is extremely high, then it might be preferable to accept a higher false-positive rate in order to have a lower false-negative rate.

Positive Predictive Value: Positive predictive value (PPV) is also known as precision. PPV is the Bayesian posterior probability of near-term failure, given a positive DGA test result. When the prevalence of near-term failure is very low, as it is for transformers, even a highly sensitive screening test can have a low PPV, implying that a positive test result requires further investigation to confirm the diagnosis.

Other statistics that can be used to evaluate classifiers, but not used for the following optimization curves, are presented in the Appendix for brevity.
ROC Curve
Receiver Operating Characteristic (ROC) curves were first utilized in WWII by electrical engineers to optimize radar systems for accurately identifying targets versus spurious returns. This tool was soon adapted to many other classification problems across industries. More recently, the machine learning community has adapted it as a fundamental tool for assessing the performance of various classification methods.

Figure 1: Graphic showing example ROC curves. More area under the curve generally represents a better classifier. The left shows a ROC curve with continuous variable, the right image shows discrete status levels typical for most DGA screening tests.
The general idea, shown in 1, is that the true positive rate (Eq. 2) and the false positive rate (Eq. 3) can be plotted on scales of zero to one. A perfect classifier is one that has 100% true positives and 0% false positives, which is in the upper left corner of the chart. Any method which is equivalent to a fair coin flip (i.e. has purely random results unrelated to the condition) will have a true positive rate that is equal to its false-positive rate, represented by the red dashed diagonal line. Anything at or below that line would mean that the classifier is either anti-correlated or has no diagnostic value for the condition being tested for.
Generally speaking, a curve with a large area under it represents a better overall classifier than a curve with a smaller area under it. This can be computed down to a single number by simply taking the integral under the curve. A special case is where the curves for two different methods intersect. Then a combination of the two methods may be a better method overall. For a method with multiple status codes, such as IEEE C57.104-2019 with status code 1 to 3 (plus extreme), each status code can be considered to have its own true positive and false-positive rates, providing a series of points on a ROC curve. Joining those points with straight line segments (starting at 0,0 and finishing at 1,1) will generate the ROC curve. Similarly, any method with continuous risk variable, or health index number, can provide multiple points (up to one per transformer) on a ROC curve by recording all true positives and false positives corresponding to various levels of the variable.
Precision-Recall Curve
Another curve that can be used to compare classification methods is called a Precision-Recall curve (Positive Predictive Value – True Positive Rate curve). As with the ROC curve, a large area under the P-R curve generally indicates a better classification method. Precision is another name for positive predictive value (Eq. 4). Recall is just another name for true positive rate or sensitivity (Eq. 2). Same as for ROC curves, points on a P-R curve represent individual “levels” (such as status codes) of a classification method. But linear interpolation is not appropriate. A P-R curve has some advantage over a ROC curve when there is a strong imbalance in positives versus negatives conditions, such as when the condition tested for is rare, which is the case for transformer failures.
Since transformer reliability is high, the prevalence of failure is typically 0.25-1% per year for all causes, i.e., the number of condition positives for near-term failure is very low. A P-R-curve can highlight a method that has good predictive value for extremely rare high-risk cases, but may not perform well in general as a classifier of risk. For example, in Figure 2, Method A has a high positive predictive value (precision, PPV) for cases where it has a low sensitivity (TPR), but lower PPV for most of the cases where it has better sensitivity. In other words, if the transformer is flagged positive based on very high limits (which can’t detect problems occurring at lower gas levels), you might strongly believe the condition to be true, but positive results based on lower limits may often be false positives.

Figure 2: Graphic showing example Precision-Recall curves where more area under the curve represents a better classifier (A is better than B). Linear interpolation is not appropriate for the PR curve and requires special interpolation between points.
Conclusions
By using the classifier assessment methods presented here, it is possible to objectively and scientifically compare DGA interpretation methods. This comparison allows incremental improvements to algorithms that show better performance. Examples of this are provided in which compares several different DGA interpretation methods. In, IEEE C57.104-2019 was shown to be improved by adding a level for extreme DGA. The classifier assessment tools were also effectively used to optimize the choice in limits for the “Extreme” DGA level. Multiplying the status code 3 limits by a factor of 7 maximized the diagnostic odds ratio (Eq. 14) and the area under the optimization curves, obtaining the greatest diagnostic effect at flagging transformers at risk of near-term failure.
References:
- Z.H. Draper and J.J. Dukarm. “Forecasting Near-Term Failure of Transformers Using Reliability Statistics on Dissolved Gas Analysis”. In: 2021 CIGRE Canada Conference. Number 408. Toronto, ON, Oct. 2021.
- Z.H. Draper J.J. Dukarm and C. Beauchemin. “How to Improve IEEE C57.104-2019 DGA Fault Severity Interpretation”. In: 2022 IEEE PES Transmission and Distribution Conference and Exposition (TD 2022). in press. New Orleans, LA, Apr. 2022.
- L. Maxim, R. Niebo, and M. Utell. “Screening tests: a review with examples”. In: Inhalation Toxicology 26 (2014), pp. 811–828.
- Alaa Tharwat. “Classification assessment methods”. In: Applied computing & informatics ahead-of-print.ahead-of-print (2020 8). issn: 2210-8327. Doi: 10.1016/j.aci.2018.08.003. Url: https://doi.org/10.1016/j.aci.2018.08.003.
- Afina S. Glas et al. “The diagnostic odds ratio: a single indicator of test performance”. English. In: Journal of clinical epidemiology 56.11 (2003), pp. 1129–1135.
- P.M. Woodward. Probability and information theory with applications to radar. Pergamon Press, 1953.
- Tom Fawcett. “An introduction to ROC analysis”. In: Pattern Recognition Letters 27.8 (2006). ROC Analysis in Pattern Recognition, pp. 861–874. issn: 0167-8655. doi: https://doi.org/10.1016/j.patrec.2005.10.010. Url: http://www.sciencedirect.com/science/article/pii/S016786550500303X.
- Jesse Davis and Mark Goadrich. “The Relationship between Precision-Recall and ROC Curves”. In: Proceedings of the 23rd International Conference on Machine Learning. ICML ’06. Pittsburgh, Pennsylvania, USA: Association for Computing Machinery, 2006, pp. 233–240. isbn: 1595933832. Doi: 10.1145/1143844.1143874. url: https://doi.org/10.1145/1143844.1143874.
- cmglee, MartinThoma. Roc-draft-xkcd-style.svg. [Online; accessed December 15, 2021]. 2021, CC BYSA 4.0, url: https://commons.wikimedia.org/w/index.php?curid=109730045.
- Takaya Saito and Marc Rehmsmeier. “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.” In: PLoS One (2015). doi: 10(3):e011843.
- Saito Rehmsmeier. Precision-recall curves for multiple models. [Online; accessed December 15, 2021]. 2021. Url: https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/two-precision-recall-curves.png.