BIG DATA … where is all the data, and is it useful?

Presented By:
Brian Sparling, SMIEEE
Dynamic Ratings
Dr. Victoria M. Catterson
Senior Consultant to Dynamic Ratings
TechCon 2017


Much has been made of Big Data, and its potential usefulness in many applications in the electric power industry. It certainly is useful in many domains such as banking and logistics, giving insights to previously ‘hidden’ data aspects that can and do escape the human eye.

The application of Big Data tools, and data analytics more generally, within electric power utilities should consider the specific features of the world of asset management of major assets such as transformers, switchgear and battery systems. In particular, challenges arise with regard to the quality and timeliness of the data records of the equipment’s past life, not to mention that much of that data has gone missing.

For those engaged in the development of condition (or health) indices, this can prove to be a major challenge. Data quality and timeliness determines the confidence one can place on the evaluations made.

This paper will address this issue, and demonstrate that there are methods to detect bad quality of data. It will provide suggestions on how to address the timeliness of the data with regard to the confidence level that is necessary to make judgments and decisions in the asset management field.


Health Indexing is widely used as a tool for planning, refurbishment or replacement of critical HV electrical equipment [1]. The health index is a weighted combination of various factors and parameters influencing their condition, which in turn can be derived from a combination of inspection, off-line testing and on-line continuous monitoring data [2].

A lot of research has considered appropriate techniques for integrating this data into an overall health index [1]. These approaches generally assume that the data presents a reliable picture of current transformer health, meaning that the test record is complete and up-to-date. In an operational context the data may not be as reliable as is often assumed, and yet asset managers must still make decisions about maintenance and replacement based on it.

Big Data and Data Quality

More and more data is being collected by utilities, leading to an interest in tools and methods for handling so-called Big Data. Big Data is generally categorized according to the Gartner “3Vs” model, meaning it has high volume, high velocity, and high variety (of source and type). Other industries such as banking and logistics have reached the realm of Big Data faster than the power industry, and specific processing platforms and technologies have been developed for handling Big Data.

But before utilities adopt these platforms, the application domain and unique requirements of the power industry should be considered. While large historical data sets may be available for analysis, the quality of this data will have a significant impact on the quality of decisions that can be made.

A wealth of research throughout the 1990’s and 2000’s considered ways of characterizing the quality of data, with the aim of determining whether it could be used as a sound basis for decision making. Early work considered a very wide range of facets of data quality such as [3]:

  • Accuracy, precision, and reliability
  • Currentness, age, and timeliness
  • Completeness, and duplication
  • Consistency and integrity

Over time, these concepts have been condensed and defined as five dimensions of quality [4], which are:

  • Completeness; are all the records or fields present?
  • Timeliness; is the data up-to-date?
  • Validity; does the data conform to formatting and domain rules? For example, age cannot be a negative number.
  • Consistency; can related records be compared without conflict?
  • Accuracy; is the data a true reflection of the situation being recorded?

In some cases, a further dimension is added [5]:

  • Uniqueness; is the data recorded without repetition? For example, is the measured and/or calculated value from an IED or sensor constant regardless of changing operating conditions.

With respect to transformer data, there is scope for poor data quality along a number of these axes. Manual data entry and record-keeping introduces opportunities for invalid or inaccurate results to be stored. The skill level of the technician performing the test can also impact accuracy of the result. Sending DGA samples to different labs for analysis may lead to inconsistences due to differences in calibration, and/or processing of the samples, which make it more difficult to compare results against each other.

An example of this occurrence is given in Figure 1, showing the record for one of a fleet of GSU transformers with an Independent Power Producer (IPP). The engineer on site drew the conclusion that this transformer was in perfect health, as their training indicated that “No gas” is a good result. However, this record immediately suggests poor data quality, and should not be considered a reliable indicator of transformer health.

Figure 1 Example DGA record showing poor accuracy and uniqueness

Another example from the same fleet is shown in Figure 2. This unit tripped off-line due to a Buchholz operation. In this case again, the owner was under the impression that “No gas” is a good result. The owner was following an appropriate methodology, by performing DGA testing every six months with the same personnel drawing the samples from the units. However, the records highlighted in pink show suspiciously low levels of fault gases, which are more likely to be poor data quality than a true reflection of transformer health.

What one suspects is that the process by the laboratory (A) was inconsistent to say the least. The unit tripped off line due to a Buchholz relay operation on Oct 9th. Multiple samples were taken at that time, and on this occasion, sent to three different laboratories. Significant levels of gas were measured by two of the three laboratories, but ‘A’ was still inconsistent with the other two labs.

It was interesting to see that laboratory ‘A’ was selected to continue the DGA program, with the same inconsistent results. The data quality of the historical record is therefore suspect. Even although the last sample details a far more accurate reading of the gases, as properly processed by a competent laboratory, the suspect records mean that very little information can be drawn about trends over time. This makes asset management decision-making more difficult than if the data quality were higher.

Figure 2 Example DGA record showing poor accuracy and consistency

The Impact of ISO 55000 on Data Quality

ISO 55000 is a series of standards governing asset management processes (including data collection and retention) within organizations [6]. It was originally developed by the Institute of Asset Management (IAM) and published by the British Standards Institute in 2004 as Publicly Available Specification 55 (PAS55), before being adopted internationally in 2014 as ISO 55000. It is not specific to the power sector, although the network utilities in the United Kingdom have adopted ISO 55000.

Within the standard, the requirements on documented information such as test results are discussed. With regard to data quality, three key sections of the standard require that:

  • “The organization shall include consideration of… the impact of quality, availability and management of information on organizational decision making”
  • “The organization shall determine… how and when information is to be collected, analyzed and evaluated”
  • “When determining its information requirements, the organization should consider… its ability to maintain the appropriate quality and timeliness of the information”.

Fundamentally, ISO 55000 places a duty on the organization to consider what data is needed for the decision-making process. It does not specify details such as the level of quality or timeliness that is appropriate. The responsibility is placed entirely on the company to justify that their data collection strategy is appropriate for their operational needs. Therefore, if data is potentially out of date, it is not indicative of a poor asset management strategy or non-conformance to ISO 55000, as there can be strong operational reasons why particular tests may not be performed at the usual intervals.

Therefore, there is nothing in ISO 55000 which provides explicit guidance on the handling of data quality.

Data Timeliness as a Facet of Data Management

While accuracy and consistency of data are clearly important, the timeliness of test records is also critical to good decision-making. Researchers have proposed different approaches to quantifying and dealing with the effects of poor timeliness. It should be noted that all data quality measures are highly dependent on the business process the data is feeding.

For example, one test may still be considered timely two years after it was taken, while another may be out of date at this interval. Therefore, domain knowledge is needed to determine the threshold for whether a test result is timely enough (or accurate enough, valid enough, etc.) for the purposes of decision-making.

With these thresholds in place, it may be useful to automate the calculation of the overall quality of the data on a given transformer. Some approaches to this include taking a weighted average of the five axes of data quality, or taking the lowest score as the overall quality score [7]. This work has close parallels with the transformer health index itself, in terms of whether a transformer’s score should be an average of the health of its subcomponents, or whether the health is equal to that of the poorest condition component.

It should be recognized that a test report being old, does not inherently mean it is out of date. It only becomes out of date if the underlying circumstances change so that the test result no longer reflects the current status of the transformer. However, it is generally impossible to determine this without repeating the test. Therefore, the assumption is made that over time, the test result becomes less and less likely to represent the current health status of the transformer.

Bouzeghou and Peralta introduced terminology for describing these different aspects of timeliness [8]. They differentiate between currency (the time since the measurement was made) and obsolescence (how much the underlying value has changed since the measurement was taken). With regard to the whole database, they define a freshness rate, which is the percentage of records which are not obsolete. Finally, they define a timeliness period as being an application-specific length of time in which a piece of data may be considered not obsolete.

Applying this terminology to the transformer case, there is uncertainty about whether or not a test result is obsolete. The currency can be calculated from the known date of a given test, but the challenge lies in determining whether that test result is obsolete given the time elapsed since it was taken. The timeliness period (as defined by Bouzeghou and Peralta) can be used to help assess the likelihood of obsolescence.

As an example, two DGA tests a year apart may show low absolute values of key gases and very little change within the year. In this case, it would be reasonable to assume that the test report is unlikely to become obsolete within the next year, and that the likelihood of the test result becoming obsolete increases with each subsequent year. However, the timeliness of a report is highly dependent on the test type and test result. If the two DGA reports show a rapid rate of change over the year, the timeliness period would shorten significantly, and the chance of obsolescence one year later is high.

Summary and Discussion

When calculating a transformer health index based on test reports and condition parameters, there is clearly a question of how best to integrate data of varying ages. The field of data management contains some terminology and definitions which help to quantify the precise problem.

  • The challenge is to recognize and handle data obsolesce. An older test report is not automatically wrong (obsolete), but it can be difficult to assess how likely it is that the parameter has changed since the test was taken.
  • The period in which data can be considered timely is application specific, since some measurements may be obsolete within minutes while others remain static for months. Domain expertise is needed to judge reasonable periods of timeliness of data for transformer condition assessment. Timeliness may also depend on the parameter’s value and rate of change, as in the case of gassing.

Current practice within the industry was reviewed to try to assess how transformer owners tend to handle this problem. The most detailed information relates to the Distribution Network Owners (DNOs) within the UK, who propose a number of possible solutions:

1. Discard older test results and use default parameter values in the health index [9]. This has the advantage of being very simple to implement and understand. It has the disadvantage of ignoring any information contained within the older data, as a default value may over- or underestimate true aging.

2. Estimate the current value of the parameter, by using expert judgment to project forward from the older test value. If performed in an ad hoc manner, this may result in a subjective outcome based on the experience of the expert. A more rigorous approach is to formalize a general process by eliciting knowledge from an expert, with the caveat that elicitation can be time-consuming and validation may be difficult.

3. Estimate the current value of the parameter, by using an aging model to project forward from the older test value [9]. The aging models may be either a single function for all parameters, or may be custom-built for each parameter. In the first case, the algorithm is relatively simple to implement and understand, although the models may not capture all expected parameter changes. The second case may give more accurate results, at the expense of a longer development process. In both cases, the resulting models maybe difficult to validate.

4. Discard older test results and use population statistics to allocate a health index. This suited the particular needs of a UK utility for risk profiling as a subset of their wood pole asset base [10], but is not likely to give accurate results when applied to individual transformers.

5. Mandate that full testing of assets must be performed at a given interval, to remove the possibility of poor data timeliness. Adopted by another UK utility [11], however this is not always practical from an operational point of view.

By looking more widely to other domains, it was found that the timeliness of data can be presented along with the results, so that the user can make a judgment about the reliability of the output. The timeliness can be presented in a few different ways:

6. Icons indicating the accuracy and timeliness of each piece of data as in [12]. For transformers, this would mean mapping out the health index calculation and highlighting which inputs are of suspect timeliness. The asset manager can then trace through the impact of any older results on the final index. This has the benefit of using all available information in the calculation. Its disadvantage is that it makes no attempt to estimate changes in the older results, and may give a subjective outcome.

7. Comparing a metric for data quality or data timeliness against a trust threshold as in [13]. Based on the timeliness (or full data quality) of all inputs to the health index calculation, a separate data timeliness (quality) metric is calculated. If it is below a set threshold, the data is considered too poor for deriving sound decisions. The development of the quality metric can be an intensive and difficult process, and relies on significant expert judgment. It is also not clear how to deal with a result based on poor data quality.

8. Performing a sensitivity analysis on the health index as in [14]. Estimates are made of the possible current values of the potentially-obsolete condition parameter (e.g. by making a best case/worst case prediction from the time the test was taken). If the final health index does not change very much, the timeliness of the test is not of high importance. If there is a high sensitivity to the older test, the process has at least generated defined bounds on the possible values of the health index.

This aspect of developing confidence in any health or condition assessment of equipment is a concern to those involved. As such, thought needs to be put into the assessment criteria, to include not only current test and visual observations, but also previous older test data that may not be obsolete.

This requires the use of expert judgment to identify a data timeliness period or criteria for each test or condition parameter. Any data outside of these criteria is considered potentially obsolete, and should be treated with caution.


[1] B. Sparling and J. Aubin, “Determination of Health Index for Aging Transformers in View of Substation Asset Optimization,” in TechCon 2010, Mar. 2010.

[2] B. Sparling, “Health Index Performed … Now What?” in TechCon 2014, Feb. 2014.

[3] C. Fox, A. Levitin, and T. Redman,“The notion of data and its quality dimensions, ” Information Processing and Management, vol. 30, no. 1, pp. 9–19, Jan. 1994.

[4] K. Yin, S. Wang, Z. Liu, Q. Yu, and B. Zhou, “Research and development on data quality assessment management system,” in 2nd International Conference on Systems and Informatics (ICSAI), Nov. 2014, pp. 992–997.

[5] N. Askham, D. Cook, M. Doyle, H. Fereday, M. Gibson, U. Landbeck, R. Lee, C. Maynard, G. Palmer, and J. Schwarzenbach, “The Six Primary Dimensions for Data Quality Assessment: Defining Data Quality Dimensions,” The Data Management Association (DAMA) UK, Tech. Rep., Oct. 2013.

[6] Project Committee 251, “Asset management,” ISO 55000, International Standard, Mar. 2014.

[7] L. L. Pipino, Y. W. Lee, and R. Y. Wang, “Data quality assessment,” Communications of the ACM, vol. 4, no. 45, pp. 211–218, 2002.

[8] M. Bouzeghoub and V. Peralta, “A framework for analysis of data freshness,” in Proceedings of the 2004 International Workshop on Information Quality in Information Systems (IQIS’04), 2004, pp. 59–67.

[9] R. Wakelen et al, “DNO Common Network Asset Indices Methodology,” Network Asset Indices Methodology (AIM) Working Group, draft version 4, Dec. 2015.

[10] PA Consulting, “Scottish Power Energy Networks Asset Management Health Index Reporting Assurance,” Scottish Power Energy Networks, Annex, Jun. 2013.

[11] Energypeople, “SSEPD asset management and non-load related proposals for RIIO-ED1,” Scottish and Southern Energy Power Distribution, Report, Jun. 2013.

[12] M. L. Matthews, L. Rehak, A.-L. S. Lapinski, and S. McFadden, “Improving the maritime surface picture with a visualization aid to provide rapid situation awareness of information uncertainty,” IEEE Toronto International Conference on Science and Technology for Humanity (TIC-STH), Sep. 2009, pp. 533–538.

[13] A. H. Azadnia, M. Z. M. Saman, K. Y. Wong, and A. R. Hemdi, “Integration model of Fuzzy C means clustering algorithm and TOPSIS Method for Customer Lifetime Value Assessment,” IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), Dec. 2011, pp. 16–20.

[14] M. Mishra, D. Sidoti, D. F. Martinez Ayala, X. Han, G. V. Avvari, L. Zhang, K. R. Pattipati, W. An, J. A. Hansen, and D. L. Kleinman, “Dynamic resource management and information integration for proactive decision support and planning,” 18th International Conference on Information Fusion, Jul. 2015, pp. 295–302

Join our email list

We use cookies to give you the best online experience. By using this website you agree with our cookie policy.