Asset Management, System Intelligence
As an asset manager, one is always looking for better ways to utilize the data they have to make better data-driven decisions. We live in an unprecedented time for data availability but often times we lack data integration. Furthermore, as we seek to integrate these data sources, we are met with security hurdles in accessing and sharing said data. At times we question if it is even possible to bring this information into a common platform in an efficient manner to facilitate the decision-making process. Is it possible to integrate these disparate data sources in this environment? If we could, what doors would it open for compound analytics and machine learning to further enhance our ability to prioritize maintenance activities?
Your manager walks in and asks you to provide a list of your top 50 transformers to be replaced and she needs it in two days to support the five-year budgeting process. You are responsible for 4,758 transformers which is up from 2,550 last year due to attrition and a hiring freeze. Time to get to work. You know from experience, some of your worst actors, that’s a start. You have condition assessment data like Dissolved Gas Analysis (DGA) and electrical testing. There’s also work order history, outage, and Supervisory Control and Data Acquisition (SCADA) information as well as the notebook you’ve been maintaining for years and a few dozen spreadsheets with various purposes to support your decision-making process. Each of these sources is in its own software and may be listed with legacy asset IDs, serial numbers, and/or current asset IDs. It seems that over the course of multiple mergers and software upgrades, the equipment ids were changed in some systems and not in others. Navigating this maze will be a task in and of itself. If only all of this information could be brought together in one spot…if only.
Duke Energy’s journey towards data integration set out with the goal of bringing many disparate data sources together onto a common platform to incorporate engineering analytics and machine learning to lead to data-driven decisions. Multiple data sources were targeted as critical for asset management including DGA, electrical testing, work history, outages, and SCADA. Additionally, Duke had a growing number of Online monitors that were targeted for integration onto the platform.
This was to be an enterprise solution and would need to incorporate the four business regions within Duke. These regions were the relic of previous mergers and each region had previously been a different company. While all regions had converged on the same CMMS and had largely standardized data requirement within it, the other data sources proved to be far more siloed. As each data source was explored, key differences were noted in the way data was handled. Some of the common issues encountered were:
- Equipment ID’s-some used a legacy ID, some used Serial Number, some used CMMS ID
- Naming convention of SCADA information
- Outage information database ID controlled by operations group and implemented regionally
- SCADA naming conventions varied by region, with no standard on abbreviations across regions.
So how was this issue addressed? First step was to pick the master system of record. For our efforts, our CMMS was determined to be the system of record. As the system of record, the CMMS ID would be the unique key to which the other sources would be linked. Because of the importance of the CMMS, a commitment was made to clean up the data and populate fields that had been identified by asset SMEs as critical and highly desired. If an asset was missing critical fields it would not be passed in the HRM platform. For this reason, data cleanup was a priority of the project efforts. Additionally, the extra data would provide more features on which the machine learning algorithms could train.
The next step was to review each source noting regional differences to see how they match up to the system of record ID. The best-case scenario was that the source would use the CMMS ID as the ID, but this was inconsistent between regions and between sources. Several strategies were employed to deal with these shortcomings:
Clean up the data automatically in the source
An example would be to write a program that matches the serial number used as the key in the electrical testing to the serial number field of our CMMS which is the system of record. This program would update the ID field within the electrical test database. For some sources, this wasn’t an option.
Can be done relatively quickly compared to manually updating.
Updating the ID can impact the field technicians performing the test because the asset ID they’re used to seeing is different.
Brute force-Clean up the data manually in the source
These were used when small amounts of data needed processed but weren’t practical if tens of thousands of fields needed to be entered.
Convenient for small quantities of data.
Increased potential for human error, slow, not ideal for large quantities of data.
Extract, Transform, Load (ETL) Rules
These are programming rules that are used to transform the data after it is extracted from the source but before it is loaded into the datamart. An example of the transformation rules would be removing special characters from source system A which uses them in serial numbers, but source system B does not allow special characters, so they had been removed from that source already. Without the transformation, many serial number matches would be lost.
Performed automatically every time data is pulled. Convenient and a standard protocol used in transferring data.
ETLs are typically managed by IT and lack transparency to other organizations.
A table could be created in our HRM datamart that housed the CMMS asset ID and the source asset ID.
With the data in a table in the datamart, SQL queries could be run conveniently to join the CMMS to the source.
Lacks transparency for the business without a user interface because the tables are managed by IT. Would likely require a user interface (UI) to be managed by someone other than IT.
Cross Reference Specification
A field could be added to the CMMS system that would contain the unique key for that asset.
Can be managed by the Transmission organization and not IT.
Requires a special process to upload large quantities of data.
The next step was to address any data security concerns. SCADA information for load and the condition-based monitors which bring back information on the substation SCADA RTUs. For illustrative purposes, we will discuss a common means of providing protective layers within a network. From a network standpoint, the RTUs communicate with the Operations Center in a Classified Militarized Zone (CMZ). A data historian is configured to receive data from the Operations Center through a firewall interface with limited open ports in the Demilitarized Zone (DMZ). The data historian passes approved data to a data historian on the LAN side that acts as a server for process measurements to Duke engineers. Furthermore, VLANs can be used to segregate the network. HRM interfaces with the LAN side historian server to pull back the cross-referenced tags and tie them to a CMMS ID.
Now that we have our data loaded into our datamart it is then passed to our software vendor. With the data in hand, engineered analytics were used to create health indices around conditions and rolled up into overall health scores. Metrics such as rates of change, % deviation from baseline, heavily loaded days, health scoring indexes, criticality scoring indexes are brought together to provide an engineered analytic risk score. Health indices were configured around condition assessment analytics and these health indices were rolled up into overall health scores. Asset Subject Matter Experts were involved from the beginning and were crucial in setting up the health and criticality indexes and the way they are rolled up into the risk score so that it mimicked the logic they typically used but didn’t have the time to apply to all assets when asked to manually rank assets. These scores are consistent across all regions and allow ranking of assets on overall health conditions.
Analytics were also configured around metrics that were related to the criticality of the asset. Metrics like customer count, substation type, environmental factors, etc, roll up into an overall criticality score associated with the asset.
Finally, the health score is brought together with the criticality score to provide an overall risk score for the asset. Rolling up the metrics in this manner allows for reporting and ranking on specific health analytics or on the overall risk, health, and criticality.
To further enhance the engineered analytics and provide additional insights machine learning models were implemented. We opted for a supervised scientific machine learning model that was built on the engineered analytics. In this method, a sample set of assets, some of which were failed assets (with condition assessments available) and some of which were healthy. Data scientists ran this data through machine learning algorithms that looked for patterns that were indicative of the failed assets.
It was important that a variety of makes, models, and vintages were provided for both healthy and assets. Overrepresentation in the sample set leads to false patterns recognized. For instance, if all failed assets provided were a specific make and model, then the machine learning algorithm would see that make and model as a prime indicator of failure, especially if not healthy, this make and model were provided. Failed asset information was moved to a separate to preserve history and facilitate standardized NATF failure codes. Accuracy of this information was also critical to facilitate pattern recognition by the machine learning techniques.
Once the models were trained, they were applied to the asset data in HRM yielding a machine learning health score. To measure the success of the machine learning model Precision and Recall metrics were used. Before going further, we need to define some terms:
- True Positives- an asset predicted to fail within the time frame of two years that did fail.
- True Negatives- an asset predicted to not fail within two years that did not fail.
- False Positives-an asset predicted to fail within the time frame of two years that did not fail
- False Negatives-an asset predicted to not fail within two years that did fail
Precision is a measure of how often the model expects to predict True positives.
Mathematically, it is the number of true positive divided by the quantity of true positives plus false positives. Recall is a measure of how often the model expects to predict True Positives without missing a True positive in the sample set. Mathematically, it is the number of True Positives divided by the quantity of True Positives plus False Negatives. To evaluate these values the model was backtested on a wider set of historic data including failed assets. The model continues to be improved as our data set grows.
Improving the workflow for our Subject Matter Experts was another primary goal. With the necessary data integrated into a common platform, the next item on the agenda was how to identify at-risk equipment and raise it to the attention of the subject matter experts. To accomplish this goal an alert system was put in place. SMEs identified specific results of condition assessments that they would typically trigger them to review the assessment results with greater scrutiny. The alerts trigger a state change for the asset to ‘Risk Identified’ in HRM. This allows an SME to filter for assets within their area of responsibility for ‘Risk Identified’ assets to know when an asset has had a new condition assessment that contains a predefined anomaly.
The SME then reviews the asset within HRM where they have access to other condition assessments, work order history, outage information, SCADA, and electrical testing to cross-reference and make a comprehensive decision on the condition of this asset. If the asset is at risk, the state is changed to Monitor and a special work order, classified as a Watchlist Work, Order is generated to take appropriate action on the asset based on the assessment. The Watchlist classification, along with the priority, communicates the urgency of the work order to our work management organization. If, on the other hand, the SME determines that the ‘Risk Identified’ within HRM is not indicative of further concern, then the asset can be taken to a state of ‘Stable.’ The asset will remain in this state until another condition assessment is performed that triggers an alert.
- Bradley Mitchell (27 August 2018). “Demilitarized Zone in Computer Networking”. Retrieved 10
- Christopher Riggio (1 November 2019) What’s the deal with Accuracy, Precision, Recall, and F1? Retrieved 12/1/2021