24 November 2014
Title: Do I need to fix a failed component now, or can I wait until tomorrow?
Ideally in systems in which failures are monitored and sensed, an engineer would fix a failure immediately. But this might not be possible due to limited resources and/or physical distance to a device. So how does an engineer prioritise and make best use of their resources, while still ensuring the service is operating within acceptable levels of risk of failure?
We hypothesise predictive event-based modelling and reasoning with a stochastic temporal logic can inform decision making when failures occur. We show, with a real industrial case study, that by relating the status of assets to service behaviour in a CTMC model, the risk of service failure now, and over various time frames, future failure rates, and interventions, can be quantified. We reason both in the context of how the system is designed to meet service requirements, and how it actually meets service requirements, when the models are calibrated with rates derived from historical, field data.