|
||||
|
|
||||
|
The research group "Dependability of Computing Systems", headed by Prof. Dr. Klaus Echtle, works in the field of fault-tolerant parallel and distributed systems. Complete processors including their RAM are taken as redundant units to replace each other in case of failure. The redundancy is activated by fault tolerance techniques like majority voting, interactive consistency, backward error recovery based on checkpointing, forward error recovery by exception handling, reconfiguration, and others. Each of the techniques is implemented by appropriate protocols to be executed during normal operation and on fault occurrence. Current research aims at an increased efficiency of the fault tolerance protocol in order to reduce the execution time and resource overhead. Expensive redundancy must not decrease performance too much by slowing down the interactions among processors. The main approaches to both reliability and performance imply reductions in the number of redundant processors and exchanged messages, concurrency among the messages, prevention of critical timeouts, use of digital signatures, and last not least careful adaption of the fault model to real fault scenarios and certification requirements. The design of efficient protocols as countermeasures against faults usually leads to significantly increased system complexity, which causes severe problems for system verification and validation. Solutions are developed in our second research area: test of fault tolerance protocols by fault injection. Based on the analysis of a Petri-net-like system model we determine the set of faults to be inserted into the communication system by a software-implemented fault injector. Research in both fields, design of efficient fault tolerance protocols and test by fault injection, requires implementations for computing systems with modern inter-processor communication. In this way the effectiveness of our algorithms and heuristics can be evaluated experimentally. Using a parallel computing system enables experiments on closely coupled multi-processor systems, and may also serve as a testbed emulating the properties of a loosely coupled distributed system. |
||||
|
Keywords:
|
||||
|
The development of our research since the early eighties is presented in an article of our interest group on fault tolerance and dependability (in German language). |