Fields of Research

Dependability of Computing Systems
Verlässlichkeit von Rechensystemen

The research group "Dependability of Computing Systems", headed by Prof. Dr. Klaus Echtle, works in the field of fault-tolerant parallel and distributed systems. Complete processors including their RAM are taken as redundant units to replace each other in case of failure. The redundancy is activated by fault tolerance techniques like majority voting, interactive consistency, backward error recovery based on checkpointing, forward error recovery by exception handling, reconfiguration, and others. Each of the techniques is implemented by appropriate protocols to be executed during normal operation and on fault occurrence.

Current research aims at an increased efficiency of the fault tolerance protocol in order to reduce the execution time and resource overhead. Expensive redundancy must not decrease performance too much by slowing down the interactions among processors. The main approaches to both reliability and performance imply reductions in the number of redundant processors and exchanged messages, concurrency among the messages, prevention of critical timeouts, use of digital signatures, and last not least careful adaption of the fault model to real fault scenarios and certification requirements.

The design of efficient protocols as countermeasures against faults usually leads to significantly increased system complexity, which causes severe problems for system verification and validation. Solutions are developed in our second research area: test of fault tolerance protocols by fault injection. Based on the analysis of a Petri-net-like system model we determine the set of faults to be inserted into the communication system by a software-implemented fault injector.

Research in both fields, design of efficient fault tolerance protocols and test by fault injection, requires implementations for computing systems with modern inter-processor communication. In this way the effectiveness of our algorithms and heuristics can be evaluated experimentally. Using a parallel computing system enables experiments on closely coupled multi-processor systems, and may also serve as a testbed emulating the properties of a loosely coupled distributed system.

Keywords:

  • Fault Tolerance
  • Fault-Tolerant Distributed Systems
  • Fault-Tolerant Automation System for Safety-Critical Application
  • Software-Implemented Fault Tolerance
  • Efficient Fault-Tolerant Communication Protocol
  • Fault-Masking and Byzantine Agreement Protocols
  • Analysis by Modeling and Exploration of the State Space
  • Partial Reachability Analysis Focused on Violations of Fault Tolerance Properties
  • Analysis by Software-Implemented Fault Injection into the Processor
  • Automatic Design of Fault Tolerance Techniques by Genetic Algorithms

The development of our research since the early eighties is presented in an article of our interest group on fault tolerance and dependability (in German language).