OptSCORE

Selbstoptimierte Effizienz, Sicherheit und Robustheist für aktive Replikation

von 2015 bis 2021

Förderung

DFG HA 2207/10-1 und 10-2

OptSCORE ist ein Forschungsprojekt des Instituts für Verteilte Systeme und der Juniorprofessur für Sicherheit in Informationssystemen an der Universität Passau.

Vernetzte IT-Systeme haben sehr häufig hohe Anforderungen an Zuverlässigkeit, Verfügbarkeit und Sicherheit. Replikation von Diensten ist ein grundlegender Mechanismus, diese Anforderungen zu erfüllen. Um gleichzeitig Skalierbarkeit zu erzielen, gibt es vor allem im Bereich von Speicherdiensten viele Ansätze, die mit abgeschwächten Konsistenzanforderungen auskommen. Manche Systemdienste benötigen jedoch stärkere Konsistenzgarantien, z.B. Koordinierungsdienste wie ZooKeeper, den Namenode von HDFS oder Dienste zum Identity Management. Für byzantinische Fehler sind schwache Konsistenzmodelle ebenfalls ungeeignet, da divergente Werte nicht von fehlerhaften unterscheidbar sind. Das OptSCORE Projekt konzentriert sich daher auf State-Machine-Replication (SMR), ein Replikationsverfahren für Dienste mit starken Konsistenzanforderungen. SMR basiert auf verteilter Einigung sowie auf deterministischen Ausführungen. Die oft sequentielle und damit deterministische Ausführung ist mit der zunehmenden Verbreitung von Mehrkernsystemen inakzeptabel, was zur Entwicklung deterministischer Scheduler geführt hat.

In der Praxis entsteht ein großer Raum konfigurierbarer Parameter angefangen von der Auswahl unterschiedlicher Protokolle bis zur Festlegung von Timeout-Werten. Wir werden zunächst Art und Umfang des Einflusses von Protokollen, Algorithmen und Parametern auf der einen und von Umgebungsbedingungen und Anwendungsverhalten auf der anderen Seite auf das Gesamtverhalten eines SMR-Systems zusammentragen. Wir konzentrieren uns dabei auf die Gruppenkommunikation und das deterministische Scheduling nebenläufiger Ausführungen. Als zweiter Schritt sollen Konzepte entworfen werden, die Parameter automatisch abzuleiten und ständig zu optimieren.

Eine Prototypimplementierung für ein rekonfigurierbares und selbstadaptierendes Gruppenkommunikationssystem sowie für einen selbstoptimierenden deterministischen Scheduler soll entworfen und in ein Framework für replizierte Dienste integriert werden, mit dem schließlich praxisnahe Evaluationen möglich werden. Wir erwarten, dass sich selbstadaptierende Systeme entsprechend besser verhalten als starr konfigurierte oder nicht konfigurierbare Systeme. Das Vorhaben ist damit ein Grundlagenbeitrag um letztendlich SMR-basierte Systeme näher an die Praxis heranzuführen.

Zugehörige Publikationen

2023

Köstler, J., Reiser, H.P., Hauck, F.J. and Habiger, G. 2023. Fluidity: location-awareness in replicated state machines. 38th ACM/SIGAPP Symp. on Appl. Comp. – SAC (Mar. 2023).
In planetary-scale replication systems, the overall response delay is greatly influenced by the geographical distances between client and server nodes. Current systems define the replica locations statically during startup time. However, the selected locations might be suboptimal for the clients, and the client request origin distribution may change over time, so a different replica placement may provide lower overall request latencies. In this work, we propose a locationaware replicated state machine that is able to adapt the geographic location of its replicas dynamically during runtime to locations geographically closer to client request origins. Our prototype is able to observe emerging optimization potentials and to reduce the overall request latency for the majority of clients by adapting its replica locations to the time-dependent optimum placement during real-world use case evaluations, whereby the absolute performance gain is dependent on the respective usage scenario.

2022

Berger, C., Eichhammer, P., Reiser, H.P., Domaschka, J., Hauck, F.J. and Habiger, G. 2022. A survey on resilience in the IoT: taxonomy, classification, and discussion of resilience mechanisms. ACM Comp. Surv. 54, 7 (2022), 147:1-147:39.
Internet-of-Things (IoT) ecosystems tend to grow both in scale and complexity, as they consist of a variety of heterogeneous devices that span over multiple architectural IoT layers (e.g., cloud, edge, sensors). Further, IoT systems increasingly demand the resilient operability of services, as they become part of critical infrastructures. This leads to a broad variety of research works that aim to increase the resilience of these systems. In this article, we create a systematization of knowledge about existing scientific efforts of making IoT systems resilient. In particular, we first discuss the taxonomy and classification of resilience and resilience mechanisms and subsequently survey state-of-the-art resilience mechanisms that have been proposed by research work and are applicable to IoT. As part of the survey, we also discuss questions that focus on the practical aspects of resilience, e.g., which constraints resilience mechanisms impose on developers when designing resilient systems by incorporating a specific mechanism into IoT systems.

2021

Köstler, J., Reiser, H.P., Habiger, G. and Hauck, F.J. 2021. SmartStream: towards Byzantine resilient data streaming. 36th Ann. ACM Symp. on Appl. Comp. – SAC (Virtual Event, Republic of Korea, Mar. 2021), 213–222.
Data streaming platforms connect heterogeneous services through the publish-subscribe paradigm. Currently available platforms provide protection against crash faults, but are not resistant against Byzantine faults like arbitrary hardware faults and intrusions. State machine replication can provide this protection, but the higher resource requirements and the more elaborated communication primitives usually result in a higher overall complexity and a non-negligible performance degradation. This is especially true for data streaming if the default textbook approach of integrating the service into a replicated state machine is followed without further adaptions. The standard state management with state logs and snapshots and without any partitioning scheme limits both performance and scalability in a way those systems become unusable in practice. That is why we propose SmartStream, a topic-based Byzantine fault-tolerant data streaming platform that harmonizes the competing concepts of both systems and leverages the specific characteristics of data streaming, namely the append-only semantics of the application state and its partitionable structure. We show its effectiveness in a prototype implementation and evaluate its performance. The evaluation results show a moderate drop in system throughput when compared to state-of-the-art data streaming platforms like Apache Kafka, but reasonable overall performance considering the stronger resilience guarantees.

2020

Habiger, G., Hauck, F.J., Reiser, H.P. and Köstler, J. 2020. Self-optimising Application-agnostic Multithreading for Replicated State Machines. Int. Symp. on Rel. Distr. Sys. – SRDS (2020), 165–174.
State-machine replication (SMR) is a well-known approach for fault-tolerant services demanding fast recovery. It is not easy, however, to parallelise SMR in order to exploit modern multicore architectures. Two main approaches have been extensively studied; one focusing on request-level concurrency using prior knowledge, the other utilising application-agnostic and lock-level deterministic scheduling. We show that significant performance improvements for the latter approach require deterministic scheduler configurations to be dynamically adapted to the current application load during runtime. First, we summarise current research on parallel SMR execution. Second, an analysis of obstacles in lock-level deterministic multithreading approaches shows how static scheduler configurations can lead to poor performance when load on the system varies over time. Third, we present a simple yet effective automatic adaptation solution, which provides significantly better overall system behaviour compared to static configurations. This is demonstrated by evaluations using a full system setup.

2019

Habiger, G. and Hauck, F.J. 2019. Systems support for efficient state-machine replication. Tagungsband des FB-SYS Herbsttreffens 2019 (Osnabrück, 2019).

2018

Habiger, G., Hauck, F.J., Köstler, J. and Reiser, H.P. 2018. Resource-Efficient State-Machine Replication with Multithreading and Vertical Scaling. Proc. of the 14th Eur. Dep. Comp. Conf. (EDCC) (Iaşi, Romania, Sep. 2018).
State-machine replication (SMR) enables transparent and delayless masking of node faults. It can tolerate crash faults and malicious misbehavior, but usually comes with high resource costs, not only by requiring multiple active replicas, but also by providing the replicas with enough resources for the expected peak load. This paper presents a vertical resource-scaling solution for SMR systems in virtualized environments, which can dynamically adapt the number of available cores to current load. In similar approaches, benefits of CPU core scaling are usually small due to the inherent sequential execution of SMR systems in order to achieve determinism. In our approach, we utilize sophisticated deterministic multithreading to avoid this bottleneck and experimentally demonstrate that core scaling then allows SMR systems to effectively tailor resources to service load, dramatically reducing service provider costs.

2016

Hauck, F.J., Habiger, G. and Domaschka, J. 2016. UDS: a novel and flexible scheduling algorithm for deterministic multithreading. Proc. of the 35th Int. Symp. on Reliable Distrib. Sys. (SRDS) (Budapest, Hungry, Sep. 2016).
Hauck, F.J. and Domaschka, J. 2016. UDS: a unified approach to determinisitic multithreading. 36th Int. Conf. on Distrib. Comp. Sys. (ICDCS) (Nara, Japan, Jun. 2016).
Habiger, G., Hauck, F.J., Köstler, J. and Reiser, H.P. 2016. Vertikale Skalierung für aktiv replizierte Dienste in Cloud-Infrastrukturen.