Data Management & Engineering

Data Management is one of the central challenges in developing modern software systems. The need for more sophisticated Data Management is even more emphasized in the current times of Artificial Intelligence and Big Data-based systems which have even more demanding data requirements than traditional Data Management had to consider.

Data Engineering

In Data Engineering, we focus on preparing data for its deployment or usage in a complex AI/data-driven system. This covers for example discovering data, cleaning data, transforming data, or integrating data from heterogenous sources. Also, there is a focus on (domain-specific) meta-data creation and management. Furthermore, aspects of data biases and potentially arising societal issues like misrepresentation and unfairness become focus area. Data Engineering topics are often seen in the context of their application domains, like Digital Humanities, medicine, but also business application like banking.

Scalable Data Management

In Scalable Data Management, the focus is on how to cope with the ever-increasing demand for storage and processing power by scaling data operations. This covers for example methods for stream-processing but also flexible distribution schemes or the deployment of scalable AI-models.


  • Amalur - Next-generation Data Integration in Data Lakes

    With Amalur project we believe that this is the right moment to revisit all the components of classic data integration (DI) systems, and to see how these fit into modern data lakes that are meant to support linear algebra as a first-class citizen.

  • Valentine - Schema Matching for Data Discovery

    Valentine is an extensible open-source project to execute and organize large-scale automated matching processes on tabular data either for experimentation or deployment in real world data. Valentine was published in ICDE 2021 and demoed in VLDB 2021.

  • Clonos - Consistent Causal Recovery for Highly-Available Streaming Dataflows

    Clonos is a fault tolerance approach that achieves fast operator recovery with exactly-once guarantees and high availability by instantly switching to passive standby operators. Clonos enforces causally consistent recovery, including output deduplication, by tracking nondeterminism within the system through causal logging. Clonos was presented in a SIGMOD 2021 paper.

  • Transactions on Stateful Functions-as-a-Service

    This project deals with executing transactions (two-phase commit and SAGAs) on Stateful Functions-as-a-Service systems such as Apache Flink's Statefun. This work has been awarded the best paper award in ACM DEBS 2021.

  • Optimizing ML Inference Queries under Constraints

    Optimizing ML inference queries is hard, especially when constraints (e.g., accuracy or execution time) have to be satisfied, and the complexity of the inference query increases. This project aims to tackle constraint-based ML inference query optimization problem. The proposed optimizer aims at high effectiveness, and can navigate a large search space to find optimal query plans on various model zoos.