After the increased adoption of machine learning (ML) in various applications and disciplines, a synergy between the database (DB) systems and ML communities emerged. Steps involved in an ML pipeline, such as data preparation and cleaning, feature engineering and management of the ML lifecycle, can benefit from research conducted by the data management community. For example, the management of the ML lifecycle requires mechanisms for modeling, storing and querying ML artifacts. Moreover, in many use cases pipelines require a mixture of relational and linear algebra operators raising the question of whether a seamless integration between the two algebras is possible. In the opposite direction, ML techniques are explored in core components of database systems, e.g., query optimization, indexing and monitoring. Traditionally hard problems in databases, such as cardinality estimation, or problems with high human supervision like DB administration, might benefit more from learning algorithms than from rule-based or cost-based approaches.

The workshop aims at bringing together researchers and practitioners in the intersection of DB and ML research, providing a forum for DB-inspired or ML-inspired approaches addressing challenges encountered in each of the two areas. In particular, we welcome new research topics combining the strengths of both fields.

Topics of particular interest in the workshop include, but are not limited to:

  • Data collection and preparation for ML applications
  • Declarative machine learning on databases, data warehouses or data lakes
  • Hybrid optimization techniques for databases and machine learning
  • Model-aware data discovery, cleaning, and transformation
  • Benchmarking ML-oriented data management systems (data augmentation, data cleaning, etc)
  • Data management during the life cycle of ML models
  • Novel data management systems for accelerating training and inference of ML models
  • DB-inspired techniques for modeling, storage and provenance of ML artifacts
  • Learned database design, configuration and tuning
  • Machine learning for query optimization
  • Applied machine learning/deep learning for data integration
  • ML-enabled data exploration and discovery in data lakes
  • ML functionality inside DBMS


The workshop will accept both regular papers and short papers (work in progress, vision/outrageous ideas). All submissions must be prepared in accordance with the IEEE template available here. The following are the page limits (excluding references):

Regular papers: 8 pages
Short papers: 4 pages

All submissions (in PDF format) should be sent to Easychair.


All deadlines are 11:59PM PST.

Submission deadline: 14 January 2022 (extended) 27 January 2022
Author notification: 22 February 2022 (extended) 25 February 2022
Camera-ready version: 8 March 2022 (extended) 10 March 20222
Workshop day: 9 May 2022


Martin Grohe

Towards a Theory of Vector Embeddings of Graphs and Relational Data

Martin Grohe, RWTH Aachen University, Germany

ABSTRACT.Vector representations of graphs and relational data, whether hand-crafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to these forms of structured data. A wide range of methods for generating such vector embeddings has been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.

The first part of my talk will be devoted to embedding algorithms in practice. Starting with a brief overview of common embedding techniques, I will speak about a new embedding algorithm for dynamically changing relational data. In the second part of my talk, I will discuss theoretical ideas that have proved useful for analysing and designing vector embeddings and that may help us to develop a more principled view on the area.

ABOUT. Martin Grohe is a computer scientist known for his research on parameterized complexity, mathematical logic, finite model theory, the logic of graphs, database theory, and descriptive complexity theory. He is a University Professor of Computer Science at RWTH Aachen University, where he holds the Chair for Logic and Theory of Discrete Systems. Grohe won the Heinz Maier-Leibnitz Prize awarded by the German Research Foundation in 1999. He was elected as an ACM Fellow in 2017 for "contributions to logic in computer science, database theory, algorithms, and computational complexity".

Paul Groth

Data Curation and Debugging for Data Centric AI

Paul Groth, University of Amsterdam, The Netherlands

ABSTRACT. It is increasingly recognized that data is a central challenge for AI systems - whether training an entirely new model, discovering data for a model, or applying an existing model to new data. Given this centrality of data, there is need to provide new tools that are able to help data teams create, curate and debug datasets in the context of complex machine learning pipelines. In this talk, I outline the underlying challenges for data debugging and curation in these environments. I then discuss our recent research that both takes advantage of ML to improve datasets but also uses core database techniques for debugging in such complex ML pipelines.

ABOUT. Paul Groth is Professor of Algorithmic Data Science at the University of Amsterdam where he leads the Intelligent Data Engineering Lab (INDElab). He holds a Ph.D. in Computer Science from the University of Southampton (2007) and has done research at the University of Southern California, the Vrije Universiteit Amsterdam and Elsevier Labs. His research focuses on intelligent systems for dealing with large amounts of diverse contextualized knowledge with a particular focus on web and science applications. This includes research in data provenance, data integration and knowledge sharing.

Paul is scientific director of the UvA’s Data Science Center. Additionally, he is co-scientific director of two Innovation Center for Artificial Intelligence (ICAI) labs: The AI for Retail (AIR) Lab - a collaboration between UvA and Ahold Delhaize; and the Discovery Lab - a collaboration between Elsevier, the University of Amsterdam and VU University Amsterdam.


Jyoti Leeka

Query Optimizer as a Service: An Idea Whose Time Has Come!

Jyoti Leeka, Senior Research Scientist, Microsoft Research

ABSTRACT. Query optimization is a critical technology needed by all modern data processing systems. However, it is traditionally implemented in silos and is deeply embedded in different systems. Furthermore, over the years, query optimizers have become less understood and rarely touched pieces of code that are brittle to changes and very expensive to maintain, thus slowing down the pace of innovation. In this talk, I will argue that it is time to design query optimizer as a service in modern cloud architectures. Such a design will help build a set of well-maintained optimizations that are externalized from the query engines and that could be learned (and improved) using the large workloads present in modern clouds. I will present a reference architecture for our query optimizer as a service, explaining details of intra-query and inter-query optimizations performed. A key enabler for the externalization of intra-query optimization is the plethora of recent machine learning-based techniques developed to improve query optimizer components, such as cardinality, cost model, and query planner. On the other hand, externalization of inter-query optimization, also known as multi-query optimization, is motivated by numerous efforts on view materialization, physical layouts (i.e., partitioning, etc.), and most recently by Pipemizer, a data pipeline-aware optimization effort at Microsoft. Finally, I will describe our success in deploying the early version of query optimizer as a service in Cosmos at Microsoft.

ABOUT. Jyoti Leeka is a Senior Scientist currently focusing on improving the performance of Microsoft’s large-scale data-intensive production analytics clusters. These clusters comprise of 300k servers running hundreds of thousands of production analytic jobs on a daily basis; written by thousands of developers, processing several exabytes of data per day, and involving several hundred petabytes of I/O. The main focus of this work is to develop algorithms to find optimal/approximate physical designs for Microsoft’s production job pipelines. Before joining GSL, Jyoti was a postdoctoral researcher at MSR for two years. Her focus was on query optimization for distributed systems.

Jie Yang

ARCH: Know What Your Machine Doesn’t Know

Jie Yang, Assistant Professor, Delft University of Technology

ABSTRACT. Despite their impressive performance, machine learning systems remain prohibitively unreliable in safety-, trust-, and ethically sensitive domains. Recent discussions in different sub-fields of AI have reached the consensus of knowledge need in machine learning; few discussions have touched upon the diagnosis of what knowledge is needed. In this talk, I will present our ongoing work on ARCH, a knowledge-driven, human-centered, and reasoning-based tool, for diagnosing the unknowns of a machine learning system. ARCH leverages human intelligence to create domain knowledge required for a given task and to describe the internal behavior of a machine learning system; it infers the missing or incorrect knowledge of the system with the built-in probabilistic, abductive reasoning engine. ARCH is a generic tool that can be applied to machine learning in different contexts. In the talk, I will present several applications and domains in which ARCH is currently being developed and tested, including health, finance, and transport.

ABOUT. Jie Yang is an assistant professor at the Web Information Systems (WIS) group in TU Delft. He co-leads the Kappa research line on Crowd Computing & Human-Centered AI at the WIS group and the Delft AI Lab Design@Scale in the university. Before, he was a machine learning scientist at Alexa Shopping, Amazon Research, based in Seattle, and a senior researcher at the eXascale Infolab, University of Fribourg - Switzerland. He works on human-in-the-loop approaches for reliable and trustworthy machine learning. His research contributes a new set of human-in-the-loop methods and tools for the development and evaluation of, and the interaction with, machine learning systems.


Accepted papers:

  • Datastack: Unification of Heterogeneous Machine Learning Dataset Interfaces
    Max Lübbering, Maren Pielka, Ilhamcengiz Henk and Rafet Sifa
  • Evaluating the Lottery Ticket Hypothesis to Sparsify Neural Networks for Time Series Classification
    Georg Stefan Schlake, Jan David Hüwel, Fabian Berns and Christian Beecks
  • GitSchemas: A Dataset for Automating Relational Data Preparation Tasks
    Till Döhmen, Madelon Hulsebos, Christian Beecks and Sebastian Schelter
  • Sample-based Kernel Structure Learning with Deep Neural Networks for Automated Structure Discovery
    Alexander Graß, Till Döhmen and Christian Beecks
  • Join Path Based Data Augmentation for Decision Trees
    Andra Ionescu, Rihan Hai, Marios Fragkoulis and Asterios Katsifodimos
Malaysia (MYT) Amsterdam (CEST) Activity Title Presenter
14:00 - 14:10 8:00 - 8:10 Opening 🎉
14:10 - 14:55 8:10 - 8:55 Keynote 1 Towards a Theory of Vector Embeddings of Graphs and Relational Data Martin Grohe (RWTH Aachen University)
14:55 - 15:30 8:55 - 9:30 Invited Talk 1 Query Optimizer as a Service: An Idea Whose Time Has Come! Jyoti Leeka (Microsoft Research)
15:30 - 15:40 9:30 - 9:40 Coffee break ☕
15:40 - 16:25 9:40 - 10:25 Keynote 2 Data Curation and Debugging for Data Centric AI Paul Groth (University of Amsterdam)
16:25 - 16:45 10:25 - 10:45 Research Talk 1 GitSchemas: A Dataset for Automating Relational Data Preparation Tasks Till Döhmen
16:45 - 17:05 10:45 - 11:05 Research Talk 2 Join Path Based Data Augmentation for Decision Trees Andra Ionescu
17:05 - 17:20 11:05 - 11:20 Coffee break ☕
17:20 - 17:55 11:20 - 11:55 Invited Talk 2 ARCH: Know What Your Machine Doesn’t Know Jie Yang (Delft University of Technology)
17:55 - 18:15 11:55 - 12:15 Research Talk 3 Sample-based Kernel Structure Learning with Deep Neural Networks for Automated Structure Discovery Alexander Graß
18:15 - 18:35 12:15 - 12:35 Research Talk 4 Datastack: Unification of Heterogeneous Machine Learning Dataset Interfaces Max Lübbering
18:35 - 18:55 12:35 - 12:55 Research Talk 5 Evaluating the Lottery Ticket Hypothesis to Sparsify Neural Networks for Time Series Classification Georg Schlake
18:55 - 19:00 12:55 - 13:00 Closing


Program committee:

  • Hazar Harmouch - Hasso Plattner Institute, Germany
  • Roee Shraga - Technion - Israel Institute of Technology, Israel
  • Syed Muhammad Fawad Ali - Poznan University of Technology, Poland
  • Rana Alotaibi - University of California San Diego, USA
  • Christos Koutras - Delft University of Technology, The Netherlands
  • Zoi Kaoudi - Qatar Computing Research Institute, Qatar
  • Marios Fragkoulis - University of Ioannina, Greece
  • Nikolaos Vasiloglou - relationalAI
  • Stefan Manegold - CWI, The Netherlands

Attendance Support

We are very happy to announce attendance support opportunities for students to attend DBML 2022, which allow free workshop registration for virtual attendees. Due to the limited funding opportunities, there is a strong focus on universities in developing countries (as listed by ACM).

Who can apply

The first authors of each accepted paper at DBML 2022 can apply, who are also full-time students (graduate or undergraduate) affiliated with universities.

How to apply

After the paper notification, please send your paper number and student certificate to Dr. Hai (R.Hai@tudelft.nl)