3rd International Workshop on Databases and Machine Learning

in conjunction with ICDE 2024 | May 13 2024


After the increased adoption of machine learning (ML) in various applications and disciplines, a synergy between the database (DB) systems and ML communities emerged. Steps involved in ML pipelines, such as data preparation and cleaning, feature engineering, and management of the ML lifecycle can benefit from research conducted by the data management community. For example, the management of the ML lifecycle requires mechanisms for modeling, storing, and querying ML artifacts. Moreover, in many use cases pipelines require a mixture of relational and linear algebra operators, raising the question of whether a seamless integration between the two algebras is possible.

In the opposite direction, ML techniques are explored in core components of database systems, e.g., query optimization, indexing, and monitoring. Traditionally hard problems in databases, such as cardinality estimation, or problems with high human supervision like DB administration, might benefit more from learning algorithms than from rule-based or cost-based approaches.

The workshop aims at bringing together re-searchers and practitioners in the intersection of DB and ML research, providing a forum for DB-inspired or ML-inspired approaches addressing challenges encountered in each of the two areas. In particular, we welcome new research topics combining the strengths of both fields.

Information of the previous workshops can be accessed and seen at DBML 2023 and DBML 2022.

For any questions regarding the workshop please contact: dbml24chairs@gmail.com

Topics of particular interest for the workshop include, but are not limited to topics along the following two categories:

  • ML for Data Management and DBMS
  • Learned data discovery, cleaning, and transformation
  • ML-enabled data exploration and discovery in data lakes
  • Learned database design, configuration, and tuning
  • ML for query optimization, indexing, partitioning
  • Natural language enablement (e.g., queries, result summarization, chatbot interfaces, etc.)
  • Pretrained models for databases and data management, e.g. (Large Language Models).
  • Representation learning for data cleaning, preprocessing, and management
  • Benchmarking ML-oriented data management (data augmentation, data cleaning, etc) or DBMSs
  • Data Management for ML
  • Data collection and preparation for ML applications
  • Data quality and provenance for ML
  • Novel data management systems for accelerating training and inference of ML models
  • Data and metadata management for the ML lifecycle
  • DB-inspired techniques for modeling, storage, and provenance of ML artifacts


All deadlines are 11:59PM PST.

Submission deadline: 26 January 2024 02 February 2024
Author notification: 22 February 2024 29 February 2024
Camera-ready version: 08 March 2024 15 March 2024
Workshop day: 13 May 2024


Accepted papers:

Keynote Presentations:

  • The slides for the first keynote by Renata Borovica-Gajic on Physical database design tuning with Multi-Armed Bandits: Reaching the holy grail of performance guarantees can be found here
  • The slides for the second keynote by Paolo Papotti on SQL and Large Language Models: A Marriage Made in Heaven? The slides for the here
  • The slides for the third keynote by Fatemeh Nargesian on Data Acquisition for AI can be found here
Netherlands (CEST) Activity Title Presenter
9:00 - 9:10 Opening
9:10 - 10:00 Keynote Physical database design tuning with Multi-Armed Bandits: Reaching the holy grail of performance guarantees Renata Borovica-Gajic
10:00 - 10:30 Coffee Break ☕
Benchmarking and Evaluation
10:30 - 11:00 Research Talk Evaluating Ambiguous Questions in Semantic Parsing Simone Papicchio, Paolo Papotti and Luca Cagliero
Research Talk Will Sharing Metadata Leak Privacy? Danning Zhan and Rihan Hai
ML for data processing
11:00 - 12:00 Research Talk ChimeraTL: Transfer Learning in DBMS with Fewer Samples Tatsuhiro Nakamori, Shohei Matsuura, Takashi Miyazaki, Sho Nakazono, Taiki Sato, Takashi Hoshino and Hideyuki Kawashima
Research Talk OPTWIN: Drift Identification with Optimal Sub-Windows Mauro Dalle Lucca Tosi and Martin Theobald
12:00 - 13:30 Lunch 🍱
13:30 - 14:15 Keynote SQL and Large Language Models: A Marriage Made in Heaven? Paolo Papotti
14:15 - 15:00 Roundtable discussions Future trends in DBML research Collaborative Session
15:00 - 15:30 Coffee Break ☕
15:30 - 17:00 Keynote Data Acquisition for AI Fatemeh Nargesian
Learned Data Wrangling
Research Talk ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines Mohamed Abdelaal, Anil Bora Yayak, Kai Klede and Harald Schoening
Research Talk Directions Towards Efficient and Automated Data Wrangling with Large Language Models Zeyu Zhang, Paul Groth, Iacer Calixto and Sebastian Schelter
Research Talk Relationalizing Tables with Large Language Models: The Promise and Challenges Zezhou Huang and Eugene Wu


Dr Renata Borovica-Gajic

Physical database design tuning with Multi-Armed Bandits: Reaching the holy grail of performance guarantees

Renata Borovica-Gajic, University of Melbourne

ABSTRACT. Optimizing physical database design is pivotal for achieving prompt query responses, a critical aspect of database setup. However, existing commercial solutions primarily rely on manual intervention by database administrators (DBAs) to identify and furnish suitable training workloads. This approach is becoming increasingly impractical as workloads evolve into more ad hoc patterns, exacerbating challenges, particularly with the prevalence of mixed OLTP and OLAP (HTAP) workloads. In this talk, I will discuss a novel self-driving method for real-time physical design tuning, circumventing the need for DBAs and query optimizers. Our approach involves strategic exploration and direct performance observation to discern optimal structures, treating the problem as sequential decision-making under uncertainty, and using the multi-armed bandit (MAB) framework to solve it. By balancing exploration and exploitation, DBA Bandits offer reliable performance, even in the face of unpredictable ad hoc and HTAP workloads, while providing long sought-after statistical guarantees on the efficacy of proposed design structures.

ABOUT. Dr Renata Borovica-Gajic holds the position of Senior Lecturer in Data Analytics and is an ARC DECRA Fellow at the School of Computing and Information Systems (CIS) at the University of Melbourne. Additionally, she serves as the Associate Dean (Diversity and Inclusion) within the Faculty of Engineering and IT. Her research focuses on the convergence of database systems, machine learning, artificial intelligence, and data-driven optimization and analytics. Her scholarly contributions are regularly featured in esteemed data management outlets such as SIGMOD, VLDB, and ICDE conferences, as well as in journals including VLDBJ, TKDE, and CSUR. Notable recognitions include the esteemed L'Oréal-UNESCO For Women in Science Award in 2023, the Test of Time Award at SIGMOD 2022, and the Google Award for Research Inclusion in 2021. Moreover, she has been honoured with Research Excellence Awards in 2022 and 2023, alongside Excellence in Teaching and Learning Awards in 2018 and 2020.

Dr Paolo Papotti

SQL and Large Language Models: A Marriage Made in Heaven?

Paolo Papotti , EURECOM

ABSTRACT. With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of documents. However, for data-intensive tasks over structured data, relational DBs and SQL queries are at the core of countless applications. While these two technologies may appear distant, in this talk we will see that they can interact effectively and with promising results. LLMs can help users express SQL queries (Semantic Parsing), but SQL queries can be used to evaluate LLMs (Benchmarking). Their combination can be further advanced, with opportunities to query with a unified SQL interface both LLMs and DBs. We present recent results on these topics and then conclude with an overview of the research challenges in effectively leveraging the combined power of SQL and LLMs.

ABOUT.Paolo Papotti is an Associate Professor at EURECOM, France since 2017. He got his PhD from Roma Tre University (Italy) in 2007 and had research positions at the Qatar Computing Research Institute (Qatar) and Arizona State University (USA). His research is focused on data management and, more recently, on NLP. He has authored more than 140 publications, and his work has been recognized with two “Best of the Conference” citations (SIGMOD 2009, VLDB 2016), three best demo award (SIGMOD 2015, DBA 2020, SIGMOD 2022), and two Google Faculty Research Award (2016, 2020).

Dr fatemeh Nargesian

Data Acquisition for AI

Fatemeh Nargesian, University of Rochester

ABSTRACT. Data science is increasingly reliant on the discovery and integration of data from diverse sources such as open data portals and data marketplaces. With a massive collection of data like a data lake, dataset discovery involves searching for relevant datasets to downstream data science tasks. For multiple disjoint data sources, data acquisition streamlines the integration of sources to compile a dataset that meets specific schema and distribution requirements. In this talk, I will first describe how to develop efficient algorithms for dataset discovery and data enrichment based on the join operation. I will then present a method to construct a navigational structure over data lakes, offering an alternative discovery approach to the conventional keyword search. Next, we will see how to perform distribution-aware discovery in order to tailor a dataset with a desired distribution from multiple sources, aiming to address group representation issues. Finally, I will conclude with a discussion on the challenges of developing data acquisition systems that support AI-based analytics.

ABOUT. Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. She obtained her PhD at the University of Toronto. Her research interests are in dataset discovery, distribution-aware data integration and data selection, and scientific time-series management. Her work has appeared at top-tier venues including VLDB, SIGMOD, and ICDE and received the best demo award of VLDB 2017.


Program committee:

  • Syed Fawad Ali - Accenture Germany
  • Matthias Boehm - TU Berlin
  • Zhiwei Fan - Meta
  • Hazar Harmouch - University of Amsterdam
  • Zezhou Huang - Columbia University
  • Madelon Hulsebros - University of California Berkeley
  • Bojan Karlaš - Harvard Medical School
  • Christos Koutras - TU Delft
  • Yao Lu - National University of Singapore
  • Manisha Lutra - TU Darmstadt
  • Ryan Marcus - University of Pennsylvania
  • Pedro Pedreira - Meta
  • Ibrahim Sabek - University of Southern California
  • Sebastian Shelter - University of Amsterdam
  • Tuo Shi - City University of Hong Kong
  • Roee Shraga - Worcester Polytechnic Institute
  • Utku Sirin - Harvard University
  • Peter Triantafillou - University of Warwick
  • Zhihui Yang - Zhejiang University
  • Chao Zhang - Tsinghua University