DBML 2024

ABOUT

After the increased adoption of machine learning (ML) in various applications and disciplines, a synergy between the database (DB) systems and ML communities emerged. Steps involved in ML pipelines, such as data preparation and cleaning, feature engineering, and management of the ML lifecycle can benefit from research conducted by the data management community. For example, the management of the ML lifecycle requires mechanisms for modeling, storing, and querying ML artifacts. Moreover, in many use cases pipelines require a mixture of relational and linear algebra operators, raising the question of whether a seamless integration between the two algebras is possible.

In the opposite direction, ML techniques are explored in core components of database systems, e.g., query optimization, indexing, and monitoring. Traditionally hard problems in databases, such as cardinality estimation, or problems with high human supervision like DB administration, might benefit more from learning algorithms than from rule-based or cost-based approaches.

The workshop aims at bringing together re-searchers and practitioners in the intersection of DB and ML research, providing a forum for DB-inspired or ML-inspired approaches addressing challenges encountered in each of the two areas. In particular, we welcome new research topics combining the strengths of both fields.

Information of the previous workshops can be accessed and seen at DBML 2023 and DBML 2022.

For any questions regarding the workshop please contact: dbml24chairs@gmail.com

Topics of particular interest for the workshop include, but are not limited to topics along the following two categories:

ML for Data Management and DBMS
Learned data discovery, cleaning, and transformation
ML-enabled data exploration and discovery in data lakes
Learned database design, configuration, and tuning
ML for query optimization, indexing, partitioning
Natural language enablement (e.g., queries, result summarization, chatbot interfaces, etc.)
Pretrained models for databases and data management, e.g. (Large Language Models).
Representation learning for data cleaning, preprocessing, and management
Benchmarking ML-oriented data management (data augmentation, data cleaning, etc) or DBMSs
Data Management for ML
Data collection and preparation for ML applications
Data quality and provenance for ML
Novel data management systems for accelerating training and inference of ML models
Data and metadata management for the ML lifecycle
DB-inspired techniques for modeling, storage, and provenance of ML artifacts

IMPORTANT DATES

All deadlines are 11:59PM PST.

Submission deadline:	~~26 January 2024~~ 02 February 2024
Author notification:	~~22 February 2024~~ 29 February 2024
Camera-ready version:	~~08 March 2024~~ 15 March 2024
Workshop day:	13 May 2024

PROGRAM

Accepted papers:

Relationalizing Tables with Large Language Models: The Promise and Challenges
Zezhou Huang and Eugene Wu.
The slides to the presentation is available here here
Evaluating Ambiguous Questions in Semantic Parsing
Simone Papicchio, Paolo Papotti and Luca Cagliero.
The slides to the presentation is available here here
OPTWIN: Drift Identification with Optimal Sub-Windows
Mauro Dalle Lucca Tosi and Martin Theobald.
The slides to the presentation is available here here
ChimeraTL: Transfer Learning in DBMS with Fewer Samples
Tatsuhiro Nakamori, Shohei Matsuura, Takashi Miyazaki, Sho Nakazono, Taiki Sato, Takashi Hoshino and Hideyuki Kawashima.
The slides to the presentation is available here here
Will Sharing Metadata Leak Privacy?
Danning Zhan and Rihan Hai.
The slides to the presentation is available here here
ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines
Mohamed Abdelaal, Anil Bora Yayak, Kai Klede and Harald Schoening.
The slides to the presentation is available here here
Directions Towards Efficient and Automated Data Wrangling with Large Language Models
Zeyu Zhang, Paul Groth, Iacer Calixto and Sebastian Schelter.
The slides to the presentation is available here here

Keynote Presentations:

The slides for the first keynote by Renata Borovica-Gajic on Physical database design tuning with Multi-Armed Bandits: Reaching the holy grail of performance guarantees can be found here
The slides for the second keynote by Paolo Papotti on SQL and Large Language Models: A Marriage Made in Heaven? The slides for the here
The slides for the third keynote by Fatemeh Nargesian on Data Acquisition for AI can be found here

Netherlands (CEST)	Activity	Title	Presenter
9:00 - 9:10	Opening
9:10 - 10:00	Keynote	Physical database design tuning with Multi-Armed Bandits: Reaching the holy grail of performance guarantees	Renata Borovica-Gajic
10:00 - 10:30	Coffee Break ☕
Benchmarking and Evaluation
10:30 - 11:00	Research Talk	Evaluating Ambiguous Questions in Semantic Parsing	Simone Papicchio, Paolo Papotti and Luca Cagliero
	Research Talk	Will Sharing Metadata Leak Privacy?	Danning Zhan and Rihan Hai
ML for data processing
11:00 - 12:00	Research Talk	ChimeraTL: Transfer Learning in DBMS with Fewer Samples	Tatsuhiro Nakamori, Shohei Matsuura, Takashi Miyazaki, Sho Nakazono, Taiki Sato, Takashi Hoshino and Hideyuki Kawashima
	Research Talk	OPTWIN: Drift Identification with Optimal Sub-Windows	Mauro Dalle Lucca Tosi and Martin Theobald
12:00 - 13:30	Lunch 🍱
13:30 - 14:15	Keynote	SQL and Large Language Models: A Marriage Made in Heaven?	Paolo Papotti
14:15 - 15:00	Roundtable discussions	Future trends in DBML research	Collaborative Session
15:00 - 15:30	Coffee Break ☕
15:30 - 17:00	Keynote	Data Acquisition for AI	Fatemeh Nargesian
Learned Data Wrangling
	Research Talk	ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines	Mohamed Abdelaal, Anil Bora Yayak, Kai Klede and Harald Schoening
	Research Talk	Directions Towards Efficient and Automated Data Wrangling with Large Language Models	Zeyu Zhang, Paul Groth, Iacer Calixto and Sebastian Schelter
	Research Talk	Relationalizing Tables with Large Language Models: The Promise and Challenges	Zezhou Huang and Eugene Wu

KEYNOTES

Physical database design tuning with Multi-Armed Bandits: Reaching the holy grail of performance guarantees

Renata Borovica-Gajic, University of Melbourne

ABSTRACT. Optimizing physical database design is pivotal for achieving prompt query responses, a critical aspect of database setup. However, existing commercial solutions primarily rely on manual intervention by database administrators (DBAs) to identify and furnish suitable training workloads. This approach is becoming increasingly impractical as workloads evolve into more ad hoc patterns, exacerbating challenges, particularly with the prevalence of mixed OLTP and OLAP (HTAP) workloads. In this talk, I will discuss a novel self-driving method for real-time physical design tuning, circumventing the need for DBAs and query optimizers. Our approach involves strategic exploration and direct performance observation to discern optimal structures, treating the problem as sequential decision-making under uncertainty, and using the multi-armed bandit (MAB) framework to solve it. By balancing exploration and exploitation, DBA Bandits offer reliable performance, even in the face of unpredictable ad hoc and HTAP workloads, while providing long sought-after statistical guarantees on the efficacy of proposed design structures.

ABOUT. Dr Renata Borovica-Gajic holds the position of Senior Lecturer in Data Analytics and is an ARC DECRA Fellow at the School of Computing and Information Systems (CIS) at the University of Melbourne. Additionally, she serves as the Associate Dean (Diversity and Inclusion) within the Faculty of Engineering and IT. Her research focuses on the convergence of database systems, machine learning, artificial intelligence, and data-driven optimization and analytics. Her scholarly contributions are regularly featured in esteemed data management outlets such as SIGMOD, VLDB, and ICDE conferences, as well as in journals including VLDBJ, TKDE, and CSUR. Notable recognitions include the esteemed L'Oréal-UNESCO For Women in Science Award in 2023, the Test of Time Award at SIGMOD 2022, and the Google Award for Research Inclusion in 2021. Moreover, she has been honoured with Research Excellence Awards in 2022 and 2023, alongside Excellence in Teaching and Learning Awards in 2018 and 2020.

SQL and Large Language Models: A Marriage Made in Heaven?

Paolo Papotti , EURECOM

ABSTRACT. With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of documents. However, for data-intensive tasks over structured data, relational DBs and SQL queries are at the core of countless applications. While these two technologies may appear distant, in this talk we will see that they can interact effectively and with promising results. LLMs can help users express SQL queries (Semantic Parsing), but SQL queries can be used to evaluate LLMs (Benchmarking). Their combination can be further advanced, with opportunities to query with a unified SQL interface both LLMs and DBs. We present recent results on these topics and then conclude with an overview of the research challenges in effectively leveraging the combined power of SQL and LLMs.

ABOUT.Paolo Papotti is an Associate Professor at EURECOM, France since 2017. He got his PhD from Roma Tre University (Italy) in 2007 and had research positions at the Qatar Computing Research Institute (Qatar) and Arizona State University (USA). His research is focused on data management and, more recently, on NLP. He has authored more than 140 publications, and his work has been recognized with two “Best of the Conference” citations (SIGMOD 2009, VLDB 2016), three best demo award (SIGMOD 2015, DBA 2020, SIGMOD 2022), and two Google Faculty Research Award (2016, 2020).

Data Acquisition for AI

Fatemeh Nargesian, University of Rochester

ABSTRACT. Data science is increasingly reliant on the discovery and integration of data from diverse sources such as open data portals and data marketplaces. With a massive collection of data like a data lake, dataset discovery involves searching for relevant datasets to downstream data science tasks. For multiple disjoint data sources, data acquisition streamlines the integration of sources to compile a dataset that meets specific schema and distribution requirements. In this talk, I will first describe how to develop efficient algorithms for dataset discovery and data enrichment based on the join operation. I will then present a method to construct a navigational structure over data lakes, offering an alternative discovery approach to the conventional keyword search. Next, we will see how to perform distribution-aware discovery in order to tailor a dataset with a desired distribution from multiple sources, aiming to address group representation issues. Finally, I will conclude with a discussion on the challenges of developing data acquisition systems that support AI-based analytics.

ABOUT. Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. She obtained her PhD at the University of Toronto. Her research interests are in dataset discovery, distribution-aware data integration and data selection, and scientific time-series management. Her work has appeared at top-tier venues including VLDB, SIGMOD, and ICDE and received the best demo award of VLDB 2017.

ORGANISATION

Ziyu Li

TU Delft

Workshop Chair

Gerardo Vitagliano

Hasso-Plattner-Institut

Workshop Chair

Carsten Binnig

TU Darmstadt

Workshop Chair

Danning Zhan

TU Delft

Publicity chair

Program committee:

Syed Fawad Ali - Accenture Germany
Matthias Boehm - TU Berlin
Zhiwei Fan - Meta
Hazar Harmouch - University of Amsterdam
Zezhou Huang - Columbia University
Madelon Hulsebros - University of California Berkeley
Bojan Karlaš - Harvard Medical School
Christos Koutras - TU Delft
Yao Lu - National University of Singapore
Manisha Lutra - TU Darmstadt
Ryan Marcus - University of Pennsylvania
Pedro Pedreira - Meta
Ibrahim Sabek - University of Southern California
Sebastian Shelter - University of Amsterdam
Tuo Shi - City University of Hong Kong
Roee Shraga - Worcester Polytechnic Institute
Utku Sirin - Harvard University
Peter Triantafillou - University of Warwick
Zhihui Yang - Zhejiang University
Chao Zhang - Tsinghua University

3rd International Workshop on Databases and Machine Learning

in conjunction with ICDE 2024 | May 13 2024

ABOUT

IMPORTANT DATES

PROGRAM

Accepted papers:

Keynote Presentations:

KEYNOTES

Physical database design tuning with Multi-Armed Bandits: Reaching the holy grail of performance guarantees

Renata Borovica-Gajic, University of Melbourne

SQL and Large Language Models: A Marriage Made in Heaven?

Paolo Papotti , EURECOM

Data Acquisition for AI

Fatemeh Nargesian, University of Rochester

ORGANISATION

Ziyu Li

TU Delft

Workshop Chair

Gerardo Vitagliano

Hasso-Plattner-Institut

Workshop Chair

Carsten Binnig

TU Darmstadt

Workshop Chair

Danning Zhan

TU Delft

Publicity chair

Program committee: