DBML 2025

ABOUT

After the increased adoption of machine learning (ML) in various applications and disciplines, a synergy between the database (DB) systems and ML communities emerged. Steps involved in ML pipelines, such as data preparation and cleaning, feature engineering, and management of the ML lifecycle can benefit from research conducted by the data management community. For example, the management of the ML lifecycle requires mechanisms for modeling, storing, and querying ML artifacts. Moreover, in many use cases pipelines require a mixture of relational and linear algebra operators, raising the question of whether a seamless integration between the two algebras is possible.

In the opposite direction, ML techniques are explored in core components of database systems, e.g., query optimization, indexing, and monitoring. Traditionally hard problems in databases, such as cardinality estimation, or problems with high human supervision like DB administration, might benefit more from learning algorithms than from rule-based or cost-based approaches.

The workshop aims at bringing together re-searchers and practitioners in the intersection of DB and ML research, providing a forum for DB-inspired or ML-inspired approaches addressing challenges encountered in each of the two areas. In particular, we welcome new research topics combining the strengths of both fields.

Information of the previous workshops can be accessed and seen at DBML 2024, DBML 2023 and DBML 2022.

For any questions regarding the workshop please contact: dbml25@googlegroups.com

Topics of particular interest for the workshop include, but are not limited to topics along the following two categories:

ML for Data Management and DBMS
Learned data discovery, cleaning, and transformation
ML-enabled data exploration and discovery in data lakes
Learned database design, configuration, and tuning
ML for query optimization, indexing, partitioning
Natural language enablement (e.g., queries, result summarization, chatbot interfaces, etc.)
Pretrained models for databases and data management, e.g. (Large Language Models).
Representation learning for data cleaning, preprocessing, and management
Benchmarking ML-oriented data management (data augmentation, data cleaning, etc) or DBMSs
Data Management for ML
Data collection and preparation for ML applications
Data quality and provenance for ML
Novel data management systems for accelerating training and inference of ML models
Data and metadata management for the ML lifecycle
DB-inspired techniques for modeling, storage, and provenance of ML artifacts

SPEAKERS

On the Generalization of Temporal Graph Learning: Theoretical Insights and Simple Algorithms

Jian Kang, University of Rochester

ABSTRACT. Real-world graphs often evolve over time. The de facto methods to extract spatial-temporal information in temporal graphs are recurrent neural networks (RNNs) and self-attention mechanism (SAM). Despite empirically good performance, their theoretical foundations largely remain uncharted. In this talk, I will share our works on improving the generalization of temporal graph learning. I will first present a unified theoretical framework to examine the generalization of three fundamental types of temporal graph learning methods. Guided by the theoretical understanding, I will introduce two simple yet effective temporal graph learning models that enjoy small generalization error, smooth optimization landscape and better empirical performance. We hope these results motivate researchers to rethink the importance of simple model architectures.

ABOUT. Jian Kang is an Assistant Professor in the Department of Computer Science at the University of Rochester. His research focuses on machine learning on graphs to advance knowledge understanding and scientific discovery. He received his Ph.D. in Computer Science from the University of Illinois Urbana-Champaign. He was recognized as Rising Star in Data Science by The University of Chicago and a Mavis Future Faculty Fellow by the University of Illinois Urbana-Champaign. He also serves as Guest Editor of Generative Search and Recommendation of Frontiers in Big Data, Proceedings Chair of CIKM 2025, and Web Chair of KDD 2024.

Data Acquisition for Domain Adaptation of Closed-Box Models

Contributing authors: Yiwei Liu, Xiaohui Yu, Nick Koudas

ABSTRACT. Machine learning (ML) marketplaces, pivotal for numerous industries, often offer models to customers as "closed boxes". These models, when deployed in new domains, might experience lower performance due to distributional shifts. Our paper proposes a framework designed to enhance closed-box classification models. This framework allows customers, upon detecting performance gaps on their validation datasets, to gather additional data for creating an auxiliary "padding" model. This model assists the original closed-box model in addressing classification weaknesses in the target domain. The framework includes a "weakness detector" that identifies areas where the model falls short and an Augmented Ensemble method that combines the original and padding models to improve accuracy and expand the diversity of the ML marketplace. Extensive experiments on several popular benchmark datasets confirm the superiority of our proposed framework over baseline approaches.

ABOUT. Xiaohui Yu is a Professor and the Graduate Program Director in the School of Information Technology, York University, Canada. He obtained his PhD degree from the University of Toronto. His research interests lie in the broad area of data science, with a particular focus on the intersection of data management and machine learning (ML). The results of his research have been published in top data science journals and conferences, such as SIGMOD, VLDB, ICDE, and TKDE. He regularly serves on the program committees of leading conferences and is an Associate/Area Editor for the IEEE Transactions on Knowledge and Data Engineering (TKDE), the ACM Transactions on Knowledge Discovery in Data (TKDD), and Information Systems. He is a General Co-Chair for the KDD 2025 conference. He has collaborated regularly with industry partners, and some research results have been incorporated into large-scale production systems.

Advancing Domain-Specific Intelligence with LLMs & KGs: A Five-Year Journey Toward the Future

Essam Mansour , Concordia University

ABSTRACT. In this talk, I will share my five-year journey advancing AI, Knowledge Graphs (KGs), and data science automation to build cutting-edge systems. I developed an AI-enabled KG engine that optimizes AI infrastructure by bridging Graph DBs and Graph ML frameworks. Our system introduces novel training and inference accelerators. I also created an LLM-powered chatbot platform for KGs to enhance domain-specific question answering by using LLMs for understanding and linking tasks. Additionally, I advanced data science with semantic abstraction and KG-powered automation via graph neural networks. My work has led to around 10 top-tier publications (SIGMOD, PVLDB, ICDE) and open-sourced systems. I have collaborated with industry leaders like Google and IBM, Canadian banks like RBC and NBC, and research institutions like the National Research Council Canada. I will also present my vision for LLM-powered generative AI in specialized domains. My talk will showcase how LLMs could enhance and reshape domain-specific intelligence while addressing their strengths and limitations. My group’s recent work explores LLMs as benchmark creators, data scientists, and security analysts. These examples demonstrate the industry impact of LLMs. Finally, I will introduce my framework for LLM-powered domain-specific intelligence and its role in the future of generative AI for specialized domains.

ABOUT. Dr. Essam Mansour is an associate professor in the Department of Computer Science and Software Engineering at Concordia University in Montreal and the head of the Cognitive Data Science lab (CoDS). Over the past decade, he has led pioneering research in AI for databases, AI infrastructure optimization, knowledge graphs (KGs), large language models (LLMs), graph neural networks, and distributed/parallel data systems. In the last five years, Dr. Mansour has developed a promising research program in linked data science for federated and heterogeneous datasets. This program has achieved significant milestones, securing over $750K in federal and industry funding and forming strategic research projects with industry leaders, such as Google, IBM, RBC, and National Bank of Canada (NBC). His group is developing AI-powered systems optimized for scalability on supercomputers and cloud platforms. His research has resulted in over 30 publications in top-tier conferences and journals, including SIGMOD, PVLDB, and ICDE. Dr. Mansour is a regular reviewer for prestigious journals such as ACM TODS, VLDB Journal, and IEEE TKDE, and has served on the program committees of PVLDB, SIGMOD, and ICDE.

Towards Foundation Database Models

Johannes Wehrstein , TU Darmstadt

ABSTRACT. Recently, machine learning models have been utilized to realize many database tasks in academia and industry. To solve such internal tasks of database systems, the state-of-the-art is one-off models that need to be trained individually per task and even per dataset, which causes extremely high training overheads. In this talk, we argue that a new learning paradigm is needed that moves away from such one-off models towards generalizable models that can be used with only minimal overhead for an unseen dataset on a wide spectrum of tasks. While recently, several advances towards more generalizable models have been made, still, no model exists that can generalize across both datasets and tasks. As such, we propose a new direction which we call foundation models for databases, which is pre-trained in both task-agnostic and dataset-agnostic manner, which makes it possible to use the model with low overhead to solve a wide spectrum of downstream tasks on unseen datasets. In this vision talk, we propose an architecture for such a foundation database model, describe a promising feasibility study with a first prototype of such a model, and discuss the research roadmap to address the open challenges.

ABOUT. Johannes Wehrstein is a Doctoral Researcher at the Systems Group @ TU Darmstadt, specializing in learned database components. His research focuses on leveraging AI to enhance database performance, covering areas such as cost estimation, query optimization, advisory systems, as well as foundational AI research on query-plan representation learning. Previously, he was a Student Researcher at Systems Research @ Google, where he worked on foundation database models — models that can generalize across tasks, workloads, and databases. He was recently awarded the CIDR’25 best paper award for his work on foundation database models.

Unstructured Data Management for Machine Learning

Dong Deng , Rutgers University

ABSTRACT. In this talk, I will present our recent works on unstructured data management for machine learning. Large-scale unstructured data, such as vectors and texts, are ubiquitous nowadays due to the rapid development of deep learning and large language models (LLMs). Many machine learning models have been developed to effectively represent real-world objects as high-dimensional feature vectors. Meanwhile, real-world objects (e.g., products, video frames) are often associated with structured attributes (e.g., price, timestamp). In many scenarios, both the feature vectors and the structured attributes of these objects need to be jointly queried. To address this challenge, I will introduce our recent work for multi-modal approximate nearest neighbor search, which retrieves the approximate nearest neighbors of a query vector while satisfying attribute-based constraints. I will also discuss our work on near-duplicate text alignment, which identifies similar snippets between a short query and long text documents. This is a computationally intensive task with important applications in areas such as bioinformatics, copyright protection, and deduplication. I will report on our recent progress in scaling near-duplicate alignment to large text corpora and demonstrate how it can be used to detect unintended memorization in LLMs.

ABOUT. Dong Deng is an Assistant Professor in the Department of Computer Science at Rutgers University. Before joining Rutgers, he was a postdoc in the Database Group at MIT. He obtained his Ph.D. in Computer Science from Tsinghua University. His research interests include large-scale data management, vector databases, unstructured data management, data science, data curation, and database systems. He has published over 50 research papers in top database venues, including SIGMOD, VLDB, and ICDE. His research is supported by the National Science Foundation, Adobe, and The ASSISTments Foundation.

PROGRAM

Hong Kong (GMT+8)	Activity	Title	Presenter
14:00 - 14:10	Opening
14:10 - 14:50	Invited Talk	Advancing Domain-Specific Intelligence with LLMs & KGs: A Five-Year Journey Toward the Future	Essam Mansour
14:50 - 15:10	Contributed Paper	Data Acquisition for Domain Adaptation of Closed-Box Models Yiwei Liu (York University), Xiaohui Yu (York University), Nick Koudas (University of Toronto)	Xiaohui Yu
15:10 - 15:30	Invited Paper Presentation	Towards Foundation Database Models	Johannes Wehrstein
15:30 - 16:00	Coffee Break ☕
16:00 - 16:40	Invited Talk	Unstructured Data Management for Machine Learning	Dong Deng
16:40 - 17:20	Invited Talk	On the Generalization of Temporal Graph Learning: Theoretical Insights and Simple Algorithms	Jian Kang
17:20 - 17:30	Closing

IMPORTANT DATES

All deadlines are 11:59PM AoE.

Submission deadline:	~~January 25~~^th February 28^th 2025
Author notification:	~~March 30~~^thApril 11^th 2025
Camera-ready version:	~~April 8~~^thApril 28^th 2025
Workshop day:	May 19^th 2025

SUBMISSION AND AUTHOR GUIDELINES

Papers should be submitted using the Conference Management Tool. Papers must be prepared in accordance with the available IEEE format. Papers must not exceed 6 pages including the references. No appendix is allowed. Only electronic submissions in PDF format will be considered. Submissions will be reviewed in a single-blind manner.

ORGANISATION

Fatemeh Nargesian

University of Rochester

Workshop Chair

Lise Stork

University of Amsterdam

Workshop Chair

Aditya Shankar

TU Delft

Publicity chair

PROGRAM COMMITTEE

Amine Mhedhbi - Polytechnique Montréal
Gerardo Vitagliano - MIT CSAIL
Roee Shraga - WPI
Yiming Lin - UC Berkeley
Yuyu Luo - HKUST (GZ)
Steven Whang - KAIST
Daphne Miedema - University of Amsterdam
Sebastian Schelter - TU Berlin
Jan-Christoph Kalo - University of Amsterdam
Julien Romero - Telecom SudParis
Antonios Georgakopoulos - University of Amsterdam
Zhi Zhang - University of Amsterdam
Madelon Hulsebos - CWI Amsterdam

4th International Workshop on Databases and Machine Learning

in conjunction with ICDE 2025 | May 19^th-23^rd 2025

ABOUT