4th International Workshop on Databases and Machine Learning

in conjunction with ICDE 2025 | May 19th-23rd 2025

ABOUT

After the increased adoption of machine learning (ML) in various applications and disciplines, a synergy between the database (DB) systems and ML communities emerged. Steps involved in ML pipelines, such as data preparation and cleaning, feature engineering, and management of the ML lifecycle can benefit from research conducted by the data management community. For example, the management of the ML lifecycle requires mechanisms for modeling, storing, and querying ML artifacts. Moreover, in many use cases pipelines require a mixture of relational and linear algebra operators, raising the question of whether a seamless integration between the two algebras is possible.

In the opposite direction, ML techniques are explored in core components of database systems, e.g., query optimization, indexing, and monitoring. Traditionally hard problems in databases, such as cardinality estimation, or problems with high human supervision like DB administration, might benefit more from learning algorithms than from rule-based or cost-based approaches.

The workshop aims at bringing together re-searchers and practitioners in the intersection of DB and ML research, providing a forum for DB-inspired or ML-inspired approaches addressing challenges encountered in each of the two areas. In particular, we welcome new research topics combining the strengths of both fields.

Information of the previous workshops can be accessed and seen at DBML 2024, DBML 2023 and DBML 2022.

For any questions regarding the workshop please contact: dbml25@googlegroups.com

Topics of particular interest for the workshop include, but are not limited to topics along the following two categories:

  • ML for Data Management and DBMS
  • Learned data discovery, cleaning, and transformation
  • ML-enabled data exploration and discovery in data lakes
  • Learned database design, configuration, and tuning
  • ML for query optimization, indexing, partitioning
  • Natural language enablement (e.g., queries, result summarization, chatbot interfaces, etc.)
  • Pretrained models for databases and data management, e.g. (Large Language Models).
  • Representation learning for data cleaning, preprocessing, and management
  • Benchmarking ML-oriented data management (data augmentation, data cleaning, etc) or DBMSs
  • Data Management for ML
  • Data collection and preparation for ML applications
  • Data quality and provenance for ML
  • Novel data management systems for accelerating training and inference of ML models
  • Data and metadata management for the ML lifecycle
  • DB-inspired techniques for modeling, storage, and provenance of ML artifacts

SPEAKERS

Dr Jian kang

On the Generalization of Temporal Graph Learning: Theoretical Insights and Simple Algorithms

Jian Kang, University of Rochester

ABSTRACT. Real-world graphs often evolve over time. The de facto methods to extract spatial-temporal information in temporal graphs are recurrent neural networks (RNNs) and self-attention mechanism (SAM). Despite empirically good performance, their theoretical foundations largely remain uncharted. In this talk, I will share our works on improving the generalization of temporal graph learning. I will first present a unified theoretical framework to examine the generalization of three fundamental types of temporal graph learning methods. Guided by the theoretical understanding, I will introduce two simple yet effective temporal graph learning models that enjoy small generalization error, smooth optimization landscape and better empirical performance. We hope these results motivate researchers to rethink the importance of simple model architectures.

ABOUT. Jian Kang is an Assistant Professor in the Department of Computer Science at the University of Rochester. His research focuses on machine learning on graphs to advance knowledge understanding and scientific discovery. He received his Ph.D. in Computer Science from the University of Illinois Urbana-Champaign. He was recognized as Rising Star in Data Science by The University of Chicago and a Mavis Future Faculty Fellow by the University of Illinois Urbana-Champaign. He also serves as Guest Editor of Generative Search and Recommendation of Frontiers in Big Data, Proceedings Chair of CIKM 2025, and Web Chair of KDD 2024.

Dr Xiaohui Yu

Data Acquisition for Domain Adaptation of Closed-Box Models

Contributing authors: Yiwei Liu, Xiaohui Yu, Nick Koudas

ABSTRACT. Machine learning (ML) marketplaces, pivotal for numerous industries, often offer models to customers as "closed boxes". These models, when deployed in new domains, might experience lower performance due to distributional shifts. Our paper proposes a framework designed to enhance closed-box classification models. This framework allows customers, upon detecting performance gaps on their validation datasets, to gather additional data for creating an auxiliary "padding" model. This model assists the original closed-box model in addressing classification weaknesses in the target domain. The framework includes a "weakness detector" that identifies areas where the model falls short and an Augmented Ensemble method that combines the original and padding models to improve accuracy and expand the diversity of the ML marketplace. Extensive experiments on several popular benchmark datasets confirm the superiority of our proposed framework over baseline approaches.

ABOUT. Xiaohui Yu is a Professor and the Graduate Program Director in the School of Information Technology, York University, Canada. He obtained his PhD degree from the University of Toronto. His research interests lie in the broad area of data science, with a particular focus on the intersection of data management and machine learning (ML). The results of his research have been published in top data science journals and conferences, such as SIGMOD, VLDB, ICDE, and TKDE. He regularly serves on the program committees of leading conferences and is an Associate/Area Editor for the IEEE Transactions on Knowledge and Data Engineering (TKDE), the ACM Transactions on Knowledge Discovery in Data (TKDD), and Information Systems. He is a General Co-Chair for the KDD 2025 conference. He has collaborated regularly with industry partners, and some research results have been incorporated into large-scale production systems.

Dr Essam Mansour
Essam Mansour , Concordia University
Johannes Wehrstein

Towards Foundation Database Models

Johannes Wehrstein , TU Darmstadt

ABSTRACT. Recently, machine learning models have been utilized to realize many database tasks in academia and industry. To solve such internal tasks of database systems, the state-of-the-art is one-off models that need to be trained individually per task and even per dataset, which causes extremely high training overheads. In this talk, we argue that a new learning paradigm is needed that moves away from such one-off models towards generalizable models that can be used with only minimal overhead for an unseen dataset on a wide spectrum of tasks. While recently, several advances towards more generalizable models have been made, still, no model exists that can generalize across both datasets and tasks. As such, we propose a new direction which we call foundation models for databases, which is pre-trained in both task-agnostic and dataset-agnostic manner, which makes it possible to use the model with low overhead to solve a wide spectrum of downstream tasks on unseen datasets. In this vision talk, we propose an architecture for such a foundation database model, describe a promising feasibility study with a first prototype of such a model, and discuss the research roadmap to address the open challenges.

ABOUT. Johannes Wehrstein is a Doctoral Researcher at the Systems Group @ TU Darmstadt, specializing in learned database components. His research focuses on leveraging AI to enhance database performance, covering areas such as cost estimation, query optimization, advisory systems, as well as foundational AI research on query-plan representation learning. Previously, he was a Student Researcher at Systems Research @ Google, where he worked on foundation database models — models that can generalize across tasks, workloads, and databases. He was recently awarded the CIDR’25 best paper award for his work on foundation database models.

Dong Deng
Dong Deng , Rutgers University

PROGRAM

Hong Kong (GMT+8) Activity Title Presenter
14:00 - 14:10 Opening
14:10 - 14:50 Invited Talk To be announced Essam Mansour
14:50 - 15:10 Contributed Paper Data Acquisition for Domain Adaptation of Closed-Box Models
Yiwei Liu (York University), Xiaohui Yu (York University), Nick Koudas (University of Toronto)
Xiaohui Yu
15:10 - 15:30 Invited Paper Presentation Towards Foundation Database Models Johannes Wehrstein
15:30 - 16:00 Coffee Break ☕
16:00 - 16:40 Invited Talk To be announced Dong Deng
16:40 - 17:20 Invited Talk On the Generalization of Temporal Graph Learning: Theoretical Insights and Simple Algorithms Jian Kang
17:20 - 17:30 Closing

IMPORTANT DATES

All deadlines are 11:59PM AoE.

Submission deadline: January 25th February 28th 2025
Author notification: March 30thApril 11th 2025
Camera-ready version: April 8thApril 28th 2025
Workshop day: May 19th 2025

SUBMISSION AND AUTHOR GUIDELINES

Papers should be submitted using the Conference Management Tool. Papers must be prepared in accordance with the available IEEE format. Papers must not exceed 6 pages including the references. No appendix is allowed. Only electronic submissions in PDF format will be considered. Submissions will be reviewed in a single-blind manner.

ORGANISATION

PROGRAM COMMITTEE

  • Amine Mhedhbi - Polytechnique Montréal
  • Gerardo Vitagliano - MIT CSAIL
  • Roee Shraga - WPI
  • Yiming Lin - UC Berkeley
  • Yuyu Luo - HKUST (GZ)
  • Steven Whang - KAIST
  • Daphne Miedema - University of Amsterdam
  • Sebastian Schelter - TU Berlin
  • Jan-Christoph Kalo - University of Amsterdam
  • Julien Romero - Telecom SudParis
  • Antonios Georgakopoulos - University of Amsterdam
  • Zhi Zhang - University of Amsterdam
  • Madelon Hulsebos - CWI Amsterdam