DBML 2023

ABOUT

After the increased adoption of machine learning (ML) in various applications and disciplines, a synergy between the database (DB) systems and ML communities emerged. Steps involved in an ML pipeline, such as data preparation and cleaning, feature engineering and management of the ML lifecycle, can benefit from research conducted by the data management community. For example, the management of the ML lifecycle requires mechanisms for modeling, storing and querying ML artifacts. Moreover, in many use cases pipelines require a mixture of relational and linear algebra operators raising the question of whether a seamless integration between the two algebras is possible.

In the opposite direction, ML techniques are explored in core components of database systems, e.g., query optimization, indexing and monitoring. Traditionally hard problems in databases, such as cardinality estimation, or problems with high human supervision like DB administration, might benefit more from learning algorithms than from rule-based or cost-based approaches.

The workshop aims at bringing together researchers and practitioners in the intersection of DB and ML research, providing a forum for DB-inspired or ML-inspired approaches addressing challenges encountered in each of the two areas. In particular, we welcome new research topics combining the strengths of both fields.

Topics of particular interest in the workshop include, but are not limited to:

Data collection and preparation for ML applications
Declarative machine learning on databases, data warehouses or data lakes
Hybrid optimization techniques for databases and machine learning
Model-aware data discovery, cleaning, and transformation
Benchmarking ML-oriented data management systems (data augmentation, data cleaning, etc)
Data management during the lifecycle of ML models
Novel data management systems for accelerating training and inference of ML models
DB-inspired techniques for modeling, storage and provenance of ML artifacts
Learned database design, configuration and tuning
Machine learning for query optimization
Applied machine learning/deep learning for data integration
ML-enabled data exploration and discovery in data lakes
ML functionality inside DBMS

SUBMISSION GUIDELINES

The workshop will accept both regular papers and short papers (work in progress, vision/outrageous ideas). All submissions must be prepared in accordance with the IEEE template available here. The workshop follows the same rules of Conflicts of Interest (COI) as ICDE 2023. The following are the page limits (excluding references):

Regular papers:	6 pages
Short papers:	4 pages

All submissions (in PDF format) should be sent to Easychair.

IMPORTANT DATES

All deadlines are 11:59PM PST.

Submission deadline:	~~04 January 2023~~ (extended) 11 January 2023
Author notification:	~~01 February 2023~~ (extended) 04 February 2023
Camera-ready version:	13 February 2023
Workshop day:	3 April 2023

KEYNOTES

How Databases and Machine Learning Systems Can Benefit from Each Other: A Perspective from Product and Research

Matteo Interlandi, Gray System Lab, Microsoft

ABSTRACT. As machine learning (ML) continues to gain prominence in today's world, it is becoming increasingly clear that databases and ML systems are two faces of the same coin. Drawing on my experience in both product and research teams, I will provide three different perspectives of why I think that databases and machine learning systems are deeply connected. The talk will be structured around three main topics: execution, optimizations, and abstractions. The audience will discover how classical machine learning runtimes are closely related to query processing, and how ML and database operations can be co-optimized. Finally, I will showcase how to turn relational algebra into the tensor operations. Overall, this talk will demonstrate that databases and machine learning systems are fundamentally intertwined, and that recognizing this connection can foster exciting advancements in both fields.

ABOUT. Matteo Interlandi is a Principal Scientist at the Gray Systems Lab (GSL) within Microsoft. His expertise lies at the intersection of Machine Learning and Database Systems, and his research has earned him numerous accolades, including a best demo award at VLDB 2022 and an honorable mention at SIGMOD 2021, and a “Best of VLDB 2016”. Prior to joining Microsoft, Matteo was a Postdoctoral Scholar at the University of California, Los Angeles, and a Research Associate at the Qatar Computing Research Institute. Matteo earned his Ph.D. from the University of Modena and Reggio Emilia, Italy.

Towards AI-Generated Database Management Systems

Immanuel Trummer, Cornell University

ABSTRACT. The past years have been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I outline novel opportunities in the context of data management, enabled by these advances. I discuss several recent research projects, aimed at exploiting advanced language processing for tasks such as parsing a database manual to support automated tuning, or mining data for patterns, described in natural language. Finally, I discuss our recent and ongoing research, aimed at synthesizing code for SQL processing in general-purpose programming languages, while enabling customization via natural language commands.

ABOUT. Immanuel Trummer is assistant professor at Cornell University. His papers were selected for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. He received the NSF CAREER Award and multiple Google Faculty Research Awards.

Optimizing Communications and Data for Large-scale Learning

Ce Zhang, ETH Zurich

ABSTRACT. The rapid progress of machine learning in the last decade has been fueled by the increasing scale of data and compute. However, this ever-increasing scale has created significant challenges for machine learning, which center around two fundamental bottlenecks: data movement (communications) and data quality. To alleviate these bottlenecks, one must jointly optimize and analyze data and learning. In this talk, I will share some of our research in this direction, focusing on optimizing data movements to enable large-scale, distributed and decentralized learning.

ABOUT. Ce is an Associate Professor in Computer Science at ETH Zurich. The mission of his research is to make machine learning techniques widely accessible while being cost-efficient and trustworthy to everyone who wants to use them to make our world a better place. He believes in a system approach to enabling this goal, and his current research focuses on building next-generation machine learning platforms and systems that are data-centric, human-centric, and declaratively scalable.

INVITED TALKS

Relational AutoDiff

Mahmoud Abo Khamis, Senior Computer Scientist, RelationalAI

ABSTRACT. Modern database systems have been progressively expanding their use cases far outside traditional bookkeeping and data analytics, and into artificial intelligence workloads like machine learning and mathematical optimization. This in turn motivates the need for native in-database automatic differentiation to better support these use cases. In this talk, we present RelationalAD (RAD), our framework for automatic differentiation at RelationalAI (RAI). Rel, the modeling language underlying RelationalAI, is declarative and can be viewed as a generalization of Datalog with infinite relations (e.g. arithmetic), aggregation, and relational abstraction. The input to RAD is a Rel program that defines a (set of) relational views and the output is another Rel program that defines new views that are the derivatives with respect to some given input relations. We show that performing AutoDiff inside a high-level database language like Rel allows us to evaluate derivatives while enjoying many features offered by the underlying database engine like factorization, query optimization and compilation, as well as support for higher-order derivatives. We present several examples covering recursive Rel programs, matrix calculus, neural network training, and gradient descent among others. We conclude with some challenges, design issues, as well as open problems.

ABOUT. Mahmoud Abo Khamis is a Senior Computer Scientist at RelationalAI since 2017. He received his Ph.D. in Computer Science and Engineering from the State University of New York at Buffalo in 2016. He also worked as a Senior Database Engineer at Infor from 2015 until 2017. His research interests include database systems and theory, in-database machine learning, query optimization and evaluation, information theory, and beyond worst-case analysis. His work has received two PODS Best Paper Awards in 2016 and 2022, two SIGMOD Research Highlight Awards in 2016 and 2022, and the Best CSE Dissertation Award 2016 from SUNY Buffalo. His work also received several invitations to the Journal of the ACM, the ACM TODS, and the ACM STOC. He served on the program committees of PODS 2019, PODS 2021, and ICDT 2022, and he is also a reviewer for the VLDB Journal and the ACM TODS among others.

Applying Compressed Data Direct Computing from Database to ML Workloads

Feng Zhang, Associate Professor, Renmin University, China

ABSTRACT. The rapid growth of data volume poses challenges for modern database systems in terms of space and time. Compressed data direct computing, as a solution that combines the advantages of space savings from data compression and efficiency gains from direct computing, has been proved to be a promising research in the database field. We find that the core of compressed data direct computing is data reuse, and it can be extended to ML workloads that are also concerned with data size and computational complexity. In this talk, we introduce Deep Reuse, a concrete implementation of data reuse into ML workloads. It shows great potential for inference latency reduction on popular neural networks such as CNNs. Inspired by Deep Reuse, we further carry out research in three aspects: 1) applying it to optimized operators such as the fast convolution algorithm Winograd, 2) embedding it into neural networks to address non-determinism and low accuracy problems through a consistent training process, and 3) extending the application to resource-constrained IoT devices.

ABOUT. Feng Zhang is an Associate Professor at Renmin University of China. He received his PhD from Tsinghua University in 2017, and has been a visiting scholar at NCSU in 2016 and NUS in 2018. His research interests include databases and high-performance computing. He mainly studies high-performance direct computing on compression in data analytics and management. His papers are published in prestigious international conferences and journals including SIGMOD, VLDB, SC, USENIX ATC, ASPLOS, and NeurIPS. He got ACM SIGHPC China Rising Star Award and TPDS Best Paper Award. He has provided consulting services to numerous IT companies in China, including Alibaba, Tencent, and Ant Company.

PROGRAM

Accepted papers:

Provenance-based Explanations for Machine Learning (ML) Models
Justin Turnau, Nkechi Akwari, Seokki Lee and Dwarkesh Rajput
Privacy-preserving Data Federation for Trainable, Queryable and Actionable Data
Stavroula Iatropoulou, Theodora Anastasiou, Sophia Karagiorgou, Petros Petrou, Dimitrios Alexandrou and Thanassis Bouras
Imputation of Missing Values in Training Data using Variational Autoencoder
Xuerui Hong and Shuang Hao
COAX: Correlation-Aware Indexing
Ali Hadian, Behzad Ghaffari, Taiyi Wang and Thomas Heinis
Efficient Index Learning via Model Reuse and Fine-tuning
Guanli Liu, Jianzhong Qi, Lars Kulik, Kazuya Soga, Renata Borovica-Gajic and Benjamin I. P. Rubinstein
A Fast Hybrid Spatial Index with External Memory Support
Xinyu Su, Jianzhong Qi and Egemen Tanin
Optimizing Machine Learning Inference Queries for Multiple Objectives
Ziyu Li, Mariette Schönfeld, Rihan Hai, Alessandro Bozzon and Asterios Katsifodimos
Unsupervised Intra-Domain Adaptation for Recommendation via Uncertainty Minimization
Chenghao Chen, Jie Xiao, Jin Liu, Jie Zhang, Jia Jia and Ning Hu

Carlifornia (PST)	Activity	Title	Presenter
8:30 - 8:40	Opening
8:40 - 9:25	Keynote 1	How Databases and Machine Learning Systems Can Benefit from Each Other: A Perspective from Product and Research	Matteo Interlandi (Gray System Lab, Microsoft)
9:25 - 9:45	Research talk 1	Imputation of Missing Values in Training Data using Variational Autoencoder (Online)
9:45 - 10:05	Research talk 2	Unsupervised Intra-Domain Adaptation for Recommendation via Uncertainty Minimization (Online)
10:05 - 10:30	Coffee break ☕
10:30 - 11:15	Keynote 2	Optimizing Communications and Data for Large-scale Learning	Ce Zhang (ETH Zurich)
11:15 - 11:50	Invited talk 1	Relational AutoDiff	Mahmoud Abo Khamis (RelationalAI)
11:50 - 12:05	Research talk 3	Optimizing Machine Learning Inference Queries for Multiple Objectives
12:05 - 12:20	Research talk 4	Provenance-based Explanations for Machine Learning (ML) Models
12:20 - 12:35	Research talk 5	Privacy-preserving Data Federation for Trainable, Queryable and Actionable Data (Online)
12:35 - 14:00	Lunch ☕
14:00 - 14:45	Keynote 3	Towards AI-Generated Database Management Systems	Immanuel Trummer (Cornell University)
14:45 - 15:05	Research talk 6	Efficient Index Learning via Model Reuse and Fine-tuning
15:05 - 15:25	Research talk 7	A Fast Hybrid Spatial Index with External Memory Support (Online)
15:25 - 15:40	Research talk 8	COAX: Correlation-Aware Indexing (Online)
15:40 - 16:00	Coffee break
16:00 - 16:35	Invited talk 2	Applying Compressed Data Direct Computing from Database to ML Workloads	Feng Zhang (Renmin University of China)
16:35 - 17:35	Panel	Datalakes, AI and the Cloud
17:35	Ending

PANEL: Data Lakes, AI, and Cloud

Location: Platinum 2

Rihan Hai

TU Delft

Host

Yannis Papakonstantinou

Databriks

Matteo Interlandi

Gray System Lab, Microsoft

Asterios Katsifodimos,

TU Delft & Amazon

Ce Zhang

ETH Zurich

Davit Buniatyan

Activeloop

ORGANISATION

Rihan Hai

TU Delft

Workshop Chair

Nantia Makrynioti

RelationalAI

Workshop Chair

Kwanghyun Park

Yonsei University

Workshop Chair

Chengliang Chai

Beijing Institute of Technology

Workshop chair

Andra Ionescu

TU Delft

Workshop chair

Wenbo Sun

TU Delft

Publicity chair

Program committee:

Dan Olteanu - University of Zurich, Switzerland
Asterios Katsifodimos - TU Delft, The Netherlands
Fatemeh Nargesian - University of Rochester, USA
Rana Alotaibi - Microsoft Gray Systems Lab
Zoi Kaoudi - TU Berlin, Germany
Xiaoou Ding - Harbin Institute of Technology, China
Marios Fragkoulis - DeliveryHero, Greece
Christos Koutras - TU Delft, The Netherlands
Chi Zhang - Brandeis University, USA
Hazar Harmouch - Hasso-Plattner-Institut, Germany
Zhiwei Fan - Meta
Tuo Shi - Tianjin University, China
Syed Ali - Accenture, Germany
Xuanhe Zhou - Tsinghua University, China
Lampros Flokas - Columbia University, USA
Gerardo Vitagliano - Hasso-Plattner-Institut, Germany
Roee Shraga - Northeastern University, USA
Chao Zhang - Tsinghua University, China
Shuang Hao - Beijing Jiaotong University, China
Yuyu Luo - Tsinghua University, China
Zezhou Huang - Columbia University, USA
Utku Sirin - Harvard University, USA
Bojan Karlaš - ETH Zurich, Switzerland
Madelon Hulsebos - University of Amsterdam, The Netherlands

International workshop on databases and machine learning

in conjunction with ICDE 2023 | April 3 2023

ABOUT

SUBMISSION GUIDELINES

IMPORTANT DATES

KEYNOTES

How Databases and Machine Learning Systems Can Benefit from Each Other: A Perspective from Product and Research

Matteo Interlandi, Gray System Lab, Microsoft

Towards AI-Generated Database Management Systems

Immanuel Trummer, Cornell University

Optimizing Communications and Data for Large-scale Learning

Ce Zhang, ETH Zurich

INVITED TALKS

Relational AutoDiff

Mahmoud Abo Khamis, Senior Computer Scientist, RelationalAI

Applying Compressed Data Direct Computing from Database to ML Workloads

Feng Zhang, Associate Professor, Renmin University, China

PROGRAM

Accepted papers:

PANEL: Data Lakes, AI, and Cloud

Location: Platinum 2

Rihan Hai

TU Delft

Host

Yannis Papakonstantinou

Databriks

Matteo Interlandi

Gray System Lab, Microsoft

Asterios Katsifodimos,

TU Delft & Amazon

Ce Zhang

ETH Zurich

Davit Buniatyan

Activeloop

ORGANISATION

Rihan Hai

TU Delft

Workshop Chair

Nantia Makrynioti

RelationalAI

Workshop Chair

Kwanghyun Park

Yonsei University

Workshop Chair

Chengliang Chai

Beijing Institute of Technology

Workshop chair

Andra Ionescu

TU Delft

Workshop chair

Wenbo Sun

TU Delft

Publicity chair

Program committee: