International workshop on databases and machine learning

in conjunction with ICDE 2023 | April 3 2023

ABOUT

After the increased adoption of machine learning (ML) in various applications and disciplines, a synergy between the database (DB) systems and ML communities emerged. Steps involved in an ML pipeline, such as data preparation and cleaning, feature engineering and management of the ML lifecycle, can benefit from research conducted by the data management community. For example, the management of the ML lifecycle requires mechanisms for modeling, storing and querying ML artifacts. Moreover, in many use cases pipelines require a mixture of relational and linear algebra operators raising the question of whether a seamless integration between the two algebras is possible.

In the opposite direction, ML techniques are explored in core components of database systems, e.g., query optimization, indexing and monitoring. Traditionally hard problems in databases, such as cardinality estimation, or problems with high human supervision like DB administration, might benefit more from learning algorithms than from rule-based or cost-based approaches.

The workshop aims at bringing together researchers and practitioners in the intersection of DB and ML research, providing a forum for DB-inspired or ML-inspired approaches addressing challenges encountered in each of the two areas. In particular, we welcome new research topics combining the strengths of both fields.

Topics of particular interest in the workshop include, but are not limited to:

  • Data collection and preparation for ML applications
  • Declarative machine learning on databases, data warehouses or data lakes
  • Hybrid optimization techniques for databases and machine learning
  • Model-aware data discovery, cleaning, and transformation
  • Benchmarking ML-oriented data management systems (data augmentation, data cleaning, etc)
  • Data management during the lifecycle of ML models
  • Novel data management systems for accelerating training and inference of ML models
  • DB-inspired techniques for modeling, storage and provenance of ML artifacts
  • Learned database design, configuration and tuning
  • Machine learning for query optimization
  • Applied machine learning/deep learning for data integration
  • ML-enabled data exploration and discovery in data lakes
  • ML functionality inside DBMS

SUBMISSION GUIDELINES

The workshop will accept both regular papers and short papers (work in progress, vision/outrageous ideas). All submissions must be prepared in accordance with the IEEE template available here. The workshop follows the same rules of Conflicts of Interest (COI) as ICDE 2023. The following are the page limits (excluding references):

Regular papers: 6 pages
Short papers: 4 pages

All submissions (in PDF format) should be sent to Easychair.

IMPORTANT DATES

All deadlines are 11:59PM PST.

Submission deadline: 04 January 2023 (extended) 11 January 2023
Author notification: 01 February 2023 (extended) 04 February 2023
Camera-ready version: 13 February 2023
Workshop day: 3 April 2023

KEYNOTES

Matteo Interlandi

How Databases and Machine Learning Systems Can Benefit from Each Other: A Perspective from Product and Research

Matteo Interlandi, Gray System Lab, Microsoft

ABSTRACT. As machine learning (ML) continues to gain prominence in today's world, it is becoming increasingly clear that databases and ML systems are two faces of the same coin. Drawing on my experience in both product and research teams, I will provide three different perspectives of why I think that databases and machine learning systems are deeply connected. The talk will be structured around three main topics: execution, optimizations, and abstractions. The audience will discover how classical machine learning runtimes are closely related to query processing, and how ML and database operations can be co-optimized. Finally, I will showcase how to turn relational algebra into the tensor operations. Overall, this talk will demonstrate that databases and machine learning systems are fundamentally intertwined, and that recognizing this connection can foster exciting advancements in both fields.

ABOUT. Matteo Interlandi is a Principal Scientist at the Gray Systems Lab (GSL) within Microsoft. His expertise lies at the intersection of Machine Learning and Database Systems, and his research has earned him numerous accolades, including a best demo award at VLDB 2022 and an honorable mention at SIGMOD 2021, and a “Best of VLDB 2016”. Prior to joining Microsoft, Matteo was a Postdoctoral Scholar at the University of California, Los Angeles, and a Research Associate at the Qatar Computing Research Institute. Matteo earned his Ph.D. from the University of Modena and Reggio Emilia, Italy.

Immanuel Trummer

Towards AI-Generated Database Management Systems

Immanuel Trummer, Cornell University

ABSTRACT. The past years have been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I outline novel opportunities in the context of data management, enabled by these advances. I discuss several recent research projects, aimed at exploiting advanced language processing for tasks such as parsing a database manual to support automated tuning, or mining data for patterns, described in natural language. Finally, I discuss our recent and ongoing research, aimed at synthesizing code for SQL processing in general-purpose programming languages, while enabling customization via natural language commands.

ABOUT. Immanuel Trummer is assistant professor at Cornell University. His papers were selected for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. He received the NSF CAREER Award and multiple Google Faculty Research Awards.

Ce Zhang

Optimizing Communications and Data for Large-scale Learning

Ce Zhang, ETH Zurich

ABSTRACT. The rapid progress of machine learning in the last decade has been fueled by the increasing scale of data and compute. However, this ever-increasing scale has created significant challenges for machine learning, which center around two fundamental bottlenecks: data movement (communications) and data quality. To alleviate these bottlenecks, one must jointly optimize and analyze data and learning. In this talk, I will share some of our research in this direction, focusing on optimizing data movements to enable large-scale, distributed and decentralized learning.

ABOUT. Ce is an Associate Professor in Computer Science at ETH Zurich. The mission of his research is to make machine learning techniques widely accessible while being cost-efficient and trustworthy to everyone who wants to use them to make our world a better place. He believes in a system approach to enabling this goal, and his current research focuses on building next-generation machine learning platforms and systems that are data-centric, human-centric, and declaratively scalable.

INVITED TALKS

Mahmoud Abo Khamis

Relational AutoDiff

Mahmoud Abo Khamis, Senior Computer Scientist, RelationalAI

ABSTRACT. Modern database systems have been progressively expanding their use cases far outside traditional bookkeeping and data analytics, and into artificial intelligence workloads like machine learning and mathematical optimization. This in turn motivates the need for native in-database automatic differentiation to better support these use cases. In this talk, we present RelationalAD (RAD), our framework for automatic differentiation at RelationalAI (RAI). Rel, the modeling language underlying RelationalAI, is declarative and can be viewed as a generalization of Datalog with infinite relations (e.g. arithmetic), aggregation, and relational abstraction. The input to RAD is a Rel program that defines a (set of) relational views and the output is another Rel program that defines new views that are the derivatives with respect to some given input relations. We show that performing AutoDiff inside a high-level database language like Rel allows us to evaluate derivatives while enjoying many features offered by the underlying database engine like factorization, query optimization and compilation, as well as support for higher-order derivatives. We present several examples covering recursive Rel programs, matrix calculus, neural network training, and gradient descent among others. We conclude with some challenges, design issues, as well as open problems.

ABOUT. Mahmoud Abo Khamis is a Senior Computer Scientist at RelationalAI since 2017. He received his Ph.D. in Computer Science and Engineering from the State University of New York at Buffalo in 2016. He also worked as a Senior Database Engineer at Infor from 2015 until 2017. His research interests include database systems and theory, in-database machine learning, query optimization and evaluation, information theory, and beyond worst-case analysis. His work has received two PODS Best Paper Awards in 2016 and 2022, two SIGMOD Research Highlight Awards in 2016 and 2022, and the Best CSE Dissertation Award 2016 from SUNY Buffalo. His work also received several invitations to the Journal of the ACM, the ACM TODS, and the ACM STOC. He served on the program committees of PODS 2019, PODS 2021, and ICDT 2022, and he is also a reviewer for the VLDB Journal and the ACM TODS among others.

FengZhang

Applying Compressed Data Direct Computing from Database to ML Workloads

Feng Zhang, Associate Professor, Renmin University, China

ABSTRACT. The rapid growth of data volume poses challenges for modern database systems in terms of space and time. Compressed data direct computing, as a solution that combines the advantages of space savings from data compression and efficiency gains from direct computing, has been proved to be a promising research in the database field. We find that the core of compressed data direct computing is data reuse, and it can be extended to ML workloads that are also concerned with data size and computational complexity. In this talk, we introduce Deep Reuse, a concrete implementation of data reuse into ML workloads. It shows great potential for inference latency reduction on popular neural networks such as CNNs. Inspired by Deep Reuse, we further carry out research in three aspects: 1) applying it to optimized operators such as the fast convolution algorithm Winograd, 2) embedding it into neural networks to address non-determinism and low accuracy problems through a consistent training process, and 3) extending the application to resource-constrained IoT devices.

ABOUT. Feng Zhang is an Associate Professor at Renmin University of China. He received his PhD from Tsinghua University in 2017, and has been a visiting scholar at NCSU in 2016 and NUS in 2018. His research interests include databases and high-performance computing. He mainly studies high-performance direct computing on compression in data analytics and management. His papers are published in prestigious international conferences and journals including SIGMOD, VLDB, SC, USENIX ATC, ASPLOS, and NeurIPS. He got ACM SIGHPC China Rising Star Award and TPDS Best Paper Award. He has provided consulting services to numerous IT companies in China, including Alibaba, Tencent, and Ant Company.

PROGRAM

Accepted papers:

  • Provenance-based Explanations for Machine Learning (ML) Models
    Justin Turnau, Nkechi Akwari, Seokki Lee and Dwarkesh Rajput
  • Privacy-preserving Data Federation for Trainable, Queryable and Actionable Data
    Stavroula Iatropoulou, Theodora Anastasiou, Sophia Karagiorgou, Petros Petrou, Dimitrios Alexandrou and Thanassis Bouras
  • Imputation of Missing Values in Training Data using Variational Autoencoder
    Xuerui Hong and Shuang Hao
  • COAX: Correlation-Aware Indexing
    Ali Hadian, Behzad Ghaffari, Taiyi Wang and Thomas Heinis
  • Efficient Index Learning via Model Reuse and Fine-tuning
    Guanli Liu, Jianzhong Qi, Lars Kulik, Kazuya Soga, Renata Borovica-Gajic and Benjamin I. P. Rubinstein
  • A Fast Hybrid Spatial Index with External Memory Support
    Xinyu Su, Jianzhong Qi and Egemen Tanin
  • Optimizing Machine Learning Inference Queries for Multiple Objectives
    Ziyu Li, Mariette Schönfeld, Rihan Hai, Alessandro Bozzon and Asterios Katsifodimos
  • Unsupervised Intra-Domain Adaptation for Recommendation via Uncertainty Minimization
    Chenghao Chen, Jie Xiao, Jin Liu, Jie Zhang, Jia Jia and Ning Hu
Carlifornia (PST) Activity Title Presenter
8:30 - 8:40 Opening
8:40 - 9:25 Keynote 1 How Databases and Machine Learning Systems Can Benefit from Each Other: A Perspective from Product and Research Matteo Interlandi (Gray System Lab, Microsoft)
9:25 - 9:45 Research talk 1 Imputation of Missing Values in Training Data using Variational Autoencoder (Online)
9:45 - 10:05 Research talk 2 Unsupervised Intra-Domain Adaptation for Recommendation via Uncertainty Minimization (Online)
10:05 - 10:30 Coffee break ☕
10:30 - 11:15 Keynote 2 Optimizing Communications and Data for Large-scale Learning Ce Zhang (ETH Zurich)
11:15 - 11:50 Invited talk 1 Relational AutoDiff Mahmoud Abo Khamis (RelationalAI)
11:50 - 12:05 Research talk 3 Optimizing Machine Learning Inference Queries for Multiple Objectives
12:05 - 12:20 Research talk 4 Provenance-based Explanations for Machine Learning (ML) Models
12:20 - 12:35 Research talk 5 Privacy-preserving Data Federation for Trainable, Queryable and Actionable Data (Online)
12:35 - 14:00 Lunch ☕
14:00 - 14:45 Keynote 3 Towards AI-Generated Database Management Systems Immanuel Trummer (Cornell University)
14:45 - 15:05 Research talk 6 Efficient Index Learning via Model Reuse and Fine-tuning
15:05 - 15:25 Research talk 7 A Fast Hybrid Spatial Index with External Memory Support (Online)
15:25 - 15:40 Research talk 8 COAX: Correlation-Aware Indexing (Online)
15:40 - 16:00 Coffee break
16:00 - 16:35 Invited talk 2 Applying Compressed Data Direct Computing from Database to ML Workloads Feng Zhang (Renmin University of China)
16:35 - 17:35 Panel Datalakes, AI and the Cloud
17:35 Ending

ORGANISATION

Program committee:

  • Dan Olteanu - University of Zurich, Switzerland
  • Asterios Katsifodimos - TU Delft, The Netherlands
  • Fatemeh Nargesian - University of Rochester, USA
  • Rana Alotaibi - Microsoft Gray Systems Lab
  • Zoi Kaoudi - TU Berlin, Germany
  • Xiaoou Ding - Harbin Institute of Technology, China
  • Marios Fragkoulis - DeliveryHero, Greece
  • Christos Koutras - TU Delft, The Netherlands
  • Chi Zhang - Brandeis University, USA
  • Hazar Harmouch - Hasso-Plattner-Institut, Germany
  • Zhiwei Fan - Meta
  • Tuo Shi - Tianjin University, China
  • Syed Ali - Accenture, Germany
  • Xuanhe Zhou - Tsinghua University, China
  • Lampros Flokas - Columbia University, USA
  • Gerardo Vitagliano - Hasso-Plattner-Institut, Germany
  • Roee Shraga - Northeastern University, USA
  • Chao Zhang - Tsinghua University, China
  • Shuang Hao - Beijing Jiaotong University, China
  • Yuyu Luo - Tsinghua University, China
  • Zezhou Huang - Columbia University, USA
  • Utku Sirin - Harvard University, USA
  • Bojan Karlaš - ETH Zurich, Switzerland
  • Madelon Hulsebos - University of Amsterdam, The Netherlands
-->