After the increased adoption of machine learning (ML) in various applications and disciplines, a synergy between the database (DB) systems and ML communities emerged. Steps involved in an ML pipeline, such as data preparation and cleaning, feature engineering and management of the ML lifecycle, can benefit from research conducted by the data management community. For example, the management of the ML lifecycle requires mechanisms for modeling, storing and querying ML artifacts. Moreover, in many use cases pipelines require a mixture of relational and linear algebra operators raising the question of whether a seamless integration between the two algebras is possible.
In the opposite direction, ML techniques are explored in core components of database systems, e.g., query optimization, indexing and monitoring. Traditionally hard problems in databases, such as cardinality estimation, or problems with high human supervision like DB administration, might benefit more from learning algorithms than from rule-based or cost-based approaches.
The workshop aims at bringing together researchers and practitioners in the intersection of DB and ML research, providing a forum for DB-inspired or ML-inspired approaches addressing challenges encountered in each of the two areas. In particular, we welcome new research topics combining the strengths of both fields.
Topics of particular interest in the workshop include, but are not limited to:
The workshop will accept both regular papers and short papers (work in progress, vision/outrageous ideas). All submissions must be prepared in accordance with the IEEE template available here. The workshop follows the same rules of Conflicts of Interest (COI) as ICDE 2023. The following are the page limits (excluding references):
|Regular papers:||6 pages|
|Short papers:||4 pages|
All submissions (in PDF format) should be sent to Easychair.
All deadlines are 11:59PM PST.
|Camera-ready version:||13 February 2023|
|Workshop day:||3 April 2023|
ABSTRACT. As machine learning (ML) continues to gain prominence in today's world, it is becoming increasingly clear that databases and ML systems are two faces of the same coin. Drawing on my experience in both product and research teams, I will provide three different perspectives of why I think that databases and machine learning systems are deeply connected. The talk will be structured around three main topics: execution, optimizations, and abstractions. The audience will discover how classical machine learning runtimes are closely related to query processing, and how ML and database operations can be co-optimized. Finally, I will showcase how to turn relational algebra into the tensor operations. Overall, this talk will demonstrate that databases and machine learning systems are fundamentally intertwined, and that recognizing this connection can foster exciting advancements in both fields.
ABOUT. Matteo Interlandi is a Principal Scientist at the Gray Systems Lab (GSL) within Microsoft. His expertise lies at the intersection of Machine Learning and Database Systems, and his research has earned him numerous accolades, including a best demo award at VLDB 2022 and an honorable mention at SIGMOD 2021, and a “Best of VLDB 2016”. Prior to joining Microsoft, Matteo was a Postdoctoral Scholar at the University of California, Los Angeles, and a Research Associate at the Qatar Computing Research Institute. Matteo earned his Ph.D. from the University of Modena and Reggio Emilia, Italy.
ABSTRACT. The past years have been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I outline novel opportunities in the context of data management, enabled by these advances. I discuss several recent research projects, aimed at exploiting advanced language processing for tasks such as parsing a database manual to support automated tuning, or mining data for patterns, described in natural language. Finally, I discuss our recent and ongoing research, aimed at synthesizing code for SQL processing in general-purpose programming languages, while enabling customization via natural language commands.
ABOUT. Immanuel Trummer is assistant professor at Cornell University. His papers were selected for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. He received the NSF CAREER Award and multiple Google Faculty Research Awards.
ABSTRACT. The rapid progress of machine learning in the last decade has been fueled by the increasing scale of data and compute. However, this ever-increasing scale has created significant challenges for machine learning, which center around two fundamental bottlenecks: data movement (communications) and data quality. To alleviate these bottlenecks, one must jointly optimize and analyze data and learning. In this talk, I will share some of our research in this direction, focusing on optimizing data movements to enable large-scale, distributed and decentralized learning.
ABOUT. Ce is an Associate Professor in Computer Science at ETH Zurich. The mission of his research is to make machine learning techniques widely accessible while being cost-efficient and trustworthy to everyone who wants to use them to make our world a better place. He believes in a system approach to enabling this goal, and his current research focuses on building next-generation machine learning platforms and systems that are data-centric, human-centric, and declaratively scalable.
ABSTRACT. Modern database systems have been progressively expanding their use cases far outside traditional bookkeeping and data analytics, and into artificial intelligence workloads like machine learning and mathematical optimization. This in turn motivates the need for native in-database automatic differentiation to better support these use cases. In this talk, we present RelationalAD (RAD), our framework for automatic differentiation at RelationalAI (RAI). Rel, the modeling language underlying RelationalAI, is declarative and can be viewed as a generalization of Datalog with infinite relations (e.g. arithmetic), aggregation, and relational abstraction. The input to RAD is a Rel program that defines a (set of) relational views and the output is another Rel program that defines new views that are the derivatives with respect to some given input relations. We show that performing AutoDiff inside a high-level database language like Rel allows us to evaluate derivatives while enjoying many features offered by the underlying database engine like factorization, query optimization and compilation, as well as support for higher-order derivatives. We present several examples covering recursive Rel programs, matrix calculus, neural network training, and gradient descent among others. We conclude with some challenges, design issues, as well as open problems.
ABOUT. Mahmoud Abo Khamis is a Senior Computer Scientist at RelationalAI since 2017. He received his Ph.D. in Computer Science and Engineering from the State University of New York at Buffalo in 2016. He also worked as a Senior Database Engineer at Infor from 2015 until 2017. His research interests include database systems and theory, in-database machine learning, query optimization and evaluation, information theory, and beyond worst-case analysis. His work has received two PODS Best Paper Awards in 2016 and 2022, two SIGMOD Research Highlight Awards in 2016 and 2022, and the Best CSE Dissertation Award 2016 from SUNY Buffalo. His work also received several invitations to the Journal of the ACM, the ACM TODS, and the ACM STOC. He served on the program committees of PODS 2019, PODS 2021, and ICDT 2022, and he is also a reviewer for the VLDB Journal and the ACM TODS among others.
ABSTRACT. The rapid growth of data volume poses challenges for modern database systems in terms of space and time. Compressed data direct computing, as a solution that combines the advantages of space savings from data compression and efficiency gains from direct computing, has been proved to be a promising research in the database field. We find that the core of compressed data direct computing is data reuse, and it can be extended to ML workloads that are also concerned with data size and computational complexity. In this talk, we introduce Deep Reuse, a concrete implementation of data reuse into ML workloads. It shows great potential for inference latency reduction on popular neural networks such as CNNs. Inspired by Deep Reuse, we further carry out research in three aspects: 1) applying it to optimized operators such as the fast convolution algorithm Winograd, 2) embedding it into neural networks to address non-determinism and low accuracy problems through a consistent training process, and 3) extending the application to resource-constrained IoT devices.
ABOUT. Feng Zhang is an Associate Professor at Renmin University of China. He received his PhD from Tsinghua University in 2017, and has been a visiting scholar at NCSU in 2016 and NUS in 2018. His research interests include databases and high-performance computing. He mainly studies high-performance direct computing on compression in data analytics and management. His papers are published in prestigious international conferences and journals including SIGMOD, VLDB, SC, USENIX ATC, ASPLOS, and NeurIPS. He got ACM SIGHPC China Rising Star Award and TPDS Best Paper Award. He has provided consulting services to numerous IT companies in China, including Alibaba, Tencent, and Ant Company.
|8:30 - 8:40||Opening|
|8:40 - 9:25||Keynote 1||How Databases and Machine Learning Systems Can Benefit from Each Other: A Perspective from Product and Research||Matteo Interlandi (Gray System Lab, Microsoft)|
|9:25 - 9:45||Research talk 1||Imputation of Missing Values in Training Data using Variational Autoencoder (Online)|
|9:45 - 10:05||Research talk 2||Unsupervised Intra-Domain Adaptation for Recommendation via Uncertainty Minimization (Online)|
|10:05 - 10:30||Coffee break ☕|
|10:30 - 11:15||Keynote 2||Optimizing Communications and Data for Large-scale Learning||Ce Zhang (ETH Zurich)|
|11:15 - 11:50||Invited talk 1||Relational AutoDiff||Mahmoud Abo Khamis (RelationalAI)|
|11:50 - 12:05||Research talk 3||Optimizing Machine Learning Inference Queries for Multiple Objectives|
|12:05 - 12:20||Research talk 4||Provenance-based Explanations for Machine Learning (ML) Models|
|12:20 - 12:35||Research talk 5||Privacy-preserving Data Federation for Trainable, Queryable and Actionable Data (Online)|
|12:35 - 14:00||Lunch ☕|
|14:00 - 14:45||Keynote 3||Towards AI-Generated Database Management Systems||Immanuel Trummer (Cornell University)|
|14:45 - 15:05||Research talk 6||Efficient Index Learning via Model Reuse and Fine-tuning|
|15:05 - 15:25||Research talk 7||A Fast Hybrid Spatial Index with External Memory Support (Online)|
|15:25 - 15:40||Research talk 8||COAX: Correlation-Aware Indexing (Online)|
|15:40 - 16:00||Coffee break|
|16:00 - 16:35||Invited talk 2||Applying Compressed Data Direct Computing from Database to ML Workloads||Feng Zhang (Renmin University of China)|
|16:35 - 17:35||Panel||Datalakes, AI and the Cloud|