Thesis projects

The WIS group is open for students who want to do their thesis on subjects in the wider area of web information systems and information architecture.

At the WIS group we stimulate students to contact prof. Geert-Jan Houben, to discuss possible topics for a thesis project (and literature survey). In such discussions students are free to suggest their own topic and then together a concrete thesis (or literature survey) topic will be defined.

Master students who like to do a specialisation in WIS are encouraged to select Web Science & Engineering, Information Retrieval, Crowd Computing, and Seminar Web Information Systems (possibly in the second MSc year). In any case, for help in composing a program, students can always contact prof. Houben for advice.

As a rough indication of possible subjects for thesis projects, below we give some subjects of projects that have been running, that are running or that are open for new students. Many of these topics can be approached in projects within the research lab, in industry, or in a collaboration between academia and industry. Industry here includes organizations that are public or private, large or small, national or international. Examples of organisations hosting some of our thesis student: Adyen, Crowdsense, Exact Software, Capgemini, Gemeente Amsterdam, Greetinq, KPMG, IBM, ICTU, IDS Scheer, ING, Innopay, Isaac, ISD, Logica, Sanoma, Tamtam, TNO, and Truvo.

Thesis topics at Sigma on Social Computing Systems

The mission of the Sigma-Lab is to understand, design, and build social computing systems that process social data to improve We-based applications and systems. The Sigma-Lab team is supervised by Alessandro Bozzon, and it includes the post-docs Achilleas Psyllidis, Pavel Kucherbaev, and Andrea Mauri; the PhD students Jasper OostermanJie Yang, Sepideh Mesbah, Vincent Gong, and Shahin Sharifi; and the research engineers Carlo van der Valk, and Ioannis Protonotarios. 

The research conducted ins Sigma-Lab aims at answering questions such as: How can humans and machines better collaborate in the creation, analysis, and sense making of social data? How to control and accelerate the knowledge creation process at scale? How to systematically and reliably exploit social data in urban analytics? How can social data be effectively and efficiently injected in existing WIS to achieve pre-defined data-driven business goals?

In the last five years, the Sigma-Lab supervised more than 20 MsC theses, and is looking for students interested in one of the following topics.

  • Urban Data Analytics for Smart Cities: in the context of the SocialGlass project, we are looking for master students with a passion for data science, and an interest in improving the quality of life in our cities. In SocialGlass we develop new urban data science methods that can help addressing issues in domains such as transportation, crowdedness in the city, responsible energy consumption, urban planning, and business attractiveness. Examples of available MsC thesis include:
    • Developing (Deep) Machine Learning models to quantify and predict safety / crime rates in urban neighborhoods, using combinations of StreetView, social media, and socio-economic data
    • Developing (Deep) Machine Learning models to quantify and predict quality-of-life aspects (e.g. segregation, deprivation etc.) in cities, using satellite imagery and social media data
    • Developing models and implementing systems to recommend new POI locations in cities
  • Analysing Individual Energy Consumption Behaviour using Social Media Data: Currently, energy consumption data are primarily being gathered by (smart) energy meters at the household level. While such data is highly reliable and temporally complete, its acquisition requires access to the energy infrastructure; moreover, such data semantically poor, The aim of this project is  to explore the potential usefulness of social media as an alternative source to collect data about individuals energy consumption behaviour.  We focus on four components of energy lifestyle namely: Dwelling, Mobility, Food consumption and Leisure. The output of this project will be a social media analysis pipeline for collecting and classifying the energy related social media posts (e.g. tweets) and finally generating an energy consumption profile for social media users. Relevant MSc courses: Information Retrieval, Web Science and Engineering.
  • Chatbots Able to Learn New Skills: There are chatbots serving a purpose to retrieve information (e.g. “when is the next train to Amsterdam?”) or perform a transaction (e.g. “Purchase one ticket to Escher museum in The Hague”). Such chatbots are usually designed for a specific narrow use case and their functionality is hardcoded. Extending functionality of such chatbot requires an intervention of a software developer to the codebase of the chatbot. We envision a chatbot system, which can extend its functionality by learning from users, crowd workers, experts, or even automatically. Think about Wikipedia. Years ago, it lacked articles on many topics. Now with contributions of thousands of people all around the world it is hard to find a topic, which is not covered there. Similarly if thousands of people teach new skills to such chatbot, it will be able to effectively serve millions of users in wide range of domains.
  • Generating Chatbots based on APIs and DB schemas. Currently it is possible to develop a chatbot semi-automatically based on Q/A dataset or based on an API. We believe that a logical next step is to be able to construct a chatbot automatically based on data base schema or REST API. This research will help to understand how to map database schema and API endpoints tree with a conversation tree, and to allow fast creation of chatbots.
  • Human Aided Bots - Dialogue Management. When we purchase a coffee the conversation we have with the barista is quite standard. In contract at work the conversation with a colleague about solving a unique complex problem is not predefined, and we adapt along its way. Similarly chatbots usually manage to follow a dialogue in a predefined domain it is designed for quite well, and fail to do so in more complex and less predictable conversation scenarios. We aim to address this issue by designing methods and tools for modeling both fixed and open dialogues. A special interest is understanding dialogues on the go, even if this chatbot was not initially designed for such chatbot.
  • Forecasting of financial status of SMEs with Social Data: the goal of this MsC project is to combine social media data (e.g., from Twitter and LinkedIn) with financial data and macroeconomic data from public sources (e.g. Google Trends, Yahoo Finance, business magazine articles), towards forecasting the financial status of small and medium enterprises. The work will be performed in collaboration with the newly founded Exact Data Science core team. Your work will allow customers of Exact (i.e., the entrepreneurs) to leverage the power of big data for better data-driven decision making. As an intern in Exact, you will experience hands-on all phases of the typical data scientist work: from data collection and transformation to feature engineering, from training predictive models to deploying them in actual production code. 
  • Enterprise Crowdsourcing: While machine learning and artificial intelligence applications are gaining popularity, enterprises are devoting more and more attention to enterprise crowd-sourcing as an effective technique able to capitalize on their available human resources to achieve inclusion of in-house human generated data. The aim of this thesis project, to be performed in collaboration with IBM Netherlands, is to advance the state-of-the-art in enterprise crowdsourcing by studying how different task design and participation incentives affect the quality and reliability of the employees' work. 
  • Music Recommendation Based on User Context. It is known that people listen to different music at work and at home. People listen to different music having breakfast alone at a working day and having dinner with friends on Saturday. At all these different contexts music as well changes from person to person. Going to the app and manually choosing a different playlist every time is so 20th century. We envision a system that can learn from the user, and plays automatically the music the user wants to listen now, depending on activity, location, weather, mood, physical status and other context features. In this project we need to model user’s context relevant for music preferences, to develop methods to detect this context, and develop a recommender system mapping contexts and listening preferences.
  • Extracting Domain Specific Entities and Relations from text (Web pages) Extracting entities of interest (e.g dataset, method, evaluation metrics) and their relations (e.g. isUsedBy, ComparedWith, ..) from massive text corpora (e.g Clueweb) is important for enhancing the semantic search, linking information across different sources and etc. The aim of this project is to devise methods to automatically extract the entities of interest and the relations between them. Relevant MSc courses  Information Retrieval, Pattern Recognition.
  • Long-Tail Named Entity Extraction This engineering-heavy MSc thesis focuses on implementing a framework for named entity recognition and extraction from natural text, with a focus on rare entities. In collaboration with our team, novel NER and NEE methods are developed, implemented and evaluated on scientific publication corpora . The final result is released as a well-documented open source project.

Thesis topics at Lambda on Information retrieval and Data Science in MOOCs

The Lambda-Lab has two broad research lines: information retrieval (conversational search, deep learning approaches to ad-hoc retrieval, reproducibility in IR, search as learning) and data science in the context of Massive Open Online Courses. Get in touch with Claudia Hauff to discuss possible options.

Thesis projects at Epsilon on explanations and human interaction

The E(psilon)-lab is a new lab (formed April 2017) within the Web Information Systems group and is concerned with human interaction with artificial advice givers, and specifically explanations to support decision making. To have a concrete idea of what this means, take a look at the introduction to the special issue on human interaction with artificial advice givers.

The E-lab takes a user-centered approach to research, and evaluates the quality of human decision making to drive both interface and algorithm design. The research is currently driven by two applied challenges: 1) How to deal with filter-bubbles and confirmation bias; and 2) How to support decision making for sequences of items (in addition to individual items). To address these challenges, the E-lab has two research lines and accommodates MSc thesis topics in both. Beyond what is mentioned here as examples, you are also welcome to use your own ideas as long as they fit into the research lines.

  • Explainable algorithms: When a system provides advice, such as an item to try or buy in a recommender system, it is not always clear how this conclusion was reached. This line investigates how advice can be explained in a way that supports users in making good decisions. This line is currently focused on how to construct recommendation sequences that are diverse, while maintaining user satisfaction, and considering trade-offs between different types of preferences in domains such as music and tourism. One example project could be to develop a playlist recommender system which considers the ordering and diversity of the tracks. Another, is to automatically generate travel itineraries for tourists. Relevant MSc courses to have followed for this line are Multimedia Search and Recommendation, Fundamentals of Data Analytics, and Information Retrieval.
  • Novel interfaces and interactions for explanations: Recent developments in AI have enabled better artificial advice giving that supports and even augments human capabilities. As these advice-giving systems increase in complexity, their designers have also come to realize that a standard graphical user interface (GUI) is often not sufficient to harness their power. This line investigates methods for supporting interaction with AAGs (e.g., natural language, visualization, and argumentation). This line is currently focused on interfaces for helping users understand and explore their blind-spots and to discover novel and relevant content. Example projects would develop and evaluate argumentation interfaces with users, develop interactive explanation interfaces, or develop novel explanation visualizations. Relevant MSc courses to have followed for this line include (depending on project), Artificial Intelligence Techniques, Human-Agent/Robot Teamwork, Affective Computing, and Data Visualization.

Thesis Projects on Human-Centered Query Processing

Kappa-Lab, supervised by Christoph Lofi, focuses on topics related to structural representation of information, with a specific focus on querying that information from a user's perspective. As such, it is bridging between database research, and the areas knowledge engineering, natural language processing, or user modelling.
Currently, the following thesis topics are available:

  • Smart Access to Open Educational Resources: This graduation project is hosted at TU Delft library. More information can be found here.
  • Long-Tail Named Entity Extraction: This engineering-heavy MSc thesis focuses on implementing a framework for named entity recognition and extraction from natural text, with a focus on rare entities. In collaboration with our team, novel NER and NEE methods are developed, implemented and evaluated. The final result is released as a well-documented open source project.
  • Extracting Domain Specific Entities and Relations from text (Web pages): Extracting entities of interest (e.g dataset, method, evaluation metrics) and their relations (e.g. isUsedBy, ComparedWith, ..) from massive text corpora (e.g Clueweb) is important for enhancing the semantic search, linking information across different sources and etc. The aim of this project is to devise methods to automatically extract the entities of interest and the relations between them. Relevant MSc courses  Information Retrieval, Pattern Recognition. 

Thesis Topics on Big (Streaming) Data Management and Analysis

The thesis subjects below are to be advised by Dr. Asterios Katsifodimos, Assistant Professor with the Web Information Systems Group. Asterios works in the broad area of data management, with a focus on scalable batch and streaming analytics. The thesis subjects below are defined in a high-level fashion so that students can steer the subject to their liking (more on systems, or theory), level (Bachelor or Master), and skill-set. If you are interested in any of the subjects below, or want to propose one that would match Asterios' style of research, please get in contact with him!

    Internship opportunities: Many of the theses below are relevant to real-life problems and (depending on the motivation of the student and the quality of their results) have the potential of an internship opportunity with companies like KPMG in Amsterdam, SAP Innovation Center in Berlin, the KTH university in Stockholm, TU Berlin, or the - under development - Delft Data Science Platform.

    • Bridging Linear and Relational Algebra for Scalable Data Science
      Linear algebra operations are at the core of many Machine Learning (ML) pipelines. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to-end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and crossvalidation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. The goal of this thesis would be to optimize Data Science pipelines by applying ideas from database optimizers in large Big Data pipelines which indlude both linear and relational algebra operations. The thesis can take over work on the theory side (how can we represent “query plans” so that we can optimize them?) and/or on the practical side (how can be design novel physical operators to use in scalable ML pipelines?).
    • Executing Transactions in Modern Stream Processors
      Stream processors such as Apache Flink, Storm or APEX are emerging in the industry as a tool to perform both analytical workloads (e.g., monitoring log files, sensors, and micro-services) but also mission critical services such as fraud detection in credit-card transactions. Modern streaming systems are now on an arms-race to provide first-class support for application state with strong state consistency guarantees in the presence of failures. At the same time, there is growing need for executing high-throughput transaction directly on the stream processor, rather than on a traditional database system. The goal of this thesis is to investigate ways of executing transactions on the application state of modern stream processing systems (e.g., Apache Flink).
    • Dataset Versioning For Social Data Science
      Version control is a very important part of every development process. Developers typically branch from a version of a software system, apply their own changes and then merge their changes to a master branch. Various tools and systems exist, the most famous and successful being Git (and the gitHub website). Git, however, is designed for the develpment process of software, not data. This thesis should create tools and a platform, very similar to the ideas behind Github, but for very large datasets. There are a lot of challenges associated with dataset versioning. Most of them stem from the sheer volume of datasets which can be in the order of Terabytes. It is evident that retrieving, comparing (and creating deltas) and storing data of such a volume is a non-trivial task. This thesis will investigate current techniques for version management of massive datasets, and propose changes to those techniques, in order to tackle the challenges mentioned above.
    • Data Lakes
      The aim of this thesis work is to understand the state of the art in technology for data lakes. More specifically, the student will work on implementing novel data processing functionality and services into an existing data-lake platform.
    • Scalable Inference with Deep-Learning Models in the GPU Clouds
      Nowadays we witness the proliferation of solutions for scalable Machine LEarning Inference (e.g., Google Cloud Machine Learning, SAP Leonardo ML foundations, Amazon’s AI on AWS). In these platforms, a specialized model is first trained and then used to respond to users’ requests, such as image recognition, where the user sends an image to a running service and receives a set of objects that are found in that image. Tensorflow and the Inception deep-learning model are typical examples of technologies used for such ML inference. However, such inference is very slow on CPUs and cloud companies typically use GPUs to perform inference at scale. In Software-as-a-Service (SaaS) offerings, the objective of a service provider is to allocate and de-allocate resources (CPU, GPU, memory and network) to satisfy its SLAs, while minimizing its operational costs. Since costs are directly associated to the amount of resources a provider is utilizing service providers typically achieve higher utilization and profits by multiplexing workloads from different users. The goal of this thesis is to design a solution for multiplexing ML inference workloads on GPUs, in order to increase resource utilization and adhere to SLAs of users.