The WIS group is open for students that want to do their thesis on subjects in the wider area of web information systems and information architecture.
At the WIS group we stimulate students to contact prof. Geert-Jan Houben (make an appointment via the secretary Esther van Seters at e.e.vanrooijen@tudelft.nl), to discuss possible topics for a thesis project (and literature survey). In such discussions students are free to suggest their own topic and then together a concrete thesis (or literature survey) subject will be defined.
As a rough indication of possible subjects for thesis projects, below we give some subjects of projects that have been running, that are running or that are open for new students. We note that many of these topics can be approached in projects inside the research lab, inside industry, or in a collaboration between university and industry. Industry here includes organizations that are public or private, large or small, national or international. As illustration, some organizations hosting recent student projects: Adyen, Greetinq, KPMG, IDS Scheer, ING, Innopay, Isaac, ISD, Tamtam, TNO, Truvo, Aanmelder.nl, Exact Software, Logica, Cap Gemini, ICTU.
Here, is a list of subjects as inspiration:
Personalization in applications is usually based on user or context models that represent relevant aspects of the user or context, e.g. the interests in films, films genres or actors in the case of a movie recommender application. Obtaining the data for these user models is often not an easy task, specially when the user starts using the application, the so-called cold-start. So, the import and use of data from other sites can help. This project aims to investigate how information from social network sites can be used to fill user models for a given application, e.g. in the domain of movies or TV programs, by studying the mapping between user models, and by implementing a configurable software tool that allows for extracting data from social networking sites and transforming it into a user model for the given application.
Tagging is an easy way that many new web (2.0) applications apply for allowing users to express their opinion about given resources, e.g pictures in Flickr, videos in Youtube. Most information systems in practice, certainly most commercial websites, use a concept-based approach where via a schema or an ontology some structure is identified in the data. This structure is needed to efficiently organize and store the data in databases. Obviously the freedom of the tag-based approach and the structure of the concept-based approach require some "glue" if you want to use them together in an application, for example when an existing information system is extended with a tagging interface. The goal of this project is to study ways to effectively connect tags and concepts and create a software tool that allows to generate such a connection between tags and concepts.
With social networking applications such as Facebook, Flickr, etc. millions of individuals and organizations together create online data and share information within networks (of friends). Meanwhile, a large amount of that data is now published on the Web in RDF. Projects like Linking Open Data build a mechanism to connect and share such data and knowledge. However, related to social network applications there is also an increasing amount of privacy abuses such as unwanted exposure, distortion, identity theft and reputation damage. A new concept of trust is needed, especially when it comes to personal data. The goal of this thesis project is to create a trust model for the linked open data. The trust representation could be based on RDF or FOAF, an RDF vocabulary that represents personal data and interpersonal relationships for the Semantic Web. On top of that representation, trust computation and propagation can then be defined in the trust model.
Ontologies are structures which encode the knowledge relevant for a certain domain. While in principle they constitute resources which are independent of a certain natural language, for several applications it is necessary to enrich them with information related to the way that we talk about the concepts and relations modeled in the ontology. However, there is a lack of knowledge on how to include lexical information in an ontology. This thesis will involde the creation of a model to include lexical information in an ontology, and the development of a tool to populate an ontology with lexical information. Strong modeling skills are required for this project.
Just like there are many XML schemas on the web for different but related applications which have some kind of overlap, there are many different ontologies on the semantic web that are related and could be combined. In fact, for many web applications it could be interesting to combine their own ontologies with external ontologies such as user ontologies from social web sites or product ontologies from producers. However, current ontology repositories do not often scale very well when they are used to combine ontologies, since the rules that specify the relationship between the ontologies make it harder to automatically reason over the ontologies. The goal of this thesis is to (1) investigate the scalability of existing ontology repositories and especially, (2) determine what the typical features are of integrated ontologies that determine whether they scale well or not for large instances and (3) determine which subset of ontology features and integration rules would already be sufficient for many practical applications and implement those based on a relational database.
A trust network is a graph-based network composed of trust relationships. A key point of a trust network is that it has the ability to expand itself, which is called “propagation”, by deriving new trust relationships from existing ones. The goal of this thesis is to investigate a propagation algorithm for a trust network. We adopt some open datasets of a trust network, e.g. advogato, for testing and verifying the algorithm.
In command and control (C&C) rooms for the Dutch police a specific information management system called "Het Geďntegreerd Meldkamer Systeem (GMS)" is used. In addition C&C rooms use the ARBI, a specialized telephone system that helps the officers to contact the required emergency workers and their organizations. The goal of this assignment is to make an overview of the information need and design an information architecture that ensures this need is met at reasonable costs. The analysis of the information need will include making an overview of which data is needed and what the required quality of the information is. The work will include close cooperation with a C&C room in The Netherlands and therefore the student should be proficient in Dutch.
The MOSAIC project (Multi-Officer System of Agents for Informed Crisis Control) is aimed at creating so-called super situation awareness by retrieving relevant information from a wide range of heterogeneous data sources and on the other hand to avoid information overload. The assignment concerns the implementation of a system that will support the commanders in a Command and Control room by filtering and generates messages based on incoming messages and background information about the crisis situation. The system will be designed in close cooperation with a PhD student that is currently working on this project and there is the possibility for the student to be hired by the DECIS lab for the duration of the thesis.
Payment Service Providers focus on providing one online payment interface for online shops such that they can support multiple payment methods, like credit cards and local payment methods like iDeal, bank transfers, etcetera. The payment service provider takes care of all the communication and money flows with all the payment method providers (acquirers) and offers her merchants just one type of reporting and money flow. The assignment consists of developing a logging system that reports on the performance of the platform, monitors the applications and hardware components and notifies when suspicious or erroneous situations occur. This automatic process should instantly detect suspicious and erroneous situations on both a high (platform) as well as in low (transaction) level.
Many providers of web applications such as for example online registration systems, provide their functionality through a distributed system that is hosted on several different data centers and cloud-like systems. This means that for a particular transaction many steps may have to be performed on different locations. The assignment will concern itself with the analysis and design of such message based load distribution and replication system. The exact extent of the assignment will depend on the company that provides the web service and manages the distributed system.
Livetagging is an innovative way to tag live events like a meeting or a lecture. By attaching tags to time indices of a streamed or recorded event, livetagging enables users to see at one glance what this event is about. Furthermore, users can find the parts of the event that are about a particular topic by just clicking on a tag. In this project, you create a collaborative livetagging application, which allows spectators or participants of an event to tag this event and observe the tag cloud as it develops.
Many existing applications contain their data in traditional formats such as the ones we know from relational databases. To publish their data as Linked Open Data has many advantages, as the data can be integrated and connected to other applications or background knowledge (e.g. encyclopedias, maps, dictionaries) and thus the data can be enriched and thus added value created. To realize these advantages, a method and supporting tooling is needed that allows to turn legacy data, such as relational data into RDF data that is published as linked data.
With the evolution of the semantic web, many applications and knowledge collections have become available that can be integrated to create new applications. This process of data and knowledge integration in the semantic web asks for an engineering approach that follows a proven method to produce high quality applications. From research in (traditional) web engineering, e.g. Hera, OO-HDM, WebML, we know how a model-driven approach can be effective. In this project, a new method and associated tooling is created for model-driven engineering in the semantic web.
In order to turn existing documents and textual information into semantically enriched information, often the first step that is needed is to recognize certain concepts in a text and turn them into concepts that are connected with background knowledge or other applications. Take for example, the identification of geographical locations in text for the purpose of linking the locations to common background knowledge from a geo-repository. This kind of concept extraction combines parsing techniques and various information extraction techiques with semantic web technology.
Tagging is a popular activity that the majority of web users have experienced in many social networking applications, tagging photos, bookmarks, etc. Creating tags is usually followed by the use of tags in user interfaces to retrieve and access information, with tag clouds and several other user interface paradigms. The application of tags can be studied in several different scenarios and projects, for example to improve access by lay people to information produced by professionals (visitors & museums, citizens & government, etc.), to co-create cultural experiences (festivals, art collections, etc.), or to create collaboration in communities.
Browsing on the semantic web, the Web of Linked Data, is different from the browsing we know from the traditional web, the Web of Documents, and therefore there exist special browsers, such as Tabulator or Explorator. It is interesting to investigate how these tools can effectively be used (in applications) for end-user access to the Web of Linked Data, and how they can be improved. For example, the creation of views and the manipulation of data (other than simple "reading") can be added to the current tooling. These projects can contain both theoretical and development activities.
The popularity of immersive simulated environments for experiential learning is growing; they will be part of tomorrow’s learning technologies in the key area of adult training. The major challenge is to effectively align the learning experience in the simulated environment with the real world context and day-to-day job practice. For the personalization and the user modeling in such environments semantic technology and ontological reasoning can be used to seamlessly link the simulated learning experience and real-world job-related experiences, thus creating augmented simulated experiential learning.
Adaptive applications need to know the user in order to be able to adapt to the user. There are several ways to explicitly ask or import relevant user knowledge, but in many cases there is a lack of proper support to verify with the user whether the application's assumptions are correct and valid. Interactive dialogue-based tools can enhance an application to obtain higher quality user knowledge through a carefully designed communication between application and end-user. For several scenarios, the design and development of such a interactive dialogue-based tool is an interesting project subject.
With RDF becoming more and more the default data format for publishing data on the Web, ideally with a query interface, it has become essential that RDF stores scale well for large amounts of data. One way of achieving this in traditional database systems has been the use of index structures, either general ones or specialized ones for specific types of queries. Together with the Eindhoven University of Technology the WIS group is developing new indexing techniques for RDF. In this assignment the efficiency of state of the art of indexing in existing RDF stores is studied, and they will be compared to indexes that were recently proposed in the literature, including those that were locally developed. Based upon this new index structures will be developed and it will be studied how query engines can efficiently make use of them.
One of the essential features of linked data on the Web is that data is published in a distributed at different sites but at the same time is linked because that data-sets will will refer to each other. This makes the Web of linked data like a big distributed database and raises the issue of how to efficiently query this database. This is essential building data mash-ups and enriching application with data available in the Web of linked data. The assignment will investigate query evaluation techniques that allow the execution of queries, either formulated in SPARQL or another RDF query language, over this Web of linked data.
For the purpose of the analysis and transformation of large data-sets the RDFGears tool has been developed to design and express visually complex data-transformation workflows that combine, integrate and transform both internal and external datasets. The focus of the tool is on RDF datasets, but it can also deal with other data formats that are found in a Big Data setting. To allow for the processing of large datasets an implementation is being designed where the RDFGears workflows are translated to a MapReduce framework for execution. The research will focus on finding efficient mappings and applying optimizations known from database research to obtain the most efficient way of executing workflows.
In this project we focus on building a general system for the execution of transformations of big data, for either the purpose of data integration, data enrichment or data analysis. The implementation will be focusing on efficiency and be based on using techniques from the area of database research and (functional) programming language research to achieve an efficient implementation that applies state-of-the-art optimization and evaluation techniques as are used in NoSQL DBMSs and interpreters for functional programming languages. To achieve extreme scalability (sometimes called "scaling out" on this scale) the system will be based on cloud-computing techniques such as the mapreduce framework.
Cloud computing is a rapidly growing technology that has created many new possibilities for offering web services and applications in a cost effective and scalable way, especially as a technique for supporting Software-as-a-Service. However, the use of cloud technology by ICT departments, either by using external clouds or local clouds, also creates many new issues. For example, if a local on-premises cloud-based infrastructure is introduced, how do we integrate this with the existing ICT infrastructure? Or, how do we integrate external cloud services with local cloud services in a robust way such that we preserve the robustness, dependability and security of the total system. An important issue is for example how to ensure the correctness of a transaction that involve multiple internal and external services. In this assignment it will be studied how currently these issues are addressed and how we can improve on that.
Designing adaptive hypermedia and web applications in general can be a very complex task. Therefore researchers in the WIS group have developed a language to specify these applications at a hight abstraction level such that the navigation structure can be specified in a clear way, and from this the application can be generated. The language specifies how this structure is generated from a database that contains the content that is to be presented, and can take into account extra context information about the user in order to personalize the navigational structure. The assignment will consist of extending the language and its implementation such that it can be used to describe attractive and state-of-the-art adaptive web applications.
Some concepts are more closely related than others. The concepts 'elections' and 'voting', for example, can be considered to be closely related. This so called semantic distance can be used in a number of applications. In a retrieval application, a search for elections can be expanded with a search for voting. In a browsing application, hyperlinks could be created from documents tagged or annotated with 'election' to those tagged as 'voting'. Semantic distance can also be used in query performance prediction: the distance (or rather, proximity) between results in a result list says something about how well the result list matches a query. The question we address here is how to determine semantic distance. Several measures have been proposed in the fields of Information retrieval and Semantic Web. The project involves comparing the value of existing measures in various application scenario's, and developing new measures, targetted towards a specific application.
Governments on local, national and EU level publish large amounts of data: political data such as notes of city counsil/parliament meetings and voting behavior of its members, but also public data like crime statistics, applications for permits, the civil registry, etc. By law a large portion of this data is openly available. However, the data is not always easily accessible. Governments are increasingly interested in semantic web techniques to publish their data on the Web as linked open data. The challenges lie not only in publishing the data, but also in the creation of applications that open up the data to the public in new ways. Especially new combinations of existing data are fruitful. Several projects around the topic of publishing, use and re-combination of government data are available.
Ontologies are structures which encode the knowledge relevant for a certain domain. Mappings between ontolgies are becomming increasingly important as they enable the combination of datasets, domains and applications. The creation of tools that automatically find mappings between concepts from different ontologies is an active area of research. Most mapping tools currently available are generic mappers, i.e. they can map any ontology to any other. In this project, we try to improve on those tools by adding domain specific knowledge. The challenge is to identify exactly what knowledge is needed to find high quality mappings: knowledge of the structure of the ontology, the content of the ontology, the lexical form of the textual labels, etc. This project will consist of the development and testing of a mapping tool that allows for manual input of domain knowledge.
In cooperation with TomTom there are three related assignments available: (1) Rich Snippet Generation - Automatically generate a short summary from the website of a point of interest. (Job ID: 20142) (2) Place Extraction from Crawled Content - Automatically detect the name of a POI from a website. (Job ID: 20140) (3) Social Media Opportunities - Investigate the possibilities to use social media as platforms to stimulate community interaction with our location data. (Job ID: 20143).
The allocation of a leak or contamination in a water distribution network can take days and is very expensive. However, it has been shown in the past that the use of social media and data mash-ups can sometimes lead surprisingly quick to important insights into the behavior of the network. For example, during the shut down of pumping station Hoofddorp as a consequence of a power supply interruption on December 27th 2010, a customer created a map of the affected region, showing the tweets (Twitter messages) related to the interruption of the water supply and published it online, within thirty minutes of the shut down. This research aims to make a proof of concept of the human sensor. The intended end-users are customers visiting the company’s website during calamities, operators and field maintenance personnel. Relevant tweets and complaints of customers are used as input for a distribution network model. Using a backtrace function, the leak or contamination is determined real time, i.e. within minutes. The graphical output of the model is the estimated location of the source of the problem encircled by a bandwidth of uncertainty, which shrinks when more input enters the model.
Knowledge is only valuable if it can be shared, therefore oreators of the semantio models should be able to publish these models online, on a publioation platform. The goal of this platform is to allow users to share their knowledge with one anotherI and offer mechanisms for indexing and searching the oontent of these models. Apart from maybe user credentials and user rights, all information about the published models, should be stored within the models themselves.
Areas of interest during the development, and possibly implementation, of the publication platform are:
Relevant subjects:
The project will be executed with the company Semmtech.
The goal of this research project is to be able to model basic mathematical operations and formulas within a semantic model. Examples of the types of calculations are cost calculations of activities, or geometrical calculations for finding areas and volumes of physical objects. After conceptualizing these formulas, the modeled calculations can be automatically performed, using the concepts described by the model. Apart from formulas, the quantities of the various characteristics of classes and individuals, as well as their units of measure, need to be modeled in a computer interpretable fashion. These conceptualized quantities can be used as the input of the mathematical operations described by the formulas.
This subject ie targeted at students interested in formal modeling languages and/or information- and knowledge management. Some of the subjects relevant for this project are:
The project will be executed with the company Semmtech.
In the international OpenPhacts programme (http://www.openphacts.org/), a large number of companies and institutes have joined efforts to integrate all publicly available pharmaceutical data and make it accessible in RDF for subsequent data mining and knowledge extraction. In the Netherlands, the goal is to build on this by adding knowledge-bases centred on biotechnology. Biotechnologist study and improve micro-organisms, such as yeast and bacteria, for production of foods, fine chemicals, biofuels, pharmaceuticals etc. The conversion of existing biotechnological databases is a big challenges that requires sophisticated and powerful tools for converting the data. The main goal of this research will be to investigate how RDFGears, an RDF transformation and integration tool, can be extended such that it can efficiently and in a user-friendly way specify and execute such conversions. As an initial feasibility study, we will consider the conversion the contents of the Saccharomyces Genome Database (http://www.yeastgenome.org) to RDF, but other databases will also be considered.
Recently there has been a surge in Semantical tools for annotating and enriching text, and their quality has been steadily improving, also within bussiness environments. The Dutch language area, however, is small and the Dutch grammar has distinct characteristics. How well do these tools perform for this language? This assignment will be done within the context of the Newz project and with the company Dayon.
Possible steps:
A major project in the Dutch media - Newz - is busy building a large ontology. The system learns by using the enrichment afterwards of thousands of Dutch newspaper articles. How do we maintain such a growing ontology and how do we recognize and correct errors in the ontology? This assignment will be done within the company Dayon.
Possible steps:
Newz is a platform for digital products of the major Dutch newspapers. The metadata is stored in a triple store and is enriched with Linked Data. A domain ontology specifically desigend for Newz would allow more effective enrichment and application of this data for client applications. This has for Newz currently not yet been developed. This assignment will be done within the company Dayon.
Possible steps:
Newz is a platform for digital products of the major Dutch newspapers. To this end, various applications are built. For the news domain, the temporal aspect is crucial for presenting relevant information to end users. This aspect has not yet been addressed in the platform. An alternative would be the aspect of geographical location. This assignment will be done within the company Dayon.
Possible steps:
M-Industries is a startup company from Delft, developing software for clients in the industrial sector. It focuses on designing systems that support complex business processes by using its in-house developed data-modeling language and software development platform, based on several years of research and development. The company has special interests in domain-specific declarative languages, functional programming, data transformation, static code analysis, browser-based applications and Node.js.
In this project the student will investigate existing Graph Databases to see how they can efficiently support such business data-models as developed by M-industries. Conventionally such applications are based on classical relational databases, but since modern data models have become more graph-like, including the way these data models are defined and accessed by applications, there is reason to believe that it is both more convenient and efficient to implement them on top of Graph Databases. To test this hypothesis this assignment will consist of building and investigating a prototype data access layer for the data model of M-industries.
For more ideas or inspiration, you can of course also have a look at the research interests of the Web Information Systems group members. We repeat that projects can run inside the research lab, inside industry, or in a collaboration between university and industry. Industry here includes organizations that are public or private, large or small, national or international. As illustration, some organizations hosting recent student projects: Adyen, Greetinq, KPMG, IDS Scheer, ING, Innopay, Isaac, ISD, Tamtam, TNO, Truvo, Aanmelder.nl, Exact Software, Logica, Cap Gemini, ICTU.
More information can be obtained from prof.dr.ir. Geert-Jan Houben.