π-lab: Software Analytics

Recent and upcoming activities

  • May 20174 papers involving π-lab researchers to be presented at ICSE 2017 and MSR 2017, in Buenos Aires
    • May 2017We are organizing a software analytics meetup in the context of Delft Data Science 
    • Feb 2017 The CodeFeedr project started.
    • Feb 2017 Joseph Hejderup joined the WIS group, will be working on the CodeFeedr project.
    • Jan 2017 The π-lab was started as part of the Web Information Systems group.

    Introduction

    Modern software projects are more than just the code that comprises them: teams follow specific development processes; the code runs on servers or mobile phones and produces runtime logs; users talk about the software in forums like StackOverflow and GitHub and rate the product in app stores. The software is part of a collection of similar applications and depends on external code or service API’s to deliver its functionality. Modern software teams need data to make informed decisions that enable continuous, feedback-driven improvement.

    As π-lab, we work to make software analytics a core asset for software development teams. To this end, we design tools and processes to make software analytics a core feedback loop for software projects.

    Our research touches topics such as computer-supported collaborative  work (CSCW), big data systems and algorithms, software engineering processes, software analysis and data science. Currently, we focus on the following 3 research lines, even though we are always open for new ideas:

    • Engineering for (software) analytics: creating platforms for data ingestion, integration and querying in a streaming fashion
    • Distributed collaboration on software development: optimising code review and integration processes across millions of projects and developers
    • Software ecosystems:analysing the fragility, security and robustness properties of package managers

    The following slides give a high level overview of our recent work

    Researchers

    Several researchers to the software analytics research line:

    • Joseph Heijderup works on systems for high-performance software analytics, funded by the CodeFeedr project
    • Dominik Safaric works on streaming software analytics systems
    • Georgios Gousios is assistant professor at WIS and leader of the π-lab.
    • Geert-Jan Houben is the head of WIS and provides expertise in Web Systems

    as well as the following Master students:

    • Rik Nijssen: Work on pull request process optimization
    • Herman Banken: Reactive programming program comprehension

      Tools and Datasets

      Here is a collection of tools and datasets we have developed through the years. We are always looking for motivated students to work on extending and updating them!

      • GHTorrent: All data from GitHub, in MySQL and MongoDB formats, also on BigQuery
      • TravisTorrent: Combined data from GitHub and TravisCI, suitable for CI research (developed in cooperation with the TestRoots project)
      • Pourquoi: Pull request analytics and prioritization

      Publications

      1. R. Kikas, G. Gousios, M. Dumas, and D. Pfahl, “Structure and Evolution of Package Dependency Networks,” in Proceedings of the 14th Working Conference on Mining Software Repositories, 2017.
      2. M. Beller, G. Gousios, and A. Zaidman, “Oops, My Tests Broke the Build: An Explorative Analysis of Travis CI with GitHub,” in Proceedings of the 14th Working Conference on Mining Software Repositories, 2017.
      3. M. Beller, G. Gousios, and A. Zaidman, “TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration,” in Proceedings of the 14th Working Conference on Mining Software Repositories, 2017.

      Master thesis topics

      Interested in one of the following topics? Please contact Georgios Gousios

      Merging GHTorrent with Software Heritage

      Software Heritage is building the largest source code repository in existence, initially populated with all projects from GitHub. The GHTorrent project collects and archives data from the GitHub API, including issues, teams, pull requests and commits.
      The proposed project aims to integrate the construction processes of the two datasets. The target is to allow the two projects to be updated independently but also create a fusion point where updates from either project's database are integrated into a centralised, queriable archive in a streaming fashion.

      The project is done in co-operation with Stefano Zacchiroli and may optionally involve a 2-3 month paid intership at INRIA Paris.

      Streaming software security

      The aggregation of both projects and deployment configurations on GitHub has made those projects particularly vulnerable to sensitive data leaks. For reasons that have to do with ease of use or just pure negligence and mistakes, it is quite common for GitHub users to push passwords, database connection strings, cloud provider one-time passwords and environment variables and private SSH keys to public repositories. Once this information is made public, it is impossible to retract it as projects such as GHTorrent and GitHub Archive archive this information, while GitHub's real-time event stream makes it easy for adversaries to attack the exposed systems almost immediately. The aim of the proposed project is to explore this phenomenon and propose effective counter-measures. 

      Streaming cascading aggregations

      Cascading aggregations work by specifying a set of key metrics, a set of thresholds for those and a set of functions that can extract interesting pieces of information or combine two other functions. To react efficiently on current events, aggregation functions always work on data streams. Insights can be generated by linking metric threshold violations to aggregation functions; this creates a graph of aggregations, which, when topologically shorted, can lead to generation of summarized information.  What we are interested into is a language to specify cascading aggregations and a (stream-based) processor that will generate automated data summaries that read like this (e.g. when applied on software engineering data):

      "Version 1.2.1 (commit a223b) of app Foo is receiving negative feedback (sentiment ratio: 0.45%) on app store. Users are complaining about frequent crashes.Top exceptions in app crash log: NullPointerException (88%), increased 95% in version 1.2.1. Static analysis on commit a223b indicates possible uninitialised variable x in Bar.java, line 75. Commit a223b is 85% bigger than average. Code review passed with 3 comments and 2 thumbs up"