1. Abstract

Our ambition is to understand how people behave in the Social Web. In previous work, we analyzed the nature of tagging activities people perform on Social Web systems like Flickr, Delicious and StumbleUpon and studied the impact of cross-system user modeling on personalization. In this paper, we study characteristics of Twitter-based profiles and investigate how one can leverage Twitter activities for personalization. For this purpose, we developed a library for aggregating and enriching the semantics of individual Twitter activities. Given this preliminary work, we conduct an in-depth analysis on a large Twitter dataset of more than 10 million tweets that were published by more than 20,000 users in a period of 3 months. We study user modeling on Twitter and answer research questions that concern the temporal evolution of individual user profiles inferred from Twitter activities:

2. Datasets

Tweets: Over a period of more than three months (starting from end of October 2010) we crawled Twitter information streams of more than 20,000 users. Together, these people published more than 10 million tweets.

News: To allow for linkage and semantic enrichment of tweets with news articles, we also monitored more than 60 RSS feeds of prominent news media such as BBC, CNN or New York Times and aggregated the content of 77,544 news articles.

Semantics: Given the content of Twitter messages and news articles we extract entities and topics to better understand the semantics of Twitter activities. Therefore we utilize OpenCalais to do this (see sementicsXYsql.gz).

The datasets will further grow as we are currently preparing a second extended version of our enriched Twitter/News datasets.

name number of records description
tweets.sql.gz (643MB) tweets posted by the users
news.sql.gz (73MB) news articles monitored from 62 news media websites
sementicsTweetsEntity.sql.gz (71MB) entity assignments extracted from tweets; 709,245 distinct entities (categorized in 39 types)
sementicsTweetsTopic.sql.gz (15MB) topic assignments (18 distinct topics) extracted from tweets
sementicsNewsEntity.sql.gz (40MB) entity assignments extracted from news; 170,577 distinct entities (categorized in 39 types)
sementicsNewsTopic.sql.gz (603KB) topic assignments (18 distinct topics) extracted from news