About

This site provides supplemental material and information about the paper Analyzing User Modeling on Twitter for Personalized News Recommendations. Available online: http://link.springer.com/chapter/10.1007%2F978-3-642-22362-4_1#page-1

Slides:

1. Datasets

Tweets: Over a period of more than two months (starting from end of October to beginning of January) we crawled Twitter information streams of more than 20,000 users. Together, these people published more than 10 million tweets. As we were interested in analyzing also temporal characteristics of the user profiles, we created a sample of 1619 users, who contributed at least 20 tweets in total and at least one tweet in each month of our observation period. This sample dataset contained 2,316,204 tweets (see umap-2011-tweets.sql.gz).

News: To allow for linkage of tweets with news articles we also monitored more than 60 RSS feeds of prominent news media such as BBC, CNN or New York Times and aggregated the content of 77,544 news articles (see umap-2011-news.sql.gz).

Semantics: Given the content of Twitter messages and news articles we extract entities and topics to better understand the semantics of Twitter activities. Therefore we utilize OpenCalais to do this (see sementicsXYsql.gz).

name number of records description
umap-2011-tweets.sql.gz (643MB) 2316204 tweets posted by 1619 users
umap-2011-news.sql.gz (73MB) 77544 news articles monitored from 62 news media websites
umap-2011-sementicsTweetsEntity.sql.gz (71MB) 1896328 entity assignments extracted from tweets (1,051,524); 709,245 distinct entities (categorized in 39 types)
umap-2011-sementicsTweetsTopic.sql.gz (15MB) 1112538 topic assignments (18 distinct topics) extracted from tweets (731,486)
umap-2011-sementicsNewsEntity.sql.gz (40MB) 1216570 entity assignments extracted from news (63,140), 170,577 distinct entities (categorized in 39 types)
umap-2011-sementicsNewsTopic.sql.gz (603KB) 86368 topic assignments (18 distinct topics) extracted from news (62,909)

2. Further Findings

To better understand the temporal dynamics of entity-based profiles, we compared the deviation of timestamps of tweets that mention a specific entity. The higher standard deviation of a certain entity, the more consistently does the entity occur in tweets posted by the user. Entities, for which we detect low standard deviation appear rather at a certain point in time. In the below figure we plot the average standard deviation for different entity types. It can be seen that entities in the category country or technology have a higher standard deviation than entities in the category movie. For more than 50% of the movies, standard deviation of the timestamps is even 0, i.e. a user mentions the movie just once. Regarding persons it seems that approximately 40% of the persons, who are mentioned in the tweets of a user, are mentioned quite regularly while the others are only mentioned within a certain time frame and then are not mentioned in any further tweets.

3. Twitter-based User Modeling Framework

We release our Twitter-based user modeling framework as alpha version for Java developers, who would like to test the user modeling strategies discussed in our paper: Twitter-based User Modeling Framework (alpha version)

The framework allows developers to (1) enrich the semantics of tweets, (2) link them to news articles and (3) generate user profiles.

3.1 Quick start

Given the full path of a JSON file that contains user tweets monitored via the Twitter Streaming API, the framework will first parse the files and add them to a database (tweets table, complete database schema, which stores the tweets, semantics, and (optionally) news). Parse tweets:

java -jar um-twitter-enrichment.jar add fullPathToJSONFile databaseLocation databaseUsername databasePassword

(1) Enrichment: The framework enriches the tweets that are stored in the tweets table by extracting entities and topics via the OpenCalais Web service (requires API key). The entities and topics extracted from the tweets will be stored in semanticsTweetsEntity and semanticsTweetsTopic respectively.

java -jar um-twitter-enrichment.jar enrich opencalaisAPIKey databaseLocation databaseUsername databasePassword

(2) Linkage If you crawled news articles and stored them in the news table then the framework allows you to explicitly link tweets to the corresponding news articles. It basically will fill up the so-called nas table. Therefore it will use two strategies:

  1. follow the links posted in tweets
  2. compare entities and timestamps of tweets and news articles, i.e. if a news article talks about entities that are mentioned in a tweet and the publishing date of the tweet and news article is similar then tweet and news article will get linked

Command:

java -jar um-twitter-enrichment.jar link databaseLocation databaseUsername databasePassword

(3) Generation of User Profiles Using the framework, user profiles can be generated basically with on line of code once a user modeling strategy is defined. For example, for creating a topic-based profile for a certain time period one could use the following code snippet:

//Create a topic-based profile for a certain time period:
Timestamp profileFrom = Timestamp.valueOf("2010-11-15 00:00:00");
Timestamp profileTo = Timestamp.valueOf("2010-12-29 00:00:00");

//a. create configuration for your strategy
UMConfiguration umConf = new UMConfiguration(
"my first UM strategy",
UM_Type.Topic_based,
UM_Source.Twitter_and_News_based,
profileFrom, profileTo,
UM_TimeSlot.All, 1, null);

//b. instantiate user modeling strategy
UserModelingStrategy um = UserModelingFactory.getUMStrategy(umConf);

//c. get profile vector for user via her Twitter ID (here: 1234)
um.getProfileVector(1234);