Radio Silence Over: Updates, Mahaya, TimeSpace, Moscow

Ayman has been on my case, and for a good reason this time. We kind of neglected you, good readers of our blog. It’s been a long and winding few months years. We both fully intend to write more but for now, here’s a quick update from the Naaman half. And it’s exciting (at least for me).

The quick update, for those who don’t know, is that I have co-founded a company called Mahaya which aims to organize the world’s memories: make sense of the world’s stories and events as they are shared on social media. We are currently beta testing a new product called Seen, which automatically makes it fun and simple to see what happened anywhere. This week, The New York Times announced that Mahaya will be one of the three companies in the inaugural run of the TimeSpace program (whoever named it should receive the Pulitzer).

In related news, next week in Moscow, I will be giving a keynote at ECIR 2013, talking about how the work we have done in the last 8 or so years have informed the vision (and technology) for Mahaya. Here’s the motivation for the talk below. I will try to post the full notes after I give the talk (Ayman, keep me honest here).

Time for Events.

In the last 8 years, my work and research had focused on the ways in which social media reflects and interacts with “the real world” — by which I mean actual occurrences, atoms clashing, people performing acts that are tied to a specific location and, often, to a time.

2005 was the onset of location-based social media as we know it. Flickr got popular (and got acquired by Yahoo). In 2006, Flickr formally introduced geotagging by supporting geo-metadata and providing a map interface; they thus created an easy way for people to associate location data with content, at scale. Almost immediately, we had… lots of dots on on a map! Surely, we thought, these dots can tell us more about the world than where photos were taken. Can they tell us *what* are the most interesting places/landmarks, instead?

Tag Maps was our attempt at Yahoo! Research Berkeley to do that. For any world region, any zoom level, we extracted (using fairly simple IR tools), the most salient and important topics for that area; we built an interactive prototype that exposed this information, a video of which you can find here (see if you can spot Yoda!). We realized (read more here) that one could extract a strong signal about the real world, about people, their geographic activities and interests from social media data.

Tag Maps / World Explorer Demo from Mor on Vimeo.

We then noticed a funny entry on the Paris Tag Map. It read “Les Blogs”; an explanation can be found here: a bunch of bloggers at a conference, posting Flickr photos until our algorithm thought this was the main descriptor for that area (and Paris). In other words, events started showing up on our map. That got us thinking: can we do a better job modeling, identifying and presenting the data that is specifically associated with events?

tagmaps paris

At SIGIR 2007 we showed that the answer is yes. With Tye and Nathan, we described a system that discovers real-world events from Flickr geotagged data, including hyper-local events such as BYOBW (an old favorite for me to show in talks, and an event I literally learned about from our results). The takeaway? social media can reflect real-world events, via content created by a collective of mostly uncoordinated contributors.

After Tahrir square these “discoveries” seem rather obvious, but that was not the case in 2007, before Facebook and Twitter gained any mainstream popularity, and well before iPhone popularized media and location (iPhone 3G was released in July 2008).

In my talk, I am going to address the challenges in developing event technologies, show some of the solutions and technologies we developed in my research, complain that in 2013 that problem is not yet solved commercially (case in point: the link I had to use for BYOBW above), and give a demo of Mahaya’s recent product, Seen, where we start solving the “event problem”. I’ll also talk about social media as the next step in the evolution of information systems, and what it means for Information Retrieval.

Come and say hi if you are in Moscow next week!

Taking R to the Zoo

Feeling pretty good about now?   So far we’ve just played in the garden; there are problems when you enter the real world.   Let’s start by looking at this dataset of TacoBell tweets.  It’s about 10,000 tweets.   So still pretty small, but the 3.1MB of deliciousness can cause us some problems.    First, lets read it in.  We’ll do so from the URL loader.

> u <- url("http://blog.looxii.com/wp-content/uploads/2011/01/tb-tweats-jan24-jan31.csv")
> system.time(f <- read.csv(u))
   user  system elapsed
  1.331   0.137   9.010

Here, we create a URL file object then pass it to our read.csv function.  Upon completion, you won’t notice it closes the URL file object.    This will take a few seconds to load, you can wrap any command by system.time(…) to see how long it takes.   Now lets look at what we have:

> dim(f)  # how many rows and columns?
[1] 9413    9
> class(f)
[1] "data.frame"
> class(f$s)
[1] "factor"
> class(f$Source)
[1] "factor"
> class(f$Title)
[1] "factor"

The class ‘factor’ is a nominal variable and R loves it.  It’s good if you have distinct types to specify, but not so much for dates or tweets.

> f$s[1]
[1] 01/31/11 04:21 AM
2011 Levels:        01/24/11 01:04 PM 01/24/11 01:06 PM ... 01/31/11 12:16 AM

The 2011 levels tells us there are this many distinct timestamps in the dataset.  We need dates to be, well, dates.  And tweets to be text.  We can convert the arrays or variables by wrapping them in a converting function like:

> f$Body[1]
[1] I really want Taco Bell. I dont care if its fake meat!
8612 Levels:  ... ????Taco Bell???????????????????????????????????????????????????????????
> as.character(f$Body[1])
[1] "I really want Taco Bell. I dont care if its fake meat!"

But really, the best way to do this is to make sure the reader pulls in the right data class when the file loads.  This is specified in the read.csv file.

> types <- c("character", "factor", "factor", "character", "character", "character", "character", "character", "character")
> u <- url("http://blog.looxii.com/wp-content/uploads/2011/01/tb-tweats-jan24-jan31.csv")
> f <- read.csv(u, colClasses=types)
> class(f$s)
[1] "character"

Great!  The c(…) function made a vector of strings, one for each column in the file; each entry is the name of the class for that column and we pass it in to the read.csv(…) function.  Next, we want just the tweets with @ symbols.  In R, we can grep in a string like so:

> grep("@", "this is a test")
integer(0)
> grep("@", "this is @ test")
[1] 1
> grep("@", c("this is not a test", "this is @ test"))
[1] 2

That 1 is an array index, not a truth value.  Watch, lets check for an @ symbol in the first 5 rows of our dataset.

> grep("@", f$Body[1:5])
[1] 3 4 5

So, rows 3, 4, and 5 have an @ symbol.  Oh hey, that’s a nice little index vector into the csv file!  So, if we want to make a new variable which is just the @ symbols, it’s easy, just say give us all those rows by passing that vector in as the row indices.

> dim(f)
[1] 9413    9
> ats <- f[grep("@", f$Body), ]
> dim(ats)
[1] 4031    9

So roughly 42% of the dataset has @ symbols. Now we’ll need the zoo package.  Go get it.

> install.packages("zoo")
Installing package(s) into ‘/Users/shamma/Library/R/2.13/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.cnr.berkeley.edu/bin/macosx/leopard/contrib/2.13/zoo_1.7-6.tgz'
Content type 'application/x-gzip' length 1396545 bytes (1.3 Mb)
opened URL
==================================================
downloaded 1.3 Mb

The downloaded packages are in
	/var/folders/un/unv7BK-CG2qWofD8jLjha+++Q3I/-Tmp-//RtmpcbF2jV/downloaded_packages

Now, we still need the first column as timestamps, not character arrays.

> ?strptime
> strptime(ats[1,1], format="%m/%d/%y %I:%M %p")
[1] "2011-01-31 04:20:00"
> class(strptime(ats[1,1], format="%m/%d/%y %I:%M %p"))
[1] "POSIXlt" "POSIXt"

strptime(…) lets us convert strings to timestamps with a specified format.  The ?strptime command will tell you what to use for formatting as its different from other languages you might know.  Great, we can do this against the whole column and make a “zoo” or Z’s Ordered Observations.

> library(zoo)

Attaching package: ‘zoo’

The following object(s) are masked from ‘package:base’:

    as.Date, as.Date.numeric

> ?zoo
> z <- zoo(ats$Title, order.by=strptime(ats[,1], format="%m/%d/%y %I:%M %p"))
Warning message:
In zoo(ats$Title, order.by = strptime(ats[, 1], format = "%m/%d/%y %I:%M %p")) :
  some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique

Just ignore the warnings right now.  What we are doing is ordering our data set (in this case the Titles) by the timestamp.  The strptime(…) command is applied to the first column of the dataset (remember how R distributes a function across a vector?).  Really we are going to use the zoo as an intermediate data structure.  Now we aggregate them, and we will do this by the number of tweets by minute.

> ats.length <- aggregate(z, format(index(z), "%m-%d %H:%M"), length)
> summary(ats.length)
         Index        ats.length
 01-24 05:00:   1   Min.   : 1.000
 01-24 05:01:   1   1st Qu.: 1.000
 01-24 05:03:   1   Median : 2.000
 01-24 05:05:   1   Mean   : 2.803
 01-24 05:06:   1   3rd Qu.: 3.000
 01-24 05:07:   1   Max.   :29.000
 (Other)    :1432

The aggregate(…) function takes the zoo and collectes them by the specified function. In this case, we chose length, so this is the total number of tweets per minute (the length of the vector at that time aggregate, not the length of the tweets).  We can easily aggregate by the hour by changing the time format:

> ats.length.H <- aggregate(z, format(index(z), "%m-%d %H"), length)
> summary(ats.length.H)
      Index      ats.length.H
 01-24 05:  1   Min.   : 1.00
 01-24 06:  1   1st Qu.:17.00
 01-24 07:  1   Median :31.00
 01-24 08:  1   Mean   :29.64
 01-24 09:  1   3rd Qu.:42.00
 01-24 10:  1   Max.   :70.00
 (Other) :130

Or even calculate the mean if the zoo contained numeric data (like follower counts) by changing the function specified to aggregate.  Plotting this is easy too…but instead of plotting to the screen, lets save two PNGs.

> png("byminute.png")
> barplot(ats.length)
> dev.off()
null device
          1
> png("byhour.png")
> barplot(ats.length.H)
> dev.off()
null device
          1
>

The png(…) function opens a PNG file for writing.  Then any plotting command will be written to disk (and not displayed) until you call dev.off().  Our two plots look like (minute on the left, hour on the right):

What’s great is, you can save a vector PDF too by using the pdf(…) function just like the png(…) one.  Next time, we’ll talk about dealing with something really really big data wise.

PS: years ago, I asked how to do this kind of aggregation on StackOverflow, which is a great resource for R help or just about any other programming language.

PPS: Bonus points for doing this but computing the average tweet length by minute.

R 2D data and simple Map Plotting

So by now you may have noticed I’m focused on the basics of how R represents numbers and vectors.  That is the general point of this tutorial…not to show you how to type cor.test(…) and get a number out, but rather how to manipulate data and data structures to work for you.  In Computer Science, one thing you’re taught early on is: the more sophisticated a data structure, the more simplified the code will (generally) be.  R is no exception…except for one big exception which I’ll get to later on.  For now, on to the second dimension.

To do this, we’re going to read in a file rather than make it as we’ve done in the past. First, you’re going to have to change your working directory.  If you’re running the R GUI console, you can do this in the menubar under the Misc->Change Working Directory… command.  Or, if you are like me and your idea of a GUI is a VT220, you can use the command prompt:

> getwd()
[1] "/Users/shamma"
> setwd("/Users/shamma/tmp")
> getwd()
[1] "/Users/shamma/tmp"
>

Change your working directory into a new directory somewhere and make this simple little file and call it sample.csv:

CA,CB,CC
11,12,13
21,22,23
31,32,33
41,42,43

Ok, so if your working directory is set right, we should be able to read the file in easily.

> s <- read.csv("sample.csv", header=TRUE, sep=",")
> class(s)
[1] "data.frame"
> s
  CA CB CC
1 11 12 13
2 21 22 23
3 31 32 33
4 41 42 43

I’m specifying the comma delimiter, but it defaults to a comma already…so feel free to leave it out.   We also told the read.csv function that this data has a header row.  In the first column you see there then numbers 1 to 4 are just row numbers for your viewing pleasure.  To get stuff out of this “data.frame” (and we’ll worry about what that is later), R does something called “column-major order” like most scientific languages. Meaning, complex structures are collections of columns.  However, we access it row then column like so:

> s[1, 2]
[1] 12

Gets us row 1 column 2. Nice ya?  Lets look at some other examples.  I’m going to put comments after each command to explain what’s happing inline of the code.

> s[1] # col 1, don't do this because it looks odd.
  CA
1 11
2 21
3 31
4 41
> s[1, ] # row 1 (note the blank where the col is to be given)
  CA CB CC
1 11 12 13
> s[ ,2] # col 2 (same trick as before with the blank)
[1] 12 22 32 42
> s[3, -2] # row 3, No col 2
  CA CC
3 31 33
> s[-2, -2] # no 2nd row or col
  CA CC
1 11 13
3 31 33
4 41 43

Pretty simple and follows what we learned last time.  Hey, remember that header row?  We can access columns by name.  This is pretty handy and will keep you from counting which column that was in your dataset.

> s$CA
[1] 11 21 31 41
> s$CA[2:4]
[1] 21 31 41
> s$CB[2:4]
[1] 22 32 42

Great!  We know how to read something in and how to pick out exactly what we need.  Let’s do something real.  First, make this file and call it cities.csv.

name,long,lat
Newcastle,-1.6917,55.0375
Austin,-97.7,30.3
Cairo,31.25,30.05

Next, read it in like so:

> cities <- read.csv("cities.csv", header=TRUE)
> plot(cities$long, cities$lat, pch=20, col="blue", cex=.9)

And you should see a very useless plot window like:

Great…so we need a map to make this, well, intelligible.  To do this, we’ll need our first package.  You can install these little puppies from the menubar somewhere under Packages & Data.  I prefer to use the keyboard (mice carry diseases).  However you want to do it, get the maps package.

> install.packages("maps")
Installing package(s) into ‘/Users/shamma/Library/R/2.13/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.cnr.berkeley.edu/bin/macosx/leopard/contrib/2.13/maps_2.2-5.tgz'
Content type 'application/x-gzip' length 2104264 bytes (2.0 Mb)
opened URL
==================================================
downloaded 2.0 Mb

The downloaded packages are in
	/var/folders/un/unv7BK-CG2qWofD8jLjha+++Q3I/-Tmp-//RtmpkxL1RA/downloaded_packages
>

Great! Now we are going to plot it again, but this time put the points on a world map as long and lat.  First we load the package.  Display the map plot with map(…).  Then add the points (remember from the first tutorial, the call points(…) adds dots to an existing plot).

> library(maps)
> map(database="world", col="grey")
> points(cities$long, cities$lat, pch=20, col="blue")

You should get something like:

Better?  Finally, lets color the regions.  This is where we start to dive into package magic.  We can ask the package, where are these places. Then fill the map regions.  Then plot our points. Notice how we are using variable and column names to make our code human readable.

> cities
       name     long     lat
1 Newcastle  -1.6917 55.0375
2    Austin -97.7000 30.3000
3     Cairo  31.2500 30.0500
> places <- map.where(x=cities$long, y=cities$lat)
> places
[1] "UK:Great Britain" "USA"              "Egypt"
> map(database="world", col="grey")
> map(database="world", col="grey", regions=places, fill=TRUE, add=TRUE)
> points(cities$long, cities$lat, pch=20, col="blue")

And bingo!  The map package found the three countries.  The first call to map(…) [line 9] displays the world map.  The second call to map(…) [line 10] adds our regions to the plot (see the add=TRUE parameter) by filling in the countries.  Then points(…) adds our three city dots via their long and lat. Our first kinda real thing and we have a nice geo plot!  This is how I did the geo plots in the Statler Inauguration demo. In the next installment, we’ll enter the zoo and I’ll explain why you’ll love and hate the data.frame object.

 

Indexing Things in R

As Naaman pointed out, I took a couple of things for granted in my last tutorial. I assumed you know what a variable is, what a function is, and that you are comfortable typing into a command-line console oh and that you new what R is. For our next tutorial, I will still make those assumptions. Now lets say you did everything in the previous tutorial post and you’re looking at that flashing cursor and you wonder…what did I set already? The function ls(…) will List Objects currently loaded in memory.

> ls()
[1] "myline.fit" "x"          "x2"         "y"          "y2"

See?  There’s everything we defined in the past session.  Now, if we could only remember what these things are…there’s a function for that too called class(…):

> class(x)
[1] "numeric"
> class(y)
[1] "numeric"
> class(myline.fit)
[1] "lm"

Here we see that x and y are of the class “numeric” and myline.fit is a “lm” or linear model. Notice if you just have a number, that’s also of class “numeric”:

> class(9)
[1] "numeric"

So, R doesn’t really make a strong distinction between a number and a list of numbers; let’s call it a vector because a list is technically different in R.  This is is because R will distribute operations across the whole vector if the thing that is “numeric” has more than one element.  Take a look at this:

> a <- 5
> a - 1
[1] 4
> x
[1]  1  3  6  9 12
> x - 1
[1]  0  2  5  8 11

For the variable a, subtracting 1 gives us 4.  However, when we simply subtract 1 from x, where x is a vector, actually subtracts 1 from every element in the vector.  If you’re an old school LISP hack like me, then you’ll be very excited, but I’m getting a little ahead of myself.  So, what if you just want an individual number from the vector?  R uses a standard ‘array index’ scheme except, unlike every other computer language you’ve likely seen…it starts counting at 1 and not 0.  Check it:

> x
[1]  1  3  6  9 12
> x[0]
numeric(0)
> x[1]
[1] 1
> x[2]
[1] 3

We see that x[0] is numeric(0) which is basically an empty value (a placeholder for a number but with no value stored there).  x[1] is the first element.  x[2] is the second.  We can also see how many items are in there and notice we get an NA when we exceed the right boundary.

> length(x)
[1] 5
> x[6]
[1] NA

NA means ‘Not Available‘.  Now be careful because if you think a negative value is out of range, you’re mistaken.  For example, x[-1] means show me x EXCEPT for the first element.  Looky here:

> x
[1]  1  3  6  9 12
> x[-1]
[1]  3  6  9 12
> x[-2]
[1]  1  6  9 12
> x[-6]
[1]  1  3  6  9 12
> x[-10]
[1]  1  3  6  9 12

Yes, I’d call that not obvious.  Notice -6 and -10 don’t change the vector as there is no 6th or 10th element to remove.  If we start to think of things as vectors of stuff, it gets neat.  If you want the first three elements, you can call a range by startingNumber:endingNumber.

> x[1:3]
[1] 1 3 6
> x[3:5]
[1]  6  9 12

And if you want say just the 2nd and 4th elements, you can just put a numeric vector in there:

> x[c(2,4)]
[1] 3 9

Remember our friend c(…)?  It returns a vector of numbers.  We can simply pass that into the array index and get the 2nd and 4th elements.  And you can mix and match.  This is because the c(…) function expands the range when it is evalutated:

> c(1:3, 5)
[1] 1 2 3 5
> x[c(1:3, 5)]
[1]  1  3  6 12

Things can get messy fast but it wont let you mix negatives with non-negative indecies:

> x
[1]  1  3  6  9 12
> y
[1]  1.5  2.0  7.0  8.0 15.0
> c(x, y)
 [1]  1.0  3.0  6.0  9.0 12.0  1.5  2.0  7.0  8.0 15.0
> z <- c(x, y)
> z
 [1]  1.0  3.0  6.0  9.0 12.0  1.5  2.0  7.0  8.0 15.0
> z[c(3:5, 8)]
[1]  6  9 12  7
> z[c(1, 3:5, 8:9)]
[1]  1  6  9 12  7  8
> z[c(-1, 3:5, 8:9)]
Error in z[c(-1, 3:5, 8:9)] :
  only 0's may be mixed with negative subscripts

Whew…our first error message.  Ok, so lets make an empty vector then add stuff to it, leaving some blanks:

> v <- vector()
> v
logical(0)
> v[1] <- 2
> v[2] <- 4
> v
[1] 2 4
> v[6] <- 12
> v
[1]  2  4 NA NA NA 12

See how R just padded some NAs in there so it could set the 6th element.

> c(1,2,3,4,5) -> a
> a
[1] 1 2 3 4 5
> a[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
[1] 1 3 5

Notice we can also pass in true or false as a ‘switch’ to show that array index.  Next time, we’ll throw in an extra dimension…just to make things interesting.

R is calling…will you pick up the phone?

Over the past few years, my work has become rather quantastic.  This is possibly due to the so called big data world we live in: a result of storage becoming cheap and computing becoming ubiquitous.   Naaman generally won’t geek out with me anymore as I’ve grown fond of the letter R.  I’ve dragged an intern or two through it. Others have been asking me for a good ‘get started’ guide.  In fact, there isn’t one; there are several.  However, R’s difficulty partially comes in the packages you want to use.  Part of it comes in just knowing its structure and how to select things.  Years ago, while teaching studio art, I devised a Photoshop tutorial that was all based on selection with the marquee and magic wand tools.  I told the students “if you can select it, then you can do anything…try not to get excited about the plugins so much.”  It’s about time I shared this simple R tutorial which is written with the same philosophy (oh ya – go get R first and run it.  you should be looking at a console window)…we’ll start now with stupid R tricks but after a few posts be knee deep in making stuff happen.

> x <- c(1,3,6,9,12)
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0     3.0     6.0     6.2     9.0    12.0
> c(1.5,2,7,8,15) -> y
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.5     2.0     7.0     6.7     8.0    15.0

Here we call the c(…) function to combine some numbers into an array/vector and store it. You can use the = operator but really you want to use <- or ->.  Think of it as funneling something into the variable rather than overwriting it.  The summary(…) function will try to give you a quick glimpse into a variable you might have lying around.  So, now we can call some simple stats stuff.

> mean(x)
  [1] 6.2
> median(x)
  [1] 6
> sd(x)
  [1] 4.438468
> var(x)
  [1] 19.7

This gets good when you have more data than Excel would like to hold (you know like over 10,000 rows)…we’ll see later it’s super easy and kinda tricky to read something from disk later on.  So now we have two vectors x and y, lets find a correlation like so:.

> cor(x, y)
  [1] 0.965409
> ?cor
> cor.test(x, y)
    Pearson's product-moment correlation
  data:  x and y
  t = 6.413, df = 3, p-value = 0.007683
  alternative hypothesis: true correlation is not equal to 0
  95 percent confidence interval:
   0.5608185 0.9978007
  sample estimates:
  cor
    0.965409

Pretty simple stuff.  Notice calling ?cor brings up info about the cor(…) function in a new window.  So lets go ahead and lets plot it.

> plot(x,y)

Lines…let fit a line to the plot.  The function call lm(…) fits a linear model.  We need to express y as a function of x this is done with the ~ oddly enough…we’ll call this myline.fit which is a nicer variable name than just a non-expressive letter:

> myline.fit <- lm(y ~ x)
> summary(myline.fit)
  Call:
  lm(formula = y ~ x)
  Residuals:
        1       2       3       4       5
   0.9898 -0.8909  0.5381 -2.0330  1.3959
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  -0.6802     1.3665  -0.498  0.65285
  x             1.1904     0.1856   6.413  0.00768 **
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  Residual standard error: 1.648 on 3 degrees of freedom
  Multiple R-squared: 0.932,   Adjusted R-squared: 0.9094
  F-statistic: 41.13 on 1 and 3 DF,  p-value: 0.007683

Then we make the plot and then add a line to it and some new points in green just for good measure.

> plot(x,y)
> abline(myline.fit)
> x2 <- c(0.5, 3, 5, 8, 12)
> y2 <- c(0.8, 1, 2, 4, 6)
> points(x2, y2, col="green")

Not too pretty but a plot should be visible, we can worry about pretty later.

Perhaps we can make it pretty after we read some data in and start getting real.  Next time!

 

Cheer Up! Some Holiday Hacking

With my star undergrads Ian and Abe, and backend support from Ziad, we put together this mashup for the holidays! We use the data from the Twitter streaming crawler we built (for our NSF-funded work) to get Instagram photos posted on Twitter that have the word Christmas in the tweet, and where the photo location is available on Twitter. We then add the Google Streetview of the photo location and, well, mash them all together.

Cheerbeat Screenshot

The result is an interesting juxtaposition (as one comment on my Facebook post captured well) of the “small instagram-style photos (typically close-up, indoors) against the backdrop of the (typically distant, outdoors) Google street views”. As such, the StreetView gives context to the Instagram photo and maybe provides the settings in which the activity in the photo is taking place, another dimension of understanding, often much stronger than the text of the tweet itself.

Cheerbeat Screenshot

The app is also an interesting (and mostly unintended) statement about privacy — I don’t know what these users would feel like knowing their environment is exposed to all, and not just in a default bland zoomed-in map format.

Cheerbeat Screenshot

The Cheerbeat application (instacheer was our first name choice but, perhaps not amazingly, already taken with another Instragram Christmas mashup!) mostly runs as javascript in the browser. We continuously crawl Twitter data using the streaming API on our server. When the app loads, it grabs from our server a .json file with the latest 250 tweets with “Christmas”, “Instagr.am” that have geo coordinates that are not empty. We then (in the browser) use the Google Streetview API to find which of these insta-tweets’ locations are available. The app then rotates through the tweets/photos showing the tweet, picture, location, time and Streetview of each.

As a side note, after all this filtering,  surprisingly little data satisfied all these criteria, mostly (I suspect) because Twitter requires specific user authorization for location information to be posted in tweets. In other words, even though many (most?) Instagrams will have location data, a lot of those will not have their data available when posted on Twitter.

There are extra features coming for this app (e.g., choosing your own keywords), but more on that later.

Happy holidays and enjoy the beat!

 

Putting on a SMILe (Plus: Winners!)

They say academia is the art of becoming world-renown without appearing to be self-promoting. Sometimes, however, you gotta make some noise. In our case, we (that’s my team and I, don’t blame Ayman) have recently launched a new lab, the Social Media Information Lab. We thought we’d like to get the word out, especially as we are looking for new PhD students (and maybe postdocs) to join our ranks.

As the CHI 2011 conference is the most popular conference that matches our research area, we decided to do something for it. It also helps that CHI has traditionally been a very playful gathering, with people allowing their badges to be decorated with a host of badges (formal and informal), stickers, puppets, and various other household items. Love the CHI academics. We decided to have a little game.

With our convenient lab name acronym, SMIL (perhaps not accidental), we zeroed in on a Smile theme pretty quickly. We picked four exceptionally smily CHI luminaries as our SMILe ambassadors: Ben Shneirderman, Judy Olson, Elizabeth Churchill, and Ed Chi. The fantastically talented Funda Kivran-Swaine has turned their smily regular pictures into a monochromatic image (Ed now carries his proudly on his Twitter profile), which we printed on some 1000 stickers using the wonderful-yet-pricy Zazzle service. Of course, the stickers included the URL of the SMIL website.

From left: Judy Olson, Ed Chi, Elizabeth Churchill, Ben Shneiderman

We devised a conference game, with very simple mechanics: collect all four heroes on your badge, post it on Flickr/Twitter (#chismil) and you have a chance to win a CHI-SMIL t-shirt. We also made it somewhat difficult: different team members (and friends) distributed different stickers, and Ed’s sticker was the most rare, and access to it tightly controlled by Funda only.

Did it work? We think it did. Soon enough, people I didn’t know approached me begging for “A Judy Olson” (or some other sticker), and a rumor was start that there is a secret, fifth member.

Second, the luminaries themselves were great sports, and seem to enjoy the commotion and exchanges around the stickers. They each had a roll of their kind, except for Ed of course (access controlled to the end!).

In addition, people went to our website and commented on it to me (and perhaps to others).

And, finally, many people labored to collect all four stickers! (partial set of images). We put names in a hat, drew them out, and have five lucky winners. There you go, people. T-shirts are coming. You’re welcome.

Stay tuned for CHI 2012. Who knows what games will be played.

Talk with Me (a.k.a. Wake me Up)

If you are reading this and live in the same great city as my good friend Dr. Naaman, you should go to the opening of the Talk to Me show at the MOMA, July 24th 2011. From their blog, they say:

Talk to Me is an exhibition on the communication between people and objects…It will feature a wide range of objects from all over the world, from interfaces and products to diagrams, visualizations, perhaps even vehicles and furniture, by bona-fide designers, students, scientists, all designed in the past few years or currently under development.

A year ago, I had the good fortune of meeting Paola Antonelli, the curator of Architecture and Design at the NY MOMA. She was describing to me this show, which was in its infancy at the time. So I’m excited to see it actually open and terribly sad I won’t be able to make the opening. We chatted for a little bit about the semantic difference between “Talk to Me” and “Talk with Me” (my research is focused more so on the latter). Quite a few months later, someone told me this quote by Ben Shneiderman: “the old computing is about what computers can do, the new computing is about what people can do.”

Recently, thinking about technology that people talk with, my friend Jeffery Bennett and myself entered a Web-of-things Hack-a-thon, part of Pervasive Computing. Our idea was simple. Can we enable an every day object to reuse the asynchronous status update on Facebook and Twitter to connect with someone in a meaningful, real-time way? Enter The REAWAKENING.


We thought to call it 'Sleeper Cell' too.

Quite simply, The REAWAKENING is a socially connected alarm clock. We used a old skool Chumby (quite possibly one of the best prototyping tools ever made) to make our clock which is tied into the Facebook and Twitter platforms. The REAWAKENING works like any other alarm clock. You set it and you go to sleep. When the alarm goes off, you can turn it off and wake up. But seriously, who does that? So, the alarm goes off, and you hit snooze and go back to bed. The snooze button gives you an extra 8.5 minutes of sleep, at the same time, The REAWAKENING posts your snooze to Facebook and Twitter:



If five (5) of your friends follow the link from the snooze post, the alarm will fire again on the clock, preempting your 8.5 minute snooze. And this cycle can continue if you hit snooze again. When you do finally wake up and turn off the alarm, your friends are notified:



There’s plenty of places for The REAWAKENING to go like shaking the clock can message your friends back to stop or you can ‘auto alarm’ to wake up when your friends nearby are going to wake up; don’t be surprised if you see it in an app-store near you. More importantly, as we continue to invent and build out a connected world, lets continue expand the people and things we talk to and who we talk with.

Using Sociology(!) to Explain Unfollows on Twitter

What gives, @ayman is no longer following me on Twitter!

Well, he still does, not least because he knows I will send roadkill to his office address if he stops. But surely, people stop following one another on Twitter all the time. Right? Right? Yes, right, as we show in our recent paper (caution, PDF), with my PhD students Funda Kivran-Swaine and Priya Govindan, to be published at CHI 2011.

Many studies, in academia and industry, in computer science and sociology (this one too), examine creation of new ties in social networks, but very few examine tie breaks and persistence. Why? One reason is that, in computer science, models of tie creation have immediate consequences for systems (e.g., recommending new contacts). Another reason is that tie breaks are rare, or hard to detect/define in many social networks, especially those networks studied by sociologists (when does Naaman’s tie with Ayman break? after 3 years on not communicating? 20?). Ron Burt‘s work is an exception, but Ron is always an exception, isn’t he.

Enter Twitter, where we can witness a dynamic social system, and where ties are created and broken for all to observe. Op-por-tu-ni-ty! Can we shed some light on the tie break phenomena in Twitter? How wide-spread is this phenomena, and what are the factors that can help predict tie breaks?

We started with a random set of 715 Twitter users, and the 245,586 Twitter users that “follow” them at Time 1 (July 2009). We looked at these users and followers again after nine months (April 2010, Time 2). Did these follow edges still exist? How many dropped over that period? The image below captures one of our 715 users, the network around them in Time 1. Those users that stopped following our user at Time 2 (the “unfollowing” users) and their connections are marked in blue. Now it’s time to pause and see what you think the overall drop “unfollow” is in our data: 5%? 15%? 25%? 75%? OK, scroll down.
Unfollowing on Twitter.
Turns out, over nine month, 30% of the follow edges disappeared. On average, a single user lost about 39% of their followers over that period. How come it’s not 30%? Because the 39% is an average of averages; probably due to the fact that people with a large number of followers — of which there are fewer — lost a smaller portion of their followers, but still a large number. Does more followers mean relatively fewer unfollowers? I’ll come back to that in a second.

For this work, we were mainly interested in looking at whether well-known sociological processes are in play on Twitter in respect to unfollowing activity. So we did our lit review, and discovered that strength of ties, embeddedness within networks, and power/status are some of the key related sociological concepts (the paper explains those in detail, of course). The question then was: can we look at the network structure alone, and based on these theories, see if there are network factors that are highly correlated with unfollows?

The details of the dataset are in the paper, but for now, just imagine that for each “follow” relationship, we had the complete network graph of both nodes. So if “@ayman following @informor” was one of the edges we looked at, we could get the entire network neighborhood of @informor, and @ayman. (This network data is presented to you courtesy of Kwak et al.). What properties of @informor’s network, and of the network around @informor and @ayman, correlate with higher probability that @ayman would stop following me?

We calculated a bunch of variables, including for example, for each of our 715 initial users (let’s call them “seeds”):

  • The seed’s number of followers.
  • The seed’s clustering coefficient: how connected their followers are.
  • The seed’s reciprocity rate: what portion of the people following them, they follow back?
  • The seed’s follow-back rate: what portion of the people they follow, follow them back?
  • The seeds follower-to-friends ratio.

And for each seed and follower pair in our data, we computed aspects of their relationship:

  • How many connections they have in common (i.e., users the seed and follower both connect to)?
  • What is the different in prestige between the two (in terms of number of followers)?
  • Does the seed reciprocate the connection to the follower?

So, which factors correlated most with unfollow activity? We ran quite a sophisticated analysis (multi-level logistic regression), but I’ll keep it simple for here with a basic analysis of the factors that our analysis had shown to contribute to the probability that a follower will unfollow a seed. For the more “scientific” study, check out the paper.

First, what did *NOT* have impact: the number of followers a seed had at Time 1 had very limited impact on the probability of unfollows for that seed, and that impact was mitigated by other factors. A figure (limited to seeds who had less than 500 followers) demonstrates this.
num followers

So what played a major role? Reciprocity, for one, did. Do you follow someone that follows you? If you do, they are much less likely to unfollow you. Remember our 245,586 connections? Half of them were reciprocated (the seed also followed their follower). When the relationship was reciprocated, 16% of the followers unfollowed. When it wasn’t, a whopping 45% did. Before I throw a figure in, an important note about causality: we don’t know the causality. For example, pairs of users who are closer in real life (“strong ties”) are likely to have a reciprocated relationship and of course, their connection is not likely to break (because they are close). A deeper examination is needed to show whether the reciprocity act *alone* helps in maintaining the tie, although the analysis in the paper suggests that it contributes more than other factors that typically signify strong relationships.

reciprocated

We can even look at the user’s tendency to reciprocate follow relationship, and its effect on the percent of followers they lose:
reciprocity

Here’s one more thing to think about: a user’s follow-back rate was highly correlated with a lower ratio of unfollows, but the ratio of followers to followees wasn’t. The follow-back rate is portion of the people a user follows that follow them back. For example, I may have 15 followers and 10 followees (people I follow) on Twitter. Out of the people I follow, 8 follow me back. So my follow-back rate is 80%, and my follower-to-followee ratio is 1.5. Both these metrics are potential measure of “importance” on Twitter, but the fact that only one — the follow-back rate — impacts the rate in which people stop following me, hints that the follow-back rate might be a better measure of importance and success on Twitter. Makes sense, Ayman? What’s your follow-back rate?

Unfollowing on Twitter: followback rate.

What else? the embeddedness is the last thing I will touch on, you can read the paper for more (it’s only a 4 pager, don’t be too easy on yourself). And by embeddedness I do not mean the number of YouTube videos you post on your Twitter stream, but the sociological concept that captures set of relationships that exists between the individuals in a relationship through third parties (i.e., common friends). More common friends? Your relationship is presumed to be stronger. It is not a surprise, then, that the larger the number of common neighbors two Twitter users have, the less likely one is to unfollow the other. From our data, this figure shows, for each level of common neighbors a “follow” relationship had, what percent of these follows became “unfollows”. For example, from all follow relationships that had no common neighbors at Time 1, 78% did not exist at Time 2; one common neighbor was enough to drop that number to 46% (and it keeps dropping — I stopped at 15 because you get the idea).

common neighbors

What didn’t we look at? Pretty much everything else! We relied on network structure alone to investigate these unfollows, as a first step. But there’s a lot more: how often do you tweet (or not)? How interesting are your posts? How similar your topics are to the people following you? We are now exploring all these factors and additional variables. Stay tuned.

[update: slideshare presentation here].