Author Archives: ayman

Online again

This blog, this Ayman Naaman Show, was started by two young researchers as a way to collaborate and post in a shared thought space. Over the years, our activity waned and moved to likes and retweets on various other social media platforms. Eventually, a bad config halted the WordPress activity till one of us had the time to fix it. Well it’s fixed now (thank you Dreamhost support). Feel relieved Naaman? There’s some backlog of content…maybe we can uncage it?

Taking R to the Zoo

Feeling pretty good about now?   So far we’ve just played in the garden; there are problems when you enter the real world.   Let’s start by looking at this dataset of TacoBell tweets.  It’s about 10,000 tweets.   So still pretty small, but the 3.1MB of deliciousness can cause us some problems.    First, lets read it in.  We’ll do so from the URL loader.

> u <- url("http://blog.looxii.com/wp-content/uploads/2011/01/tb-tweats-jan24-jan31.csv")
> system.time(f <- read.csv(u))
   user  system elapsed
  1.331   0.137   9.010

Here, we create a URL file object then pass it to our read.csv function.  Upon completion, you won’t notice it closes the URL file object.    This will take a few seconds to load, you can wrap any command by system.time(…) to see how long it takes.   Now lets look at what we have:

> dim(f)  # how many rows and columns?
[1] 9413    9
> class(f)
[1] "data.frame"
> class(f$s)
[1] "factor"
> class(f$Source)
[1] "factor"
> class(f$Title)
[1] "factor"

The class ‘factor’ is a nominal variable and R loves it.  It’s good if you have distinct types to specify, but not so much for dates or tweets.

> f$s[1]
[1] 01/31/11 04:21 AM
2011 Levels:        01/24/11 01:04 PM 01/24/11 01:06 PM ... 01/31/11 12:16 AM

The 2011 levels tells us there are this many distinct timestamps in the dataset.  We need dates to be, well, dates.  And tweets to be text.  We can convert the arrays or variables by wrapping them in a converting function like:

> f$Body[1]
[1] I really want Taco Bell. I dont care if its fake meat!
8612 Levels:  ... ????Taco Bell???????????????????????????????????????????????????????????
> as.character(f$Body[1])
[1] "I really want Taco Bell. I dont care if its fake meat!"

But really, the best way to do this is to make sure the reader pulls in the right data class when the file loads.  This is specified in the read.csv file.

> types <- c("character", "factor", "factor", "character", "character", "character", "character", "character", "character")
> u <- url("http://blog.looxii.com/wp-content/uploads/2011/01/tb-tweats-jan24-jan31.csv")
> f <- read.csv(u, colClasses=types)
> class(f$s)
[1] "character"

Great!  The c(…) function made a vector of strings, one for each column in the file; each entry is the name of the class for that column and we pass it in to the read.csv(…) function.  Next, we want just the tweets with @ symbols.  In R, we can grep in a string like so:

> grep("@", "this is a test")
integer(0)
> grep("@", "this is @ test")
[1] 1
> grep("@", c("this is not a test", "this is @ test"))
[1] 2

That 1 is an array index, not a truth value.  Watch, lets check for an @ symbol in the first 5 rows of our dataset.

> grep("@", f$Body[1:5])
[1] 3 4 5

So, rows 3, 4, and 5 have an @ symbol.  Oh hey, that’s a nice little index vector into the csv file!  So, if we want to make a new variable which is just the @ symbols, it’s easy, just say give us all those rows by passing that vector in as the row indices.

> dim(f)
[1] 9413    9
> ats <- f[grep("@", f$Body), ]
> dim(ats)
[1] 4031    9

So roughly 42% of the dataset has @ symbols. Now we’ll need the zoo package.  Go get it.

> install.packages("zoo")
Installing package(s) into ‘/Users/shamma/Library/R/2.13/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.cnr.berkeley.edu/bin/macosx/leopard/contrib/2.13/zoo_1.7-6.tgz'
Content type 'application/x-gzip' length 1396545 bytes (1.3 Mb)
opened URL
==================================================
downloaded 1.3 Mb

The downloaded packages are in
	/var/folders/un/unv7BK-CG2qWofD8jLjha+++Q3I/-Tmp-//RtmpcbF2jV/downloaded_packages

Now, we still need the first column as timestamps, not character arrays.

> ?strptime
> strptime(ats[1,1], format="%m/%d/%y %I:%M %p")
[1] "2011-01-31 04:20:00"
> class(strptime(ats[1,1], format="%m/%d/%y %I:%M %p"))
[1] "POSIXlt" "POSIXt"

strptime(…) lets us convert strings to timestamps with a specified format.  The ?strptime command will tell you what to use for formatting as its different from other languages you might know.  Great, we can do this against the whole column and make a “zoo” or Z’s Ordered Observations.

> library(zoo)

Attaching package: ‘zoo’

The following object(s) are masked from ‘package:base’:

    as.Date, as.Date.numeric

> ?zoo
> z <- zoo(ats$Title, order.by=strptime(ats[,1], format="%m/%d/%y %I:%M %p"))
Warning message:
In zoo(ats$Title, order.by = strptime(ats[, 1], format = "%m/%d/%y %I:%M %p")) :
  some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique

Just ignore the warnings right now.  What we are doing is ordering our data set (in this case the Titles) by the timestamp.  The strptime(…) command is applied to the first column of the dataset (remember how R distributes a function across a vector?).  Really we are going to use the zoo as an intermediate data structure.  Now we aggregate them, and we will do this by the number of tweets by minute.

> ats.length <- aggregate(z, format(index(z), "%m-%d %H:%M"), length)
> summary(ats.length)
         Index        ats.length
 01-24 05:00:   1   Min.   : 1.000
 01-24 05:01:   1   1st Qu.: 1.000
 01-24 05:03:   1   Median : 2.000
 01-24 05:05:   1   Mean   : 2.803
 01-24 05:06:   1   3rd Qu.: 3.000
 01-24 05:07:   1   Max.   :29.000
 (Other)    :1432

The aggregate(…) function takes the zoo and collectes them by the specified function. In this case, we chose length, so this is the total number of tweets per minute (the length of the vector at that time aggregate, not the length of the tweets).  We can easily aggregate by the hour by changing the time format:

> ats.length.H <- aggregate(z, format(index(z), "%m-%d %H"), length)
> summary(ats.length.H)
      Index      ats.length.H
 01-24 05:  1   Min.   : 1.00
 01-24 06:  1   1st Qu.:17.00
 01-24 07:  1   Median :31.00
 01-24 08:  1   Mean   :29.64
 01-24 09:  1   3rd Qu.:42.00
 01-24 10:  1   Max.   :70.00
 (Other) :130

Or even calculate the mean if the zoo contained numeric data (like follower counts) by changing the function specified to aggregate.  Plotting this is easy too…but instead of plotting to the screen, lets save two PNGs.

> png("byminute.png")
> barplot(ats.length)
> dev.off()
null device
          1
> png("byhour.png")
> barplot(ats.length.H)
> dev.off()
null device
          1
>

The png(…) function opens a PNG file for writing.  Then any plotting command will be written to disk (and not displayed) until you call dev.off().  Our two plots look like (minute on the left, hour on the right):

What’s great is, you can save a vector PDF too by using the pdf(…) function just like the png(…) one.  Next time, we’ll talk about dealing with something really really big data wise.

PS: years ago, I asked how to do this kind of aggregation on StackOverflow, which is a great resource for R help or just about any other programming language.

PPS: Bonus points for doing this but computing the average tweet length by minute.

R 2D data and simple Map Plotting

So by now you may have noticed I’m focused on the basics of how R represents numbers and vectors.  That is the general point of this tutorial…not to show you how to type cor.test(…) and get a number out, but rather how to manipulate data and data structures to work for you.  In Computer Science, one thing you’re taught early on is: the more sophisticated a data structure, the more simplified the code will (generally) be.  R is no exception…except for one big exception which I’ll get to later on.  For now, on to the second dimension.

To do this, we’re going to read in a file rather than make it as we’ve done in the past. First, you’re going to have to change your working directory.  If you’re running the R GUI console, you can do this in the menubar under the Misc->Change Working Directory… command.  Or, if you are like me and your idea of a GUI is a VT220, you can use the command prompt:

> getwd()
[1] "/Users/shamma"
> setwd("/Users/shamma/tmp")
> getwd()
[1] "/Users/shamma/tmp"
>

Change your working directory into a new directory somewhere and make this simple little file and call it sample.csv:

CA,CB,CC
11,12,13
21,22,23
31,32,33
41,42,43

Ok, so if your working directory is set right, we should be able to read the file in easily.

> s <- read.csv("sample.csv", header=TRUE, sep=",")
> class(s)
[1] "data.frame"
> s
  CA CB CC
1 11 12 13
2 21 22 23
3 31 32 33
4 41 42 43

I’m specifying the comma delimiter, but it defaults to a comma already…so feel free to leave it out.   We also told the read.csv function that this data has a header row.  In the first column you see there then numbers 1 to 4 are just row numbers for your viewing pleasure.  To get stuff out of this “data.frame” (and we’ll worry about what that is later), R does something called “column-major order” like most scientific languages. Meaning, complex structures are collections of columns.  However, we access it row then column like so:

> s[1, 2]
[1] 12

Gets us row 1 column 2. Nice ya?  Lets look at some other examples.  I’m going to put comments after each command to explain what’s happing inline of the code.

> s[1] # col 1, don't do this because it looks odd.
  CA
1 11
2 21
3 31
4 41
> s[1, ] # row 1 (note the blank where the col is to be given)
  CA CB CC
1 11 12 13
> s[ ,2] # col 2 (same trick as before with the blank)
[1] 12 22 32 42
> s[3, -2] # row 3, No col 2
  CA CC
3 31 33
> s[-2, -2] # no 2nd row or col
  CA CC
1 11 13
3 31 33
4 41 43

Pretty simple and follows what we learned last time.  Hey, remember that header row?  We can access columns by name.  This is pretty handy and will keep you from counting which column that was in your dataset.

> s$CA
[1] 11 21 31 41
> s$CA[2:4]
[1] 21 31 41
> s$CB[2:4]
[1] 22 32 42

Great!  We know how to read something in and how to pick out exactly what we need.  Let’s do something real.  First, make this file and call it cities.csv.

name,long,lat
Newcastle,-1.6917,55.0375
Austin,-97.7,30.3
Cairo,31.25,30.05

Next, read it in like so:

> cities <- read.csv("cities.csv", header=TRUE)
> plot(cities$long, cities$lat, pch=20, col="blue", cex=.9)

And you should see a very useless plot window like:

Great…so we need a map to make this, well, intelligible.  To do this, we’ll need our first package.  You can install these little puppies from the menubar somewhere under Packages & Data.  I prefer to use the keyboard (mice carry diseases).  However you want to do it, get the maps package.

> install.packages("maps")
Installing package(s) into ‘/Users/shamma/Library/R/2.13/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.cnr.berkeley.edu/bin/macosx/leopard/contrib/2.13/maps_2.2-5.tgz'
Content type 'application/x-gzip' length 2104264 bytes (2.0 Mb)
opened URL
==================================================
downloaded 2.0 Mb

The downloaded packages are in
	/var/folders/un/unv7BK-CG2qWofD8jLjha+++Q3I/-Tmp-//RtmpkxL1RA/downloaded_packages
>

Great! Now we are going to plot it again, but this time put the points on a world map as long and lat.  First we load the package.  Display the map plot with map(…).  Then add the points (remember from the first tutorial, the call points(…) adds dots to an existing plot).

> library(maps)
> map(database="world", col="grey")
> points(cities$long, cities$lat, pch=20, col="blue")

You should get something like:

Better?  Finally, lets color the regions.  This is where we start to dive into package magic.  We can ask the package, where are these places. Then fill the map regions.  Then plot our points. Notice how we are using variable and column names to make our code human readable.

> cities
       name     long     lat
1 Newcastle  -1.6917 55.0375
2    Austin -97.7000 30.3000
3     Cairo  31.2500 30.0500
> places <- map.where(x=cities$long, y=cities$lat)
> places
[1] "UK:Great Britain" "USA"              "Egypt"
> map(database="world", col="grey")
> map(database="world", col="grey", regions=places, fill=TRUE, add=TRUE)
> points(cities$long, cities$lat, pch=20, col="blue")

And bingo!  The map package found the three countries.  The first call to map(…) [line 9] displays the world map.  The second call to map(…) [line 10] adds our regions to the plot (see the add=TRUE parameter) by filling in the countries.  Then points(…) adds our three city dots via their long and lat. Our first kinda real thing and we have a nice geo plot!  This is how I did the geo plots in the Statler Inauguration demo. In the next installment, we’ll enter the zoo and I’ll explain why you’ll love and hate the data.frame object.

 

Indexing Things in R

As Naaman pointed out, I took a couple of things for granted in my last tutorial. I assumed you know what a variable is, what a function is, and that you are comfortable typing into a command-line console oh and that you new what R is. For our next tutorial, I will still make those assumptions. Now lets say you did everything in the previous tutorial post and you’re looking at that flashing cursor and you wonder…what did I set already? The function ls(…) will List Objects currently loaded in memory.

> ls()
[1] "myline.fit" "x"          "x2"         "y"          "y2"

See?  There’s everything we defined in the past session.  Now, if we could only remember what these things are…there’s a function for that too called class(…):

> class(x)
[1] "numeric"
> class(y)
[1] "numeric"
> class(myline.fit)
[1] "lm"

Here we see that x and y are of the class “numeric” and myline.fit is a “lm” or linear model. Notice if you just have a number, that’s also of class “numeric”:

> class(9)
[1] "numeric"

So, R doesn’t really make a strong distinction between a number and a list of numbers; let’s call it a vector because a list is technically different in R.  This is is because R will distribute operations across the whole vector if the thing that is “numeric” has more than one element.  Take a look at this:

> a <- 5
> a - 1
[1] 4
> x
[1]  1  3  6  9 12
> x - 1
[1]  0  2  5  8 11

For the variable a, subtracting 1 gives us 4.  However, when we simply subtract 1 from x, where x is a vector, actually subtracts 1 from every element in the vector.  If you’re an old school LISP hack like me, then you’ll be very excited, but I’m getting a little ahead of myself.  So, what if you just want an individual number from the vector?  R uses a standard ‘array index’ scheme except, unlike every other computer language you’ve likely seen…it starts counting at 1 and not 0.  Check it:

> x
[1]  1  3  6  9 12
> x[0]
numeric(0)
> x[1]
[1] 1
> x[2]
[1] 3

We see that x[0] is numeric(0) which is basically an empty value (a placeholder for a number but with no value stored there).  x[1] is the first element.  x[2] is the second.  We can also see how many items are in there and notice we get an NA when we exceed the right boundary.

> length(x)
[1] 5
> x[6]
[1] NA

NA means ‘Not Available‘.  Now be careful because if you think a negative value is out of range, you’re mistaken.  For example, x[-1] means show me x EXCEPT for the first element.  Looky here:

> x
[1]  1  3  6  9 12
> x[-1]
[1]  3  6  9 12
> x[-2]
[1]  1  6  9 12
> x[-6]
[1]  1  3  6  9 12
> x[-10]
[1]  1  3  6  9 12

Yes, I’d call that not obvious.  Notice -6 and -10 don’t change the vector as there is no 6th or 10th element to remove.  If we start to think of things as vectors of stuff, it gets neat.  If you want the first three elements, you can call a range by startingNumber:endingNumber.

> x[1:3]
[1] 1 3 6
> x[3:5]
[1]  6  9 12

And if you want say just the 2nd and 4th elements, you can just put a numeric vector in there:

> x[c(2,4)]
[1] 3 9

Remember our friend c(…)?  It returns a vector of numbers.  We can simply pass that into the array index and get the 2nd and 4th elements.  And you can mix and match.  This is because the c(…) function expands the range when it is evalutated:

> c(1:3, 5)
[1] 1 2 3 5
> x[c(1:3, 5)]
[1]  1  3  6 12

Things can get messy fast but it wont let you mix negatives with non-negative indecies:

> x
[1]  1  3  6  9 12
> y
[1]  1.5  2.0  7.0  8.0 15.0
> c(x, y)
 [1]  1.0  3.0  6.0  9.0 12.0  1.5  2.0  7.0  8.0 15.0
> z <- c(x, y)
> z
 [1]  1.0  3.0  6.0  9.0 12.0  1.5  2.0  7.0  8.0 15.0
> z[c(3:5, 8)]
[1]  6  9 12  7
> z[c(1, 3:5, 8:9)]
[1]  1  6  9 12  7  8
> z[c(-1, 3:5, 8:9)]
Error in z[c(-1, 3:5, 8:9)] :
  only 0's may be mixed with negative subscripts

Whew…our first error message.  Ok, so lets make an empty vector then add stuff to it, leaving some blanks:

> v <- vector()
> v
logical(0)
> v[1] <- 2
> v[2] <- 4
> v
[1] 2 4
> v[6] <- 12
> v
[1]  2  4 NA NA NA 12

See how R just padded some NAs in there so it could set the 6th element.

> c(1,2,3,4,5) -> a
> a
[1] 1 2 3 4 5
> a[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
[1] 1 3 5

Notice we can also pass in true or false as a ‘switch’ to show that array index.  Next time, we’ll throw in an extra dimension…just to make things interesting.

R is calling…will you pick up the phone?

Over the past few years, my work has become rather quantastic.  This is possibly due to the so called big data world we live in: a result of storage becoming cheap and computing becoming ubiquitous.   Naaman generally won’t geek out with me anymore as I’ve grown fond of the letter R.  I’ve dragged an intern or two through it. Others have been asking me for a good ‘get started’ guide.  In fact, there isn’t one; there are several.  However, R’s difficulty partially comes in the packages you want to use.  Part of it comes in just knowing its structure and how to select things.  Years ago, while teaching studio art, I devised a Photoshop tutorial that was all based on selection with the marquee and magic wand tools.  I told the students “if you can select it, then you can do anything…try not to get excited about the plugins so much.”  It’s about time I shared this simple R tutorial which is written with the same philosophy (oh ya – go get R first and run it.  you should be looking at a console window)…we’ll start now with stupid R tricks but after a few posts be knee deep in making stuff happen.

> x <- c(1,3,6,9,12)
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0     3.0     6.0     6.2     9.0    12.0
> c(1.5,2,7,8,15) -> y
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.5     2.0     7.0     6.7     8.0    15.0

Here we call the c(…) function to combine some numbers into an array/vector and store it. You can use the = operator but really you want to use <- or ->.  Think of it as funneling something into the variable rather than overwriting it.  The summary(…) function will try to give you a quick glimpse into a variable you might have lying around.  So, now we can call some simple stats stuff.

> mean(x)
  [1] 6.2
> median(x)
  [1] 6
> sd(x)
  [1] 4.438468
> var(x)
  [1] 19.7

This gets good when you have more data than Excel would like to hold (you know like over 10,000 rows)…we’ll see later it’s super easy and kinda tricky to read something from disk later on.  So now we have two vectors x and y, lets find a correlation like so:.

> cor(x, y)
  [1] 0.965409
> ?cor
> cor.test(x, y)
    Pearson's product-moment correlation
  data:  x and y
  t = 6.413, df = 3, p-value = 0.007683
  alternative hypothesis: true correlation is not equal to 0
  95 percent confidence interval:
   0.5608185 0.9978007
  sample estimates:
  cor
    0.965409

Pretty simple stuff.  Notice calling ?cor brings up info about the cor(…) function in a new window.  So lets go ahead and lets plot it.

> plot(x,y)

Lines…let fit a line to the plot.  The function call lm(…) fits a linear model.  We need to express y as a function of x this is done with the ~ oddly enough…we’ll call this myline.fit which is a nicer variable name than just a non-expressive letter:

> myline.fit <- lm(y ~ x)
> summary(myline.fit)
  Call:
  lm(formula = y ~ x)
  Residuals:
        1       2       3       4       5
   0.9898 -0.8909  0.5381 -2.0330  1.3959
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  -0.6802     1.3665  -0.498  0.65285
  x             1.1904     0.1856   6.413  0.00768 **
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  Residual standard error: 1.648 on 3 degrees of freedom
  Multiple R-squared: 0.932,   Adjusted R-squared: 0.9094
  F-statistic: 41.13 on 1 and 3 DF,  p-value: 0.007683

Then we make the plot and then add a line to it and some new points in green just for good measure.

> plot(x,y)
> abline(myline.fit)
> x2 <- c(0.5, 3, 5, 8, 12)
> y2 <- c(0.8, 1, 2, 4, 6)
> points(x2, y2, col="green")

Not too pretty but a plot should be visible, we can worry about pretty later.

Perhaps we can make it pretty after we read some data in and start getting real.  Next time!

 

Talk with Me (a.k.a. Wake me Up)

If you are reading this and live in the same great city as my good friend Dr. Naaman, you should go to the opening of the Talk to Me show at the MOMA, July 24th 2011. From their blog, they say:

Talk to Me is an exhibition on the communication between people and objects…It will feature a wide range of objects from all over the world, from interfaces and products to diagrams, visualizations, perhaps even vehicles and furniture, by bona-fide designers, students, scientists, all designed in the past few years or currently under development.

A year ago, I had the good fortune of meeting Paola Antonelli, the curator of Architecture and Design at the NY MOMA. She was describing to me this show, which was in its infancy at the time. So I’m excited to see it actually open and terribly sad I won’t be able to make the opening. We chatted for a little bit about the semantic difference between “Talk to Me” and “Talk with Me” (my research is focused more so on the latter). Quite a few months later, someone told me this quote by Ben Shneiderman: “the old computing is about what computers can do, the new computing is about what people can do.”

Recently, thinking about technology that people talk with, my friend Jeffery Bennett and myself entered a Web-of-things Hack-a-thon, part of Pervasive Computing. Our idea was simple. Can we enable an every day object to reuse the asynchronous status update on Facebook and Twitter to connect with someone in a meaningful, real-time way? Enter The REAWAKENING.


We thought to call it 'Sleeper Cell' too.

Quite simply, The REAWAKENING is a socially connected alarm clock. We used a old skool Chumby (quite possibly one of the best prototyping tools ever made) to make our clock which is tied into the Facebook and Twitter platforms. The REAWAKENING works like any other alarm clock. You set it and you go to sleep. When the alarm goes off, you can turn it off and wake up. But seriously, who does that? So, the alarm goes off, and you hit snooze and go back to bed. The snooze button gives you an extra 8.5 minutes of sleep, at the same time, The REAWAKENING posts your snooze to Facebook and Twitter:



If five (5) of your friends follow the link from the snooze post, the alarm will fire again on the clock, preempting your 8.5 minute snooze. And this cycle can continue if you hit snooze again. When you do finally wake up and turn off the alarm, your friends are notified:



There’s plenty of places for The REAWAKENING to go like shaking the clock can message your friends back to stop or you can ‘auto alarm’ to wake up when your friends nearby are going to wake up; don’t be surprised if you see it in an app-store near you. More importantly, as we continue to invent and build out a connected world, lets continue expand the people and things we talk to and who we talk with.

Earthquake@Home

Last month, John Forsythe and myself made a Chumby app called ShakeMe. The basic idea is like the Folding@Home or Seti@Home projects, where people lend their CPU cycles for some scientific research. The major difference is we don’t want CPU cycles, we collect sensor data from accelerometers to make a sensor mesh of seismographic activity. We submitted the idea to Freescale Electronics “Sense the world” contest.

If you dig this idea, vote for us! This is a two step process on Facebook.

  1. Like Freescale on Facebook here.
  2. Like our video on Facebook here.

I hear Naaman voted three times, Voting closes December 10th, so vote soon!

Twitter Sentiment Dataset Online

Late last year, Nick Diakopoulos and myself analysed the sentiment of tweets to characterize the Presidential debates. You can read about it in this paper. For this work, we collected sentiment judgements on 3,238 tweets from the first 2008 Presidential debate.

Today, we’ve decided to post the data online for everyone. Just a few notes before we do:

  1. Twitter owners own their tweets.
  2. The sentiment judgements are free for non-commercial, educational, artistic, and academic usage.
  3. The tweets were all publicly posted.
  4. This data was collected via their search API in 2008; read this paper for details on how.
  5. Sentiment judgements were fetched from Mechanical Turkers; read this other paper for details.
  6. Be responsible in your work using this data.

Creative Commons LicenseWe are releasing this under a Creative Commons license. Dataset for Characterizing Debate Performance via Aggregated Twitter Sentiment by Nicholas Diakopoulos and David A. Shamma is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The data set is available as a compressed tab-separated file [here’s the ZIP download link]; give us a shout here as a comment if you use it somewhere. Enjoy!

UPDATE:

This dataset is now on InfoChimps.

Speaking in ML

Back to school Naaman? It has been a long summer. I had the pleasure of working with Jude Yew (you will enjoy the stylish cartoon drawing of himself) from the School of Information in Ann Arbor Michigan. We began the summer thinking about social networks and media sharing. We decided not to look at Twitter. Instead we looked back at Yahoo! Zync. We began to examine videos that were shared over IM in sync, how they were watched, and when people scrubbed. This became rather interesting and led us to ask questions about how we watch and consume and perceive videos.

To back up some, we started to look at videos just from YouTube. How they were classified. And how we could predict classification based on the video’s metadata. It turns out…its hard. We had a small dataset (under 2,000 videos) and getting a bigger crawl and throwing the data in the cloud was…well…just gonna take a little time. I get a little impatient.

We were using Naive Bayes to predict if a video was: Comedy, Music, Entertainment, Film, or News. The YouTube meta data had three features: the video length, the number of views, and the 5 star rating. We wondered about how people rate movies. Some B and even C movies are cult classics. They belong to a class of like media. It doesn’t say that a particular B movie isn’t as good as a particular A movie. If this is in fact the case, the set of 4.2 rated YouTube videos could be fit to a polynomial anywhere. In effect, they do not need to be before 4.5 and after 4.0. Technically put, the ratings of 0.0 to 5.0 could be transformed from interval to factors. With factorization, Naive Bayes has more freedom to fit polynomials to probabilistic distributions.

Only when we nominally factor the ratings can we classify videos on YouTube using only three features. Compared to random predictions with the YouTube data (21% accurate), we attained a mediocre 33% accuracy in predicting video genres using a conventional Naive Bayes approach. However, the accuracy significantly improves by nominal factoring of the data features. By factoring the ratings of the videos in the dataset, the classifier was able to accurately predict the genres of 75% of the videos.

The patterns of social activity found in the metadata are not just meaningful in their own right, but are indicative of the meaning of the shared video content. This was our first step this summer in investigating the potential meaning and significance of social metadata and its relation to the media experience. We’ll be presenting the paper Know Your Data: Understanding Implicit Usage versus Explicit Action in Video Content Classification (pdf) at IS&T/SPIE in January. Stop by and say hi if you see one of us there!