Last month, John Forsythe and myself made a Chumby app called ShakeMe. The basic idea is like the Folding@Home or Seti@Home projects, where people lend their CPU cycles for some scientific research. The major difference is we don’t want CPU cycles, we collect sensor data from accelerometers to make a sensor mesh of seismographic activity. We submitted the idea to Freescale Electronics “Sense the world” contest.

If you dig this idea, vote for us! This is a two step process on Facebook.

  1. Like Freescale on Facebook here.
  2. Like our video on Facebook here.

I hear Naaman voted three times, Voting closes December 10th, so vote soon!

Twitter Sentiment Dataset Online

Late last year, Nick Diakopoulos and myself analysed the sentiment of tweets to characterize the Presidential debates. You can read about it in this paper. For this work, we collected sentiment judgements on 3,238 tweets from the first 2008 Presidential debate.

Today, we’ve decided to post the data online for everyone. Just a few notes before we do:

  1. Twitter owners own their tweets.
  2. The sentiment judgements are free for non-commercial, educational, artistic, and academic usage.
  3. The tweets were all publicly posted.
  4. This data was collected via their search API in 2008; read this paper for details on how.
  5. Sentiment judgements were fetched from Mechanical Turkers; read this other paper for details.
  6. Be responsible in your work using this data.

Creative Commons LicenseWe are releasing this under a Creative Commons license. Dataset for Characterizing Debate Performance via Aggregated Twitter Sentiment by Nicholas Diakopoulos and David A. Shamma is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The data set is available as a compressed tab-separated file [here's the ZIP download link]; give us a shout here as a comment if you use it somewhere. Enjoy!


This dataset is now on InfoChimps.

Speaking in ML

Back to school Naaman? It has been a long summer. I had the pleasure of working with Jude Yew (you will enjoy the stylish cartoon drawing of himself) from the School of Information in Ann Arbor Michigan. We began the summer thinking about social networks and media sharing. We decided not to look at Twitter. Instead we looked back at Yahoo! Zync. We began to examine videos that were shared over IM in sync, how they were watched, and when people scrubbed. This became rather interesting and led us to ask questions about how we watch and consume and perceive videos.

To back up some, we started to look at videos just from YouTube. How they were classified. And how we could predict classification based on the video’s metadata. It turns out…its hard. We had a small dataset (under 2,000 videos) and getting a bigger crawl and throwing the data in the cloud was…well…just gonna take a little time. I get a little impatient.

We were using Naive Bayes to predict if a video was: Comedy, Music, Entertainment, Film, or News. The YouTube meta data had three features: the video length, the number of views, and the 5 star rating. We wondered about how people rate movies. Some B and even C movies are cult classics. They belong to a class of like media. It doesn’t say that a particular B movie isn’t as good as a particular A movie. If this is in fact the case, the set of 4.2 rated YouTube videos could be fit to a polynomial anywhere. In effect, they do not need to be before 4.5 and after 4.0. Technically put, the ratings of 0.0 to 5.0 could be transformed from interval to factors. With factorization, Naive Bayes has more freedom to fit polynomials to probabilistic distributions.

Only when we nominally factor the ratings can we classify videos on YouTube using only three features. Compared to random predictions with the YouTube data (21% accurate), we attained a mediocre 33% accuracy in predicting video genres using a conventional Naive Bayes approach. However, the accuracy significantly improves by nominal factoring of the data features. By factoring the ratings of the videos in the dataset, the classifier was able to accurately predict the genres of 75% of the videos.

The patterns of social activity found in the metadata are not just meaningful in their own right, but are indicative of the meaning of the shared video content. This was our first step this summer in investigating the potential meaning and significance of social metadata and its relation to the media experience. We’ll be presenting the paper Know Your Data: Understanding Implicit Usage versus Explicit Action in Video Content Classification (pdf) at IS&T/SPIE in January. Stop by and say hi if you see one of us there!

Who, What, When, Where: The Semantic Web is Alive and Well (and on Facebook)

I have killed the semantic web before (at least in my provocative title), but pointed out that the future of semantics are light-weight semantics created by programmers, users or individual companies. And here it comes: the future of the Semantic Web (and by that I also mean the Web, the life and the Universe) is now owned by Facebook.

A recent Yahoo! patent, dug up by SEO by the Sea reminded me of the work I’ve been involved with at Yahoo!, driven by the vision of Marc Davis: being able to semantically connect the four most important dimensions of Web objects, Who, What, When and Where, directly to the user experience on Yahoo!. But while Yahoo! dragged its feet, Facebook is making real steps to becoming the true W4 platform for the Web. The identity (Who) war at least seems to have been won, at least for the time being; for most people, the real identity on the Web is the one they expose on Facebook. Controlling the Who has immediate consequence (e.g., de-facto communication platform for people trying to reach contacts), but had also allowed Facebook to expand into the When (Events), What (Pages) and now Where (Places). And as I am doing the linking here, I notice the Facebook title for the Places page — interesting…

Facebook W4

In other words, the Facebook W4 network allows people to connect their experiences to well-defined concepts that “live” in the Facebook objectverse. This is one of Facebook’s greatest successes, and greatest leverage going forward.

Going forward means allowing other developers and companies to build on the Facebook W4 semantics. Yahoo! only partially succeeded in doing that with “Where”, using the Yahoo! Geo platform. Facebook now allows Websites and applications to connect via the Who (Facebook identity). Increasingly, Facebook will increase the usefulness of there “What” and “When” for other applications. The Places feature, cleverly, was already launched with integration of various companies (e.g., FourSquare) that can use the Facebook Places platform. There is no reason why this platform will soon be open (and used) by many other developers, giving Facebook ownership of Who and Where on the Web.

Going forward also means improving the capabilities of the Facebook platform in connecting and mashing the various entities. For example, to be able to record the fact that I “this picture was taken in the event Elvis Perkins in Dearland at Governor’s Island with my friend Kathleen”. Seems like that may be coming! Many other applications are of course possible (e.g., “all the Statuses ever posted from this classroom”).

And where is Twitter? With the less specific “annotation” feature, and lagging behind in the Who space, Twitter is struggling in the  objectverse, despite a strong geo-bend and a major push last year.

Interaction, movement, and dance at DIS 2010

Denmark. Århus. DIS 2010. I was particularly excited to be presenting the first detailed paper on Graffiti Dance (an art performance I co-organized last year with Renata Sheppard and Jürgen Schible). Unfortunately, Naaman wasn’t there; it’s fun for the two of us to storm into a distant country…hilarity ensues. The conference itself was spectacular. With all time lows for acceptance rates (I believe full papers were at 15% and short papers somewhere north of 21%; 2008 had about a 34% acceptance rate), the talks covered everything from prototypes to rich qualitative studies. Aaron Houssian liveblogged all three days in case you need to catch up: [Day 1, Day 2, Day 3]. I spoke on Day 3, the morning after we build a nail gun sculpture.

Now with any good talk you present, you should have some new insight to your work. In this case, I decided not to present what’s in the published article which covers some theory, design process, and system—concluding with an informal exit interview with the audience and the dancers. You should check out the video describing the performance on Vimeo. Instead, I presented the providence of the idea; how three artists far apart from each other made this happen.

First, as it was pointed out to me, nothing new was really created to make this installation happen. There were these system components for other performances that we reused to make something completely unique. The Computer Scientist in me appreciated this deeply. Sometimes, in particular with art, we fight for novelty. Henri Toulouse Lautrec put it best:

In our time there are many artists who do something because it is new.. they see their value and their justification in this newness. They are deceiving themselves.. novelty is seldom the essential. This has to do with one thing only.. making a subject better from its intrinsic nature.

Second, this takes a group painting and stencil image session and maps the on-screen movement (created by the scurry of 4 mouse cursors and brushes scrambling to create an image) and maps it to movement in the audience (facilitated through dancers). Why not map the dancers to the drawn image, rather than the movement of the cursors? It occurs to me (after a few discussions with Renata) that most approches proxy movement through audio cues, drawn images, or time of day. Our performances thing about connected action between people. Motion tied to motion is a much stronger link than an image tied to motion. Movement is not a proxy. This relates to a responsive dress Renata and I made last year, the lights in the dress respond to the dancers movements.

Light Dress

Finally, this performance carries the larger research agenda of mine: how do we build for connected synchronized action? For this embodiment that is this performance, that’s worth a longer journal paper.

[Note: once the ACM Digital Library hosts the proceedings, I'll add a link to the published paper here]

iSticks iSteelpan iTaiko and iMan

Hey Naaman? You get one of those shiny new iPad things? Ever since I saw them…I thought there was something there. Such a nice big screen. So many colors. It’s stunning. Makes me want to hit it with something.

Apple, well Mr. Steve, seems to dislike the idea of input devices aside from your hand. No pens. No stylus. Use it naturally. I think there’s something to that mantra, but then again, we do a lot as humans with tools and instruments. The exacto knife, a spatula, a paint brush…all of these things let us manipulate and create things around us. Touching is great for interacting, but we tend to create with instruments.

So, when I thought to myself that I wanted to poke and hit an iPad, I had a problem. I had no iPad. As fortune would have it, I borrowed one for one month from a friend in exchange for a box of fancy chocolates.

The second issue arose when I remembered the touch screen is capacitive. Hit it all day long with a stick; nothing. It need to carry a charge and feel like a relatively fatty finger. I immediately thought of modern conductive fabric; much less greasy than a korean sausage though not as tasty.

Armed with a metal dowel, conductive fabric, textured cotton, and some string, I showed up at Music Hackday in SF one Saturday morning and made some drumsticks. You can see how I built the sticks on Instructables:

iSticks: How to make a drumstick for an iPad.More DIY How To Projects

Now…with sticks in hand, I built my second ever iPhone app. A Taiko drum. Just to test the idea out. Not wanting to make another R8 drum kit on my borrowed iPad, I thought of a more esoteric instrument. A Steel Pan drum! Once I built the steel drum, I realized I didn’t know how to play it. So I made a tutorial that acts like a whack a mole game and teaches you how to play twinkle twinkle little star. The app won two awards at the San Francisco Music Hack Day.

Currently, iSteelPan and iTaiko are free in the App Store, which took some doing (initially Apple said I had some trade mark infringements around the tutorial). Distribution of apps…someone should run a workshop on that. Oh right, Henriette Cramer is; Deadline’s in two days…good luck!

The Secret Life of (One) Professor: Two Years In

Matt Welsh of Harvard recently wrote on the Secret Lives of Professors, a post that stirred a lot of discussion and struck a chord with a somewhat less experienced professor (that would be me; two years on the job vs. Matt’s seven). I found my self nodding at many of Matt’s well framed observations.

Matt’s main “surprises” and lessons that he offers to grad students in his post include:

Lots of time spent on funding request. I have had a similar experience, because (like Matt) I enjoy working with, and leading, a large group of researchers. Of course, the batting averages are low for funding requests (Matt downplays his success rate but I bet it’s better than average). In my first two years, I submitted 3 NSF proposals, 2 of which were declined and one outstanding (a good sign); I am currently working on two more. Each of these took significant effort, in one case at least (an estimated) two full months of my time. In addition, I submitted a number of smaller-scale proposals, most of them to quick and easy to write, and was fortunate enough to get a Google Research Award (thanks again Goog!), and to be assigned as a faculty mentor to a superstar two-year postdoc Nick Diakopoulos. Together with some other odds and ends (thanks SC&I!) I feel pretty happy after two years regarding the group and resourced I amassed; but the cost on my time is still substantial. On the bright side, as Sam Madden points out in the comments to Matt’s article, some of the grant proposal process is actually helpful in helping me think about future work and research agendas, even if the specific proposal does not get funded.

The job is never done. Even as I write this, I could (and feel that I should!) be editing a paper, or looking at some data, or catching up on email, or working on one of two said proposals. Matt’s admits:

For years I would leave the office in the evening and sit down at my laptop to keep working as soon as I got home.

I can’t say my experience is far from that, although I still insist on taking good vacations. And a 2-year old kid certainly makes for a compelling reason to stop working at any time.

Can’t get to “hack”. True enough, most of the interesting work is delegated to students, as Matt complains that he doesn’t find time to write code. However, that is partially the decision that Matt (and I) knowingly take when we decide to work (and try to fund) a large group of students. Managing fewer or no students might allow more individual research work, which is certainly a path taken by some faculty that skip on the funding requests and the resultant students meetings. However, I am no Ayman, do not miss writing code, and am happy to farm that out to students. I do enjoy thinking about the intellectual and research issues, and often get to do that with the students. I would like to have fewer meetings and less email, but unlike Matt I feel involved enough in the intellectual work, at least so far. Nevertheless, I can’t dive into it like the grad students who indeed “have it good”.

Working with students. Matt writes:

The main reason to be an academic is… to train the next generation.

I see it the same way (the intellectual pursuit is also up there, but it could be claimed that you can perform similar intellectual pursuits in other settings like research labs). The students is why I am in academia, and the advising is by far my favorite activity. From solving someone else’s problems (e.g. a student not sure how do approach X or Y) to, more substantially, showing students a path from a first-year confusion to an experienced researcher that understands how to ask (and answer) research questions, and communicate it effectively. Well, I am clearly not quite there yet having just recently started doing it (and just started funding my first PhD student). But I am enjoying it already. Like Matt, for me it is not just working with the PhDs and Masters students; the undergrads play a big role. I started working with several star undergrads, some of them have never SSH’ed into a server before, most of them have never seen how research is done. Their wide-eyed excitement is an energy source, an inspiration and a cause of constant enjoyment.

So, the bottom line?

It is certainly not for everybody. It remains to be seen if it is even for me.

I will buy that, Matt. At the end of the day, for me, it’s the students, and the freedom to carve my own path. This summer I am lucky enough to be working with my group at SC&I consisting of one postdoc, 2-3 Phd students, 3 Masters students, and 1-3 undergrads (at any given time). With teaching (more on this topic later) out of the way, I spend two full days a week with this gang talking about research, writing papers or grants, having other “good” meetings, or playing Rock Band on our Wii. It’s definitely one of the best work summers I have had, much like my summers at Yahoo! Research Berkeley where we had most of our fantastic interns join in on the fun.

Speaking of the defunct Y!RB, and regarding that path-carving freedom, I feel a lot less constrained in academia compared to industry research. I have had a fantastic experience at Yahoo!, and was lucky to have a great team at the Berkeley lab. However, to start my own project at Yahoo!, that follows my own personal vision, and involved multiple people, would have taken a lot of convincing (and would need to be ultimately tied to corporate agenda). I know Ayman does not agree, so maybe this is just a false sense that I have, that moving a bunch of people towards a vision that I choose and craft is easier in academia. To do that with the students might be, as Matt put it, “the coin of the realm”.

Apple Does Migrations (Almost) Perfectly

Just got a new Macbook pro. I’ve been on Mac for about 5 years now, and the number one most impressive feature to me is the migration. As someone lucky enough to be in a place with a fantastic IT department (yes, I know that’s unlikely, but our IT people are superstars) it means just dropping off my old Mac, and, voila! few hours later I have all the setup I had before (down to the browser history items), reproduced on a lovely new machine.

Just a few things went wrong, most of which are Apple’s fault, and some of which are quite annoying.

First, the Mac didn’t recognize the iPhone. Luckily I was clever enough to think of checking for a Mac software update, and sure enough, the only update available was a fix to this bug. +1 point, Apple.

But it got worse once the iPhone was recognized. Soon enough I got this notice right here:

OK, a little scary, and totally wrong (not getting into DRM discussion here) but not so bad as a user experience — the dialog allowed me to continue, give me options, I can live with that (but why didn’t the migration carry forward my authorization?). Anyway, I asked to authorize, only to get another prompt: Something like “sorry, you already have 5 authorized computers”.  This time, I was offered no way out other than acknowledging that lovely, yet curious fact (which 5 machines I had authorized? Ayman certainly didn’t get my permission for any content!). I was too shocked to take a screen grab of that pesky dialog. Still, this wasn’t a big deal, because I knew what to do – de-authorize all my computers (the only one I knew I had authorized was not with me — I migrated from it, see — so I couldn’t just de-authorize it). But that’s wrong, Mr. Jobs. Why would a “normal” (i.e., not 6’8″) user know how to de-authorized their other computers? Instead, I would like to have seen this process:

1. “Hey, it seems like you already reached the maximum number of computers allowed to access your licensed content! Would you like to fix that?”

Options: But of course! / No, I’ll just curl up in the corner and cry

2. “Here are the details of your 5 authorized computers. Which one(s) would you like to de-authorize?”

Options: Select any number of computers to de-authorize.

3. Done!

Easy, Steve? -gazillion points, Apple!

Another thing that didn’t migrate properly was my Screensaver (although my desktop pictures preference were kept). I guess that’s because in Snow Leopard you need to use iPhoto albums to choose screensaver photos. But why would Desktop background work and screensaver break? Slightly bizarre.

The wifi was also a mild annoyance, forgetting all my preferences (but at least remembering the networks’ credentials for secure networks).

Finally (geek/grad student topic alert), I lost my Latex (MacTex) installation in the migration to the new Mac. I mean, the files were still there but the migration broke a few symbolic links and just tampered with a folder structure enough to make my various Latex editors not find the MacTex installation. MacTex have a several-step solution, but you know me, I take my short cuts (just upgraded to MacTex 2009), which fixed all these issues.

So, Apple could have made this really close a perfect game, but allowed a couple of walks in there late in the innings, just to have Naaman complain. Well, what would I do without them.

Conversation Shadows and Social Media

If you find yourself at ICWSM this week, say hi to us. I know I’ve been introduced to Naaman at least twice so far; I believe he still writes here. So far it’s been a nice mix of the standard social network analysis to S. Craig Watkins’s talk on Investigating What’s Social about Social Media (he’s from UT Austin’s Radio TV and Film department and gives a great perspective on personal motivations and behaviors). Yahoo!’s Jake Hofman gave a great tutorial on Large-scale social media analysis with Hadoop.

Tonight, I’ll be presenting my work on Conversational Shadows. In this work we look at how people tweeted during the inauguration and show some analytical methods for discovering what was important in the event, all based off of the shadow their Twitter activity casts upon the referent event. Let me give a clear example.

Ever go to a movie? Did you notice people chat with their friends through the previews. Once the lights go down and the movie starts, they stop chatting. Sure they might say “this will be good” or “yay” but the conversation stops. I began to wonder, shouldn’t this occur on Twitter while people are watching something on TV. Does the conversation slow down at that moment of onset or when the show starts?

During Obama’s Inauguration, we sampled about 600 tweets per minute from a Twitter push stream. The volume by minute varied insignificantly. However, “a conversation” on Twitter is exhibited via the @mention convention. The mention is highlighted. It calls for attention from the recipient. Our dataset averaged about 160 tweets per minute with an @ symbol. Curiously, there were 3 consecutive minutes where the number of @ symbols dropped significantly to about 35 @s per minute. We still sampled about 600 tweets, just there was a general loss of @s. People hushed their conversation. Perhaps even gasped. Here’s a graph to give you a better feel:

During those minutes where the @ symbols dropped, Obama’s hand hit the Lincoln bible and the swearing in took place. People were still shouting “Hurray!” but they weren’t calling to other’s via the @ symbol. Following the human centered insight (as we found by studying video watching behaviors), we can examine the @ symbols to find the moment of event onset. We call this a conversational shadow: the event has a clear interaction with the social behaviors to be found on the Twitter stream. We’ve found other shadows too, come by the poster session tonight to see them or, if you can’t attend, check out my paper.