Category Archives: R

Taking R to the Zoo

Feeling pretty good about now?   So far we’ve just played in the garden; there are problems when you enter the real world.   Let’s start by looking at this dataset of TacoBell tweets.  It’s about 10,000 tweets.   So still pretty small, but the 3.1MB of deliciousness can cause us some problems.    First, lets read it in.  We’ll do so from the URL loader.

> u <- url("http://blog.looxii.com/wp-content/uploads/2011/01/tb-tweats-jan24-jan31.csv")
> system.time(f <- read.csv(u))
   user  system elapsed
  1.331   0.137   9.010

Here, we create a URL file object then pass it to our read.csv function.  Upon completion, you won’t notice it closes the URL file object.    This will take a few seconds to load, you can wrap any command by system.time(…) to see how long it takes.   Now lets look at what we have:

> dim(f)  # how many rows and columns?
[1] 9413    9
> class(f)
[1] "data.frame"
> class(f$s)
[1] "factor"
> class(f$Source)
[1] "factor"
> class(f$Title)
[1] "factor"

The class ‘factor’ is a nominal variable and R loves it.  It’s good if you have distinct types to specify, but not so much for dates or tweets.

> f$s[1]
[1] 01/31/11 04:21 AM
2011 Levels:        01/24/11 01:04 PM 01/24/11 01:06 PM ... 01/31/11 12:16 AM

The 2011 levels tells us there are this many distinct timestamps in the dataset.  We need dates to be, well, dates.  And tweets to be text.  We can convert the arrays or variables by wrapping them in a converting function like:

> f$Body[1]
[1] I really want Taco Bell. I dont care if its fake meat!
8612 Levels:  ... ????Taco Bell???????????????????????????????????????????????????????????
> as.character(f$Body[1])
[1] "I really want Taco Bell. I dont care if its fake meat!"

But really, the best way to do this is to make sure the reader pulls in the right data class when the file loads.  This is specified in the read.csv file.

> types <- c("character", "factor", "factor", "character", "character", "character", "character", "character", "character")
> u <- url("http://blog.looxii.com/wp-content/uploads/2011/01/tb-tweats-jan24-jan31.csv")
> f <- read.csv(u, colClasses=types)
> class(f$s)
[1] "character"

Great!  The c(…) function made a vector of strings, one for each column in the file; each entry is the name of the class for that column and we pass it in to the read.csv(…) function.  Next, we want just the tweets with @ symbols.  In R, we can grep in a string like so:

> grep("@", "this is a test")
integer(0)
> grep("@", "this is @ test")
[1] 1
> grep("@", c("this is not a test", "this is @ test"))
[1] 2

That 1 is an array index, not a truth value.  Watch, lets check for an @ symbol in the first 5 rows of our dataset.

> grep("@", f$Body[1:5])
[1] 3 4 5

So, rows 3, 4, and 5 have an @ symbol.  Oh hey, that’s a nice little index vector into the csv file!  So, if we want to make a new variable which is just the @ symbols, it’s easy, just say give us all those rows by passing that vector in as the row indices.

> dim(f)
[1] 9413    9
> ats <- f[grep("@", f$Body), ]
> dim(ats)
[1] 4031    9

So roughly 42% of the dataset has @ symbols. Now we’ll need the zoo package.  Go get it.

> install.packages("zoo")
Installing package(s) into ‘/Users/shamma/Library/R/2.13/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.cnr.berkeley.edu/bin/macosx/leopard/contrib/2.13/zoo_1.7-6.tgz'
Content type 'application/x-gzip' length 1396545 bytes (1.3 Mb)
opened URL
==================================================
downloaded 1.3 Mb

The downloaded packages are in
	/var/folders/un/unv7BK-CG2qWofD8jLjha+++Q3I/-Tmp-//RtmpcbF2jV/downloaded_packages

Now, we still need the first column as timestamps, not character arrays.

> ?strptime
> strptime(ats[1,1], format="%m/%d/%y %I:%M %p")
[1] "2011-01-31 04:20:00"
> class(strptime(ats[1,1], format="%m/%d/%y %I:%M %p"))
[1] "POSIXlt" "POSIXt"

strptime(…) lets us convert strings to timestamps with a specified format.  The ?strptime command will tell you what to use for formatting as its different from other languages you might know.  Great, we can do this against the whole column and make a “zoo” or Z’s Ordered Observations.

> library(zoo)

Attaching package: ‘zoo’

The following object(s) are masked from ‘package:base’:

    as.Date, as.Date.numeric

> ?zoo
> z <- zoo(ats$Title, order.by=strptime(ats[,1], format="%m/%d/%y %I:%M %p"))
Warning message:
In zoo(ats$Title, order.by = strptime(ats[, 1], format = "%m/%d/%y %I:%M %p")) :
  some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique

Just ignore the warnings right now.  What we are doing is ordering our data set (in this case the Titles) by the timestamp.  The strptime(…) command is applied to the first column of the dataset (remember how R distributes a function across a vector?).  Really we are going to use the zoo as an intermediate data structure.  Now we aggregate them, and we will do this by the number of tweets by minute.

> ats.length <- aggregate(z, format(index(z), "%m-%d %H:%M"), length)
> summary(ats.length)
         Index        ats.length
 01-24 05:00:   1   Min.   : 1.000
 01-24 05:01:   1   1st Qu.: 1.000
 01-24 05:03:   1   Median : 2.000
 01-24 05:05:   1   Mean   : 2.803
 01-24 05:06:   1   3rd Qu.: 3.000
 01-24 05:07:   1   Max.   :29.000
 (Other)    :1432

The aggregate(…) function takes the zoo and collectes them by the specified function. In this case, we chose length, so this is the total number of tweets per minute (the length of the vector at that time aggregate, not the length of the tweets).  We can easily aggregate by the hour by changing the time format:

> ats.length.H <- aggregate(z, format(index(z), "%m-%d %H"), length)
> summary(ats.length.H)
      Index      ats.length.H
 01-24 05:  1   Min.   : 1.00
 01-24 06:  1   1st Qu.:17.00
 01-24 07:  1   Median :31.00
 01-24 08:  1   Mean   :29.64
 01-24 09:  1   3rd Qu.:42.00
 01-24 10:  1   Max.   :70.00
 (Other) :130

Or even calculate the mean if the zoo contained numeric data (like follower counts) by changing the function specified to aggregate.  Plotting this is easy too…but instead of plotting to the screen, lets save two PNGs.

> png("byminute.png")
> barplot(ats.length)
> dev.off()
null device
          1
> png("byhour.png")
> barplot(ats.length.H)
> dev.off()
null device
          1
>

The png(…) function opens a PNG file for writing.  Then any plotting command will be written to disk (and not displayed) until you call dev.off().  Our two plots look like (minute on the left, hour on the right):

What’s great is, you can save a vector PDF too by using the pdf(…) function just like the png(…) one.  Next time, we’ll talk about dealing with something really really big data wise.

PS: years ago, I asked how to do this kind of aggregation on StackOverflow, which is a great resource for R help or just about any other programming language.

PPS: Bonus points for doing this but computing the average tweet length by minute.

R 2D data and simple Map Plotting

So by now you may have noticed I’m focused on the basics of how R represents numbers and vectors.  That is the general point of this tutorial…not to show you how to type cor.test(…) and get a number out, but rather how to manipulate data and data structures to work for you.  In Computer Science, one thing you’re taught early on is: the more sophisticated a data structure, the more simplified the code will (generally) be.  R is no exception…except for one big exception which I’ll get to later on.  For now, on to the second dimension.

To do this, we’re going to read in a file rather than make it as we’ve done in the past. First, you’re going to have to change your working directory.  If you’re running the R GUI console, you can do this in the menubar under the Misc->Change Working Directory… command.  Or, if you are like me and your idea of a GUI is a VT220, you can use the command prompt:

> getwd()
[1] "/Users/shamma"
> setwd("/Users/shamma/tmp")
> getwd()
[1] "/Users/shamma/tmp"
>

Change your working directory into a new directory somewhere and make this simple little file and call it sample.csv:

CA,CB,CC
11,12,13
21,22,23
31,32,33
41,42,43

Ok, so if your working directory is set right, we should be able to read the file in easily.

> s <- read.csv("sample.csv", header=TRUE, sep=",")
> class(s)
[1] "data.frame"
> s
  CA CB CC
1 11 12 13
2 21 22 23
3 31 32 33
4 41 42 43

I’m specifying the comma delimiter, but it defaults to a comma already…so feel free to leave it out.   We also told the read.csv function that this data has a header row.  In the first column you see there then numbers 1 to 4 are just row numbers for your viewing pleasure.  To get stuff out of this “data.frame” (and we’ll worry about what that is later), R does something called “column-major order” like most scientific languages. Meaning, complex structures are collections of columns.  However, we access it row then column like so:

> s[1, 2]
[1] 12

Gets us row 1 column 2. Nice ya?  Lets look at some other examples.  I’m going to put comments after each command to explain what’s happing inline of the code.

> s[1] # col 1, don't do this because it looks odd.
  CA
1 11
2 21
3 31
4 41
> s[1, ] # row 1 (note the blank where the col is to be given)
  CA CB CC
1 11 12 13
> s[ ,2] # col 2 (same trick as before with the blank)
[1] 12 22 32 42
> s[3, -2] # row 3, No col 2
  CA CC
3 31 33
> s[-2, -2] # no 2nd row or col
  CA CC
1 11 13
3 31 33
4 41 43

Pretty simple and follows what we learned last time.  Hey, remember that header row?  We can access columns by name.  This is pretty handy and will keep you from counting which column that was in your dataset.

> s$CA
[1] 11 21 31 41
> s$CA[2:4]
[1] 21 31 41
> s$CB[2:4]
[1] 22 32 42

Great!  We know how to read something in and how to pick out exactly what we need.  Let’s do something real.  First, make this file and call it cities.csv.

name,long,lat
Newcastle,-1.6917,55.0375
Austin,-97.7,30.3
Cairo,31.25,30.05

Next, read it in like so:

> cities <- read.csv("cities.csv", header=TRUE)
> plot(cities$long, cities$lat, pch=20, col="blue", cex=.9)

And you should see a very useless plot window like:

Great…so we need a map to make this, well, intelligible.  To do this, we’ll need our first package.  You can install these little puppies from the menubar somewhere under Packages & Data.  I prefer to use the keyboard (mice carry diseases).  However you want to do it, get the maps package.

> install.packages("maps")
Installing package(s) into ‘/Users/shamma/Library/R/2.13/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.cnr.berkeley.edu/bin/macosx/leopard/contrib/2.13/maps_2.2-5.tgz'
Content type 'application/x-gzip' length 2104264 bytes (2.0 Mb)
opened URL
==================================================
downloaded 2.0 Mb

The downloaded packages are in
	/var/folders/un/unv7BK-CG2qWofD8jLjha+++Q3I/-Tmp-//RtmpkxL1RA/downloaded_packages
>

Great! Now we are going to plot it again, but this time put the points on a world map as long and lat.  First we load the package.  Display the map plot with map(…).  Then add the points (remember from the first tutorial, the call points(…) adds dots to an existing plot).

> library(maps)
> map(database="world", col="grey")
> points(cities$long, cities$lat, pch=20, col="blue")

You should get something like:

Better?  Finally, lets color the regions.  This is where we start to dive into package magic.  We can ask the package, where are these places. Then fill the map regions.  Then plot our points. Notice how we are using variable and column names to make our code human readable.

> cities
       name     long     lat
1 Newcastle  -1.6917 55.0375
2    Austin -97.7000 30.3000
3     Cairo  31.2500 30.0500
> places <- map.where(x=cities$long, y=cities$lat)
> places
[1] "UK:Great Britain" "USA"              "Egypt"
> map(database="world", col="grey")
> map(database="world", col="grey", regions=places, fill=TRUE, add=TRUE)
> points(cities$long, cities$lat, pch=20, col="blue")

And bingo!  The map package found the three countries.  The first call to map(…) [line 9] displays the world map.  The second call to map(…) [line 10] adds our regions to the plot (see the add=TRUE parameter) by filling in the countries.  Then points(…) adds our three city dots via their long and lat. Our first kinda real thing and we have a nice geo plot!  This is how I did the geo plots in the Statler Inauguration demo. In the next installment, we’ll enter the zoo and I’ll explain why you’ll love and hate the data.frame object.

 

Indexing Things in R

As Naaman pointed out, I took a couple of things for granted in my last tutorial. I assumed you know what a variable is, what a function is, and that you are comfortable typing into a command-line console oh and that you new what R is. For our next tutorial, I will still make those assumptions. Now lets say you did everything in the previous tutorial post and you’re looking at that flashing cursor and you wonder…what did I set already? The function ls(…) will List Objects currently loaded in memory.

> ls()
[1] "myline.fit" "x"          "x2"         "y"          "y2"

See?  There’s everything we defined in the past session.  Now, if we could only remember what these things are…there’s a function for that too called class(…):

> class(x)
[1] "numeric"
> class(y)
[1] "numeric"
> class(myline.fit)
[1] "lm"

Here we see that x and y are of the class “numeric” and myline.fit is a “lm” or linear model. Notice if you just have a number, that’s also of class “numeric”:

> class(9)
[1] "numeric"

So, R doesn’t really make a strong distinction between a number and a list of numbers; let’s call it a vector because a list is technically different in R.  This is is because R will distribute operations across the whole vector if the thing that is “numeric” has more than one element.  Take a look at this:

> a <- 5
> a - 1
[1] 4
> x
[1]  1  3  6  9 12
> x - 1
[1]  0  2  5  8 11

For the variable a, subtracting 1 gives us 4.  However, when we simply subtract 1 from x, where x is a vector, actually subtracts 1 from every element in the vector.  If you’re an old school LISP hack like me, then you’ll be very excited, but I’m getting a little ahead of myself.  So, what if you just want an individual number from the vector?  R uses a standard ‘array index’ scheme except, unlike every other computer language you’ve likely seen…it starts counting at 1 and not 0.  Check it:

> x
[1]  1  3  6  9 12
> x[0]
numeric(0)
> x[1]
[1] 1
> x[2]
[1] 3

We see that x[0] is numeric(0) which is basically an empty value (a placeholder for a number but with no value stored there).  x[1] is the first element.  x[2] is the second.  We can also see how many items are in there and notice we get an NA when we exceed the right boundary.

> length(x)
[1] 5
> x[6]
[1] NA

NA means ‘Not Available‘.  Now be careful because if you think a negative value is out of range, you’re mistaken.  For example, x[-1] means show me x EXCEPT for the first element.  Looky here:

> x
[1]  1  3  6  9 12
> x[-1]
[1]  3  6  9 12
> x[-2]
[1]  1  6  9 12
> x[-6]
[1]  1  3  6  9 12
> x[-10]
[1]  1  3  6  9 12

Yes, I’d call that not obvious.  Notice -6 and -10 don’t change the vector as there is no 6th or 10th element to remove.  If we start to think of things as vectors of stuff, it gets neat.  If you want the first three elements, you can call a range by startingNumber:endingNumber.

> x[1:3]
[1] 1 3 6
> x[3:5]
[1]  6  9 12

And if you want say just the 2nd and 4th elements, you can just put a numeric vector in there:

> x[c(2,4)]
[1] 3 9

Remember our friend c(…)?  It returns a vector of numbers.  We can simply pass that into the array index and get the 2nd and 4th elements.  And you can mix and match.  This is because the c(…) function expands the range when it is evalutated:

> c(1:3, 5)
[1] 1 2 3 5
> x[c(1:3, 5)]
[1]  1  3  6 12

Things can get messy fast but it wont let you mix negatives with non-negative indecies:

> x
[1]  1  3  6  9 12
> y
[1]  1.5  2.0  7.0  8.0 15.0
> c(x, y)
 [1]  1.0  3.0  6.0  9.0 12.0  1.5  2.0  7.0  8.0 15.0
> z <- c(x, y)
> z
 [1]  1.0  3.0  6.0  9.0 12.0  1.5  2.0  7.0  8.0 15.0
> z[c(3:5, 8)]
[1]  6  9 12  7
> z[c(1, 3:5, 8:9)]
[1]  1  6  9 12  7  8
> z[c(-1, 3:5, 8:9)]
Error in z[c(-1, 3:5, 8:9)] :
  only 0's may be mixed with negative subscripts

Whew…our first error message.  Ok, so lets make an empty vector then add stuff to it, leaving some blanks:

> v <- vector()
> v
logical(0)
> v[1] <- 2
> v[2] <- 4
> v
[1] 2 4
> v[6] <- 12
> v
[1]  2  4 NA NA NA 12

See how R just padded some NAs in there so it could set the 6th element.

> c(1,2,3,4,5) -> a
> a
[1] 1 2 3 4 5
> a[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
[1] 1 3 5

Notice we can also pass in true or false as a ‘switch’ to show that array index.  Next time, we’ll throw in an extra dimension…just to make things interesting.

R is calling…will you pick up the phone?

Over the past few years, my work has become rather quantastic.  This is possibly due to the so called big data world we live in: a result of storage becoming cheap and computing becoming ubiquitous.   Naaman generally won’t geek out with me anymore as I’ve grown fond of the letter R.  I’ve dragged an intern or two through it. Others have been asking me for a good ‘get started’ guide.  In fact, there isn’t one; there are several.  However, R’s difficulty partially comes in the packages you want to use.  Part of it comes in just knowing its structure and how to select things.  Years ago, while teaching studio art, I devised a Photoshop tutorial that was all based on selection with the marquee and magic wand tools.  I told the students “if you can select it, then you can do anything…try not to get excited about the plugins so much.”  It’s about time I shared this simple R tutorial which is written with the same philosophy (oh ya – go get R first and run it.  you should be looking at a console window)…we’ll start now with stupid R tricks but after a few posts be knee deep in making stuff happen.

> x <- c(1,3,6,9,12)
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0     3.0     6.0     6.2     9.0    12.0
> c(1.5,2,7,8,15) -> y
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.5     2.0     7.0     6.7     8.0    15.0

Here we call the c(…) function to combine some numbers into an array/vector and store it. You can use the = operator but really you want to use <- or ->.  Think of it as funneling something into the variable rather than overwriting it.  The summary(…) function will try to give you a quick glimpse into a variable you might have lying around.  So, now we can call some simple stats stuff.

> mean(x)
  [1] 6.2
> median(x)
  [1] 6
> sd(x)
  [1] 4.438468
> var(x)
  [1] 19.7

This gets good when you have more data than Excel would like to hold (you know like over 10,000 rows)…we’ll see later it’s super easy and kinda tricky to read something from disk later on.  So now we have two vectors x and y, lets find a correlation like so:.

> cor(x, y)
  [1] 0.965409
> ?cor
> cor.test(x, y)
    Pearson's product-moment correlation
  data:  x and y
  t = 6.413, df = 3, p-value = 0.007683
  alternative hypothesis: true correlation is not equal to 0
  95 percent confidence interval:
   0.5608185 0.9978007
  sample estimates:
  cor
    0.965409

Pretty simple stuff.  Notice calling ?cor brings up info about the cor(…) function in a new window.  So lets go ahead and lets plot it.

> plot(x,y)

Lines…let fit a line to the plot.  The function call lm(…) fits a linear model.  We need to express y as a function of x this is done with the ~ oddly enough…we’ll call this myline.fit which is a nicer variable name than just a non-expressive letter:

> myline.fit <- lm(y ~ x)
> summary(myline.fit)
  Call:
  lm(formula = y ~ x)
  Residuals:
        1       2       3       4       5
   0.9898 -0.8909  0.5381 -2.0330  1.3959
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  -0.6802     1.3665  -0.498  0.65285
  x             1.1904     0.1856   6.413  0.00768 **
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  Residual standard error: 1.648 on 3 degrees of freedom
  Multiple R-squared: 0.932,   Adjusted R-squared: 0.9094
  F-statistic: 41.13 on 1 and 3 DF,  p-value: 0.007683

Then we make the plot and then add a line to it and some new points in green just for good measure.

> plot(x,y)
> abline(myline.fit)
> x2 <- c(0.5, 3, 5, 8, 12)
> y2 <- c(0.8, 1, 2, 4, 6)
> points(x2, y2, col="green")

Not too pretty but a plot should be visible, we can worry about pretty later.

Perhaps we can make it pretty after we read some data in and start getting real.  Next time!