Skip to content

R, e.g.: New Years Eve Tweets

May 5, 2012

First, simple plotting with ggplot.

New Year’s eve is an interesting time to look at twitter volumes because the event moves from time zone to time zone over a 24 hour period.  Below are hourly NY-related tweets by UTC.

The three major humps in volume correspond to the Americas, Europe and Asia.

Not too surprisingly, comparing 2010 and 2011 shows twitter usage growing in most timezones. In the 2010 vs. 2011 scatter plot there are three outliers for growth.

(Basic skill to add here is to exclude the outliers from the linear regression in some R-ish way, rather than manually. Use this to estimate volumes based on the other points and output the difference.)

#!/usr/bin/env Rscript
#
# Plot time series volume data

# install.packages("ggplot2", dependencies = TRUE )
# install.packages("gridExtra", dependencies = TRUE )

library(ggplot2)
library(gridExtra)

X <-read.delim("./twitter.timeline.csv.byhour.csv", sep=",", header=TRUE)
# parse datetime strings
X$date <- as.POSIXct(X$time)

summary(X)

# ts is unix style timestamp
sub0 <- X[ (X$ts > 1293814803 & X$ts  < 1293901203)  , ]
sub1 <- X[ (X$ts > 1325350803 & X$ts  < 1325437203)  , ]

summary(sub0)
summary(sub1)

p0 <- qplot(sub0$date, sub0$count, geom="bar", stat="identity", xlab="2010:  12/31 - 1/1 GMT", ylab="tweets/hr") + scale_y_continuous(limits = c(0, 1100000))
p1 <- qplot(sub1$date, sub1$count, geom="bar", stat="identity", xlab="2011:  12/31 - 1/1 GMT", ylab="tweets/hr") + scale_y_continuous(limits = c(0, 1100000))

png(filename = "./NYTweets-World-2011.png", width = 600, height = 600, units = 'px')
print(
    p1
)
dev.off()

png(filename = "./NYTweets-World-2011v2010.png", width = 600, height = 600, units = 'px')
print(
    grid.arrange(p0, p1, ncol = 1, main="Compare 2010 vs 2011")
)
dev.off()

png(filename = "./NYTweets-Growth-2011v2010.png", width = 600, height = 600, units = 'px')
print(
qplot(sub0$count, sub1$count, xlab="2010 tweets/hr", ylab="2011 tweets/hr") +
     geom_smooth(method=lm,se=FALSE) +
     scale_y_continuous(limits = c(0, 1100000)) +
     scale_x_continuous(limits = c(0, 1100000))
)
dev.off()

Data and Source Gist.

Advertisement
No comments yet

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: