tags: ,

Maybe this is a counter example.  The outcome of this activity illustrates one of my points for  my introductory “R, For Example” post. Specifically, the format of the data is nearly everything.  Here’s the backstory of the data.  I’ll wait.  Okay, here’s the R problem…

The goal was to create a stacked chart of aligned time series plots like this:

The plots have different y-scales–and this is where the trouble starts. ggplot does a great job of generating every appropriate y-tick marks and giving everything that ggplot prettiness. But if you do this plot naively, you get a plot where the x-ranges are identical, but not aligned.  Here was my original graph,

The arrow shows the alignment problem. This is because printing the scales takes up different space, depending on the magnitude (decade, really) of the tick labels.

I imagined the easy way to fix this was a custom formatter, so I write a quick bit of code to format the y-tick labels with the right number of leading spaces.

```# Fix it by manually formatting the y-tick labels
lalb <- function(x) {
lx <- sprintf('%04d', x) # Format the strings as HH:MM:SS
lx <- gsub('^000', ' ', lx) # Remove leading 00: if present
lx <- gsub('^00', ' ', lx) # Remove leading 00: if present
lx <- gsub('^0', ' ', lx) # Remove leading 0 if present
}
```

Quick and easy, right? Nope. The tick labels use proportionally spaced fonts, so this gets hack gets us closer, but not quite there.  In the end, I had to add some artificial padding to the tick label. I ended up adding “8” to the plot with the largest scale and this pushed everything far enough right to get it pretty close.  The final results is shown in the first graph of the post.

Lesson: A better way to do this is to load the data in column form and make each publisher a type (in a single column) instead of making each publisher its own column. R likes tall data frames. Then a ggplot-generated grid of plots would be aligned perfectly.

Here’s the code:

```#!/usr/bin/env Rscript
library(ggplot2)
library(gridExtra)
library(scales)

# data in columns, joined on date field
x\$date = as.POSIXct(x\$time)
summary(x)

# Make some times series plots
p1 <- ggplot(data=x) +
geom_line(aes(date, twcount), color="red") +
opts(title="Twitter", axis.title.x = theme_blank())
p3 <- ggplot(data=x) +
geom_line(aes(date, wpcccount + wpoccount), color="red") +
opts(title="Wordpress Comments", axis.title.x = theme_blank())
p4 <- ggplot(data=x) +
geom_line(aes(date, ngcount), color="red") +
opts(title="Newsgator")+ xlab("Time (UTC)")

png(filename = "./timelines_noalign.png", width = 550, height = 750, units = 'px', res = 100)
print(
grid.arrange(p1, p3, p4 ,ncol = 1,
main=textGrob("Earthquake - Mexico - 20 March 2012", gp=gpar(fontsize=14, lineheight=18)) )
)

# Fix it by manually formatting the y-tick labels
lalb <- function(x) {
lx <- sprintf('%04d', x) # Format the strings as HH:MM:SS
lx <- gsub('^000', ' ', lx) # Remove leading 00: if present
lx <- gsub('^00', ' ', lx) # Remove leading 00: if present
lx <- gsub('^0', ' ', lx) # Remove leading 0 if present
}

# Unfortunately, do to proportional spaced fonts?, spaces dont' quite do ti. Add a manual adjustment.
p1 <- ggplot(data=x) +
geom_line(aes(date, twcount), color="red") +
opts(title="Twitter", axis.title.x = theme_blank(), axis.title.y = theme_text(hjust=8)) +
ylab("") + scale_y_continuous(labels = lalb)
p3 <- ggplot(data=x) +
geom_line(aes(date, wpcccount + wpoccount), color="red") +
opts(title="Wordpress Comments", axis.title.x = theme_blank()) +
ylab("") + scale_y_continuous(labels = lalb)
p4 <- ggplot(data=x) +
geom_line(aes(date, ngcount), color="red") +
opts(title="Newsgator")+ xlab("Time (UTC)") +
ylab("") + scale_y_continuous(labels = lalb)

png(filename = "./timelines_align.png", width = 550, height = 750, units = 'px', res = 100)
print(
grid.arrange(p1, p3, p4 ,ncol = 1,
main=textGrob("Earthquake - Mexico - 20 March 2012", gp=gpar(fontsize=14, lineheight=18)) )
)
dev.off()
```

Data and code on github.

2 Comments leave one →
July 7, 2012 3:19 pm

Great blog.
Late comment, but I’ve just stumbled upon this post. Usually, I deal with this sort of thing by making ‘fake’ panel data with melt(), then using ggplot’s faceting to plot
count ~ time | source, with the y-scale “free.” Is that a less painful solution to your problem? (that futzing with label spacing seems excruciating). It also scales relatively well (since an arbitrary # of plots are made with 1 call to melt + 1 call to ggplot, instead of multiple ggplot calls + grid.arrange()).

Like

2. July 7, 2012 4:01 pm

Carl-

Thanks for the comment. I didn’t know about shaping data with melt–thanks for pointing that out. I find that if the data is in R is set up right, the rest is nearly trivial so I have a command line script that adds a factor column to csv data when I concatenate it. That usually gets me to the point of using gridExtra for layout as you explain, but by shaping the data before I import it.

DrS

Like