R is a fairly intuitive language. It generally does not take long to learn the different variable types, how to import data, the control statements, etc. However, when it comes to making graphics, there is so much syntax involved that it can be a bit overwhelming.
What is ggplot2? It is a package for R that contains tools for producing various graphics. I’ll go further into what it is, and how to use it in another post. The purpose of this post is to show how using the plot() function in R can get unnecessarily complicated, and that the ggplot2 package is a superior alternative.
One of the issues I’ve experienced more than a few times, is the compromise between flexibility and detail. Let me illustrate with an example with some of R’s built-in data.
library(ggplot2) #use the 'diamonds' dataset, comes with the ggplot2 package rawData <- diamonds
This data shows various characteristics for a sample of nearly 54k diamonds. The characteristics that we’ll use are carat, price, clarity, and cut. The example that I am about to go into closely follows one found in Hadley Wickham’s ggplot2: Elegant Graphics for Data Analysis, an excellent resource on the subject.
Now we’ll take a sample of 200 observations:
set.seed(1000) sampleData <- diamonds[sample(nrow(diamonds),200),]
Let’s start by plotting price vs. carat using the standard plot() function:
Which gives us:
The above plot is not very pretty, but it’s not a bad start. We can see that the axis titles need to be changed, maybe add some gridlines depending on your preference, but all in all, not bad.
plot(sampleData$price~sampleData$carat, xlab="carat", ylab="price", ) grid()
Gives a better looking plot:
Still not publication quality, but it’s getting there. Now lets switch to a function in the ggplot2 package called qplot() and try to reproduce the plot:
Besides having a slightly different grid pattern and colour, it’s pretty much the same thing. I’ll explain the syntax in another post, but right now it doesn’t seem like there’s a huge difference between plot and qplot. However, let’s say you show someone your data and they request that you to include the diamond cut variable in your graph.
With ggplot2, this is pretty simple:
qplot(carat,price,data=sampleData, colour = cut)
We can now see the relationship between diamond cut and the other 2 variables. The legend is also added automatically.
Okay, so how would we make the same plot using the plot() function? Well, it’s a bit more involved. First we have to create a palette of colours (one colour for each level of cut), then we create a vector of colours to match to the cut variable in our data, then we set the colour of the points in the graph to the vector of colours. Since a legend isn’t generated automatically, we’ll have to do some similar slight of hand for that too. Here’s the code:
#determine number of different levels of cuts in data nColours <- nlevels(sampleData$cut) #create a vector of colours for the data colourVector <- rainbow(nColours) plot(sampleData$price~sampleData$carat, xlab = "carat", ylab="price", col = colourVector[sampleData$cut], pch=16 ) grid() legend("topleft", legend=levels(sampleData$cut), pch=16, col=colourVector)
So this is essentially the same thing that the qplot() produced, but it took a lot more work. Now let’s say that you show it to the same person and they then ask you to switch the colour scheme to match the clarity of the diamond, and the shape of the points to match the cut of the diamond. With qplot(), this is also pretty easy:
qplot(carat, price, data=sampleData, colour=clarity, shape=cut)
Although this graph is suffering from information overload, and it would be better to split the data into multiple graphs, it shows the flexibility of using ggplot2. Doing the same thing with the plot() function would require the below code:
#determine number of different levels of clarity in data nColours <- nlevels(sampleData$clarity) #create a vector of colours for the data colourVector <- rainbow(nColours) #determine number of different levels of cuts in data nCuts <- nlevels(sampleData$cut) #create a vector of shapes for the data cutVector <- c(1:nCuts) plot(sampleData$price~sampleData$carat, xlab="carat", ylab="price", col = colourVector[sampleData$clarity], pch=cutVector[sampleData$cut] ) grid() legend("topleft", legend=levels(sampleData$clarity), pch=16, col=colourVector) legend("left", legend=levels(sampleData$cut), pch=cutVector)
The wheels are starting to come off the wagon a bit. The code is getting more cumbersome, the risk of making a mistake somewhere has also gone up, and it would take a lot more work to get it up to publication quality. The code that was used in this example is similar to what I used to make the stacked barplots in the income inequality posts.
At the start of this post, I mentioned the compromise between flexibility and detail. With the plot() function, as you fine tune the various attributes, the code gets bigger and more complicated. Once one finally gets to the point where the graph is deemed publication quality, the proposition of changing anything is a nightmare. With ggplot2, we have seen that making big changes does not require nearly the same amount of work.
Let’s finish by supposing that once more you show your graph to the person mentioned above and they like it but were also hoping that you could show the distribution of carat size for the various cuts.
I’m not exactly sure how to do this with the plot() function, you’d probably have to split the data by the ‘cut’ variable, use the density() function on the split data, store the results in vectors, and then pass the vectors to the plot() function. This would time a bit of time.
With qplot() it would be:
qplot(carat, data=diamonds, geom="density", colour = cut)
So it looks like ggplot2 is the way to go for building plots in R. The question now is: how do you use it? Although the syntax is usually simple, the concepts behind the functions are different from most other graphics packages in R. I will go into this in the next post on the subject.