R Code: why use ggplot2 instead of plot()?

R is a fairly intuitive language.  It generally does not take long to learn the different variable types, how to import data, the control statements, etc.  However, when it comes to making graphics, there is so much syntax involved that it can be a bit overwhelming.

What is ggplot2? It is a package for R that contains tools for producing various graphics.  I’ll go further into what it is, and how to use it in another post.  The purpose of this post is to show how using the plot() function in R can get unnecessarily complicated, and that the ggplot2 package is a superior alternative.

One of the issues I’ve experienced more than a few times, is the compromise between flexibility and detail.  Let me illustrate with an example with some of R’s built-in data.

library(ggplot2)
#use the 'diamonds' dataset, comes with the ggplot2 package
rawData <- diamonds

This data shows various characteristics for a sample of nearly 54k diamonds.  The characteristics that we’ll use are carat, price, clarity, and cut.  The example that I am about to go into closely follows one found in Hadley Wickham’s ggplot2: Elegant Graphics for Data Analysis, an excellent resource on the subject.

Now we’ll take a sample of 200 observations:

set.seed(1000)
sampleData <- diamonds[sample(nrow(diamonds),200),]

Let’s start by plotting price vs. carat using the standard plot() function:

plot(sampleData$price~sampleData$carat)

Which gives us:

Fig1

The above plot is not very pretty, but it’s not a bad start.  We can see that the axis titles need to be changed, maybe add some gridlines depending on your preference, but all in all, not bad.

plot(sampleData$price~sampleData$carat,
xlab="carat",
ylab="price",
)
grid()

Gives a better looking plot:

Fig2

Still not publication quality, but it’s getting there.  Now lets switch to a function in the ggplot2 package called qplot() and try to reproduce the plot:

qplot(carat,price,data=sampleData)

Which gives:

Fig3

Besides having a slightly different grid pattern and colour, it’s pretty much the same thing.  I’ll explain the syntax in another post, but right now it doesn’t seem like there’s a huge difference between plot and qplot.  However, let’s say you show someone your data and they request that you to include the diamond cut variable in your graph.

With ggplot2, this is pretty simple:

qplot(carat,price,data=sampleData, colour = cut)

Fig4

We can now see the relationship between diamond cut and the other 2 variables.  The legend is also added automatically.

Okay, so how would we make the same plot using the plot() function? Well, it’s a bit more involved.  First we have to create a palette of colours (one colour for each level of cut), then we create a vector of colours to match to the cut variable in our data, then we set the colour of the points in the graph to the vector of colours.  Since a legend isn’t generated automatically, we’ll have to do some similar slight of hand for that too.  Here’s the code:

#determine number of different levels of cuts in data
nColours <- nlevels(sampleData$cut)
#create a vector of colours for the data
colourVector <- rainbow(nColours)

plot(sampleData$price~sampleData$carat,
xlab = "carat",
ylab="price",
col = colourVector[sampleData$cut],
pch=16
)
grid()
legend("topleft",
legend=levels(sampleData$cut),
pch=16,
col=colourVector)

Fig5

So this is essentially the same thing that the qplot() produced, but it took a lot more work.  Now let’s say that you show it to the same person and they then ask you to switch the colour scheme to match the clarity of the diamond, and the shape of the points to match the cut of the diamond.  With qplot(), this is also pretty easy:

qplot(carat, price, data=sampleData, colour=clarity, shape=cut)

Fig6

Although this graph is suffering from information overload, and it would be better to split the data into multiple graphs, it shows the flexibility of using ggplot2.  Doing the same thing with the plot() function would require the below code:


#determine number of different levels of clarity in data
nColours <- nlevels(sampleData$clarity)
#create a vector of colours for the data
colourVector <- rainbow(nColours)

#determine number of different levels of cuts in data
nCuts <- nlevels(sampleData$cut)
#create a vector of shapes for the data
cutVector <- c(1:nCuts)

plot(sampleData$price~sampleData$carat,
xlab="carat",
ylab="price",
col = colourVector[sampleData$clarity],
pch=cutVector[sampleData$cut]
)
grid()
legend("topleft",
legend=levels(sampleData$clarity),
pch=16,
col=colourVector)

legend("left",
legend=levels(sampleData$cut),
pch=cutVector)

Fig7

The wheels are starting to come off the wagon a bit.  The code is getting more cumbersome, the risk of making a mistake somewhere has also gone up, and it would take a lot more work to get it up to publication quality.  The code that was used in this example is similar to what I used to make the stacked barplots in the income inequality posts.

At the start of this post, I mentioned the compromise between flexibility and detail.  With the plot() function, as you fine tune the various attributes, the code gets bigger and more complicated.  Once one finally gets to the point where the graph is deemed publication quality, the proposition of changing anything is a nightmare.  With ggplot2, we have seen that making big changes does not require nearly the same amount of work.

Let’s finish by supposing that once more you show your graph to the person mentioned above and they like it but were also hoping that you could show the distribution of carat size for the various cuts.

I’m not exactly sure how to do this with the plot() function, you’d probably have to split the data by the ‘cut’ variable, use the density() function on the split data, store the results in vectors, and then pass the vectors to the plot() function.  This would time a bit of time.

With qplot() it would be:

qplot(carat, data=diamonds, geom="density", colour = cut)

Fig8

So it looks like ggplot2 is the way to go for building plots in R.  The question now is: how do you use it?  Although the syntax is usually simple, the concepts behind the functions are different from most other graphics packages in R.  I will go into this in the next post on the subject.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s