Income inequality in Canada: Part 2 – First crack at data

I thought it best to begin my journey by taking as broad of a perspective as possible.  I would rather include too much data and narrow it down later than start with too little.

Data:

After looking through the StatsCan website, I was able to find a dataset that contained the distributions of market incomes for different family types in Canada.  Market income is total income less income from government sources.  The dataset goes back to 1976 and you can find it here: http://www5.statcan.gc.ca/cansim/a26?lang=eng&retrLang=eng&id=2020201&tabMode=dataTable&srchLan=-1&p1=-1&p2=9

If you go to the website and go to the add/remove data tab, you should be able to get the below xls file (but in csv format).  If you download the below file, you should change the files to csv format, that is what I used but the website won’t let me upload csvs.

cansim-2020201-eng-1014797951932281941

You need to clean up this file a bit before you can import into into R.  After removing some of the descriptions and unnecessary data, you should have a file which looks like this:

cansim_202_0201_1

The csv version of the above file is what I used to produce the below graphs.

Methodology

With data in hand, I thought the best way to represent how the distributions of income changes over time was by stacked barplots.  Fortunately, R has a built in function for barplots that I was able to use.  There is a very useful package for R called ggplot2 which I find to be a generally superior method for creating plots in R.  The ggplot2 package is a little bit more complicated to use, and as there are not many good online sources for learning how to use it, I may post a small overview/tutorial before coding with it.  Although ggplot2 is more complicated in terms of the plotting concepts, I find the features to be much better contained, and much more flexible.  Outside of ggplot2, there are many other functions and packages that you can use for your plots.  The syntax for these plots is extensive and is almost a programming language on its own.

Below is the R code I used to generate the plots.

#if you decide to use the commented-out functions to create the colourVector,
#you'll need to install the colorRamps package and uncomment the below line
#library(colorRamps)


#import data from csv file
rawData <- read.csv("C:/Users/Business/Dropbox/Economics Research/Income inequality/cansim_202_0201_1.csv",header=FALSE)

#remove first row of data which contains headers, will add custom headers later
rawData <- rawData[-1,]

#create a colour gradient for the graph
#one colour for each level of income

numLevels <- dim(rawData)[1]
numYears <- dim(rawData)[2]-1
colourVector <- rainbow(numLevels)

#below are other colour vectors you can use to replace the one above
#colourVector <- matlab.like(numLevels)
#colourVector <- matlab.like2(numLevels)
#colourVector <- cyan2yellow(numLevels)
#colourVector <- blue2red(numLevels)


#take subset of data to be plotted, convert to matrix form
#to use R's barplot function, the data needs to be in matrix form
#the below code drops the first column since that just contains the column names
#the first column is later used as a name vector for the legend

plotData <- as.matrix(rawData[,-1])

#add column names to the matrix
colNames <- c(1976:2011)
colnames(plotData)<-colNames

#take subset of data to be used for the legend
legendVector <- rawData[,1]

#now we start the actual plotting

#everthing between this line and the dev.off() line gets written to the below pdf file
pdf("C:/Users/Business/Dropbox/Economics Research/Income inequality/Fig_1a.pdf")

#the par function is used to either set or query graphical parameters
#in this case we use the par function to set the margins and to set the plotting
#to the figure region (instead of the plot region). We use the mar function to
#adjust the bottom, left, top, and right margins respectively. The units are in
#number of lines of margin - the default values are (5.1,4.1,4.1,2.1). By setting xpd to
#TRUE, we plot to the figure, if it were set it to FALSE, the plotting would be clipped
#to the plot region.

par(mar=c(5.1,4.1,4.1,9.1),xpd=TRUE)

#the barplot function makes our graph. we send it the plot data and our colourVector
#and we set most of the other parameters to FALSE or null as it is sometimes easier to
#add things like the axes separately.

plot1<- barplot(plotData, col=colourVector, names.arg = NULL, axes = FALSE, xaxt='n', space=0.75)

#add the y axis, the axis will be a sequence of 0 to 100 with increments of 10.
#the tcl parameter controls the tick length
#the cex.axis parameter controls the size of the text

axis(2, at = seq(from = 0, to = 100, by = 10), tcl = -0.5, cex.axis = 0.6)

#add the x axis
axis(1, at = plot1, tcl = -0.5, labels = seq(from = 1976, to = 2011, by = 1), las = 2, cex.axis = 0.6)

#add the title to the figure, note the \n in the middle of the title - this forces a new
#line. otherwise, the text would run off the figure

title(main="Figure 1: Distribution of market income for all family types\n in 2011 constant dollars in Canada from 1976 to 2011", font.main = 4)

#add a legend
#pch = 15 gives filled rectangles, we match the fill by setting the col parameter equal
#to the colourVector used.

legend("topright", inset=c(-0.385,0), legend=legendVector, pch = 15, col = colourVector, cex = 0.7)

dev.off()

Results:

I could not decide on which colour scale made for the best graph so I created 3 copies of the same figure.  The only difference in the below plots is the colour scale.  The first plot seems to give the best contrast in the upper and lower income ranges.  The second plot gives the contrast in the middle income ranges.  The third plot does not give very good differentiation outside a narrow band in the middle.  The reason why I included the last plot is that I wished to illustrate how changing the colour scales can shift the way we interpret the results.

The below plots are a good starting point, but they do not tell us a lot.  In my opinion, there do not appear to be any obvious trends in our data.  The main problem with these plots is that there’s too much information.  In my next post on this topic, I will work on simplifying the data.

Fig_1a

Fig_1b

Fig_1c

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s