R vs Excel

After taking some time away from the blog with our first child, and preparing materials for a course, this blog is starting back up again.

In all the posts I’ve made, I’ve used R for data analysis rather than Excel.  There are a lot of people who don’t like using Excel, but I actually like Excel quite a bit, and I use it all the time for financial models.  In my opinion, for any sort of accounting-like modelling where you want to display several interconnected tables of data simultaneously, Excel is the way to go.  However, the same attributes that make Excel easy-to-use and flexible also make it a very weak tool for analyzing data.

Excel is a workbench, R is a process.  With Excel, you have your tools and data and you manually put everything together.  Each cell is treated independently, logic can flow in any direction, and nobody sees how it was put together – they just see the final product.  Modifications to an Excel project require someone to take it apart, make the modification, and put it back together.  Mistakes are easy to make, and hard to detect.

With R, the user gives the computer a set of instructions, and the computer tries to carry out the instructions.  Unless the analysis is extremely simple, it is much faster to write out a set of instructions rather than doing everything yourself in a spreadsheet.  In addition to being faster, R is also more reliable.  R scripts have a very linear logic – “first do this, and then this, and then this…”.  Well-commented R code will be much more transparent than an Excel sheet, this makes it is easier to detect logic errors.  Finally, with all of R’s libraries (over 10,000), R give you tools to do analysis and problem solving that just is not possible with Excel.

There are other data science programming languages out there – why use R over another language like Python?  That answer depends on many factors and I’ll save that for another post.




Correlations of provincial house prices in Canada 1990-2014

A common opinion that I come across is that there is no such thing as a “Canadian housing market”.  Instead, the various housing markets across the country behave independently of one another.  I used the CMHC webcrawler to gather data on annual median provincial prices of absorbed houses from 1990-2014.  I then used the corrgram package in R to determine the degree of linear association of house prices between provinces and the 95% confidence interval.  The results show a very high correlation across all provinces suggesting to me that there is in fact a “Canadian housing market”.

Continue reading

Web crawling with Python: Part 2, Navigation

In the last post, I described how to get setup with Python, Scrapy, Selenium, and Firebug in order in order to start programming web crawlers.  In this post I will describe how to program a scrapy web crawler to navigate the CMHC website and locate data to be retrieved.  In the next post, I will show how to scrape and store the website data.

Start by opening a terminal window and navigating to the directory where you want to store the web crawler. Enter the following command in the terminal:

Continue reading

Web crawling with Python: Part 1, Setup

It has been awhile since my last post, I have been working on an app for the past few months which consumed all the behind-a-screen time I could muster, but now it’s time to get back to things.

In addition to the R graphs that I usually do, I will be writing more about data mining. If you can get your data from StatsCan, then you’re probably good to go since you’re able to customize a lot of their reports and there are several formats to choose from for downloading. A lot of data is not so easily attainable, and in the past, an analyst would manually copy and paste data into excel sheets, or if they were lucky, they might have been able to use a web query from their spreadsheet to link to the data. Now, there are far superior options available which not only help retrieve data but also open up a whole new world of perspective.

Continue reading