Web crawling with Python: Part 2, Navigation

In the last post, I described how to get setup with Python, Scrapy, Selenium, and Firebug in order in order to start programming web crawlers.  In this post I will describe how to program a scrapy web crawler to navigate the CMHC website and locate data to be retrieved.  In the next post, I will show how to scrape and store the website data.

Start by opening a terminal window and navigating to the directory where you want to store the web crawler. Enter the following command in the terminal:

Continue reading

Web crawling with Python: Part 1, Setup

It has been awhile since my last post, I have been working on an app for the past few months which consumed all the behind-a-screen time I could muster, but now it’s time to get back to things.

In addition to the R graphs that I usually do, I will be writing more about data mining. If you can get your data from StatsCan, then you’re probably good to go since you’re able to customize a lot of their reports and there are several formats to choose from for downloading. A lot of data is not so easily attainable, and in the past, an analyst would manually copy and paste data into excel sheets, or if they were lucky, they might have been able to use a web query from their spreadsheet to link to the data. Now, there are far superior options available which not only help retrieve data but also open up a whole new world of perspective.

Continue reading