Web crawling with Python: Part 2, Navigation

In the last post, I described how to get setup with Python, Scrapy, Selenium, and Firebug in order in order to start programming web crawlers.  In this post I will describe how to program a scrapy web crawler to navigate the CMHC website and locate data to be retrieved.  In the next post, I will show how to scrape and store the website data.

Start by opening a terminal window and navigating to the directory where you want to store the web crawler. Enter the following command in the terminal:

scrapy startproject cmhc1

This will create a folder with the necessary files to start making our program. We will now start to code the spider (the file that will navigate and scrape data from the domain). Go to the spider folder that was created and make a new file called cmhc_spider1.py and add the following code:

#bring in necessary modules
import scrapy
from selenium import webdriver

class cmhc_spider1(scrapy.Spider):
	name = 'cmhc_spider1'
	allowed_domains = ['cmhc-schl.gc.ca/']
	start_urls = ['https://www03.cmhc-schl.gc.ca/hmiportal/en/']

	#Python uses the first parameter passed in the __init__ method to refer to the object created.  
	#You don't have to use 'self', you can use any word, but 'self' is the convention
	def __init__(self):
		#create a instance of the selenium webdriver using Firefox and assign it to the 'driver' instance variable.
		self.driver = webdriver.Firefox()

	def parse(self, response):
		#Print the response received by the Spider
		#A lot of text gets returned to the terminal which can make it difficult to spot the print output.
		#It helps to have starting and trailing asterisks
		print "***** RESPONSE: %s *****" % response

		#create a new variable 'driver' and assign 'self.driver' to it, this will make the code cleaner
		#going forward.
		driver = self.driver
		#pass the response url to the web driver.  this should create a firefox window at the start url
		driver.get(response.url)

So there’s a bit to discuss, let’s go through it. First, we import the necessary modules for Selenium and Scrapy. We create a new class called ‘cmhc_spider1’ which is a subclass of scrapy.Spider. Every subclass of scrapy.Spider requires three attributes:

  1. name: A unique name to identify the spider.  For the sake of simplicity, I tend to use the same name for the file, the class, and the name attribute.
  2. start_urls: A list of urls from which the crawler will start crawling.  In this case we start on the CMHC data portal.
  3. parse(): Each start url triggers a response which is sent to the parse() method.  The parse method navigates through the response and scrapes the data from the url.

The scrapy spider works well at navigating through a website’s HTML, but it isn’t so great when the website has a lot of dynamic content created from JavaScript.  To handle the dynamic content, we bring in the selenium webdriver to help out the spider.  The class initializer method creates a webdriver using the Firefox browser.  Our webdriver is able to find and select tabs and buttons on a website and so we’ll use it to steer the way to the data that we’re searching.

Run this spider by typing the following command in the terminal window.

scrapy crawl cmhc_spider1

This should open up a Firefox window to the below page:

Featured image

So this is a decent start, now we can work on adding more commands to our spider to take to the page that contains the data. First, we need it to check the box to accept the terms and conditions, and then press the ‘Get Started’ button.  To know how to do this, we make use of Firebug to identify these elements, and then tell the webdriver to click them. Right-click an element and inspect with Firebug.  You should see the below screen.

Screenshot 2015-03-20 09.43.59

We can see the element is an input tag with an id of “iAccept”. We will use something called an XPath to have the webdriver select the element. An XPath is a shorthand way of describing an HTML tag. It’s worth investing some time to learn XPath syntax. If the XPath was //input, the spider would select all input HTML tags on the site. We add further specificity by using //input[@id=’iAccept’] which will select all input tags which have “iAccept” as the id. Since there is likely only one input tag with that id, we can be pretty confident that the XPath will get the right element. We continue with Firebug to navigate the site and take note of the required XPaths. We then get our web driver to select these elements. One last important thing!  Often the element you want to select doesn’t load immediately with the page. If your spider tries to select an element before it is loaded, it’s going to crash. The easiest way to handle this is to import the time module and use the sleep function to pause the spider for a few seconds while the page finishes loading. Here is the rest of the code:

import scrapy
from selenium import webdriver
import time
from scrapy.selector import HtmlXPathSelector


class cmhc_spider2(scrapy.Spider):
	name = 'cmhc_spider2'
	allowed_domains = ['cmhc-schl.gc.ca/']
	start_urls = ['https://www03.cmhc-schl.gc.ca/hmiportal/en/']

	#This is the iniializer method that gets called when the cmhb_spider object gets created
	#In python, you see that 'self' gets passed around a lot.  Like other languages 'self' refers
	#to the object itself.  Unlike other languages, you always have to explicitily declare it in Python.
	#Python uses the first parameter passed in the __init__ method to refer to the object created.  
	#You don't have to use 'self', you can use any word, but 'self' is the convention
	def __init__(self):
		#create a instance of the selenium webdriver using Firefox and assign it to the 'driver' instance variable.
		self.driver = webdriver.Firefox()

	def parse(self, response):
		#Print the response received by the Spider
		#A lot of text gets returned to the terminal which can make it difficult to spot the print output.
		#It helps to have starting and trailing asterisks
		print "***** RESPONSE: %s *****" % response

		#Because statements in methods must use fully qualified names and not just simple names, we will
		#create a new variable 'driver' and assign it to 'self.driver', this will make the code cleaner
		#going forward than always using 'self.driver'.
		driver = self.driver
		#pass the response url to the web driver.  this should create a firefox window at the start url
		driver.get(response.url)
		#locate the checkbox by its xpath and select it
		driver.find_element_by_xpath("//input[@id='iAccept']").click()
		#locate the 'Accept' button and click it to continue on to the site
		driver.find_element_by_xpath("//a[@href='/hmip-pimh/en/Main/DoNotShowIntro']").click()
		#locate the 'Tables' button and click it
		driver.find_element_by_xpath("//a[@href='/hmip-pimh/en/TableMapChart?id=1&t=1']").click()
		#locate the New Housing Data.  This will select any HTML hyperlink tag with the text: 'New Housing Construction'
		#this next element doesn't get loaded right away, we need the driver to wait a few seconds before proceeding
		time.sleep(2)
		driver.find_element_by_partial_link_text('New Housing Con').click()
		#select the dataset we want
		driver.find_element_by_partial_link_text('Starts (Actual)').click()

This code should take us to the below page:

Screen Shot 2015-03-21 at 4.00.49 PM

This is the page that contains the table data we want to scrape.  In the next post we will create a scrapy Item object to contain the data, and use the spider to cycle through the various tables on this page and extract the data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s