In the last post, I described how to get setup with Python, Scrapy, Selenium, and Firebug in order in order to start programming web crawlers. In this post I will describe how to program a scrapy web crawler to navigate the CMHC website and locate data to be retrieved. In the next post, I will show how to scrape and store the website data.
Start by opening a terminal window and navigating to the directory where you want to store the web crawler. Enter the following command in the terminal:
scrapy startproject cmhc1
This will create a folder with the necessary files to start making our program. We will now start to code the spider (the file that will navigate and scrape data from the domain). Go to the spider folder that was created and make a new file called cmhc_spider1.py and add the following code:
#bring in necessary modules import scrapy from selenium import webdriver class cmhc_spider1(scrapy.Spider): name = 'cmhc_spider1' allowed_domains = ['cmhc-schl.gc.ca/'] start_urls = ['https://www03.cmhc-schl.gc.ca/hmiportal/en/'] #Python uses the first parameter passed in the __init__ method to refer to the object created. #You don't have to use 'self', you can use any word, but 'self' is the convention def __init__(self): #create a instance of the selenium webdriver using Firefox and assign it to the 'driver' instance variable. self.driver = webdriver.Firefox() def parse(self, response): #Print the response received by the Spider #A lot of text gets returned to the terminal which can make it difficult to spot the print output. #It helps to have starting and trailing asterisks print "***** RESPONSE: %s *****" % response #create a new variable 'driver' and assign 'self.driver' to it, this will make the code cleaner #going forward. driver = self.driver #pass the response url to the web driver. this should create a firefox window at the start url driver.get(response.url)
So there’s a bit to discuss, let’s go through it. First, we import the necessary modules for Selenium and Scrapy. We create a new class called ‘cmhc_spider1’ which is a subclass of scrapy.Spider. Every subclass of scrapy.Spider requires three attributes:
- name: A unique name to identify the spider. For the sake of simplicity, I tend to use the same name for the file, the class, and the name attribute.
- start_urls: A list of urls from which the crawler will start crawling. In this case we start on the CMHC data portal.
- parse(): Each start url triggers a response which is sent to the parse() method. The parse method navigates through the response and scrapes the data from the url.
Run this spider by typing the following command in the terminal window.
scrapy crawl cmhc_spider1
This should open up a Firefox window to the below page:
So this is a decent start, now we can work on adding more commands to our spider to take to the page that contains the data. First, we need it to check the box to accept the terms and conditions, and then press the ‘Get Started’ button. To know how to do this, we make use of Firebug to identify these elements, and then tell the webdriver to click them. Right-click an element and inspect with Firebug. You should see the below screen.
We can see the element is an input tag with an id of “iAccept”. We will use something called an XPath to have the webdriver select the element. An XPath is a shorthand way of describing an HTML tag. It’s worth investing some time to learn XPath syntax. If the XPath was //input, the spider would select all input HTML tags on the site. We add further specificity by using //input[@id=’iAccept’] which will select all input tags which have “iAccept” as the id. Since there is likely only one input tag with that id, we can be pretty confident that the XPath will get the right element. We continue with Firebug to navigate the site and take note of the required XPaths. We then get our web driver to select these elements. One last important thing! Often the element you want to select doesn’t load immediately with the page. If your spider tries to select an element before it is loaded, it’s going to crash. The easiest way to handle this is to import the time module and use the sleep function to pause the spider for a few seconds while the page finishes loading. Here is the rest of the code:
import scrapy from selenium import webdriver import time from scrapy.selector import HtmlXPathSelector class cmhc_spider2(scrapy.Spider): name = 'cmhc_spider2' allowed_domains = ['cmhc-schl.gc.ca/'] start_urls = ['https://www03.cmhc-schl.gc.ca/hmiportal/en/'] #This is the iniializer method that gets called when the cmhb_spider object gets created #In python, you see that 'self' gets passed around a lot. Like other languages 'self' refers #to the object itself. Unlike other languages, you always have to explicitily declare it in Python. #Python uses the first parameter passed in the __init__ method to refer to the object created. #You don't have to use 'self', you can use any word, but 'self' is the convention def __init__(self): #create a instance of the selenium webdriver using Firefox and assign it to the 'driver' instance variable. self.driver = webdriver.Firefox() def parse(self, response): #Print the response received by the Spider #A lot of text gets returned to the terminal which can make it difficult to spot the print output. #It helps to have starting and trailing asterisks print "***** RESPONSE: %s *****" % response #Because statements in methods must use fully qualified names and not just simple names, we will #create a new variable 'driver' and assign it to 'self.driver', this will make the code cleaner #going forward than always using 'self.driver'. driver = self.driver #pass the response url to the web driver. this should create a firefox window at the start url driver.get(response.url) #locate the checkbox by its xpath and select it driver.find_element_by_xpath("//input[@id='iAccept']").click() #locate the 'Accept' button and click it to continue on to the site driver.find_element_by_xpath("//a[@href='/hmip-pimh/en/Main/DoNotShowIntro']").click() #locate the 'Tables' button and click it driver.find_element_by_xpath("//a[@href='/hmip-pimh/en/TableMapChart?id=1&t=1']").click() #locate the New Housing Data. This will select any HTML hyperlink tag with the text: 'New Housing Construction' #this next element doesn't get loaded right away, we need the driver to wait a few seconds before proceeding time.sleep(2) driver.find_element_by_partial_link_text('New Housing Con').click() #select the dataset we want driver.find_element_by_partial_link_text('Starts (Actual)').click()
This code should take us to the below page:
This is the page that contains the table data we want to scrape. In the next post we will create a scrapy Item object to contain the data, and use the spider to cycle through the various tables on this page and extract the data.