Web crawling with Python: Part 3, scraping data to csv

In parts 1 and 2, I described how to get setup with Scrapy and Selenium and start navigating a website with a lot of dynamic content.  In this part, I wrap things up by extracting data from a website to the csv file. Here is the website:

Screen Shot 2015-03-21 at 4.00.49 PM

Let’s start by creating a scrapy item called Cmhc1Item to contain the extracted data:

import scrapy
from scrapy.item import Item, Field


class Cmhc1Item(scrapy.Item):
    # define the fields for your item here like:
    geo = scrapy.Field()
    single = scrapy.Field()
    semiDetached = scrapy.Field()
    row = scrapy.Field()
    apartment = scrapy.Field()
    total = scrapy.Field()
    year = scrapy.Field()

The Item class works like a flexible dictionary which contains objects called fields that we’ll use to store the extracted data. In this case, we create a field for each variable that we’ll extract.  You should have an items.py file in your scrapy folder already.

The website contains several tables that are generated through a top menu. The spider must be able to access the menu, generate the various tables, and extract their data. Continuing the code from part 2, we get the web crawler to open the menu item for the table, select annual data, and refresh the table:

driver.find_element_by_xpath("//a[@id='timePeriodLink']").click()
#we want annual data
driver.find_element_by_xpath("//input[@id='toggleAnnual']").click()
driver.find_element_by_xpath("//div[@class='refresh-table']/button").click()
time.sleep(2)

Once this is done, we create a scrapy selector and pass it the page source from the web driver. We then use the selector to extract a list of the years that are available:

hxs = HtmlXPathSelector(text=driver.page_source)
#use the selector to extract all the years available and store the information in variable 'years'
years = hxs.xpath("//div[@class='time-period-caption']/div/text()").extract()

Now we want to make a loop that will cycle through the various tables. We start by making a new variable called ‘index’ which we’ll use to keep track of the loop iterations. For each year, we update the selector’s page source, select all the table rows, and for each row we create a Cmhc1Item to store the values of each cell in the row. We then append the CmhcItem to a growing array. Here’s the code:

for i in years:
		print i

		index = 0
		for i in years:
			index += 1
			hxs = HtmlXPathSelector(text=self.driver.page_source)
			data = hxs.xpath("//div[@class='table-section']//tbody//tr")

			for datum in data:
				item = Cmhc1Item()
				item ["geo"] = datum.xpath('./th[1]/text()').extract()
				item ["single"] = datum.xpath('./td[1]/text()').extract()
				item ["semiDetached"] = datum.xpath('./td[2]/text()').extract()
				item ["row"] = datum.xpath('./td[3]/text()').extract()
				item ["apartment"] = datum.xpath('./td[4]/text()').extract()
				item ["total"] = datum.xpath('./td[5]/text()').extract()
				item ["year"] = hxs.xpath("//a[@id='timePeriodLink']/text()").extract()

				print "*** PRINTING ITEMS ***"
				print item
				print "*** DONE PRINTING ITEMS ***"

				items.append(item)

			time.sleep(2)
			driver.find_element_by_xpath("//a[@id='timePeriodLink']").click()

			for n in range(1,index):
				driver.find_element_by_xpath("//a[@class='prev-year']").click()

			driver.find_element_by_xpath("//div[@class='refresh-table']/button").click()
			time.sleep(2)

		return items

At the end of each loop, the web driver opens the time period link and clicks the prev-year button to open a new year. The number of times it needs to click the prev-year button is determined by the index which is incremented for each year. We have to do this because each time the driver opens the time period link, it gets set to the most recent year.  Also note the sleep methods used. When using a web driver, it’s important to make sure the page gets fully loaded before trying to access its elements. Use the sleep method as a handbrake to make sure the web driver doesn’t get ahead of itself.

Here is the code for the entire spider:

#bring in necessary modules
import scrapy
from scrapy.spider import Spider
from scrapy.selector import Selector
from selenium import webdriver
import time
from scrapy.selector import HtmlXPathSelector
from cmhc1.items import Cmhc1Item


class cmhc_spider3(scrapy.Spider):
	name = 'cmhc_spider3'
	allowed_domains = ['cmhc-schl.gc.ca/']
	start_urls = ['https://www03.cmhc-schl.gc.ca/hmiportal/en/']

	#This is the iniializer method that gets called when the cmhb_spider object gets created
	#In python, you see that 'self' gets passed around a lot.  Like other languages 'self' refers
	#to the object itself.  Unlike other languages, you always have to explicitily declare it in Python.
	#Python uses the first parameter passed in the __init__ method to refer to the object created.  
	#You don't have to use 'self', you can use any word, but 'self' is the convention
	def __init__(self):
		#create a instance of the selenium webdriver using Firefox and assign it to the 'driver' instance variable.
		self.driver = webdriver.Firefox()

	def parse(self, response):
		#where we will store the spyder items
		items = []

		#Print the response received by the Spider
		#A lot of text gets returned to the terminal which can make it difficult to spot the print output.
		#It helps to have starting and trailing asterisks
		print "***** RESPONSE: %s *****" % response

		#Because statements in methods must use fully qualified names and not just simple names, we will
		#create a new variable 'driver' and assign it to 'self.driver', this will make the code cleaner
		#going forward than always using 'self.driver'.
		driver = self.driver
		#pass the response url to the web driver.  this should create a firefox window at the start url
		driver.get(response.url)
		#locate the checkbox by its xpath and select it
		driver.find_element_by_xpath("//input[@id='iAccept']").click()
		#locate the 'Accept' button and click it to continue on to the site
		driver.find_element_by_xpath("//a[@href='/hmip-pimh/en/Main/DoNotShowIntro']").click()
		#locate the 'Tables' button and click it
		driver.find_element_by_xpath("//a[@href='/hmip-pimh/en/TableMapChart?id=1&t=1']").click()
		#locate the New Housing Data.  This will select any HTML hyperlink tag with the text: 'New Housing Construction'
		#this next element doesn't get loaded right away, we need the driver to wait a few seconds before proceeding
		time.sleep(2)
		driver.find_element_by_partial_link_text('New Housing Con').click()
		#select the dataset we want
		driver.find_element_by_partial_link_text('Starts (Actual)').click()
		#Hurray we're finally at a datatable!
		time.sleep(2)
		driver.find_element_by_xpath("//a[@id='timePeriodLink']").click()
		#we want annual data
		driver.find_element_by_xpath("//input[@id='toggleAnnual']").click()
		driver.find_element_by_xpath("//div[@class='refresh-table']/button").click()
		time.sleep(2)

		#now it is finally time to let Scrapy do some work.  Create a Scrapy selector and pass it the page source
		#from the web driver
		hxs = HtmlXPathSelector(text=driver.page_source)
		#use the selector to extract all the years available and store the information in variable 'years'
		years = hxs.xpath("//div[@class='time-period-caption']/div/text()").extract()
		#print the years retrieved, you'll see a 'u' in each element.  This is just Python information that
		#the text is Unicode, it's not actually part of the element.
		print "*****YEARS %s *****" % years
		for i in years:
			print i

		index = 0
		for i in years:
			index += 1
			hxs = HtmlXPathSelector(text=self.driver.page_source)
			data = hxs.xpath("//div[@class='table-section']//tbody//tr")

			for datum in data:
				item = Cmhc1Item()
				item ["geo"] = datum.xpath('./th[1]/text()').extract()
				item ["single"] = datum.xpath('./td[1]/text()').extract()
				item ["semiDetached"] = datum.xpath('./td[2]/text()').extract()
				item ["row"] = datum.xpath('./td[3]/text()').extract()
				item ["apartment"] = datum.xpath('./td[4]/text()').extract()
				item ["total"] = datum.xpath('./td[5]/text()').extract()
				item ["year"] = hxs.xpath("//a[@id='timePeriodLink']/text()").extract()

				print "*** PRINTING ITEMS ***"
				print item
				print "*** DONE PRINTING ITEMS ***"

				items.append(item)

			time.sleep(2)
			driver.find_element_by_xpath("//a[@id='timePeriodLink']").click()

			for n in range(1,index):
				driver.find_element_by_xpath("//a[@class='prev-year']").click()

			driver.find_element_by_xpath("//div[@class='refresh-table']/button").click()
			time.sleep(2)

		return items



		#driver.find_element_by_xpath("//div[@class='time-period-caption']")




Use the following command to have the spider output saved to a csv file:

scrapy crawl cmhc_spider3 -o results.csv -t csv

This should create a file named results.csv in your spider folder. Attached is the excel version of the results that was generated when I ran the spider on my machine.

results

That concludes the 3 part Scrapy and Selenium tutorial.  Feel free to send me any comments.  Happy crawling!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s