Web crawling with Python: Part 1, Setup

It has been awhile since my last post, I have been working on an app for the past few months which consumed all the behind-a-screen time I could muster, but now it’s time to get back to things.

In addition to the R graphs that I usually do, I will be writing more about data mining. If you can get your data from StatsCan, then you’re probably good to go since you’re able to customize a lot of their reports and there are several formats to choose from for downloading. A lot of data is not so easily attainable, and in the past, an analyst would manually copy and paste data into excel sheets, or if they were lucky, they might have been able to use a web query from their spreadsheet to link to the data. Now, there are far superior options available which not only help retrieve data but also open up a whole new world of perspective.

Web crawling and data mining are exploding fields, and there are many open-source tools available which allow anyone to build their own programs to systematically retrieve information from the web. I hope to write a lot on this topic, especially as I continue to learn more about it.

My first series of posts on this subject will deal with using Scrapy, a framework for Python to scrape web data.  Scrapy is one of the most popular frameworks out there and you don’t have to write a lot of code to create a web crawler.  That being said, it can take some time to be comfortable with the syntax, especially with the XPath bits. I will be using Scrapy in conjunction with the Selenium web driver, also for Python. Scrapy is great for getting data from websites and also navigating most websites, but it doesn’t work so well when you interact with any Javascript on a website (e.g. having to click a button to generate a table). This is where Selenium comes in handy as it can interact with Javascript.  It won’t always be necessary to use Selenium, and in actuality, it’s better to not use Selenium if you do not have to since it will really slow down the program.

Okay, time to get setup.  Here’s what you’ll need:

  1. Install Python
  2. Install pip (used to manage python packages) by typing: python get-pip.py (or sudo python get-pip.py)
  3. Install lxml using pip
  4. Install OpenSSL (only necessary if you’re using windows)
  5. With the above 4 items installed, you can now use pip to install Scrapy.
  6. Use pip to install Selenium for Python.
  7. Download Firebug (you’ll use this to identify to HTML elements for the web crawler)
  8. Finally, download Sublime Text to use as your editor (if you don’t already have a Python editor)

If you’re new to Python or HTML, there are some free courses at codecademy which I found helpful.

Getting everything setup takes a bit of time. A lot is done from the command line which always takes some jiggering.  In my next post on this subject, I will go through creating a web crawler from scratch and using it to collect data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s