It has been awhile since my last post, I have been working on an app for the past few months which consumed all the behind-a-screen time I could muster, but now it’s time to get back to things.
In addition to the R graphs that I usually do, I will be writing more about data mining. If you can get your data from StatsCan, then you’re probably good to go since you’re able to customize a lot of their reports and there are several formats to choose from for downloading. A lot of data is not so easily attainable, and in the past, an analyst would manually copy and paste data into excel sheets, or if they were lucky, they might have been able to use a web query from their spreadsheet to link to the data. Now, there are far superior options available which not only help retrieve data but also open up a whole new world of perspective.
Web crawling and data mining are exploding fields, and there are many open-source tools available which allow anyone to build their own programs to systematically retrieve information from the web. I hope to write a lot on this topic, especially as I continue to learn more about it.
Okay, time to get setup. Here’s what you’ll need:
- Install Python
- Install pip (used to manage python packages) by typing: python get-pip.py (or sudo python get-pip.py)
- Install lxml using pip
- Install OpenSSL (only necessary if you’re using windows)
- With the above 4 items installed, you can now use pip to install Scrapy.
- Use pip to install Selenium for Python.
- Download Firebug (you’ll use this to identify to HTML elements for the web crawler)
- Finally, download Sublime Text to use as your editor (if you don’t already have a Python editor)
If you’re new to Python or HTML, there are some free courses at codecademy which I found helpful.
Getting everything setup takes a bit of time. A lot is done from the command line which always takes some jiggering. In my next post on this subject, I will go through creating a web crawler from scratch and using it to collect data.