One of my favorite things to program are bots and scrapers. I think the first scraper I saw was in IRC where a bot posted the latest news. This was amazing. You no longer had to do it manually – a program could do it for you.
Out of IRC I discovered more bots and scrapers. Google or imdb results. The weather or prices.
My first scraper
My first scraper was a script for eggdrop which is an IRC bot. It was written in tcl which people probably don’t use anymore. If somebody has written “!google <term>
” in the chat, the bot would search for the term and return the first matched URL. Super basic, 22 lines of tcl with white spaces but it was extremely cool.
Later I’ve written a lot more scrapers for different purposes. And with different techniques from writing HTTP requests by hand to now using Scrapy.
What is this thing?
Yesterday, I started a new project in which I needed a scraper. So, I finally decided to take a deeper look at Scrapy.
My first impression was that there’s too much boilerplate.
from scrapy.item import Item, Field class TorrentItem(Item): url = Field() name = Field() description = Field() size = Field()
I have to define item classes? What is this django? Later I found out that you actually start projects like in django. After I saw what happens if you start a new project…
tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
… I thought about stopping Scrapy and just using urllib and BeautifulSoup again. Which worked fine the past. But then I looked a bit around and found opinions about Scrapy and one person wrote:
Scrapy is a mature framework with full unicode, redirection handling, gzipped responses, odd encodings, integrated http cache, etc.
Wait. Full unicode? I don’t have to care about encodings? The bane of my existence. Writing a scraper in 20 minutes and taking 2 hours to get the freaking encodings right. Sold.
You got me at encodings
Again I was on my way on writing my first scraper in Scrapy. Apparently, it’s not like django but a bit easier and you can still write scrapers pretty fast.
I followed the tutorial but with my own project and it wasn’t actually that hard. I finally took a serious look at XPath which Scrapy uses besides css selectors for extraction. And it’s also not that hard. It took me about 30 – 60 minutes to write my first scraper and scrape the first results. I was very pleased by its interactive shell which is like ipython. So you can scrape and then figure out your XPaths. Especially cool is that it shows you all matches directly. This is great.
After I looked around a bit in its doc I found that they have lots of features and middleware. Scrapy seems to thrive if you continuously scrape / crawl the web and you can even use it for testing.
One of my favorite things, besides the unicode & odd encodings support is the json export. If you run your crawler you can just add -o data.json -t json
and everything will be neatly formatted and saved into json. You can also work with pipelines, i.e. directly transforming the data and saving it, such as in a sql database.
It was less daunting than I initially thought and is less stressful because of the encoding stuff. Great piece of software.