Scraping with Scrapy

One of my favorite things to program are bots and scrapers. I think the first scraper I saw was in IRC where a bot posted the latest news. This was amazing. You no longer had to do it manually – a program could do it for you.

Out of IRC I discovered more bots and scrapers. Google or imdb results. The weather or prices.

My first scraper

My first scraper was a script for eggdrop which is an IRC bot. It was written in tcl which people probably don’t use anymore. If somebody has written “!google <term>” in the chat, the bot would search for the term and return the first matched URL. Super basic, 22 lines of tcl with white spaces but it was extremely cool.

Later I’ve written a lot more scrapers for different purposes. And with different techniques from writing HTTP requests by hand to now using Scrapy.

What is this thing?

Yesterday, I started a new project in which I needed a scraper. So, I finally decided to take a deeper look at Scrapy.

My first impression was that there’s too much boilerplate.

from scrapy.item import Item, Field

class TorrentItem(Item):
    url = Field()
    name = Field()
    description = Field()
    size = Field()

I have to define item classes? What is this django? Later I found out that you actually start projects like in django. After I saw what happens if you start a new project…

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

… I thought about stopping Scrapy and just using urllib and BeautifulSoup again. Which worked fine the past. But then I looked a bit around and found opinions about Scrapy and one person wrote:

Scrapy is a mature framework with full unicode, redirection handling, gzipped responses, odd encodings, integrated http cache, etc.

Wait. Full unicode? I don’t have to care about encodings? The bane of my existence. Writing a scraper in 20 minutes and taking 2 hours to get the freaking encodings right. Sold.

You got me at encodings

Again I was on my way on writing my first scraper in Scrapy. Apparently, it’s not like django but a bit easier and you can still write scrapers pretty fast.

I followed the tutorial but with my own project and it wasn’t actually that hard. I finally took a serious look at XPath which Scrapy uses besides css selectors for extraction. And it’s also not that hard. It took me about 30 – 60 minutes to write my first scraper and scrape the first results. I was very pleased by its interactive shell which is like ipython. So you can scrape and then figure out your XPaths. Especially cool is that it shows you all matches directly. This is great.

After I looked around a bit in its doc I found that they have lots of features and middleware. Scrapy seems to thrive if you continuously scrape / crawl the web and you can even use it for testing.

One of my favorite things, besides the unicode & odd encodings support is the json export. If you run your crawler you can just add -o data.json -t json and everything will be neatly formatted and saved into json. You can also work with pipelines, i.e. directly transforming the data and saving it, such as in a sql database.

It was less daunting than I initially thought and is less stressful because of the encoding stuff. Great piece of software.

Setting up a vps on yourserver.se

I was looking for a cheap vps to host some small apps and play around a bit more with python & cgi. After I created the mlg soundboard with bootstrap my aversion to HTML diminished.

On a bitcoin wiki I found a company called yourserver.se which offers a vps for 2 Euro per month. Yes, TWO EURO ($2.75).You get 256mb RAM, 5gb SSD(!) disk and unlimited transfer. You can choose between CentOS, Debian and Ubuntu and pay by Paypal or Bitcoin. You also get a free IP address which is absolutely crazy for this price.

Getting started

After I decided on the distribution I just paid and was immediately logged in and my vps was ready. Super easy, so uncomplicated. Somehow the locals were a bit broken. However, this should fix them (for debian-based distributions):

export LANGUAGE=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
locale-gen en_US.UTF-8
dpkg-reconfigure locales

Afterwards I just did the usual stuff: setting up users, importing config files, securing the server, etc. You can check out linode’s guide which is pretty good though a bit outdated.

I quickly installed zsh which is my favorite shell. Its auto completion features alone are worth installing it. I also run cronjobs therefore I installed bsd-mailx and postfix so that the system can send me mails if some errors happen. If you want to set up your own MTA there are many tutorials out there and it’s also covered in the linode guide. However, it can be a pain in the ass. A lot of people use google app or other services for handling emails.

If you work via ssh you should also install screen. It’s basically a window manager for the shell. I’m connecting from a mac and each time I pressed tab screen answered “Wuff —- Wuff”. You can get rid of this by setting TERM to rxvt. Just include this in your .zshrc or .bashrc.

export TERM=rxvt

The next installs where pip which is a python package manager and sqlite. SQLite is a great database which is basically enough for most people and it doesn’t suck up as much resources as mysql or postgres and is super easy to use. I mainly use it in python. You can use it for web apps without problems – especially if you use a ORM.

SQLite usually will work great as the database engine for low to medium traffic websites (which is to say, 99.9% of all websites). The amount of web traffic that SQLite can handle depends, of course, on how heavily the website uses its database. Generally speaking, any site that gets fewer than 100K hits/day should work fine with SQLite. The 100K hits/day figure is a conservative estimate, not a hard upper bound. SQLite has been demonstrated to work with 10 times that amount of traffic (Appropriate Uses For SQLite)

So yeah, if you have less than 100K hits a day you are fine. Which is more than I have in a year…

These are my basic tools. Then I started playing a bit with lighttpd (I want a lean server) which was surprisingly easy to install and configure and works pretty fine.

I felt quite comfortable doing all this stuff although the last time I used linux was about 5 years ago. But after somehow I remembered it.

Speed test

# Amsterdam
wget -O /dev/null http://lg.amsterdam.fdcservers.net/100MBtest.zip
2014-05-10 09:17:23 (4.13 MB/s) - ‘/dev/null’ saved [104857600/104857600]

# Dallas
wget -O /dev/null http://speedtest.dal01.softlayer.com/downloads/test100.zip
2014-05-10 09:10:10 (4.22 MB/s) - ‘/dev/null’ saved [104874307/104874307]

# Washington DC
wget -O /dev/null http://speedtest.wdc01.softlayer.com/downloads/test100.zip
2014-05-10 09:10:49 (4.11 MB/s) - ‘/dev/null’ saved [104874307/104874307]

# Seattle
wget -O /dev/null http://speedtest.wdc01.softlayer.com/downloads/test100.zip
2014-05-10 09:11:19 (4.71 MB/s) - ‘/dev/null’ saved [104874307/104874307]

# Tokyo
wget -O /dev/null http://speedtest.tokyo.linode.com/100MB-tokyo.bin
2014-05-10 09:13:25 (2.24 MB/s) - ‘/dev/null’ saved [104857600/104857600]

# London
wget -O /dev/null http://speedtest.london.linode.com/100MB-london.bin
2014-05-10 09:14:07 (6.09 MB/s) - ‘/dev/null’ saved [104857600/104857600]

# Sweden (altushost)
wget -O /dev/null http://31.3.153.125/100mb.test
2014-05-10 09:20:24 (10.9 MB/s) - ‘/dev/null’ saved [104857600/104857600]

So you have a solid 30 – 40 mbits connection and a fantastic connection to Sweden where the server is hosted.

VPN?

You can either use OpenVPN where you have to enable the TUN interface. I tried it yesterday and it worked pretty fine. If you want to use PPTP you just write them a support ticket and they will enable the module. So you have both options available.

Conclusion

Yourserver.se is incredible cheap, fast and easy. You can even pay by bitcoin. I don’t make any money recommending them and I didn’t even want to because they are so cheap. I sincerely feel a bit bad. Seriously, grab one before they raise the prices.

Writing a reddit bot using PRAW

Yesterday, I sat around and noticed a video posted on reddit which I knew was already posted earlier on the same sub. Therefore I decided to finally write a reddit bot.

I started using PRAW which is super easy to use. You can find the docs on the site.

The first step is always figuring out the goal and the process.

What should the bot do? I wanted it to find reposts of posts in a specific subreddit and post a comment listing all reposts.

How should the bot do it? I just started manually trying things out. I used the search function to find the same URL then I noticed different URLs for the same video. For youtube videos it worked best if I extract the video ID and search for it.

My first step was to create an account and get the newest posts.

import praw

r = praw.Reddit(user_agent="USER AGENT")
r.login('username', 'password')


new_sub = r.get_subreddit('SUBREDIT').get_new()

It’s super straight forward. Then I looked up if I checked a post using a sqlite database. If it’s not already checked I look at the domain. If this domain is youtube.com we are going to extract the video id. Currently, I’ve only seen two formats which need different handling.

The first format is /watch?v=VIDEOID&... In this case the video id is easily extracted using urlparse. The second format is a bit different. It’s mainly if people want to track attributions and looks like this: /attribution_link?a=ATTRIBUTIONID&u=%2Fwatch%3Fv%3DVIDEOID%26feature%3Dshare. Again I extracted the query using urlparse and then parsed the '/watch?v' part of the query again. This gives you the video id in this case.

If the domain isn’t youtube.com I just use the submitted URL.

Now I’m going to use reddits search function with the parameter url which just searches in the submitted url. For youtube videos the query is "url: VIDEOID" otherwise "url: URL". Now I parse each result and compare its id to the post I actually checked to avoid false positives.

The next step is using ago which translates time differences into readable text (e.g. 20 minutes ago or 4 months ago) to indicate how old a repost actually is. The last step is to add a comment to the post which lists each previous submission including the time, a permalink, the title and up and down votes. Now I add the post id to my sqlite3 db so that it won’t be checked again.

I’m currently testing the bot if it works well enough I will probably open source it.

Explaination vs. Prediction

Again, some older post I had lying around. Nonetheless, the topic is still prevalent.

I recently read a great paper named To Explain or To Predict? by Galit Shmueli. She explains the differences between the “old-school” explanatory statistics and predictive statistics. I saw lots of her observations by myself.

That means predictions are often regarded as unscientific and therefore there’s a bit of a lack of good literature – lately the situation became better with the uprising of machine learning.
Nonetheless, most students don’t learn how to make predictions and you see how people use $R^2$ to validate models.

Sure, there are some departments that teach how to predict but they are still in the minority. Of course, there’s this other trend with Big Data. I’m personally not really excited by Big Data rather by data at all.

More Info: http://galitshmueli.com/explain-predict

I wrote this post more than 2 years ago. Now machine learning became some kind of commodity on a smaller level and something strange happened. Some of the people who work with data but didn’t learn good statistical techniques started to try to explain data which is pretty terrible. It even seems that they try to reinvent statistics. I read a post yesterday called Why big data is in trouble: they forgot about applied statistics which captured this pretty nicely.

The table at the bottom is just unbelievable. It lists different fields and the application of “big data” or “data science”. They also list that in 2012 they finally start to enter fields like biology, economics, engineering, etc. Which is more sad than hilarious. So yeah, I didn’t expect this turn.

Furthermore, I saw more and more “data science” boot camps / programs popping up. Still neglecting statistical foundations. Resulting in even more terrible studies. This trend will probably follow the Gartner Hype Cycle. As far as I can tell the peak is already reached, now it will begin to be disappointing and in a few years actually reach its plateau. Here the latest Hype Cycle from July 2013:

I see the term “prescriptive analytics” on there and just looked it up. It’s astonishing that people reinvent new terms for so much stuff and it still works. Even stuff like business intelligence is basic statistics, then came predictive analysis (still statistics), data science (hey statistics), now prescriptive analytics (still statistics).

I just have to quote one of my favorite quotes on this topic:

Someone (can’t recall the source, sorry) recently defined “data scientist” as “a data analyst who lives in California.” —baconner