# Rereading hacker news, April 2009 – Sep 2011

April 2009 to September 2011 was my first absent period of tech. In this time I spent most of my time with economics and politics. I want to make up the leeway by reading good articles which I missed back then.

I decided to use hacker news as my source in the hope that most good articles were posted here.

The first thing that surprised me how few concrete tech articles are in the top ones. Most of the stuff are gimmicks or ‘feels’ stories so far. Also a ton of news articles which isn’t surprising. The most surprising fact was that the best articles in my opinion were the mildly popular ones. I read about 1500 headlines, skimmed about 150 articles and these were the most interesting things I found:

This is a beautiful post which shows how important randomization is in cryptography.

Jerry didn’t seem to care. I was confused. I was showing him technology that extracted the maximum value from search traffic, and he didn’t care? I couldn’t tell whether I was explaining it badly, or he was just very poker faced.

n.c.

I remember telling David Filo in late 1998 or early 1999 that Yahoo should buy Google, because I and most of the other programmers in the company were using it instead of Yahoo for search. He told me that it wasn’t worth worrying about. Search was only 6% of our traffic, and we were growing at 10% a month. It wasn’t worth doing better.

n.c.

But they had the most opaque obstacle in the world between them and the truth: money. As long as customers were writing big checks for banner ads, it was hard to take search seriously.

This is a great insight. Some call it the golden cage. You feel so good that you don’t care about possible threats anymore.

That’s why Yahoo as a company has never had a sharply defined identity. The worst consequence of trying to be a media company was that they didn’t take programming seriously enough. Microsoft (back in the day), Google, and Facebook have all had hacker-centric cultures. But Yahoo treated programming as a commodity. At Yahoo, user-facing software was controlled by product managers and designers. The job of programmers was just to take the work of the product managers and designers the final step, by translating it into code.

n.c.

In the software business, you can’t afford not to have a hacker-centric culture.

Your usefulness as a developer is only indirectly related to your ability to code. There are bona fide geniuses working in poverty and obscurity, there are utterly mediocre programmers doing amazingly useful and important work. Github is overflowing with brilliant, painstaking solutions to problems that just don’t matter. Your most important skill as a developer is choosing what to work on. It doesn’t matter how lousy a programmer you are, you can still transform an industry by simple merit of being the first person to seriously write software for it. Don’t spend good years of your life writing the next big social network or yet another bloody blogging engine. Don’t be That Guy.

Perfect.

Nifty trick. It could be especially interesting in combination to longer articles.

Lots of threads about the decline of HN:

I talked with Ward Cunningham once about this – any good community helps you to grow, but after a time you outgrow it. Perhaps they should be like a book – you read them, and then after that you dip into them for a bit, then less often, and eventually you don’t pick them up again.

I experienced that myself quite some times. You just outgrow communities over time. While some are great for beginners you’ll get annoyed after some time not progressing and then you start switching your community.

But there are also many valueless comments that nevertheless get upvotes, and there’s definitely a more snarky feel. I’ve felt myself being dragged into that at times and had to pull myself back. That didn’t used to happen.

See Reddit or every publicity in general.

Upvotes need to be weighed by karma, and karma of exemplary members of the community needs to be seeded by you (and other exemplary members). This way cliques of mean/non-insightful users can upvote each other to their heart’s content without making any appreciable difference in their karma value.

I found this an interesting idea. And wondered if you can rank comments by something like Pagerank. Would be interesting to implement and test that.

Your only chance, really, is to build something which can spread like a virus after being announced on a col-de-sac party. Something utterly addictive, unusual and truly amazing. A great self-selling, self-propagating viral-on-steroids idea (assuming you can code) is your only chance to succeed.

If you want to make it big fast I agree. My soundboard is an example of that. It blew up and now it gets around 160K visits per month. I didn’t advertise it but just hit a nerve. There was a trend, there was demand and I supplied it.

In Breakthrough Advertising Eugene Schwartz wrote about a similar thing – about riding waves. It’s a great book I have my notes somewhere lying around. However, it isn’t the only way to success.

Comments:

jdietrich wrote:

You seem to have been terribly misled. Only very rarely do products sell themselves. 99% of the time, the product is largely incidental to the sales process. Your idea doesn’t matter one jot, what matters is how well you can connect to customers and really sell to them.

And here’s the other side. Hard work but it can pay off.

Interesting article about using perceptual hashes for image recognition. I love the simpleness of the average hash algorithm. The author also talks about pHash which was used to crack Google’s captcha.

Programmers are most effective when they avoid writing code. They may realize the problem they’re being asked to solve doesn’t need to be solved, that the client doesn’t actually want what they’re asking for. They may know where to find reusable or re-editable code that solves their problem. They may cheat. But just when they are being their most productive, nobody says “Wow! You were just 100x more productive than if you’d done this the hard way. You deserve a raise.” At best they say “Good idea!” and go on.

Good insight. That’s something that can’t be seen normally because there’s only one outcome. I wonder how you can visualize that added value.

Tiktaalik wrote:

Blizzard for example had mild success with Rock n’ Roll Racing and Lost Vikings prior to Warcraft 2. […]

Nintendo made lots of arcade games since 1973, many being blatant clones of successful titles, before striking gold with Donkey Kong in 1981. Some of these may have sold fairly well, but the titles are ignored today so they couldn’t have been all that good. […]

Pokemon developer Game Freak seems to have had it pretty rough prior to hitting the big time with Pokemon. The company has existed since 1989 and they put out a number of relatively unknown games before Pokemon in ’96. Pokemon wasn’t a strong seller at the beginning either.

Batsu wrote:

Harmonix (creators of Guitar Hero, which they sold, and Rock Band) has a similar story. They created a handful of games over a decade or so, all music based, that never really caught on. When they released Guitar Hero and a few karaoke games, they did a little better than breaking even, and with the release of Guitar Hero 2 sales exploded.

Yup, it can take quite long.

# Scraping with Scrapy

One of my favorite things to program are bots and scrapers. I think the first scraper I saw was in IRC where a bot posted the latest news. This was amazing. You no longer had to do it manually – a program could do it for you.

Out of IRC I discovered more bots and scrapers. Google or imdb results. The weather or prices.

My first scraper

My first scraper was a script for eggdrop  which is an IRC bot. It was written in tcl which people probably don’t use anymore. If somebody has written “`!google <term>`” in the chat, the bot would search for the term and return the first matched URL. Super basic, 22 lines of tcl with white spaces but it was extremely cool.

Later I’ve written a lot more scrapers for different purposes. And with different techniques from writing HTTP requests by hand to now using Scrapy.

What is this thing?

Yesterday, I started a new project in which I needed a scraper. So, I finally decided to take a deeper look at Scrapy.

My first impression was that there’s too much boilerplate.

I have to define item classes? What is this django? Later I found out that you actually start projects like in django. After I saw what happens if you start a new project…

… I thought about stopping Scrapy and just using urllib and BeautifulSoup again. Which worked fine the past. But then I looked a bit around and found opinions about Scrapy and one person wrote:

Scrapy is a mature framework with full unicode, redirection handling, gzipped responses, odd encodings, integrated http cache, etc.

Wait. Full unicode? I don’t have to care about encodings? The bane of my existence. Writing a scraper in 20 minutes and taking 2 hours to get the freaking encodings right. Sold.

You got me at encodings

Again I was on my way on writing my first scraper in Scrapy. Apparently, it’s not like django but a bit easier and you can still write scrapers pretty fast.

I followed the tutorial but with my own project and it wasn’t actually that hard. I finally took a serious look at XPath which Scrapy uses besides css selectors for extraction. And it’s also not that hard. It took me about 30 – 60 minutes to write my first scraper and scrape the first results. I was very pleased by its interactive shell which is like ipython. So you can scrape and then figure out your XPaths. Especially cool is that it shows you all matches directly. This is great.

After I looked around a bit in its doc I found that they have lots of features and middleware. Scrapy seems to thrive if you continuously scrape / crawl the web and you can even use it for testing.

One of my favorite things, besides the unicode & odd encodings support is the json export. If you run your crawler you can just add `-o data.json -t json` and everything will be neatly formatted and saved into json. You can also work with pipelines, i.e. directly transforming the data and saving it, such as in a sql database.

It was less daunting than I initially thought and is less stressful because of the encoding stuff. Great piece of software.

# Setting up a vps on yourserver.se

I was looking for a cheap vps to host some small apps and play around a bit more with python & cgi. After I created the mlg soundboard with bootstrap my aversion to HTML diminished.

On a bitcoin wiki I found a company called yourserver.se which offers a vps for 2 Euro per month. Yes, TWO EURO (\$2.75).You get 256mb RAM, 5gb SSD(!) disk and unlimited transfer. You can choose between CentOS, Debian and Ubuntu and pay by Paypal or Bitcoin. You also get a free IP address which is absolutely crazy for this price.

Getting started

After I decided on the distribution I just paid and was immediately logged in and my vps was ready. Super easy, so uncomplicated. Somehow the locals were a bit broken. However, this should fix them (for debian-based distributions):

Afterwards I just did the usual stuff: setting up users, importing config files, securing the server, etc. You can check out linode’s guide which is pretty good though a bit outdated.

I quickly installed zsh which is my favorite shell. Its auto completion features alone are worth installing it. I also run cronjobs therefore I installed bsd-mailx and postfix so that the system can send me mails if some errors happen. If you want to set up your own MTA there are many tutorials out there and it’s also covered in the linode guide. However, it can be a pain in the ass. A lot of people use google app or other services for handling emails.

If you work via ssh you should also install screen. It’s basically a window manager for the shell. I’m connecting from a mac and each time I pressed tab screen answered “Wuff —- Wuff”. You can get rid of this by setting TERM to rxvt. Just include this in your .zshrc or .bashrc.

The next installs where pip which is a python package manager and sqlite. SQLite is a great database which is basically enough for most people and it doesn’t suck up as much resources as mysql or postgres and is super easy to use. I mainly use it in python. You can use it for web apps without problems – especially if you use a ORM.

SQLite usually will work great as the database engine for low to medium traffic websites (which is to say, 99.9% of all websites). The amount of web traffic that SQLite can handle depends, of course, on how heavily the website uses its database. Generally speaking, any site that gets fewer than 100K hits/day should work fine with SQLite. The 100K hits/day figure is a conservative estimate, not a hard upper bound. SQLite has been demonstrated to work with 10 times that amount of traffic (Appropriate Uses For SQLite)

So yeah, if you have less than 100K hits a day you are fine. Which is more than I have in a year…

These are my basic tools. Then I started playing a bit with lighttpd (I want a lean server) which was surprisingly easy to install and configure and works pretty fine.

I felt quite comfortable doing all this stuff although the last time I used linux was about 5 years ago. But after somehow I remembered it.

Speed test

So you have a solid 30 – 40 mbits connection and a fantastic connection to Sweden where the server is hosted.

VPN?

You can either use OpenVPN where you have to enable the TUN interface. I tried it yesterday and it worked pretty fine. If you want to use PPTP you just write them a support ticket and they will enable the module. So you have both options available.

Conclusion

Yourserver.se is incredible cheap, fast and easy. You can even pay by bitcoin. I don’t make any money recommending them and I didn’t even want to because they are so cheap. I sincerely feel a bit bad. Seriously, grab one before they raise the prices.

# Explaination vs. Prediction

Again, some older post I had lying around. Nonetheless, the topic is still prevalent.

I recently read a great paper named To Explain or To Predict? by Galit Shmueli. She explains the differences between the “old-school” explanatory statistics and predictive statistics. I saw lots of her observations by myself.

That means predictions are often regarded as unscientific and therefore there’s a bit of a lack of good literature – lately the situation became better with the uprising of machine learning.
Nonetheless, most students don’t learn how to make predictions and you see how people use $R^2$ to validate models.

Sure, there are some departments that teach how to predict but they are still in the minority. Of course, there’s this other trend with Big Data. I’m personally not really excited by Big Data rather by data at all.

More Info: http://galitshmueli.com/explain-predict

I wrote this post more than 2 years ago. Now machine learning became some kind of commodity on a smaller level and something strange happened. Some of the people who work with data but didn’t learn good statistical techniques started to try to explain data which is pretty terrible. It even seems that they try to reinvent statistics. I read a post yesterday called Why big data is in trouble: they forgot about applied statistics which captured this pretty nicely.

The table at the bottom is just unbelievable. It lists different fields and the application of “big data” or “data science”. They also list that in 2012 they finally start to enter fields like biology, economics, engineering, etc. Which is more sad than hilarious. So yeah, I didn’t expect this turn.

Furthermore, I saw more and more “data science” boot camps / programs popping up. Still neglecting statistical foundations. Resulting in even more terrible studies. This trend will probably follow the Gartner Hype Cycle. As far as I can tell the peak is already reached, now it will begin to be disappointing and in a few years actually reach its plateau. Here the latest Hype Cycle from July 2013:

I see the term “prescriptive analytics” on there and just looked it up. It’s astonishing that people reinvent new terms for so much stuff and it still works. Even stuff like business intelligence is basic statistics, then came predictive analysis (still statistics), data science (hey statistics), now prescriptive analytics (still statistics).

I just have to quote one of my favorite quotes on this topic:

Someone (can’t recall the source, sorry) recently defined “data scientist” as “a data analyst who lives in California.” —baconner