Cracking Google’s reCaptcha

Recently, I checked some Defcon presentations and stumbled upon this beauty. It’s a presentation about cracking Google’s voice captcha by the guys of the Defcon Group 949.

Firstly, you can get more information, the code, corpus, etc. on their project page.
The video isn’t directly from one of the Defcons but from LayerOne.

Let’s start with the summary:

  • words were distinguishable because of differing frequencies to the background noise
  • they collected about 50k samples and labeled them by hand
  • Google used only 58 words
  • Two primary methods for used:
    • pHash: provides similar hashes for similar “media” files
    • Neural networks with lots of input nodes
  • Different NNs and pHash was combined and the best performing ensemble was about 12 methods long
  • Audio captchas were phoenetic based instead of spelling based captcha (e.g. blu and blue are the same)
  • This allowed for mashing:
    • Four and Fork => Fourk matches both
    • Seven and Oven => Soven
  • Then they wrote an automatic merge finder which found dozens of mashings
    • the finder took two random words, calculated the Levenshtein distance and created a new string => small distance between parent strings
  • Aftewards, they used contextual merging based on probability on top words from NN
  • Solving one captcha takes about 2sec. They biggest bottleneck was internet speed
  • As human 8 sec audio alone

There were two systems. An old one, which could be activated if JS was disabled, which included 2 voices and 10 digits and the new one with one voice and 58 words. There was some research done on the old system by Stanford and CMU. Stanford achieved about 1.3% accuracy, CMU about 58%. These guys achieved, on the newer system, 99.1% accuracy. Just amazing! However, Google changed the system a few hours before their presentation and their accuracy dropped down to 0%.

There were about 20 – 25 million audio aptchas, i.e. if you solve enough you get duplicates. They created a lookup table which provided 61% accuracy in about 0.005 seconds.

The countermeasure by Google consisted of the following:

  • Same frequencies for words and background noise => makes it harder to split words
  • 10 instead of 5 words per captcha
  • 25 seconds in stead of 8 seconds in length
  • added new words
  • background noise consists now of actual English words instead of reverse radio broadcast

The big problem of this countermeasure is that humans got about 30% success rate. Reminds me of Rapidshare’s infamous cat captchas.

Great talk, extremely interesting. Especially, interesting is that they show again that it doesn’t really matter if you use NNs, SVMs or RBM for prediction but that the work before that, i.e. classification by hand, feature extraction and learning about the system (mashing), and after that, i.e. creating ensembles is much more important than using the latest method.

#1/25: Information Rules

I want to try out a new format which you could call “book commentary”. I’ll quote some text passages and write a short comment about each passage.

Technology changes. Economic laws do not. If you are struggling to comprehend what the Internet means for you and your business, you can learn a great deal from the advent of the telephone system a hundred years ago.

This is a great advice and I can anybody recommend to read old books about business and economics especially case studies. I already covered some old business books myself here on the blog and I’m always receptive to recommendations.

We think that content owners tend to be too conservative with respect to the management of their intellectual property. The history of the video industry is a good example. Hollywood was petrified by the advent of videotape recorders. The TV industry filed suits to prevent home copying of TV programs, and Disney attempted to distinguish video sales and rentals through licensing arrangements. All of these attempts failed. Ironically, Hollywood now makes more from video than from theater presentations for most productions. The video sales and rental market, once so feared, has become a giant revenue source for Hollywood.

Interesting enough, Hollywood is still trying to fight against piracy. The next step would probably be to offer cheap versions as a stream (ala netflix). However, people don’t want to pay too much for a video stream.
I think it may be comparable with the automotive industry in the beginning of the 20th century. There were lots of car manufactures that produced really high quality cars which were really expensive. Most people couldn’t afford a car at this time. Then came the Ford Model T, which wasn’t as fancy at these other cars but it was cheap and good enough and people bought it.
Maybe Hollywood should think about producing movies which don’t cost $200m but instead only $20m.

In competing to become the standard, or at least to achieve critical mass, consumer expectations are critical. In a very real sense, the product that is expected to become the standard will become the standard. Self-fulfilling expectations are one manifestation of positive-feedback economics and bandwagon effects.

A very interesting observation with great effects. This makes PR much more important than I thought it would be. Especially, if you apply to to startups. Signals like founding and investors become stronger. And it isn’t so much about the product and more about connects, strategic networks, PR and marketing.

The dominant component of the fixed costs of producing information are sunk costs, costs that are not recoverable if production is halted. If you invest in a new office building and you decide you don’t need it, you can recover part of your costs by selling the building. But if your film flops, there isn’t much of a resale market for its script. […] Sunk costs generally have to be paid up front, before commencing production.

We’ve seen some movies which ran horribly in the cinema but great in DVD markets. So, there’s some recoverability. However, the movie can still suck. One method to cover the costs are upfront investments. Kickstarter is basically allowing this for a mass-market and some game studies took this approach to produce games which wouldn’t be backed by a publisher (see Doublefine Adventures).

The key to reducing average cost in information markets is to increase sales volume. Think of how a TV show is marketed. It’s sold once for prime time play in the United States. Then it’s sold again for reruns during the summer. If it is a hot product, it’s sold abroad and syndicated to local stations. The same good can be sold dozens of times.The most watched TV show in the world is Baywatch, which is available in 110 countries and has more than 1 billion viewers. […] The shows are cheap to produce, have universal appeal, and are highly reusable.

Basically the Hollywood argument I made above. Lower the production costs but produce more variety and stimulate more innovation.

With information you usually produce the high-quality version first, and then subtract value from it to get the low-quality version.

This is really important for the customer. You don’t want to feel that you paid the normal price for the inferior product. One example are some games which come with lesser content in the normal version but still costs $50-60. Don’t do that.

The coupons are worthwhile only if they segment the market. A coupon says “I’m a price-sensitive consumer. You know that’s true since I went to all this trouble to collect the coupons.” Economist say that a coupon is a credible signal of willingness to pay. […] What does this have to do with information pricing? Well, suppose that information technology lowers search costs so that everyone can “costlessly” find the lowest price. This means that sales are no longer a very good way to segment the market. Or suppose that software agents can costlessly search the net for cents-off coupons. In this case, the coupons serve no useful function.

I found this passage quite interesting. Basically sites like Groupon are too easy to use, so that people don’t segment themselves that good. Furthermore, there are lots of sites which offer coupon codes, so that today a sale for most online shops is probably more appropriate.

The rights management strategy is a twist on the versioning strategy described in Chapter 3. There we argued that you should offer a whole product line of information goods. The cheap versions (which can even be free) serve as advertisements for the high-priced versions.

Freemium described over 12 years ago. Interesting enough, McAfee used a freemium model since 1993 and before that they used a “pay what you think“-model, which was also quite revolutionary for that time.

Of course, a new brand can emerge that is easy to learn, thus reducing switching costs. Indeed, one strategy for breaking into a market with significant brand-specific customer training is to imitate existing brands or otherwise develop a product that is easy to learn. Borland tried this with Quattro Pro, aimed at Lotus 1-2-3 users, and Microsoft World has built-in, specially designed help for (former!) WordPerfect users.

We’ve seen this in the online market quite recently, e.g. with WordPress and tumblr. I wonder if you see a better word processor in the future.

What happens when perfect competition meets lock-in? […] Think about the extreme case in which you face fierce competition from equally capable rivals to attract customers in the first place. Both you and your rivals know that each customer will be locked into whatever vendor he or she selects. The result is that competition indeed wrings excess profits out of the market, but only on a life-cycle basis. The inescapable conclusion: firms will lose money (invest) in attracting customers, and (just) recoup these investments from profitable sales to locked-in customers.

Normally, you would assume that lock-in leads to excessive profit in a market but it doesn’t. You can talk about quasi-profits, i.e. the lock-in needs negative investment at the start and if you locked-in a customer, he will return the investment costs over his life-time. That is, if you want to make excessive profit, you have to still rely on product differentiation and/or cost leadership.

If you give your product away, anticipating juicy follow-on sales based on consumer loyalty/switching costs, you are in for a rude surprise if those switching costs turn out to be modest.

That’s when freemium goes wrong. If you are in a market with low or non-existing lock-in costs, i.e. trash mail provider or image uploading sites, freemium probably won’t work.

An other approach is to rely on versioning by offering long-standing customers enhanced services or functionality. Extra information makes a great gift: it is cheap to offer, and long-standing customers are likely to place a relatively high value on enhancements.

I really like this idea. Often you see introductory offers, like 20% off of the subscription but after this period, you either don’t care about the price, feel ripped-off, because you have to pay more or cancel your current account and get another introductory offer.
However, if you reward long-term customers with some useful addons, they have no incentive to do the latter and probably will appreciate the extra addon. Varian and Shapiro talk about this in the book in greater length.

The beautiful if frightening implication: success and failure are driven as much by consumer expectations and luck as by the underlying value of the product. A nudge in the right direction, at the right time, can make all the difference. Marketing strategy designed to influence consumer expectations is critical in network markets.

See the quote above. Early adopters are really important and early press coverage can greatly introduce the probability of success.

The revolution strategy involves brute force: offer a product so much better than what people are using that enough users will bear the pain of switching to it. […] The revolution strategy is inherently risky. It cannot work on a small scale and usually requires powerful allies.

The authors quote Grove’s 10X as a revolutionary metric. I got two nice examples which fit to this quote.
Firstly, Google+ which hasn’t offered a 10X and wasn’t a evolution of Facebook either. Furthermore, the group in the beta phase was too small.
An other example, are open source clones of proprietary products. More often than not, free source code or enhanced privacy aren’t 10X.

In addition to launching your product early, you need to be aggressive early on to build an installed base of customers. Find the “pioneers” who ware most keen to try new technology and sign them up swiftly.

Really important, see Crossing the Chasm.

All in all, I really liked this book. I think it’s probably a must-read for internet entrepreneurs. What I personally found really interesting to see which people endorsed this book and one of them is Eric Schmidt (Google’s ex CEO). And he said in an interview about Google+ and its chance to beat Facebook: “It’s very hard to beat a fast-moving incumbent in exactly same game in technology because it changes so quickly.

If you are interested in more detail about the content of the book, its website offers free presentation material for college courses. Great book, great writing.

Intro to Data Science (UCB)

A hour ago someone posted on hacker news about this course at UC Berkeley.
You can find the slides and the videos from last year or slides only from this year. The material looks pretty basic but covers data preparation over two weeks which is quite rare but really important.
Coming from a university that basically ignored everything which wasn’t academic, two things stand out.
Firstly, there are guest lectures from people from Google, Optimizely, Yahoo, etc and their lectures are generally quite interesting.
Secondly, the freedom in choosing the final projects is awesome. You can choose freely some data sets which interests you and play with it. There was a wide variety of data from Youtube, Last.fm, basketball to Yelp.

Generally, I think that this is a pretty good intro course into this topic. Most universities try to over-theorize such basic courses and talk in my opinion too much about maths and too little about data gathering and EDA.

Also there was a particularly good comment in the hn thread:

Someone (can’t recall the source, sorry) recently defined “data scientist” as “a data analyst who lives in California.” —baconner

Stamina and Simplicity

Yesterday, I’ve read an other chapter of Founders at Work. It was chapter 8 about Evan Williams founder of blogger.com, later twitter.
The story of blogger.com was really tough, especially after all employees and his co-founder left the startup because they didn’t have any money anymore. But Evan stayed and kept up the server and the service. In 2001 Evan started adding some paid-services for blogger. And two years later Google acquired Blogger.com.

“Simplicity is powerful.” – Evan Williams

He showed this with both blogger and twitter.
I think these are powerful ways of doing things: stamina and simplicity. Often it don’t have to be complex. Complexity has some bad side effects. I think everyone who had worked on a bigger project, source code or organization knows this.

“Il semble que la perfection soit atteinte non quand il n’y a plus rien à ajouter, mais quand il n’y a plus rien à retrancher.” -Antoine de Saint-Exupéry

Translation: “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.” -Antoine de Saint-Exupéry

This could be a way to get simple and great (business) ideas. Take your first sketch. Try to make it simpler. If it will remain too complex, maybe you can’t put it into practice. Then try another idea or think harder to make your old one simpler.

There are several pros of simple ideas.

  • Your idea isn’t so prone to small changes
  • You can manage it without 90 people in administration
  • More people will understand (and use) it
  • The chance to survive is higher

“Must I be an inventor?”

The idea is not that important. Google did not invent search engines. Apple did not invent mobile mp3 players. Microsoft did not invent operating systems. So you should/must be an innovator.

Google said “Hey, search engines are great, but the results suck. We need a better search algorithm.”
Apple said “Hey, mp3 players are great, but they are not stylish and easy to use, make them look cool and easy to use.”
Microsoft said “Hey, IBM is looking for a new operation system for their new personal computer, we know where to buy one.” Oops, bad example.