Recently, I checked some Defcon presentations and stumbled upon this beauty. It’s a presentation about cracking Google’s voice captcha by the guys of the Defcon Group 949.
Firstly, you can get more information, the code, corpus, etc. on their project page.
The video isn’t directly from one of the Defcons but from LayerOne.
Let’s start with the summary:
words were distinguishable because of differing frequencies to the background noise
they collected about 50k samples and labeled them by hand
Google used only 58 words
Two primary methods for used:
pHash: provides similar hashes for similar “media” files
Neural networks with lots of input nodes
Different NNs and pHash was combined and the best performing ensemble was about 12 methods long
Audio captchas were phoenetic based instead of spelling based captcha (e.g. blu and blue are the same)
This allowed for mashing:
Four and Fork => Fourk matches both
Seven and Oven => Soven
Then they wrote an automatic merge finder which found dozens of mashings
the finder took two random words, calculated the Levenshtein distance and created a new string => small distance between parent strings
Aftewards, they used contextual merging based on probability on top words from NN
Solving one captcha takes about 2sec. They biggest bottleneck was internet speed
As human 8 sec audio alone
There were two systems. An old one, which could be activated if JS was disabled, which included 2 voices and 10 digits and the new one with one voice and 58 words. There was some research done on the old system by Stanford and CMU. Stanford achieved about 1.3% accuracy, CMU about 58%. These guys achieved, on the newer system, 99.1% accuracy. Just amazing! However, Google changed the system a few hours before their presentation and their accuracy dropped down to 0%.
There were about 20 – 25 million audio aptchas, i.e. if you solve enough you get duplicates. They created a lookup table which provided 61% accuracy in about 0.005 seconds.
The countermeasure by Google consisted of the following:
Same frequencies for words and background noise => makes it harder to split words
10 instead of 5 words per captcha
25 seconds in stead of 8 seconds in length
added new words
background noise consists now of actual English words instead of reverse radio broadcast
The big problem of this countermeasure is that humans got about 30% success rate. Reminds me of Rapidshare’s infamous cat captchas.
Great talk, extremely interesting. Especially, interesting is that they show again that it doesn’t really matter if you use NNs, SVMs or RBM for prediction but that the work before that, i.e. classification by hand, feature extraction and learning about the system (mashing), and after that, i.e. creating ensembles is much more important than using the latest method.
Aaron Patzer, founder of mint.com, talks about how he got from an idea to $170 million in just three year.
What can I learn?
Don’t be secretive: Aarons first idea was a goal setting software. Instead of building it in stealth mode, he decided to talk to people. Only about one in eighty of these people found this idea appealing, so he discarded it. Later he hasn’t gone secretive. He showed mint’s UI to lots of people and optimized it over time.
Build a prototype: If you are raising money, looking for customers or hiring people, a prototype is real advantage. People can actually see your idea in action, click around and feel the experience. Would you rather listen to a 10 minutes sales pitch or playing around 10 minutes with a new software?
Leverage your success: If you are once in the news, try to stay there. Mint got a pretty clever idea. They gave away free mint mojitos at the Techcrunch 40 and were eventually elected as people’s choice. Then they talked to other journalists if they won’t want to interview the people’s choice winner. And so on.