Recently, I checked some Defcon presentations and stumbled upon this beauty. It’s a presentation about cracking Google’s voice captcha by the guys of the Defcon Group 949.
Firstly, you can get more information, the code, corpus, etc. on their project page.
The video isn’t directly from one of the Defcons but from LayerOne.
Let’s start with the summary:
- words were distinguishable because of differing frequencies to the background noise
- they collected about 50k samples and labeled them by hand
- Google used only 58 words
- Two primary methods for used:
- pHash: provides similar hashes for similar “media” files
- Neural networks with lots of input nodes
- Different NNs and pHash was combined and the best performing ensemble was about 12 methods long
- Audio captchas were phoenetic based instead of spelling based captcha (e.g. blu and blue are the same)
- This allowed for mashing:
- Four and Fork => Fourk matches both
- Seven and Oven => Soven
- Then they wrote an automatic merge finder which found dozens of mashings
- the finder took two random words, calculated the Levenshtein distance and created a new string => small distance between parent strings
- Aftewards, they used contextual merging based on probability on top words from NN
- Solving one captcha takes about 2sec. They biggest bottleneck was internet speed
- As human 8 sec audio alone
There were two systems. An old one, which could be activated if JS was disabled, which included 2 voices and 10 digits and the new one with one voice and 58 words. There was some research done on the old system by Stanford and CMU. Stanford achieved about 1.3% accuracy, CMU about 58%. These guys achieved, on the newer system, 99.1% accuracy. Just amazing! However, Google changed the system a few hours before their presentation and their accuracy dropped down to 0%.
There were about 20 – 25 million audio aptchas, i.e. if you solve enough you get duplicates. They created a lookup table which provided 61% accuracy in about 0.005 seconds.
The countermeasure by Google consisted of the following:
- Same frequencies for words and background noise => makes it harder to split words
- 10 instead of 5 words per captcha
- 25 seconds in stead of 8 seconds in length
- added new words
- background noise consists now of actual English words instead of reverse radio broadcast
The big problem of this countermeasure is that humans got about 30% success rate. Reminds me of Rapidshare’s infamous cat captchas.
Great talk, extremely interesting. Especially, interesting is that they show again that it doesn’t really matter if you use NNs, SVMs or RBM for prediction but that the work before that, i.e. classification by hand, feature extraction and learning about the system (mashing), and after that, i.e. creating ensembles is much more important than using the latest method.