Tuesday, January 04, 2011

Google and the limits of tweaking

Late last year, several observers wrote about what they believe is deterioration in the quality of Google's search results:
Content providers have been trying to "game" Google's results ever since Google became a serious search engine player, but Google has always been able to adapt its algorithms to keep the best results coming up at the top. Now, however, it looks like the gamers are winning, and that's opening the door for other search engines.

This may not be a perfect, or even a relevant, analogy, but it may help explain what Google is facing. 25 years ago, Kurzweil was the only company that could read and convert virtually any typeface to ASCII (OCR, or optical character recognition). They did it by having the machine operator scan in examples of the material to be converted, and then individually identify each character ("this is an "L"...this is an "I"...this is a lower-case "i") until the reader could understand the test set. Then, the operator could scan in the complete set of documents, and the Kurzweil device would read and convert them. However, there were always characters that it still couldn't read, and the operator would have to stop and correct the mistakes. These corrections would further train the system.

The Kurzweil system could only recognize a limited number of typefaces at a time, because it would get confused. Over time, more training and corrections actually led to lower accuracy, as the system could no longer distinguish between similar characters such as "e", "o" and "q", "E" and "F", "D" and "O", or "I", "i", "L", "l" and "1". Early systems relied on character shapes alone and didn't use dictionaries or context checks. As a result, at some point the operator had to discard the training set and train the device all over again.

True algorithmic recognition systems from Palantir/Calera eventually solved the problem and were able to read the vast majority of typefaces without any training. Eventually, through acquisitions and mergers, the technologies of Kurzweil and Palantir/Calera fell under one roof at ScanSoft, and are currently sold as OmniPage 17 by Nuance.

My point is that the training technology of Kurzweil eventually reached its limit. Even after adding the best fixes the company could think of, its technology was eventually supplanted by algorithimically-based shape recognition, augmented with dictionaries and context analysis. Google could now face the same challenge. Having tweaked and augmented its search algorithms for years, it may no longer be able to keep up with attempts to game its system. In order to truly fix the problem, Google may have to either switch to a fundamentally different search and filtering technology, or bolt on a radically different approach, such as social searching.

As the Kurzweil case suggests, technologies have limits, and once those limits are reached, it may take radical, not just incremental, changes to the technologies in order to either get further improvements or to avoid going backward.

Enhanced by Zemanta

No comments: