Google uses modofied Tesseract, which t have released as free software. What works for them is that t hae a large dataset. You see, teaching OCR 100 examples of a letter makes it better, teaching 2000 sometimes makes it worse as it starts paying attention to details which are not meaningful, but teaching it 100 billion examples makes it near-godlike. Another thing is that have a lot of skilled engineers who tweak that and who use linguistic knowledge to make the more accurate. If OCR cannot decide between 3 interpretations of a word, you see, it chooses the one that that exists in a dictionary and that makes grammatical sense in that place of text. This is at least hoe it looks from a perspectove of a long-time OCR user. Remember, hower, that Google Books have been OCR-ed beforehand and corrected by humans using captcha (this is what Google has apparently used it for).
You also see, we have a good idea of the best way to improve OCR's accuracy by working with the data. For example, you don't want to change the training sets, rather, you want to give them lots of additional examples. That is exactly the reason why the researchers at Google decided to release them as open source to the public instead of waiting till they have the perfect way to do things. They released them like a scientist would have done. You also see, Google's ability to adapt its algorithms has been one of the most impressive things in this field which nobody had ever suspected. This is because with all the data it had with many millions of examples it had the ability to improve OCR algorithms, the ones it used to make it near-godlike. What is still missing is the best way to do it. That.