I’ve always used the acronym OCR to mean “Optical Character Recognition.” It’s an industry term, so you’ve no doubt heard it before. Some Portable Document Format (PDF) files contain one or more pictures representing a corresponding number of pages. Very often, each picture represents a page of text with writing. An OCR application will examine each shape within a user-defined area on the picture, and will estimate the identity of each shape — with the underlying assumption that each shape represents a character in the alphabet or in a library of punctuation. For the purpose of this answer, we may classify PDF files into two categories. a) those with embedded OCR, and b) those without embedded OCR. A PDF without embedded OCR is a set of one or more images written as one file. If your eyes are glazing over, I don’t blame you. Here is a video summary. (Youtube) The discussion of embedded OCR text begins at 0.15.

OCR PDF: All You Need to Know

