OCR (Optical Character Recognition)
OCR is a technology that converts images of text — like scanned documents, photos, or PDF scans — into editable, searchable text that computers can read.
How OCR works
OCR engines analyze the shapes of characters in an image, match them to known letter patterns, and output digital text. Modern OCR uses machine learning models trained on millions of text samples, making them accurate across dozens of fonts, languages, and document layouts.
The typical OCR pipeline has four stages: preprocessing (deskewing, denoising, binarization), layout analysis (finding columns, paragraphs, tables), character recognition (identifying individual letters), and postprocessing (dictionary correction, formatting recovery).
When you need OCR
- Converting scanned paper documents into searchable PDFs
- Extracting text from photographs of books, receipts, or whiteboards
- Making old document archives searchable
- Reading text from screenshots for accessibility
- Digitizing forms and invoices for data entry
Accuracy and language support
Modern OCR achieves 95-99% accuracy on clean printed text. Accuracy drops with poor scan quality, unusual fonts, handwriting, or complex layouts. Tesseract 5 (the open-source engine Konomic uses) supports 100+ languages including Latin, Cyrillic, CJK, Arabic, and Hebrew scripts.
OCR vs text extraction — the difference
A digitally-created PDF already contains text that can be selected and copied — no OCR needed. OCR is only required when the text exists only as an image (scanned pages, photos). If you can't select text in a PDF by clicking and dragging, you probably need OCR.
Extract text from scanned PDFs and images
Open tool