Optical Character Recognition sounds like it should be a solved problem by now. Point a library at an image, get text back. Done.
In practice, OCR is fussier than that. The same library that perfectly reads a crisp screenshot will return garbage when fed a photograph of a document. Lighting, resolution, skew, and font choice all matter more than you'd expect.
This guide covers the practical workflow for OCR in Python—not just the happy path, but the preprocessing steps that make the difference between usable output and nonsense.
The Tools
Tesseract is the dominant open-source OCR engine. Originally developed by HP in the 1980s, now maintained by Google. It's not the only option, but it's free, well-documented, and good enough for most use cases.
pytesseract is the Python wrapper around Tesseract. It doesn't do the OCR itself—it calls the Tesseract binary and returns the results.
Pillow and OpenCV handle image loading and preprocessing. You'll use one or both depending on how much image manipulation you need.
Installation
Tesseract is a system dependency, not a Python package. Install it first.
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt install tesseract-ocr
# Windows - download installer from GitHub
# https://github.com/UB-Mannheim/tesseract/wikiThen install the Python packages:
pip install pytesseract pillow opencv-pythonVerify Tesseract is accessible:
tesseract --versionIf pytesseract can't find Tesseract, you may need to specify the path explicitly in your code. On Windows especially, the default installation path isn't always in the system PATH.
import pytesseract
# Windows example - adjust path as needed
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'Basic OCR: The Happy Path
When the stars align—clean image, clear text, good contrast—OCR is straightforward:
from PIL import Image
import pytesseract
# Load image
image = Image.open('document.png')
# Extract text
text = pytesseract.image_to_string(image)
print(text)That's it for simple cases. A screenshot of a webpage, a scanned document at 300 DPI, a PDF converted to image—these typically work without fuss.
But most real-world images aren't this cooperative.
Why OCR Fails (And How to Fix It)
OCR engines expect black text on a white background, clearly separated characters, minimal noise, and proper orientation. When images deviate from this ideal, accuracy drops.
The fix is preprocessing—transforming the image before OCR to match what the engine expects.
Grayscale Conversion
Color information is noise for text recognition. Convert to grayscale first.
import cv2
image = cv2.imread('document.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)Thresholding (Binarization)
Convert grayscale to pure black and white. This eliminates gradients and makes text edges crisp.
# Simple threshold
_, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)
# Otsu's method - automatically finds optimal threshold
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)Otsu's method works well when there's a clear distinction between text and background. For images with uneven lighting, adaptive thresholding is better.
# Adaptive threshold - handles uneven lighting
binary = cv2.adaptiveThreshold(
gray, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
11, # block size
2 # constant subtracted from mean
)Noise Removal
Scanned documents often have specks and artifacts. Median blur removes salt-and-pepper noise while preserving edges.
denoised = cv2.medianBlur(gray, 3)For heavier noise, morphological operations can help:
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))
cleaned = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)Deskewing
Tilted text confuses OCR engines. If your document is rotated, fix it first.
import numpy as np
def deskew(image):
coords = np.column_stack(np.where(image > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = 90 + angle
elif angle > 45:
angle = angle - 90
(h, w) = image.shape[:2]
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(
image, matrix, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE
)
return rotatedRescaling
Tesseract works best with text height around 30-40 pixels. Too small, and characters blur together. Too large, and the engine may not recognize them as text.
def rescale_for_ocr(image, target_height=40):
"""Scale image so text is approximately target_height pixels tall."""
# This is a rough heuristic - adjust based on your documents
scale = 2.0 # or calculate based on detected text size
width = int(image.shape[1] * scale)
height = int(image.shape[0] * scale)
return cv2.resize(image, (width, height), interpolation=cv2.INTER_CUBIC)For scanned documents, 300 DPI is the sweet spot. Lower resolution loses detail; higher resolution increases processing time without improving accuracy.
A Complete Preprocessing Pipeline
Combining these techniques into a reusable function:
import cv2
import numpy as np
from PIL import Image
import pytesseract
def preprocess_for_ocr(image_path):
"""
Preprocess an image for optimal OCR results.
Returns a PIL Image ready for pytesseract.
"""
# Load image
image = cv2.imread(image_path)
# Convert to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Remove noise
denoised = cv2.medianBlur(gray, 3)
# Apply adaptive thresholding
binary = cv2.adaptiveThreshold(
denoised, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
11, 2
)
# Convert back to PIL Image for pytesseract
return Image.fromarray(binary)
def extract_text(image_path):
"""Extract text from an image with preprocessing."""
processed = preprocess_for_ocr(image_path)
text = pytesseract.image_to_string(processed)
return text.strip()
# Usage
text = extract_text('receipt.jpg')
print(text)Not every image needs every preprocessing step. A clean screenshot needs nothing. A photo of a whiteboard needs all of it. Experiment with your specific images.
Configuration Options
Tesseract has configuration options that affect recognition. The most useful:
Page Segmentation Modes (PSM)
Tells Tesseract what kind of content to expect:
| PSM | Description | Use Case |
|---|---|---|
| 3 | Fully automatic | Default, works for most documents |
| 4 | Single column of variable-sized text | Articles, letters |
| 6 | Single uniform block of text | Paragraphs |
| 7 | Single line of text | Headers, captions |
| 8 | Single word | Labels, buttons |
| 11 | Sparse text, no particular order | Receipts, forms |
| 13 | Raw line, treat as single line | When other modes fail |
# Single line of text
text = pytesseract.image_to_string(image, config='--psm 7')
# Sparse text (like a receipt)
text = pytesseract.image_to_string(image, config='--psm 11')Language Selection
Tesseract supports 100+ languages. Install language packs and specify which to use:
# Install additional languages
sudo apt install tesseract-ocr-fra tesseract-ocr-deu# French
text = pytesseract.image_to_string(image, lang='fra')
# Multiple languages
text = pytesseract.image_to_string(image, lang='eng+fra')Character Whitelisting
If you know the text contains only certain characters, restrict recognition:
# Only digits
config = '--psm 7 -c tessedit_char_whitelist=0123456789'
text = pytesseract.image_to_string(image, config=config)
# Only uppercase letters and digits
config = '--psm 7 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
text = pytesseract.image_to_string(image, config=config)Extracting Structured Data
Sometimes you need more than raw text. pytesseract can return bounding boxes, confidence scores, and structured data.
Bounding Boxes
# Get bounding boxes for each character
boxes = pytesseract.image_to_boxes(image)
# Get detailed data including confidence
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
# data contains:
# - 'text': recognized text for each element
# - 'conf': confidence score (0-100, -1 for non-text)
# - 'left', 'top', 'width', 'height': bounding box
# - 'level': hierarchy (page, block, paragraph, line, word)Filtering by Confidence
Low-confidence results are often errors. Filter them out:
def extract_high_confidence_text(image, min_confidence=60):
"""Extract only text with confidence above threshold."""
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
words = []
for i, conf in enumerate(data['conf']):
if int(conf) >= min_confidence:
words.append(data['text'][i])
return ' '.join(words)Working with PDFs
PDFs aren't images, so you need an extra step. Convert each page to an image, then OCR.
pip install pdf2imagepdf2image requires Poppler:
# macOS
brew install poppler
# Ubuntu/Debian
sudo apt install poppler-utilsfrom pdf2image import convert_from_path
import pytesseract
def ocr_pdf(pdf_path):
"""Extract text from all pages of a PDF."""
# Convert PDF to images
pages = convert_from_path(pdf_path, dpi=300)
all_text = []
for i, page in enumerate(pages):
text = pytesseract.image_to_string(page)
all_text.append(f"--- Page {i + 1} ---\n{text}")
return '\n\n'.join(all_text)
text = ocr_pdf('document.pdf')
print(text)If the PDF contains searchable text (not scanned), use PyPDF2 or pdfplumber instead. They extract embedded text directly, which is faster and more accurate than OCR.
import pdfplumber
def extract_pdf_text(pdf_path):
"""Extract text from a PDF with embedded text."""
with pdfplumber.open(pdf_path) as pdf:
text = ''
for page in pdf.pages:
text += page.extract_text() or ''
return textHandling Common Scenarios
Receipts
Receipts are challenging: thermal paper fades, text is small, layouts vary.
def ocr_receipt(image_path):
image = cv2.imread(image_path)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Increase contrast
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
enhanced = clahe.apply(gray)
# Aggressive denoising
denoised = cv2.fastNlMeansDenoising(enhanced, h=30)
# Threshold
_, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# OCR with sparse text mode
text = pytesseract.image_to_string(
Image.fromarray(binary),
config='--psm 11'
)
return textScreenshots
Screenshots are usually clean. Minimal preprocessing needed.
def ocr_screenshot(image_path):
image = Image.open(image_path)
# Often works without any preprocessing
return pytesseract.image_to_string(image)Handwriting
Tesseract is designed for printed text. Handwriting recognition is a different problem.
For handwriting, consider:
- Google Cloud Vision API — handles handwriting reasonably well
- AWS Textract — good for forms with mixed print and handwriting
- Microsoft Azure Computer Vision — another cloud option
These are paid services, but handwriting OCR is hard enough that the trade-off is often worthwhile.
When to Use Cloud OCR
Tesseract is good. It's not perfect. Cloud OCR services are sometimes worth the cost:
| Scenario | Recommendation |
|---|---|
| Clean documents, high volume | Tesseract (free, runs locally) |
| Mixed quality, moderate volume | Tesseract with good preprocessing |
| Handwriting | Cloud API |
| Complex layouts (tables, forms) | Cloud API or specialized tools |
| Compliance requirements (data stays local) | Tesseract |
Google Cloud Vision, AWS Textract, and Azure Computer Vision all offer free tiers. Test them against your specific documents before committing.
Putting It Together
A minimal but complete script that handles most common cases:
import cv2
import numpy as np
from PIL import Image
import pytesseract
from pathlib import Path
def preprocess(image_path):
"""Standard preprocessing pipeline."""
img = cv2.imread(str(image_path))
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
denoised = cv2.medianBlur(gray, 3)
binary = cv2.adaptiveThreshold(
denoised, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2
)
return Image.fromarray(binary)
def ocr(image_path, preprocess_image=True, lang='eng', psm=3):
"""
Extract text from an image.
Args:
image_path: Path to the image file
preprocess_image: Whether to apply preprocessing
lang: Tesseract language code
psm: Page segmentation mode
Returns:
Extracted text as a string
"""
if preprocess_image:
image = preprocess(image_path)
else:
image = Image.open(image_path)
config = f'--psm {psm}'
text = pytesseract.image_to_string(image, lang=lang, config=config)
return text.strip()
if __name__ == '__main__':
import sys
if len(sys.argv) < 2:
print("Usage: python ocr.py <image_path>")
sys.exit(1)
image_path = sys.argv[1]
text = ocr(image_path)
print(text)Final Thoughts
OCR accuracy depends more on image quality and preprocessing than on the OCR engine itself. A well-preprocessed image fed to Tesseract will outperform a raw image fed to expensive cloud services.
The best OCR is the one you don't need. If you can get structured data from the source instead of extracting it from images, do that.
When you do need OCR, start simple. Try the basic approach first. Add preprocessing steps one at a time until accuracy is acceptable. Resist the urge to build a complex pipeline before you've confirmed the simple approach doesn't work.
Most OCR failures aren't algorithmic—they're image quality problems. Better lighting, higher resolution, and cleaner source documents will improve results more than any amount of code optimization.