technology

A Practical Guide to OCR in Python

Extract text from images and PDFs using Tesseract, with preprocessing techniques that actually matter

Sathyan··10 min read

Optical Character Recognition sounds like it should be a solved problem by now. Point a library at an image, get text back. Done.

In practice, OCR is fussier than that. The same library that perfectly reads a crisp screenshot will return garbage when fed a photograph of a document. Lighting, resolution, skew, and font choice all matter more than you'd expect.

This guide covers the practical workflow for OCR in Python—not just the happy path, but the preprocessing steps that make the difference between usable output and nonsense.

The Tools

Tesseract is the dominant open-source OCR engine. Originally developed by HP in the 1980s, now maintained by Google. It's not the only option, but it's free, well-documented, and good enough for most use cases.

pytesseract is the Python wrapper around Tesseract. It doesn't do the OCR itself—it calls the Tesseract binary and returns the results.

Pillow and OpenCV handle image loading and preprocessing. You'll use one or both depending on how much image manipulation you need.

Installation

Tesseract is a system dependency, not a Python package. Install it first.

# macOS
brew install tesseract
 
# Ubuntu/Debian
sudo apt install tesseract-ocr
 
# Windows - download installer from GitHub
# https://github.com/UB-Mannheim/tesseract/wiki

Then install the Python packages:

pip install pytesseract pillow opencv-python

Verify Tesseract is accessible:

tesseract --version

If pytesseract can't find Tesseract, you may need to specify the path explicitly in your code. On Windows especially, the default installation path isn't always in the system PATH.

import pytesseract
 
# Windows example - adjust path as needed
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Basic OCR: The Happy Path

When the stars align—clean image, clear text, good contrast—OCR is straightforward:

from PIL import Image
import pytesseract
 
# Load image
image = Image.open('document.png')
 
# Extract text
text = pytesseract.image_to_string(image)
 
print(text)

That's it for simple cases. A screenshot of a webpage, a scanned document at 300 DPI, a PDF converted to image—these typically work without fuss.

But most real-world images aren't this cooperative.

Why OCR Fails (And How to Fix It)

OCR engines expect black text on a white background, clearly separated characters, minimal noise, and proper orientation. When images deviate from this ideal, accuracy drops.

The fix is preprocessing—transforming the image before OCR to match what the engine expects.

Grayscale Conversion

Color information is noise for text recognition. Convert to grayscale first.

import cv2
 
image = cv2.imread('document.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

Thresholding (Binarization)

Convert grayscale to pure black and white. This eliminates gradients and makes text edges crisp.

# Simple threshold
_, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)
 
# Otsu's method - automatically finds optimal threshold
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

Otsu's method works well when there's a clear distinction between text and background. For images with uneven lighting, adaptive thresholding is better.

# Adaptive threshold - handles uneven lighting
binary = cv2.adaptiveThreshold(
    gray, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    11,  # block size
    2    # constant subtracted from mean
)

Noise Removal

Scanned documents often have specks and artifacts. Median blur removes salt-and-pepper noise while preserving edges.

denoised = cv2.medianBlur(gray, 3)

For heavier noise, morphological operations can help:

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))
cleaned = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)

Deskewing

Tilted text confuses OCR engines. If your document is rotated, fix it first.

import numpy as np
 
def deskew(image):
    coords = np.column_stack(np.where(image > 0))
    angle = cv2.minAreaRect(coords)[-1]
    
    if angle < -45:
        angle = 90 + angle
    elif angle > 45:
        angle = angle - 90
        
    (h, w) = image.shape[:2]
    center = (w // 2, h // 2)
    matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(
        image, matrix, (w, h),
        flags=cv2.INTER_CUBIC,
        borderMode=cv2.BORDER_REPLICATE
    )
    return rotated

Rescaling

Tesseract works best with text height around 30-40 pixels. Too small, and characters blur together. Too large, and the engine may not recognize them as text.

def rescale_for_ocr(image, target_height=40):
    """Scale image so text is approximately target_height pixels tall."""
    # This is a rough heuristic - adjust based on your documents
    scale = 2.0  # or calculate based on detected text size
    width = int(image.shape[1] * scale)
    height = int(image.shape[0] * scale)
    return cv2.resize(image, (width, height), interpolation=cv2.INTER_CUBIC)

For scanned documents, 300 DPI is the sweet spot. Lower resolution loses detail; higher resolution increases processing time without improving accuracy.

A Complete Preprocessing Pipeline

Combining these techniques into a reusable function:

import cv2
import numpy as np
from PIL import Image
import pytesseract
 
def preprocess_for_ocr(image_path):
    """
    Preprocess an image for optimal OCR results.
    Returns a PIL Image ready for pytesseract.
    """
    # Load image
    image = cv2.imread(image_path)
    
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # Remove noise
    denoised = cv2.medianBlur(gray, 3)
    
    # Apply adaptive thresholding
    binary = cv2.adaptiveThreshold(
        denoised, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY,
        11, 2
    )
    
    # Convert back to PIL Image for pytesseract
    return Image.fromarray(binary)
 
def extract_text(image_path):
    """Extract text from an image with preprocessing."""
    processed = preprocess_for_ocr(image_path)
    text = pytesseract.image_to_string(processed)
    return text.strip()
 
# Usage
text = extract_text('receipt.jpg')
print(text)

Not every image needs every preprocessing step. A clean screenshot needs nothing. A photo of a whiteboard needs all of it. Experiment with your specific images.

Configuration Options

Tesseract has configuration options that affect recognition. The most useful:

Page Segmentation Modes (PSM)

Tells Tesseract what kind of content to expect:

PSMDescriptionUse Case
3Fully automaticDefault, works for most documents
4Single column of variable-sized textArticles, letters
6Single uniform block of textParagraphs
7Single line of textHeaders, captions
8Single wordLabels, buttons
11Sparse text, no particular orderReceipts, forms
13Raw line, treat as single lineWhen other modes fail
# Single line of text
text = pytesseract.image_to_string(image, config='--psm 7')
 
# Sparse text (like a receipt)
text = pytesseract.image_to_string(image, config='--psm 11')

Language Selection

Tesseract supports 100+ languages. Install language packs and specify which to use:

# Install additional languages
sudo apt install tesseract-ocr-fra tesseract-ocr-deu
# French
text = pytesseract.image_to_string(image, lang='fra')
 
# Multiple languages
text = pytesseract.image_to_string(image, lang='eng+fra')

Character Whitelisting

If you know the text contains only certain characters, restrict recognition:

# Only digits
config = '--psm 7 -c tessedit_char_whitelist=0123456789'
text = pytesseract.image_to_string(image, config=config)
 
# Only uppercase letters and digits
config = '--psm 7 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
text = pytesseract.image_to_string(image, config=config)

Extracting Structured Data

Sometimes you need more than raw text. pytesseract can return bounding boxes, confidence scores, and structured data.

Bounding Boxes

# Get bounding boxes for each character
boxes = pytesseract.image_to_boxes(image)
 
# Get detailed data including confidence
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
 
# data contains:
# - 'text': recognized text for each element
# - 'conf': confidence score (0-100, -1 for non-text)
# - 'left', 'top', 'width', 'height': bounding box
# - 'level': hierarchy (page, block, paragraph, line, word)

Filtering by Confidence

Low-confidence results are often errors. Filter them out:

def extract_high_confidence_text(image, min_confidence=60):
    """Extract only text with confidence above threshold."""
    data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
    
    words = []
    for i, conf in enumerate(data['conf']):
        if int(conf) >= min_confidence:
            words.append(data['text'][i])
    
    return ' '.join(words)

Working with PDFs

PDFs aren't images, so you need an extra step. Convert each page to an image, then OCR.

pip install pdf2image

pdf2image requires Poppler:

# macOS
brew install poppler
 
# Ubuntu/Debian
sudo apt install poppler-utils
from pdf2image import convert_from_path
import pytesseract
 
def ocr_pdf(pdf_path):
    """Extract text from all pages of a PDF."""
    # Convert PDF to images
    pages = convert_from_path(pdf_path, dpi=300)
    
    all_text = []
    for i, page in enumerate(pages):
        text = pytesseract.image_to_string(page)
        all_text.append(f"--- Page {i + 1} ---\n{text}")
    
    return '\n\n'.join(all_text)
 
text = ocr_pdf('document.pdf')
print(text)

If the PDF contains searchable text (not scanned), use PyPDF2 or pdfplumber instead. They extract embedded text directly, which is faster and more accurate than OCR.

import pdfplumber
 
def extract_pdf_text(pdf_path):
    """Extract text from a PDF with embedded text."""
    with pdfplumber.open(pdf_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text() or ''
    return text

Handling Common Scenarios

Receipts

Receipts are challenging: thermal paper fades, text is small, layouts vary.

def ocr_receipt(image_path):
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # Increase contrast
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    enhanced = clahe.apply(gray)
    
    # Aggressive denoising
    denoised = cv2.fastNlMeansDenoising(enhanced, h=30)
    
    # Threshold
    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    
    # OCR with sparse text mode
    text = pytesseract.image_to_string(
        Image.fromarray(binary),
        config='--psm 11'
    )
    return text

Screenshots

Screenshots are usually clean. Minimal preprocessing needed.

def ocr_screenshot(image_path):
    image = Image.open(image_path)
    # Often works without any preprocessing
    return pytesseract.image_to_string(image)

Handwriting

Tesseract is designed for printed text. Handwriting recognition is a different problem.

For handwriting, consider:

  • Google Cloud Vision API — handles handwriting reasonably well
  • AWS Textract — good for forms with mixed print and handwriting
  • Microsoft Azure Computer Vision — another cloud option

These are paid services, but handwriting OCR is hard enough that the trade-off is often worthwhile.

When to Use Cloud OCR

Tesseract is good. It's not perfect. Cloud OCR services are sometimes worth the cost:

ScenarioRecommendation
Clean documents, high volumeTesseract (free, runs locally)
Mixed quality, moderate volumeTesseract with good preprocessing
HandwritingCloud API
Complex layouts (tables, forms)Cloud API or specialized tools
Compliance requirements (data stays local)Tesseract

Google Cloud Vision, AWS Textract, and Azure Computer Vision all offer free tiers. Test them against your specific documents before committing.

Putting It Together

A minimal but complete script that handles most common cases:

import cv2
import numpy as np
from PIL import Image
import pytesseract
from pathlib import Path
 
def preprocess(image_path):
    """Standard preprocessing pipeline."""
    img = cv2.imread(str(image_path))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    denoised = cv2.medianBlur(gray, 3)
    binary = cv2.adaptiveThreshold(
        denoised, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )
    return Image.fromarray(binary)
 
def ocr(image_path, preprocess_image=True, lang='eng', psm=3):
    """
    Extract text from an image.
    
    Args:
        image_path: Path to the image file
        preprocess_image: Whether to apply preprocessing
        lang: Tesseract language code
        psm: Page segmentation mode
    
    Returns:
        Extracted text as a string
    """
    if preprocess_image:
        image = preprocess(image_path)
    else:
        image = Image.open(image_path)
    
    config = f'--psm {psm}'
    text = pytesseract.image_to_string(image, lang=lang, config=config)
    return text.strip()
 
if __name__ == '__main__':
    import sys
    
    if len(sys.argv) < 2:
        print("Usage: python ocr.py <image_path>")
        sys.exit(1)
    
    image_path = sys.argv[1]
    text = ocr(image_path)
    print(text)

Final Thoughts

OCR accuracy depends more on image quality and preprocessing than on the OCR engine itself. A well-preprocessed image fed to Tesseract will outperform a raw image fed to expensive cloud services.

The best OCR is the one you don't need. If you can get structured data from the source instead of extracting it from images, do that.

When you do need OCR, start simple. Try the basic approach first. Add preprocessing steps one at a time until accuracy is acceptable. Resist the urge to build a complex pipeline before you've confirmed the simple approach doesn't work.

Most OCR failures aren't algorithmic—they're image quality problems. Better lighting, higher resolution, and cleaner source documents will improve results more than any amount of code optimization.

Related Articles

More from Narchol