OCR basics - Reading a business card

OCR — Optical Character Recognition — is used everywhere now, from expense report scanning to digitizing old documents to reading license plates on parking cameras. Python and Tesseract make it a surprisingly trivial task to get started.

Installation

Ubuntu / Debian

sudo apt-get install tesseract-ocr tesseract-ocr-eng \
    libtiff5-dev libjpeg8-dev zlib1g-dev libfreetype6-dev

Fedora / RHEL

sudo dnf install tesseract tesseract-langpack-eng \
    libtiff-devel libjpeg-devel

macOS

brew install tesseract

Python dependencies

python3 -m venv venv
source venv/bin/activate
pip install pytesseract Pillow

The code

The simplest possible example — point it at an image, get text back:

#!/usr/bin/env python3
"""
OCR demo using pytesseract and Pillow.
Works on any image with printed text.
"""
from PIL import Image
import pytesseract


def read_image(filename):
    img = Image.open(filename)
    text = pytesseract.image_to_string(img)
    return text.strip()


if __name__ == '__main__':
    import sys
    filename = sys.argv[1] if len(sys.argv) > 1 else 'card.png'
    print(read_image(filename))

Improving accuracy with preprocessing

Raw photos often have poor contrast, skew, or noise that trips up Tesseract. A quick Pillow preprocessing pass helps a lot:

from PIL import Image, ImageFilter, ImageEnhance
import pytesseract


def preprocess(img):
    # Convert to grayscale
    img = img.convert('L')
    # Increase contrast
    img = ImageEnhance.Contrast(img).enhance(2.0)
    # Sharpen
    img = img.filter(ImageFilter.SHARPEN)
    return img


def read_image(filename):
    img = Image.open(filename)
    img = preprocess(img)
    # PSM 6: assume a uniform block of text
    config = '--psm 6 --oem 3'
    return pytesseract.image_to_string(img, config=config).strip()


if __name__ == '__main__':
    import sys
    filename = sys.argv[1] if len(sys.argv) > 1 else 'card.png'
    print(read_image(filename))

Extracting structured data

For a business card specifically you usually want to pull out individual fields rather than a raw text dump. image_to_data gives you bounding boxes and confidence scores for each word:

import pytesseract
from PIL import Image
import pandas as pd


def extract_words(filename):
    img = Image.open(filename)
    data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DATAFRAME)
    # Filter to words with decent confidence
    words = data[data.conf > 60][['text', 'left', 'top', 'conf']]
    return words.dropna()


if __name__ == '__main__':
    df = extract_words('card.png')
    print(df.to_string())

Quick test image

Don't have a business card handy? Generate a test image with Pillow:

from PIL import Image, ImageDraw, ImageFont

def make_test_card():
    img = Image.new('RGB', (400, 200), color='white')
    draw = ImageDraw.Draw(img)
    draw.text((20, 20),  "Jane Smith",         fill='black')
    draw.text((20, 50),  "Senior Engineer",    fill='black')
    draw.text((20, 80),  "jane@example.com",   fill='black')
    draw.text((20, 110), "+1 (555) 867-5309",  fill='black')
    img.save('test_card.png')
    return 'test_card.png'

if __name__ == '__main__':
    import pytesseract
    fname = make_test_card()
    print(pytesseract.image_to_string(Image.open(fname)))

From here you can feed the extracted text into a contacts API, a spreadsheet, or whatever you're building. Happy hacking!