OCR — Optical Character Recognition — is used everywhere now, from expense report scanning to digitizing old documents to reading license plates on parking cameras. Python and Tesseract make it a surprisingly trivial task to get started.
Installation
Ubuntu / Debian
sudo apt-get install tesseract-ocr tesseract-ocr-eng \
libtiff5-dev libjpeg8-dev zlib1g-dev libfreetype6-dev
Fedora / RHEL
sudo dnf install tesseract tesseract-langpack-eng \
libtiff-devel libjpeg-devel
macOS
brew install tesseract
Python dependencies
python3 -m venv venv
source venv/bin/activate
pip install pytesseract Pillow
The code
The simplest possible example — point it at an image, get text back:
#!/usr/bin/env python3
"""
OCR demo using pytesseract and Pillow.
Works on any image with printed text.
"""
from PIL import Image
import pytesseract
def read_image(filename):
img = Image.open(filename)
text = pytesseract.image_to_string(img)
return text.strip()
if __name__ == '__main__':
import sys
filename = sys.argv[1] if len(sys.argv) > 1 else 'card.png'
print(read_image(filename))
Improving accuracy with preprocessing
Raw photos often have poor contrast, skew, or noise that trips up Tesseract. A quick Pillow preprocessing pass helps a lot:
from PIL import Image, ImageFilter, ImageEnhance
import pytesseract
def preprocess(img):
# Convert to grayscale
img = img.convert('L')
# Increase contrast
img = ImageEnhance.Contrast(img).enhance(2.0)
# Sharpen
img = img.filter(ImageFilter.SHARPEN)
return img
def read_image(filename):
img = Image.open(filename)
img = preprocess(img)
# PSM 6: assume a uniform block of text
config = '--psm 6 --oem 3'
return pytesseract.image_to_string(img, config=config).strip()
if __name__ == '__main__':
import sys
filename = sys.argv[1] if len(sys.argv) > 1 else 'card.png'
print(read_image(filename))
Extracting structured data
For a business card specifically you usually want to pull out individual fields rather than a raw text dump. image_to_data gives you bounding boxes and confidence scores for each word:
import pytesseract
from PIL import Image
import pandas as pd
def extract_words(filename):
img = Image.open(filename)
data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DATAFRAME)
# Filter to words with decent confidence
words = data[data.conf > 60][['text', 'left', 'top', 'conf']]
return words.dropna()
if __name__ == '__main__':
df = extract_words('card.png')
print(df.to_string())
Quick test image
Don't have a business card handy? Generate a test image with Pillow:
from PIL import Image, ImageDraw, ImageFont
def make_test_card():
img = Image.new('RGB', (400, 200), color='white')
draw = ImageDraw.Draw(img)
draw.text((20, 20), "Jane Smith", fill='black')
draw.text((20, 50), "Senior Engineer", fill='black')
draw.text((20, 80), "jane@example.com", fill='black')
draw.text((20, 110), "+1 (555) 867-5309", fill='black')
img.save('test_card.png')
return 'test_card.png'
if __name__ == '__main__':
import pytesseract
fname = make_test_card()
print(pytesseract.image_to_string(Image.open(fname)))
From here you can feed the extracted text into a contacts API, a spreadsheet, or whatever you're building. Happy hacking!