Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsApr 29, 2026·3 min de lecture

Tesseract OCR — Open Source Text Recognition Engine for 100+ Languages

Tesseract is an open-source OCR engine maintained by Google, supporting over 100 languages. It converts images and scanned documents into machine-readable text with high accuracy across multiple output formats.

Introduction

Tesseract is one of the most accurate open-source OCR engines available. Originally developed at HP Labs in the 1980s and later open-sourced by Google, it uses an LSTM neural network for modern text recognition. It supports over 100 languages out of the box and can be trained on custom fonts or scripts.

What Tesseract Does

  • Extracts text from images (PNG, JPEG, TIFF, BMP, GIF) and PDFs
  • Recognizes 100+ languages including right-to-left and CJK scripts
  • Outputs plain text, hOCR, TSV, ALTO XML, or searchable PDF
  • Performs page layout analysis to handle multi-column documents
  • Supports custom training for specialized fonts or domain-specific text

Architecture Overview

Tesseract 5.x uses an LSTM-based neural network pipeline. Input images pass through adaptive thresholding and page segmentation, which identifies text blocks, lines, and words. The LSTM engine then recognizes character sequences within each text line, applying language-specific dictionaries for correction. A legacy engine (Tesseract 3.x mode) is still available for simpler use cases.

Self-Hosting & Configuration

  • Install via system packages: apt install tesseract-ocr or brew install tesseract
  • Download additional language packs: apt install tesseract-ocr-deu tesseract-ocr-fra
  • Set page segmentation mode with --psm (e.g., --psm 6 for a single block of text)
  • Choose OCR engine mode with --oem (0=legacy, 1=LSTM, 2=both, 3=default)
  • Integrate via C API, or use Python bindings like pytesseract or tesserocr

Key Features

  • LSTM neural network engine with significantly improved accuracy over legacy mode
  • Supports training custom models with tesstrain for niche fonts or languages
  • Multiple output formats including hOCR with bounding box coordinates
  • Configurable page segmentation for documents, receipts, license plates, and more
  • Active community with pre-trained models for 100+ languages on GitHub

Comparison with Similar Tools

  • EasyOCR — Python-first with GPU support; easier API but fewer output format options
  • PaddleOCR — Strong on CJK languages with detection and recognition pipelines built in
  • Surya — Newer deep-learning OCR; better on complex layouts but less mature ecosystem
  • Amazon Textract — Managed cloud service with table extraction; not self-hostable
  • Google Cloud Vision — Higher accuracy on challenging inputs; requires API access and billing

FAQ

Q: How do I improve accuracy on low-quality scans? A: Pre-process images with tools like ImageMagick to increase contrast, remove noise, and deskew before passing them to Tesseract.

Q: Can Tesseract detect text location in an image? A: Yes. Use hOCR or TSV output to get bounding box coordinates for each word, line, or block.

Q: Does Tesseract support handwriting recognition? A: Limited. It works best on printed text. For handwriting, consider training a custom LSTM model or using a dedicated handwriting engine.

Q: How do I use Tesseract from Python? A: Install pytesseract (pip install pytesseract) and call pytesseract.image_to_string(image) with a PIL Image object.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires