Cette page est affichée en anglais. Une traduction française est en cours.

SkillsMar 31, 2026·2 min de lecture

Surya — Document OCR for 90+ Languages

Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv

Script Depot · Community

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser

Surface agent

Tout agent MCP/CLI

Type

Skill

Installation

Single

Confiance

Confiance : Established

Point d'entrée

Surya — Document OCR for 90+ Languages

Commande d'installation directe

npx -y tokrepo@latest install 66bc0630-1be7-4da3-b227-f1fdb1faa065 --target codex

À exécuter après confirmation du plan en dry-run.

TL;DR

Surya performs text recognition, layout analysis, table detection, and reading order extraction for 90+ languages.

§01

What it is

Surya is an open-source document OCR toolkit that performs text recognition in over 90 languages. It goes beyond basic character recognition by providing layout analysis, table detection, reading order extraction, and LaTeX OCR. The project has accumulated over 19,000 GitHub stars and benchmarks favorably against cloud OCR services.

Surya is built for developers and researchers who need to extract structured text from scanned documents, PDFs, or images. It runs locally, which matters for teams handling sensitive documents that cannot be sent to third-party cloud APIs.

§02

How it saves time or tokens

Without Surya, extracting text from complex documents typically requires a cloud OCR service (Google Cloud Vision, AWS Textract) or manual preprocessing pipelines. Surya consolidates OCR, layout analysis, and table detection into a single local toolkit.

For AI pipelines, Surya's structured output means cleaner text going into LLMs. When you feed an LLM poorly extracted text with broken tables and scrambled reading order, you waste tokens on confused context. Surya's layout-aware extraction preserves document structure, reducing downstream token waste.

§03

How to use

Install Surya:

pip install surya-ocr

Run OCR on a document image:

surya_ocr image.png

For specific tasks, use the dedicated commands:

# Detect text lines
surya_detect image.png

# Analyze layout (tables, headers, images)
surya_layout image.png

# Table recognition
surya_table image.png

§04

Example

Using Surya in a Python script to extract text and export results:

from surya.ocr import run_ocr
from surya.model.detection.model import load_det_model
from surya.model.recognition.model import load_rec_model
from PIL import Image

image = Image.open('invoice.png')
det_model = load_det_model()
rec_model = load_rec_model()

results = run_ocr(
    [image],
    det_model=det_model,
    rec_model=rec_model,
    langs=['en']
)

for line in results[0].text_lines:
    print(line.text)

This extracts all text lines from an invoice image with English language detection.

§05

Related on TokRepo

Document processing tools -- more tools for document parsing and extraction
RAG tools -- combine OCR output with retrieval-augmented generation

§06

Common pitfalls

Surya's models require GPU for reasonable speed on large documents. CPU inference works but is significantly slower for batch processing.
Language detection is not automatic. You need to specify the expected languages via the langs parameter for best accuracy.
Table detection and OCR are separate steps. Running surya_table identifies table regions, but you still need surya_ocr to extract the cell text.

Questions fréquentes

How does Surya compare to Google Cloud Vision or AWS Textract?+

Surya benchmarks favorably against cloud OCR services for many languages and document types, according to its project documentation. The key difference is that Surya runs locally, so there are no API costs, no data leaves your machine, and there is no rate limiting.

What languages does Surya support?+

Surya supports over 90 languages for text recognition. You specify the expected languages when running OCR to improve accuracy. The full language list is maintained in the project repository.

Can Surya handle handwritten text?+

Surya is primarily designed for printed text in documents. Handwritten text recognition depends on the handwriting quality and may produce lower accuracy compared to printed text. For heavily handwritten documents, specialized HTR models may be more appropriate.

Does Surya require a GPU?+

Surya works on CPU but runs significantly faster with a GPU. For production pipelines processing many documents, a CUDA-compatible GPU is recommended. Single-page OCR on CPU is feasible for occasional use.

Can Surya extract tables as structured data?+

Yes. Surya provides table detection via surya_table, which identifies table regions and cell boundaries. Combined with OCR, you can reconstruct table content programmatically. The output includes bounding boxes for rows and cells.

Sources citées (3)

Surya GitHub— Surya OCR toolkit with 19K+ GitHub stars
Surya README— OCR benchmarks and supported languages
Google Cloud Document AI— Document AI and OCR best practices

En lien sur TokRepo

Document processing tools RAG tools Featured workflows

🙏

Source et remerciements

Created by Vik Paruchuri. Code: GPL, Models: AI Pubs Open Rail-M. VikParuchuri/surya — 19,500+ GitHub stars

Fil de discussion

Connectez-vous pour rejoindre la discussion.

Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires

Tesseract OCR — Open Source Text Recognition Engine for 100+ Languages

Tesseract is an open-source OCR engine maintained by Google, supporting over 100 languages. It converts images and scanned documents into machine-readable text with high accuracy across multiple output formats.

Skills

Script Depot

PaddleOCR — AI-Powered OCR Toolkit for 100+ Languages

A lightweight, production-ready OCR system supporting 100+ languages. Bridges documents and images to structured data for LLM pipelines.

Skills

Script Depot

Paperless-ngx — Self-Hosted Document Management with OCR

Paperless-ngx is an open-source document management system that scans, OCRs, indexes, and archives all your physical and digital documents for full-text search.

Skills

Script Depot

Kreuzberg — Polyglot Document Intelligence Framework with a Rust Core

An open-source document extraction framework that pulls text, metadata, images, and structured data from PDFs, Office files, images, and 97+ formats, with bindings for 11 programming languages.

Skills

Script Depot