Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 15, 2026·2 min de lectura

Langextract — Structured Extraction from Text Using LLMs by Google

A Python library from Google for extracting structured information from unstructured text using large language models with precise source grounding and interactive visualization.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 29/100Stage only
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Stage only
Confianza
Confianza: Established
Entrada
Langextract Overview
Comando CLI universal
npx tokrepo install 44648187-5079-11f1-9bc6-00163e2b0d79

Introduction

Langextract is a Python library developed by Google for extracting structured data from unstructured text using large language models. It provides a schema-driven API that returns typed objects with source grounding, so every extracted field links back to the specific text span that supports it.

What Langextract Does

  • Extracts structured records from free-form text using LLM reasoning
  • Supports user-defined schemas using Python dataclasses or dictionaries
  • Provides source grounding linking each extracted field to its origin span
  • Offers interactive visualization of extraction results and provenance
  • Works with Gemini models and supports custom LLM backends

Architecture Overview

Langextract wraps LLM calls with a schema-aware prompt construction layer. User-defined schemas are compiled into extraction instructions that guide the model to produce structured JSON output. A post-processing layer validates types, resolves references, and computes character-level grounding spans. The library uses Gemini API by default but exposes hooks for plugging in alternative model providers.

Self-Hosting & Configuration

  • Install via pip with Python 3.10 or later
  • Set the GOOGLE_API_KEY environment variable for Gemini access
  • Define extraction schemas as Python dataclasses or typed dicts
  • Configure model parameters like temperature and max tokens
  • Optional visualization server runs locally for debugging extractions

Key Features

  • Schema-driven extraction with automatic type validation
  • Source grounding traces every field back to its supporting text
  • Batch processing for extracting from multiple documents efficiently
  • Interactive visualization dashboard for reviewing extraction quality
  • Supports nested schemas and list-valued fields

Comparison with Similar Tools

  • Instructor — structured LLM outputs via Pydantic; Langextract adds source grounding
  • Outlines — grammar-constrained generation; Langextract operates at a higher schema level
  • BAML — type-safe AI functions; Langextract focuses specifically on extraction with provenance
  • Docling — document parsing; Langextract handles unstructured text to structured data

FAQ

Q: Which models does it support? A: Gemini models are supported by default. Custom backends can be configured via the provider interface.

Q: What is source grounding? A: Each extracted field includes character offsets pointing to the exact text span that the model used as evidence.

Q: Can it process PDFs or HTML? A: Langextract operates on plain text. Pair it with a document parser like Docling for PDF input.

Q: Is it suitable for production workloads? A: Yes, it includes batch processing, retry logic, and structured error handling for production use.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados