# Langextract — Structured Extraction from Text Using LLMs by Google > A Python library from Google for extracting structured information from unstructured text using large language models with precise source grounding and interactive visualization. ## Install Save in your project root: # Langextract — Structured Extraction from Text Using LLMs by Google ## Quick Use ```bash pip install langextract ``` ```python from langextract import extract result = extract("The meeting is on June 5 at 3pm in Room 201.", schema={"date": str, "time": str, "location": str}) print(result) ``` ## Introduction Langextract is a Python library developed by Google for extracting structured data from unstructured text using large language models. It provides a schema-driven API that returns typed objects with source grounding, so every extracted field links back to the specific text span that supports it. ## What Langextract Does - Extracts structured records from free-form text using LLM reasoning - Supports user-defined schemas using Python dataclasses or dictionaries - Provides source grounding linking each extracted field to its origin span - Offers interactive visualization of extraction results and provenance - Works with Gemini models and supports custom LLM backends ## Architecture Overview Langextract wraps LLM calls with a schema-aware prompt construction layer. User-defined schemas are compiled into extraction instructions that guide the model to produce structured JSON output. A post-processing layer validates types, resolves references, and computes character-level grounding spans. The library uses Gemini API by default but exposes hooks for plugging in alternative model providers. ## Self-Hosting & Configuration - Install via pip with Python 3.10 or later - Set the GOOGLE_API_KEY environment variable for Gemini access - Define extraction schemas as Python dataclasses or typed dicts - Configure model parameters like temperature and max tokens - Optional visualization server runs locally for debugging extractions ## Key Features - Schema-driven extraction with automatic type validation - Source grounding traces every field back to its supporting text - Batch processing for extracting from multiple documents efficiently - Interactive visualization dashboard for reviewing extraction quality - Supports nested schemas and list-valued fields ## Comparison with Similar Tools - **Instructor** — structured LLM outputs via Pydantic; Langextract adds source grounding - **Outlines** — grammar-constrained generation; Langextract operates at a higher schema level - **BAML** — type-safe AI functions; Langextract focuses specifically on extraction with provenance - **Docling** — document parsing; Langextract handles unstructured text to structured data ## FAQ **Q: Which models does it support?** A: Gemini models are supported by default. Custom backends can be configured via the provider interface. **Q: What is source grounding?** A: Each extracted field includes character offsets pointing to the exact text span that the model used as evidence. **Q: Can it process PDFs or HTML?** A: Langextract operates on plain text. Pair it with a document parser like Docling for PDF input. **Q: Is it suitable for production workloads?** A: Yes, it includes batch processing, retry logic, and structured error handling for production use. ## Sources - https://github.com/google/langextract --- Source: https://tokrepo.com/en/workflows/asset-44648187 Author: AI Open Source