# Langextract — Structured Extraction from Text Using LLMs by Google

> A Python library from Google for extracting structured information from unstructured text using large language models with precise source grounding and interactive visualization.

## Install

Save in your project root:

# Langextract — Structured Extraction from Text Using LLMs by Google

## Quick Use
```bash
pip install langextract
```
```python
from langextract import extract
result = extract("The meeting is on June 5 at 3pm in Room 201.",
                 schema={"date": str, "time": str, "location": str})
print(result)
```

## Introduction
Langextract is a Python library developed by Google for extracting structured data from unstructured text using large language models. It provides a schema-driven API that returns typed objects with source grounding, so every extracted field links back to the specific text span that supports it.

## What Langextract Does
- Extracts structured records from free-form text using LLM reasoning
- Supports user-defined schemas using Python dataclasses or dictionaries
- Provides source grounding linking each extracted field to its origin span
- Offers interactive visualization of extraction results and provenance
- Works with Gemini models and supports custom LLM backends

## Architecture Overview
Langextract wraps LLM calls with a schema-aware prompt construction layer. User-defined schemas are compiled into extraction instructions that guide the model to produce structured JSON output. A post-processing layer validates types, resolves references, and computes character-level grounding spans. The library uses Gemini API by default but exposes hooks for plugging in alternative model providers.

## Self-Hosting & Configuration
- Install via pip with Python 3.10 or later
- Set the GOOGLE_API_KEY environment variable for Gemini access
- Define extraction schemas as Python dataclasses or typed dicts
- Configure model parameters like temperature and max tokens
- Optional visualization server runs locally for debugging extractions

## Key Features
- Schema-driven extraction with automatic type validation
- Source grounding traces every field back to its supporting text
- Batch processing for extracting from multiple documents efficiently
- Interactive visualization dashboard for reviewing extraction quality
- Supports nested schemas and list-valued fields

## Comparison with Similar Tools
- **Instructor** — structured LLM outputs via Pydantic; Langextract adds source grounding
- **Outlines** — grammar-constrained generation; Langextract operates at a higher schema level
- **BAML** — type-safe AI functions; Langextract focuses specifically on extraction with provenance
- **Docling** — document parsing; Langextract handles unstructured text to structured data

## FAQ
**Q: Which models does it support?**
A: Gemini models are supported by default. Custom backends can be configured via the provider interface.

**Q: What is source grounding?**
A: Each extracted field includes character offsets pointing to the exact text span that the model used as evidence.

**Q: Can it process PDFs or HTML?**
A: Langextract operates on plain text. Pair it with a document parser like Docling for PDF input.

**Q: Is it suitable for production workloads?**
A: Yes, it includes batch processing, retry logic, and structured error handling for production use.

## Sources
- https://github.com/google/langextract

---
Source: https://tokrepo.com/en/workflows/asset-44648187
Author: AI Open Source