ScrapeGraphAI — AI-Powered Web Scraping
Python scraping library powered by LLMs. Describe what you want to extract in natural language, get structured data back. Handles dynamic pages. 23K+ stars.
What it is
ScrapeGraphAI is a Python web scraping library that uses large language models to extract structured data from websites. Instead of writing CSS selectors or XPath queries, you describe what you want in natural language and the LLM figures out how to extract it.
ScrapeGraphAI targets developers who need data extraction from websites without building custom scrapers for each site. It works with OpenAI, Anthropic, Google, and local models via Ollama.
How it saves time or tokens
Traditional scraping requires writing and maintaining selectors that break when sites change their HTML structure. ScrapeGraphAI abstracts this away -- the LLM adapts to different page layouts without code changes. The token_estimate for this workflow is approximately 500 tokens per extraction run.
How to use
- Install ScrapeGraphAI and Playwright:
pip install scrapegraphai
playwright install
- Create a SmartScraperGraph with your prompt and target URL.
- Call
.run()to get structured data back as a Python dictionary.
Example
from scrapegraphai.graphs import SmartScraperGraph
graph = SmartScraperGraph(
prompt='Extract all article titles and their authors',
source='https://news.ycombinator.com',
config={'llm': {'model': 'openai/gpt-4o', 'api_key': 'sk-...'}}
)
result = graph.run()
print(result)
# [{'title': '...', 'author': '...'}, ...]
Related on TokRepo
- Web Scraping Tools -- More web scraping and data extraction solutions
- AI Tools for Research -- Research automation tools powered by AI
Common pitfalls
- ScrapeGraphAI requires Playwright for dynamic JavaScript-rendered pages. Without it, only static HTML is parsed.
- LLM token costs can add up when scraping many pages. Use local models via Ollama for high-volume extraction to reduce API costs.
- The quality of extraction depends heavily on the prompt specificity. Vague prompts like 'get everything' produce inconsistent results.
Frequently Asked Questions
ScrapeGraphAI supports OpenAI, Anthropic, Google, Groq, and local models via Ollama. You configure the provider and model in the config dictionary when creating a graph instance.
Yes. ScrapeGraphAI uses Playwright under the hood to render dynamic pages before extraction. You need to run 'playwright install' to set up the browser binaries.
Traditional libraries like BeautifulSoup and Scrapy require you to write CSS selectors or XPath. ScrapeGraphAI uses natural language prompts instead, letting the LLM determine how to locate and extract the target data.
Yes. ScrapeGraphAI integrates with Ollama for local model inference. This is useful for high-volume scraping where API costs would be prohibitive. Set the model to an Ollama endpoint in your config.
ScrapeGraphAI offers SmartScraperGraph for single-page extraction, SearchGraph for search-engine-based extraction, and SpeechGraph for audio-to-text extraction. SmartScraperGraph is the most commonly used.
Citations (3)
- ScrapeGraphAI GitHub— ScrapeGraphAI is an AI-powered web scraping library with 23K+ GitHub stars
- Playwright Documentation— Playwright enables automated browser interaction for dynamic page rendering
- Ollama GitHub— Ollama enables running LLMs locally for cost-effective inference
Related on TokRepo
Source & Thanks
Created by ScrapeGraphAI. Licensed under MIT. ScrapeGraphAI/Scrapegraph-ai — 23,000+ GitHub stars
Discussion
Related Assets
Kornia — Differentiable Computer Vision Library for PyTorch
Kornia is a differentiable computer vision library built on PyTorch that provides GPU-accelerated implementations of classical vision algorithms including geometric transforms, color conversions, filtering, feature detection, and augmentations, all with full autograd support for end-to-end learning.
AlphaFold — AI-Powered 3D Protein Structure Prediction
AlphaFold by Google DeepMind predicts three-dimensional protein structures from amino acid sequences with atomic-level accuracy, enabling breakthroughs in drug discovery, enzyme engineering, and structural biology research.
Flash Attention — Fast Memory-Efficient Exact Attention for Transformers
Flash Attention is a CUDA kernel library that computes exact scaled dot-product attention 2-4x faster and with up to 20x less memory than standard implementations by using IO-aware tiling to minimize GPU memory reads and writes.