ScrapeGraphAI — AI-Powered Web Scraping
Python scraping library powered by LLMs. Describe what you want to extract in natural language, get structured data back. Handles dynamic pages. 23K+ stars.
Installation agent prête
Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.
npx -y tokrepo@latest install d34e3181-e3f5-4853-871e-83acafe0c60e --target codexÀ exécuter après confirmation du plan en dry-run.
What it is
ScrapeGraphAI is a Python web scraping library that uses large language models to extract structured data from websites. Instead of writing CSS selectors or XPath queries, you describe what you want in natural language and the LLM figures out how to extract it.
ScrapeGraphAI targets developers who need data extraction from websites without building custom scrapers for each site. It works with OpenAI, Anthropic, Google, and local models via Ollama.
How it saves time or tokens
Traditional scraping requires writing and maintaining selectors that break when sites change their HTML structure. ScrapeGraphAI abstracts this away -- the LLM adapts to different page layouts without code changes. The token_estimate for this workflow is approximately 500 tokens per extraction run.
How to use
- Install ScrapeGraphAI and Playwright:
pip install scrapegraphai
playwright install
- Create a SmartScraperGraph with your prompt and target URL.
- Call
.run()to get structured data back as a Python dictionary.
Example
from scrapegraphai.graphs import SmartScraperGraph
graph = SmartScraperGraph(
prompt='Extract all article titles and their authors',
source='https://news.ycombinator.com',
config={'llm': {'model': 'openai/gpt-4o', 'api_key': 'sk-...'}}
)
result = graph.run()
print(result)
# [{'title': '...', 'author': '...'}, ...]
Related on TokRepo
- Web Scraping Tools -- More web scraping and data extraction solutions
- AI Tools for Research -- Research automation tools powered by AI
Common pitfalls
- ScrapeGraphAI requires Playwright for dynamic JavaScript-rendered pages. Without it, only static HTML is parsed.
- LLM token costs can add up when scraping many pages. Use local models via Ollama for high-volume extraction to reduce API costs.
- The quality of extraction depends heavily on the prompt specificity. Vague prompts like 'get everything' produce inconsistent results.
Questions fréquentes
ScrapeGraphAI supports OpenAI, Anthropic, Google, Groq, and local models via Ollama. You configure the provider and model in the config dictionary when creating a graph instance.
Yes. ScrapeGraphAI uses Playwright under the hood to render dynamic pages before extraction. You need to run 'playwright install' to set up the browser binaries.
Traditional libraries like BeautifulSoup and Scrapy require you to write CSS selectors or XPath. ScrapeGraphAI uses natural language prompts instead, letting the LLM determine how to locate and extract the target data.
Yes. ScrapeGraphAI integrates with Ollama for local model inference. This is useful for high-volume scraping where API costs would be prohibitive. Set the model to an Ollama endpoint in your config.
ScrapeGraphAI offers SmartScraperGraph for single-page extraction, SearchGraph for search-engine-based extraction, and SpeechGraph for audio-to-text extraction. SmartScraperGraph is the most commonly used.
Sources citées (3)
- ScrapeGraphAI GitHub— ScrapeGraphAI is an AI-powered web scraping library with 23K+ GitHub stars
- Playwright Documentation— Playwright enables automated browser interaction for dynamic page rendering
- Ollama GitHub— Ollama enables running LLMs locally for cost-effective inference
En lien sur TokRepo
Source et remerciements
Created by ScrapeGraphAI. Licensed under MIT. ScrapeGraphAI/Scrapegraph-ai — 23,000+ GitHub stars
Fil de discussion
Actifs similaires
Scrapy — Fast High-Level Web Crawling Framework for Python
Scrapy is the most battle-tested web scraping framework for Python. It handles concurrency, retries, throttling, cookies, and export pipelines — letting you write spiders that scale from one page to millions with the same code.
Deck.gl — GPU-Powered Geospatial Visualization Framework
A WebGL2-powered framework for large-scale data visualization, specializing in geospatial layers, 3D rendering, and composable layer architecture.
Sanic — Async Python Web Framework Built for Speed
Sanic is an async Python web framework built for speed. Native async/await from the ground up, HTTP/1.1 and HTTP/2, WebSocket, streaming, and auto-generated API docs. Designed to be fast, flexible, and easy to use.
Flask — The Python Micro Web Framework
Flask is a lightweight WSGI web application framework for Python. Designed to make getting started quick and easy, with the ability to scale up to complex applications. The minimalist counterpart to Django, trusted by Netflix, LinkedIn, and Pinterest.