ScrapeGraphAI — AI-Powered Web Scraping
Python scraping library powered by LLMs. Describe what you want to extract in natural language, get structured data back. Handles dynamic pages. 23K+ stars.
Agent 可直接安装
这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。
npx -y tokrepo@latest install d34e3181-e3f5-4853-871e-83acafe0c60e --target codex先 dry-run 确认安装计划,再运行此命令。
What it is
ScrapeGraphAI is a Python web scraping library that uses large language models to extract structured data from websites. Instead of writing CSS selectors or XPath queries, you describe what you want in natural language and the LLM figures out how to extract it.
ScrapeGraphAI targets developers who need data extraction from websites without building custom scrapers for each site. It works with OpenAI, Anthropic, Google, and local models via Ollama.
How it saves time or tokens
Traditional scraping requires writing and maintaining selectors that break when sites change their HTML structure. ScrapeGraphAI abstracts this away -- the LLM adapts to different page layouts without code changes. The token_estimate for this workflow is approximately 500 tokens per extraction run.
How to use
- Install ScrapeGraphAI and Playwright:
pip install scrapegraphai
playwright install
- Create a SmartScraperGraph with your prompt and target URL.
- Call
.run()to get structured data back as a Python dictionary.
Example
from scrapegraphai.graphs import SmartScraperGraph
graph = SmartScraperGraph(
prompt='Extract all article titles and their authors',
source='https://news.ycombinator.com',
config={'llm': {'model': 'openai/gpt-4o', 'api_key': 'sk-...'}}
)
result = graph.run()
print(result)
# [{'title': '...', 'author': '...'}, ...]
Related on TokRepo
- Web Scraping Tools -- More web scraping and data extraction solutions
- AI Tools for Research -- Research automation tools powered by AI
Common pitfalls
- ScrapeGraphAI requires Playwright for dynamic JavaScript-rendered pages. Without it, only static HTML is parsed.
- LLM token costs can add up when scraping many pages. Use local models via Ollama for high-volume extraction to reduce API costs.
- The quality of extraction depends heavily on the prompt specificity. Vague prompts like 'get everything' produce inconsistent results.
常见问题
ScrapeGraphAI supports OpenAI, Anthropic, Google, Groq, and local models via Ollama. You configure the provider and model in the config dictionary when creating a graph instance.
Yes. ScrapeGraphAI uses Playwright under the hood to render dynamic pages before extraction. You need to run 'playwright install' to set up the browser binaries.
Traditional libraries like BeautifulSoup and Scrapy require you to write CSS selectors or XPath. ScrapeGraphAI uses natural language prompts instead, letting the LLM determine how to locate and extract the target data.
Yes. ScrapeGraphAI integrates with Ollama for local model inference. This is useful for high-volume scraping where API costs would be prohibitive. Set the model to an Ollama endpoint in your config.
ScrapeGraphAI offers SmartScraperGraph for single-page extraction, SearchGraph for search-engine-based extraction, and SpeechGraph for audio-to-text extraction. SmartScraperGraph is the most commonly used.
引用来源 (3)
- ScrapeGraphAI GitHub— ScrapeGraphAI is an AI-powered web scraping library with 23K+ GitHub stars
- Playwright Documentation— Playwright enables automated browser interaction for dynamic page rendering
- Ollama GitHub— Ollama enables running LLMs locally for cost-effective inference
来源与感谢
Created by ScrapeGraphAI. Licensed under MIT. ScrapeGraphAI/Scrapegraph-ai — 23,000+ GitHub stars
讨论
相关资产
Scrapy — Fast High-Level Web Crawling Framework for Python
Scrapy is the most battle-tested web scraping framework for Python. It handles concurrency, retries, throttling, cookies, and export pipelines — letting you write spiders that scale from one page to millions with the same code.
Deck.gl — GPU-Powered Geospatial Visualization Framework
A WebGL2-powered framework for large-scale data visualization, specializing in geospatial layers, 3D rendering, and composable layer architecture.
Sanic — Async Python Web Framework Built for Speed
Sanic is an async Python web framework built for speed. Native async/await from the ground up, HTTP/1.1 and HTTP/2, WebSocket, streaming, and auto-generated API docs. Designed to be fast, flexible, and easy to use.
Flask — The Python Micro Web Framework
Flask is a lightweight WSGI web application framework for Python. Designed to make getting started quick and easy, with the ability to scale up to complex applications. The minimalist counterpart to Django, trusted by Netflix, LinkedIn, and Pinterest.