Introduction
Colly provides a clean Go interface for web scraping and crawling. It handles concurrency, request delays, caching, and cookie management so you can focus on data extraction logic. Its callback-based design makes it straightforward to build both simple scrapers and complex multi-site crawlers.
What Colly Does
- Provides a declarative callback API for HTML element matching and extraction
- Manages request concurrency with configurable parallelism and delays
- Handles cookies, headers, and authentication automatically across requests
- Supports distributed scraping via Redis or other shared storage backends
- Caches responses to avoid redundant network calls during development
Architecture Overview
Colly's Collector is the central object that manages HTTP requests, response parsing, and callback dispatch. When you call Visit(), the collector fetches the page, parses HTML using goquery, and triggers registered callbacks (OnHTML, OnResponse, OnRequest). The collector maintains a queue, respects robots.txt by default, and can be extended with custom storage backends for visited URL tracking and request queues.
Self-Hosting & Configuration
- Add Colly to your project:
go get github.com/gocolly/colly/v2 - Create a Collector with options like AllowedDomains, MaxDepth, and UserAgent
- Register callbacks: OnHTML for CSS selectors, OnResponse for raw bytes
- Set rate limiting with Limit() rules per domain
- For distributed scraping, configure a Redis storage backend
Key Features
- Automatic parallelism with goroutine-safe collector instances
- Built-in respect for robots.txt and configurable crawl delays
- Response caching for faster development iteration cycles
- Extension ecosystem including proxy rotation and queue management
- Small dependency footprint with no CGO requirements
Comparison with Similar Tools
- Scrapy (Python) — full framework with pipelines and middlewares; Colly is more minimal and leverages Go's native concurrency
- chromedp — drives a real browser; Colly works at HTTP level without browser overhead
- goquery — HTML parsing only; Colly adds HTTP fetching, rate limiting, and crawl management
- Ferret — declarative query language for scraping; Colly offers programmatic Go control
- Rod — browser automation; Colly is faster for static HTML scraping at scale
FAQ
Q: Can Colly handle JavaScript-rendered pages? A: Not directly. For SPAs, pair Colly with chromedp or a headless browser for JS rendering, then pass HTML to Colly for extraction.
Q: How do I avoid getting blocked? A: Use Colly's built-in rate limiting, rotate user agents, and add proxy support via extensions.
Q: Does Colly support pagination? A: Yes. In your OnHTML callback, detect next-page links and call Visit() to follow them automatically.
Q: Is Colly suitable for large-scale crawling? A: Yes. With Redis-backed storage and distributed collectors, Colly handles millions of pages.