Katana — Fast and Configurable Web Crawler by ProjectDiscovery

Introduction

Katana fills the gap between simple link extractors and heavy-weight browser automation tools. It provides configurable crawling with both standard HTTP and headless browser modes, outputs clean structured data, and integrates well with other command-line security tools. Built in Go for speed and portability.

What Katana Does

Crawls websites using standard HTTP mode or headless Chromium for JavaScript-rendered pages
Extracts URLs from HTML, JavaScript files, inline scripts, CSS, robots.txt, and sitemap.xml
Supports scope control with domain, subdomain, and regex-based filters to stay within target boundaries
Outputs results in plain text, JSON, or JSONL format for easy pipeline integration
Handles authentication via custom headers, cookies, and form-based login

Architecture Overview

Katana uses a concurrent crawler engine with configurable parallelism and rate limiting. In standard mode, it makes HTTP requests and parses responses with a custom HTML parser optimized for link extraction. In headless mode, it launches a Chromium instance via the Rod library and captures network requests, DOM mutations, and dynamically generated URLs. A deduplication layer prevents re-crawling the same endpoints.

Self-Hosting & Configuration

Install via go install, download pre-built binaries from GitHub releases, or use the Docker image
Configure crawl depth, concurrency, rate limit, and timeout via CLI flags or a YAML config file
Set scope rules with -cs (crawl scope) and -fs (field scope) to control what gets crawled and extracted
Use -H for custom headers and -proxy for routing through an HTTP or SOCKS5 proxy
Pipe output directly into other tools like httpx, nuclei, or grep for security workflows

Key Features

Dual crawling modes: fast HTTP parsing and full headless browser with JavaScript execution
Automatic form filling and submission for discovering authenticated endpoints
Passive extraction from JavaScript files, detecting API endpoints and hardcoded URLs
Built-in field extraction for URLs, paths, query parameters, emails, and custom regex patterns
Seamless integration with the ProjectDiscovery ecosystem (subfinder, httpx, nuclei)

Comparison with Similar Tools

Scrapy — Python-based framework focused on data extraction; Katana is a Go CLI focused on URL discovery and security reconnaissance
Crawlee — Node.js crawling library for scraping at scale; Katana is lighter and designed for security workflows
gospider — similar Go-based crawler; Katana has headless support and better scope control
Burp Spider — built into Burp Suite; commercial and GUI-based while Katana is free and CLI-first
wget --spider — basic link checker; Katana extracts from JavaScript and supports headless rendering

FAQ

Q: When should I use headless mode? A: Use headless mode (-headless) for JavaScript-heavy single-page applications where content is rendered client-side. Standard mode is faster and sufficient for server-rendered sites.

Q: Can Katana handle authentication? A: Yes. Pass cookies via -H "Cookie: ...", use custom headers for token-based auth, or enable automatic form detection with -aff for form-based login.

Q: How do I limit the crawl scope? A: Use -cs with a regex pattern to restrict crawling to specific domains or paths. The -d flag controls maximum crawl depth.

Q: Does Katana respect robots.txt? A: By default Katana does not enforce robots.txt restrictions, as it is designed for security testing where full coverage is important. Use scope filters to restrict targets manually.

Katana — Fast and Configurable Web Crawler by ProjectDiscovery

Introduction

What Katana Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Lit — Simple Library for Fast Lightweight Web Components

Hexo — Fast Node.js Blog Framework with Plugin Ecosystem

MNN — Blazing-Fast On-Device AI Inference by Alibaba

Babylon.js — Powerful 3D Game and Rendering Engine