Introduction
Katana fills the gap between simple link extractors and heavy-weight browser automation tools. It provides configurable crawling with both standard HTTP and headless browser modes, outputs clean structured data, and integrates well with other command-line security tools. Built in Go for speed and portability.
What Katana Does
- Crawls websites using standard HTTP mode or headless Chromium for JavaScript-rendered pages
- Extracts URLs from HTML, JavaScript files, inline scripts, CSS, robots.txt, and sitemap.xml
- Supports scope control with domain, subdomain, and regex-based filters to stay within target boundaries
- Outputs results in plain text, JSON, or JSONL format for easy pipeline integration
- Handles authentication via custom headers, cookies, and form-based login
Architecture Overview
Katana uses a concurrent crawler engine with configurable parallelism and rate limiting. In standard mode, it makes HTTP requests and parses responses with a custom HTML parser optimized for link extraction. In headless mode, it launches a Chromium instance via the Rod library and captures network requests, DOM mutations, and dynamically generated URLs. A deduplication layer prevents re-crawling the same endpoints.
Self-Hosting & Configuration
- Install via
go install, download pre-built binaries from GitHub releases, or use the Docker image - Configure crawl depth, concurrency, rate limit, and timeout via CLI flags or a YAML config file
- Set scope rules with
-cs(crawl scope) and-fs(field scope) to control what gets crawled and extracted - Use
-Hfor custom headers and-proxyfor routing through an HTTP or SOCKS5 proxy - Pipe output directly into other tools like httpx, nuclei, or grep for security workflows
Key Features
- Dual crawling modes: fast HTTP parsing and full headless browser with JavaScript execution
- Automatic form filling and submission for discovering authenticated endpoints
- Passive extraction from JavaScript files, detecting API endpoints and hardcoded URLs
- Built-in field extraction for URLs, paths, query parameters, emails, and custom regex patterns
- Seamless integration with the ProjectDiscovery ecosystem (subfinder, httpx, nuclei)
Comparison with Similar Tools
- Scrapy — Python-based framework focused on data extraction; Katana is a Go CLI focused on URL discovery and security reconnaissance
- Crawlee — Node.js crawling library for scraping at scale; Katana is lighter and designed for security workflows
- gospider — similar Go-based crawler; Katana has headless support and better scope control
- Burp Spider — built into Burp Suite; commercial and GUI-based while Katana is free and CLI-first
- wget --spider — basic link checker; Katana extracts from JavaScript and supports headless rendering
FAQ
Q: When should I use headless mode?
A: Use headless mode (-headless) for JavaScript-heavy single-page applications where content is rendered client-side. Standard mode is faster and sufficient for server-rendered sites.
Q: Can Katana handle authentication?
A: Yes. Pass cookies via -H "Cookie: ...", use custom headers for token-based auth, or enable automatic form detection with -aff for form-based login.
Q: How do I limit the crawl scope?
A: Use -cs with a regex pattern to restrict crawling to specific domains or paths. The -d flag controls maximum crawl depth.
Q: Does Katana respect robots.txt? A: By default Katana does not enforce robots.txt restrictions, as it is designed for security testing where full coverage is important. Use scope filters to restrict targets manually.