Scripts2026年5月14日·1 分钟阅读

Katana — Fast and Configurable Web Crawler by ProjectDiscovery

Katana is a command-line web crawler written in Go by ProjectDiscovery, designed for security researchers and developers who need fast, configurable crawling with JavaScript rendering support.

Introduction

Katana fills the gap between simple link extractors and heavy-weight browser automation tools. It provides configurable crawling with both standard HTTP and headless browser modes, outputs clean structured data, and integrates well with other command-line security tools. Built in Go for speed and portability.

What Katana Does

  • Crawls websites using standard HTTP mode or headless Chromium for JavaScript-rendered pages
  • Extracts URLs from HTML, JavaScript files, inline scripts, CSS, robots.txt, and sitemap.xml
  • Supports scope control with domain, subdomain, and regex-based filters to stay within target boundaries
  • Outputs results in plain text, JSON, or JSONL format for easy pipeline integration
  • Handles authentication via custom headers, cookies, and form-based login

Architecture Overview

Katana uses a concurrent crawler engine with configurable parallelism and rate limiting. In standard mode, it makes HTTP requests and parses responses with a custom HTML parser optimized for link extraction. In headless mode, it launches a Chromium instance via the Rod library and captures network requests, DOM mutations, and dynamically generated URLs. A deduplication layer prevents re-crawling the same endpoints.

Self-Hosting & Configuration

  • Install via go install, download pre-built binaries from GitHub releases, or use the Docker image
  • Configure crawl depth, concurrency, rate limit, and timeout via CLI flags or a YAML config file
  • Set scope rules with -cs (crawl scope) and -fs (field scope) to control what gets crawled and extracted
  • Use -H for custom headers and -proxy for routing through an HTTP or SOCKS5 proxy
  • Pipe output directly into other tools like httpx, nuclei, or grep for security workflows

Key Features

  • Dual crawling modes: fast HTTP parsing and full headless browser with JavaScript execution
  • Automatic form filling and submission for discovering authenticated endpoints
  • Passive extraction from JavaScript files, detecting API endpoints and hardcoded URLs
  • Built-in field extraction for URLs, paths, query parameters, emails, and custom regex patterns
  • Seamless integration with the ProjectDiscovery ecosystem (subfinder, httpx, nuclei)

Comparison with Similar Tools

  • Scrapy — Python-based framework focused on data extraction; Katana is a Go CLI focused on URL discovery and security reconnaissance
  • Crawlee — Node.js crawling library for scraping at scale; Katana is lighter and designed for security workflows
  • gospider — similar Go-based crawler; Katana has headless support and better scope control
  • Burp Spider — built into Burp Suite; commercial and GUI-based while Katana is free and CLI-first
  • wget --spider — basic link checker; Katana extracts from JavaScript and supports headless rendering

FAQ

Q: When should I use headless mode? A: Use headless mode (-headless) for JavaScript-heavy single-page applications where content is rendered client-side. Standard mode is faster and sufficient for server-rendered sites.

Q: Can Katana handle authentication? A: Yes. Pass cookies via -H "Cookie: ...", use custom headers for token-based auth, or enable automatic form detection with -aff for form-based login.

Q: How do I limit the crawl scope? A: Use -cs with a regex pattern to restrict crawling to specific domains or paths. The -d flag controls maximum crawl depth.

Q: Does Katana respect robots.txt? A: By default Katana does not enforce robots.txt restrictions, as it is designed for security testing where full coverage is important. Use scope filters to restrict targets manually.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产