# Katana — Fast and Configurable Web Crawler by ProjectDiscovery

> Katana is a command-line web crawler written in Go by ProjectDiscovery, designed for security researchers and developers who need fast, configurable crawling with JavaScript rendering support.

## Install

Save as a script file and run:

# Katana — Fast and Configurable Web Crawler by ProjectDiscovery

## Quick Use
```bash
# Install with Go
go install github.com/projectdiscovery/katana/cmd/katana@latest

# Basic crawl
katana -u https://example.com

# Crawl with headless browser rendering
katana -u https://example.com -headless

# Crawl and output only JavaScript files
katana -u https://example.com -ef png,jpg,gif -f url | grep ".js$"
```

## Introduction
Katana fills the gap between simple link extractors and heavy-weight browser automation tools. It provides configurable crawling with both standard HTTP and headless browser modes, outputs clean structured data, and integrates well with other command-line security tools. Built in Go for speed and portability.

## What Katana Does
- Crawls websites using standard HTTP mode or headless Chromium for JavaScript-rendered pages
- Extracts URLs from HTML, JavaScript files, inline scripts, CSS, robots.txt, and sitemap.xml
- Supports scope control with domain, subdomain, and regex-based filters to stay within target boundaries
- Outputs results in plain text, JSON, or JSONL format for easy pipeline integration
- Handles authentication via custom headers, cookies, and form-based login

## Architecture Overview
Katana uses a concurrent crawler engine with configurable parallelism and rate limiting. In standard mode, it makes HTTP requests and parses responses with a custom HTML parser optimized for link extraction. In headless mode, it launches a Chromium instance via the Rod library and captures network requests, DOM mutations, and dynamically generated URLs. A deduplication layer prevents re-crawling the same endpoints.

## Self-Hosting & Configuration
- Install via `go install`, download pre-built binaries from GitHub releases, or use the Docker image
- Configure crawl depth, concurrency, rate limit, and timeout via CLI flags or a YAML config file
- Set scope rules with `-cs` (crawl scope) and `-fs` (field scope) to control what gets crawled and extracted
- Use `-H` for custom headers and `-proxy` for routing through an HTTP or SOCKS5 proxy
- Pipe output directly into other tools like httpx, nuclei, or grep for security workflows

## Key Features
- Dual crawling modes: fast HTTP parsing and full headless browser with JavaScript execution
- Automatic form filling and submission for discovering authenticated endpoints
- Passive extraction from JavaScript files, detecting API endpoints and hardcoded URLs
- Built-in field extraction for URLs, paths, query parameters, emails, and custom regex patterns
- Seamless integration with the ProjectDiscovery ecosystem (subfinder, httpx, nuclei)

## Comparison with Similar Tools
- **Scrapy** — Python-based framework focused on data extraction; Katana is a Go CLI focused on URL discovery and security reconnaissance
- **Crawlee** — Node.js crawling library for scraping at scale; Katana is lighter and designed for security workflows
- **gospider** — similar Go-based crawler; Katana has headless support and better scope control
- **Burp Spider** — built into Burp Suite; commercial and GUI-based while Katana is free and CLI-first
- **wget --spider** — basic link checker; Katana extracts from JavaScript and supports headless rendering

## FAQ
**Q: When should I use headless mode?**
A: Use headless mode (`-headless`) for JavaScript-heavy single-page applications where content is rendered client-side. Standard mode is faster and sufficient for server-rendered sites.

**Q: Can Katana handle authentication?**
A: Yes. Pass cookies via `-H "Cookie: ..."`, use custom headers for token-based auth, or enable automatic form detection with `-aff` for form-based login.

**Q: How do I limit the crawl scope?**
A: Use `-cs` with a regex pattern to restrict crawling to specific domains or paths. The `-d` flag controls maximum crawl depth.

**Q: Does Katana respect robots.txt?**
A: By default Katana does not enforce robots.txt restrictions, as it is designed for security testing where full coverage is important. Use scope filters to restrict targets manually.

## Sources
- https://github.com/projectdiscovery/katana
- https://docs.projectdiscovery.io/tools/katana/overview

---
Source: https://tokrepo.com/en/workflows/asset-adb11ee8
Author: Script Depot