# Magika — Google AI File Type Detection Tool > Google's deep learning file type detector with 99%+ accuracy. Magika identifies 200+ file types using AI instead of magic bytes, ideal for security scanning and content processing. ## Install Save as a script file and run: ## Quick Use ```bash pip install magika ``` ```bash # CLI usage magika document.pdf image.png script.py unknown_file # Output: # document.pdf: PDF document (confidence: 99.9%) # image.png: PNG image (confidence: 99.8%) # script.py: Python source (confidence: 99.5%) # unknown_file: JavaScript source (confidence: 98.2%) ``` ```python from magika import Magika m = Magika() result = m.identify_path("unknown_file") print(f"Type: {result.output.ct_label}") # "javascript" print(f"Group: {result.output.group}") # "code" print(f"MIME: {result.output.mime_type}") # "application/javascript" print(f"Score: {result.output.score:.2%}") # "98.20%" ``` ## What is Magika? Magika is Google's AI-powered file type identification tool. Instead of relying on file extensions or magic bytes (like the Unix `file` command), Magika uses a trained deep learning model to identify 200+ file types with 99%+ accuracy. It is especially good at distinguishing similar types (JavaScript vs TypeScript, JSON vs JSONL) and detecting misnamed or obfuscated files — critical for security scanning. **Answer-Ready**: Magika is Google's AI file type detector. Deep learning model identifies 200+ file types with 99%+ accuracy. Better than magic bytes for similar types and obfuscated files. Used in Gmail and Google Drive security. Python library and CLI. 8k+ GitHub stars. **Best for**: Security scanning, content processing pipelines, and file validation. **Works with**: Python, CLI, any pipeline. **Setup time**: Under 1 minute. ## Core Features ### 1. 200+ File Types | Category | Types | |----------|-------| | Code | Python, JavaScript, TypeScript, Rust, Go, Java, C++, ... | | Documents | PDF, DOCX, XLSX, PPTX, Markdown, LaTeX | | Data | JSON, CSV, XML, YAML, TOML, Parquet | | Media | PNG, JPEG, GIF, WebP, MP3, MP4, WebM | | Archives | ZIP, TAR, GZIP, RAR, 7Z | | Executable | ELF, PE, Mach-O, Shell scripts | | Web | HTML, CSS, SVG, WASM | ### 2. Batch Processing ```python from pathlib import Path results = m.identify_paths([ Path("file1.txt"), Path("file2.dat"), Path("file3.bin"), ]) for path, result in zip(paths, results): print(f"{path}: {result.output.ct_label} ({result.output.score:.0%})") ``` ### 3. Bytes Detection (No File Needed) ```python content = b'{"name": "test", "value": 42}' result = m.identify_bytes(content) print(result.output.ct_label) # "json" ``` ### 4. Security Use Cases ```python # Detect file type mismatches uploaded_file = "profile_photo.jpg" result = m.identify_path(uploaded_file) if result.output.ct_label != "jpeg": print(f"WARNING: File claims to be JPEG but is actually {result.output.ct_label}") # Could be a disguised executable or script ``` ## Magika vs Traditional Tools | Feature | Magika | file (libmagic) | Python-magic | |---------|--------|-----------------|--------------| | Method | Deep learning | Magic bytes | Magic bytes | | Accuracy | 99%+ | ~90% | ~90% | | Similar types | Excellent | Poor | Poor | | Obfuscated files | Good | Poor | Poor | | Speed | Fast (1ms/file) | Very fast | Very fast | | Types | 200+ | 1000+ | 1000+ | ## FAQ **Q: Is it fast enough for production?** A: Yes, ~1ms per file after model loading. Batch mode processes thousands of files per second. **Q: Does it work on binary files?** A: Yes, it identifies executables, archives, media, and any binary format. **Q: How is it used at Google?** A: Magika powers file type detection in Gmail (attachment scanning) and Google Drive (content safety). ## Source & Thanks > Created by [Google](https://github.com/google). Licensed under Apache 2.0. > > [google/magika](https://github.com/google/magika) — 8k+ stars ## 快速使用 ```bash pip install magika magika unknown_file ``` AI 深度学习文件类型检测,99%+ 准确率。 ## 什么是 Magika? Google 的 AI 文件类型检测工具,深度学习模型识别 200+ 文件类型,比 magic bytes 更准确。 **一句话总结**:Google AI 文件类型检测,深度学习 200+ 类型 99%+ 准确率,区分相似类型和伪装文件,Gmail/Drive 安全使用,8k+ stars。 **适合人群**:安全扫描、内容处理管线、文件验证。 ## 核心功能 ### 1. 200+ 文件类型 — 代码/文档/数据/媒体/可执行 ### 2. 字节检测 — 无需文件,直接检测内容 ### 3. 安全检测 — 发现伪装文件 ## 常见问题 **Q: 生产可用?** A: 可以,~1ms/文件,批量每秒数千文件。 ## 来源与致谢 > [google/magika](https://github.com/google/magika) — 8k+ stars, Apache 2.0 --- Source: https://tokrepo.com/en/workflows/82c22f07-4787-47d7-9643-8f3a5ec40706 Author: Prompt Lab