ScriptsApr 8, 2026·2 min read

Magika — Google AI File Type Detection Tool

Google's deep learning file type detector with 99%+ accuracy. Magika identifies 200+ file types using AI instead of magic bytes, ideal for security scanning and content processing.

PR
Prompt Lab · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

pip install magika
# CLI usage
magika document.pdf image.png script.py unknown_file
# Output:
# document.pdf: PDF document (confidence: 99.9%)
# image.png: PNG image (confidence: 99.8%)
# script.py: Python source (confidence: 99.5%)
# unknown_file: JavaScript source (confidence: 98.2%)
from magika import Magika

m = Magika()

result = m.identify_path("unknown_file")
print(f"Type: {result.output.ct_label}")      # "javascript"
print(f"Group: {result.output.group}")          # "code"
print(f"MIME: {result.output.mime_type}")        # "application/javascript"
print(f"Score: {result.output.score:.2%}")       # "98.20%"

What is Magika?

Magika is Google's AI-powered file type identification tool. Instead of relying on file extensions or magic bytes (like the Unix file command), Magika uses a trained deep learning model to identify 200+ file types with 99%+ accuracy. It is especially good at distinguishing similar types (JavaScript vs TypeScript, JSON vs JSONL) and detecting misnamed or obfuscated files — critical for security scanning.

Answer-Ready: Magika is Google's AI file type detector. Deep learning model identifies 200+ file types with 99%+ accuracy. Better than magic bytes for similar types and obfuscated files. Used in Gmail and Google Drive security. Python library and CLI. 8k+ GitHub stars.

Best for: Security scanning, content processing pipelines, and file validation. Works with: Python, CLI, any pipeline. Setup time: Under 1 minute.

Core Features

1. 200+ File Types

Category Types
Code Python, JavaScript, TypeScript, Rust, Go, Java, C++, ...
Documents PDF, DOCX, XLSX, PPTX, Markdown, LaTeX
Data JSON, CSV, XML, YAML, TOML, Parquet
Media PNG, JPEG, GIF, WebP, MP3, MP4, WebM
Archives ZIP, TAR, GZIP, RAR, 7Z
Executable ELF, PE, Mach-O, Shell scripts
Web HTML, CSS, SVG, WASM

2. Batch Processing

from pathlib import Path

results = m.identify_paths([
    Path("file1.txt"),
    Path("file2.dat"),
    Path("file3.bin"),
])
for path, result in zip(paths, results):
    print(f"{path}: {result.output.ct_label} ({result.output.score:.0%})")

3. Bytes Detection (No File Needed)

content = b'{"name": "test", "value": 42}'
result = m.identify_bytes(content)
print(result.output.ct_label)  # "json"

4. Security Use Cases

# Detect file type mismatches
uploaded_file = "profile_photo.jpg"
result = m.identify_path(uploaded_file)
if result.output.ct_label != "jpeg":
    print(f"WARNING: File claims to be JPEG but is actually {result.output.ct_label}")
    # Could be a disguised executable or script

Magika vs Traditional Tools

Feature Magika file (libmagic) Python-magic
Method Deep learning Magic bytes Magic bytes
Accuracy 99%+ ~90% ~90%
Similar types Excellent Poor Poor
Obfuscated files Good Poor Poor
Speed Fast (1ms/file) Very fast Very fast
Types 200+ 1000+ 1000+

FAQ

Q: Is it fast enough for production? A: Yes, ~1ms per file after model loading. Batch mode processes thousands of files per second.

Q: Does it work on binary files? A: Yes, it identifies executables, archives, media, and any binary format.

Q: How is it used at Google? A: Magika powers file type detection in Gmail (attachment scanning) and Google Drive (content safety).

🙏

Source & Thanks

Created by Google. Licensed under Apache 2.0.

google/magika — 8k+ stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets