Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 8, 2026·4 min de lecture

Grok 2 Vision — Image Understanding API for Apps

Grok-2 Vision handles images via OpenAI-compat chat.completions. Pass URL or base64. UI critique, screenshot QA, OCR, chart reading.

xAI
xAI · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 17/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 3ac4e1a5-2129-498e-920b-c46a9c17839c
Introduction

Grok-2 Vision (model grok-2-vision-latest) accepts images in the standard OpenAI chat.completions message format — pass image_url with a public URL or base64 data URI. Output is text reasoning grounded in the image. Best for: UI screenshot critique, chart and dashboard reading, document OCR with semantic understanding, content moderation, accessibility alt-text generation. Works with: openai-python, openai-node, any OpenAI-compatible client. Setup time: 2 minutes.


Public URL

from openai import OpenAI
client = OpenAI(base_url="https://api.x.ai/v1", api_key=os.environ["XAI_API_KEY"])

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Critique this dashboard. Mention spacing, color contrast, info hierarchy."},
            {"type": "image_url", "image_url": {"url": "https://example.com/dashboard.png"}},
        ],
    }],
)
print(resp.choices[0].message.content)

Base64 (for local files / private URLs)

import base64, mimetypes

def to_data_uri(path: str) -> str:
    mime = mimetypes.guess_type(path)[0] or "image/png"
    b64 = base64.b64encode(open(path, "rb").read()).decode()
    return f"data:{mime};base64,{b64}"

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the line items as JSON: description, qty, unit_price, total"},
            {"type": "image_url", "image_url": {"url": to_data_uri("invoice.png")}},
        ],
    }],
    response_format={"type": "json_object"},
)

Multi-image comparison

content = [{"type": "text", "text": "Spot 3 differences between these UI mockups."}]
for url in ["https://example.com/v1.png", "https://example.com/v2.png"]:
    content.append({"type": "image_url", "image_url": {"url": url}})

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{"role": "user", "content": content}],
)

Constraints

Limit Value
Max images per message 10
Max image size 20 MB
Supported formats PNG, JPEG, WebP, non-animated GIF
Context window 32,768 tokens
Max long edge 8,192 px (resized server-side)

FAQ

Q: Grok-2 Vision vs GPT-4o vision vs Claude 3.5 Sonnet vision? A: Comparable on most benchmarks. Grok-2 is fastest end-to-end and cheapest at ~$2/M input; GPT-4o is best at fine OCR; Claude is best at structured extraction with long instructions. Pick by primary task.

Q: Does it do bounding boxes? A: No — Grok-2 Vision returns text descriptions, not coordinates. For object detection with bboxes use Gemini 2.5 (detect_objects tool) or YOLO. Combine: YOLO for boxes, Grok for semantic interpretation of cropped regions.

Q: Can I fine-tune on images? A: Not yet — xAI has not opened image fine-tuning as of May 2026. Workaround: embed examples in the system prompt (few-shot with text-only descriptions of the image type), or use a downstream classifier on the textual output.


Quick Use

  1. Use model='grok-2-vision-latest' with the openai SDK
  2. Pass images via image_url (URL or data URI)
  3. Combine with response_format=json_object for structured extraction

Intro

Grok-2 Vision (model grok-2-vision-latest) accepts images in the standard OpenAI chat.completions message format — pass image_url with a public URL or base64 data URI. Output is text reasoning grounded in the image. Best for: UI screenshot critique, chart and dashboard reading, document OCR with semantic understanding, content moderation, accessibility alt-text generation. Works with: openai-python, openai-node, any OpenAI-compatible client. Setup time: 2 minutes.


Public URL

from openai import OpenAI
client = OpenAI(base_url="https://api.x.ai/v1", api_key=os.environ["XAI_API_KEY"])

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Critique this dashboard. Mention spacing, color contrast, info hierarchy."},
            {"type": "image_url", "image_url": {"url": "https://example.com/dashboard.png"}},
        ],
    }],
)
print(resp.choices[0].message.content)

Base64 (for local files / private URLs)

import base64, mimetypes

def to_data_uri(path: str) -> str:
    mime = mimetypes.guess_type(path)[0] or "image/png"
    b64 = base64.b64encode(open(path, "rb").read()).decode()
    return f"data:{mime};base64,{b64}"

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the line items as JSON: description, qty, unit_price, total"},
            {"type": "image_url", "image_url": {"url": to_data_uri("invoice.png")}},
        ],
    }],
    response_format={"type": "json_object"},
)

Multi-image comparison

content = [{"type": "text", "text": "Spot 3 differences between these UI mockups."}]
for url in ["https://example.com/v1.png", "https://example.com/v2.png"]:
    content.append({"type": "image_url", "image_url": {"url": url}})

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{"role": "user", "content": content}],
)

Constraints

Limit Value
Max images per message 10
Max image size 20 MB
Supported formats PNG, JPEG, WebP, non-animated GIF
Context window 32,768 tokens
Max long edge 8,192 px (resized server-side)

FAQ

Q: Grok-2 Vision vs GPT-4o vision vs Claude 3.5 Sonnet vision? A: Comparable on most benchmarks. Grok-2 is fastest end-to-end and cheapest at ~$2/M input; GPT-4o is best at fine OCR; Claude is best at structured extraction with long instructions. Pick by primary task.

Q: Does it do bounding boxes? A: No — Grok-2 Vision returns text descriptions, not coordinates. For object detection with bboxes use Gemini 2.5 (detect_objects tool) or YOLO. Combine: YOLO for boxes, Grok for semantic interpretation of cropped regions.

Q: Can I fine-tune on images? A: Not yet — xAI has not opened image fine-tuning as of May 2026. Workaround: embed examples in the system prompt (few-shot with text-only descriptions of the image type), or use a downstream classifier on the textual output.


Source & Thanks

Built by xAI. Vision docs at docs.x.ai/docs/guides/image-understanding.

Public SDK: xai-org

🙏

Source et remerciements

Built by xAI. Vision docs at docs.x.ai/docs/guides/image-understanding.

Public SDK: xai-org

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires