What is Grok 2 Vision — Image Understanding API for Apps?

Grok-2 Vision handles images via OpenAI-compat chat.completions. Pass URL or base64. UI critique, screenshot QA, OCR, chart reading.

Is Grok 2 Vision — Image Understanding API for Apps free to use?

Yes. Grok 2 Vision — Image Understanding API for Apps is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Grok 2 Vision — Image Understanding API for Apps?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Grok 2 Vision — Image Understanding API for Apps

Name: Grok 2 Vision — Image Understanding API for Apps
Author: xAI

from openai import OpenAI client = OpenAI(base_url="https://api.x.ai/v1", api_key=os.environ["XAI_API_KEY"]) resp = client.chat.completions.create( model="grok-2-vision-latest", messages=[{ "role": "user", "content": [ {"type": "text", "text": "Critique this dashboard. Mention spacing, color contrast, info hierarchy."}, {"type": "image_url", "image_url": {"url": "https://example.com/dashboard.png"}}, ], }], ) print(resp.choices[0].message.content)

import base64, mimetypes def to_data_uri(path: str) -> str: mime = mimetypes.guess_type(path)[0] or "image/png" b64 = base64.b64encode(open(path, "rb").read()).decode() return f"data:{mime};base64,{b64}" resp = client.chat.completions.create( model="grok-2-vision-latest", messages=[{ "role": "user", "content": [ {"type": "text", "text": "Extract the line items as JSON: description, qty, unit_price, total"}, {"type": "image_url", "image_url": {"url": to_data_uri("invoice.png")}}, ], }], response_format={"type": "json_object"}, )

content = [{"type": "text", "text": "Spot 3 differences between these UI mockups."}] for url in ["https://example.com/v1.png", "https://example.com/v2.png"]: content.append({"type": "image_url", "image_url": {"url": url}}) resp = client.chat.completions.create( model="grok-2-vision-latest", messages=[{"role": "user", "content": content}], )

Limit

Value

Max images per message

Max image size

20 MB

Supported formats

PNG, JPEG, WebP, non-animated GIF

Context window

32,768 tokens

Max long edge

8,192 px (resized server-side)

Quick Use

Use model='grok-2-vision-latest' with the openai SDK
Pass images via image_url (URL or data URI)
Combine with response_format=json_object for structured extraction

Intro

Grok-2 Vision (model grok-2-vision-latest) accepts images in the standard OpenAI chat.completions message format — pass image_url with a public URL or base64 data URI. Output is text reasoning grounded in the image. Best for: UI screenshot critique, chart and dashboard reading, document OCR with semantic understanding, content moderation, accessibility alt-text generation. Works with: openai-python, openai-node, any OpenAI-compatible client. Setup time: 2 minutes.

Public URL

from openai import OpenAI
client = OpenAI(base_url="https://api.x.ai/v1", api_key=os.environ["XAI_API_KEY"])

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Critique this dashboard. Mention spacing, color contrast, info hierarchy."},
            {"type": "image_url", "image_url": {"url": "https://example.com/dashboard.png"}},
        ],
    }],
)
print(resp.choices[0].message.content)

Base64 (for local files / private URLs)

import base64, mimetypes

def to_data_uri(path: str) -> str:
    mime = mimetypes.guess_type(path)[0] or "image/png"
    b64 = base64.b64encode(open(path, "rb").read()).decode()
    return f"data:{mime};base64,{b64}"

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the line items as JSON: description, qty, unit_price, total"},
            {"type": "image_url", "image_url": {"url": to_data_uri("invoice.png")}},
        ],
    }],
    response_format={"type": "json_object"},
)

Multi-image comparison

content = [{"type": "text", "text": "Spot 3 differences between these UI mockups."}]
for url in ["https://example.com/v1.png", "https://example.com/v2.png"]:
    content.append({"type": "image_url", "image_url": {"url": url}})

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{"role": "user", "content": content}],
)

Constraints

Limit	Value
Max images per message	10
Max image size	20 MB
Supported formats	PNG, JPEG, WebP, non-animated GIF
Context window	32,768 tokens
Max long edge	8,192 px (resized server-side)

FAQ

Q: Grok-2 Vision vs GPT-4o vision vs Claude 3.5 Sonnet vision? A: Comparable on most benchmarks. Grok-2 is fastest end-to-end and cheapest at ~$2/M input; GPT-4o is best at fine OCR; Claude is best at structured extraction with long instructions. Pick by primary task.

Q: Does it do bounding boxes? A: No — Grok-2 Vision returns text descriptions, not coordinates. For object detection with bboxes use Gemini 2.5 (detect_objects tool) or YOLO. Combine: YOLO for boxes, Grok for semantic interpretation of cropped regions.

Q: Can I fine-tune on images? A: Not yet — xAI has not opened image fine-tuning as of May 2026. Workaround: embed examples in the system prompt (few-shot with text-only descriptions of the image type), or use a downstream classifier on the textual output.

Source & Thanks

Built by xAI. Vision docs at docs.x.ai/docs/guides/image-understanding.

Public SDK: xai-org

Grok 2 Vision — Image Understanding API for Apps

Este activo puede ser leído e instalado directamente por agents

Public URL

Base64 (for local files / private URLs)

Multi-image comparison

Constraints

FAQ

Quick Use

Intro

Public URL

Base64 (for local files / private URLs)

Multi-image comparison

Constraints

FAQ

Source & Thanks

Fuente y agradecimientos

Discusión

Activos relacionados

Grok Live Search Tool — Real-Time Web Grounding via API

xAI Grok API Quickstart — OpenAI-Compatible Frontier Model

Jackett — Unified Torrent Indexer API for Media Automation

Albumentations — Fast Image Augmentation Library for ML Pipelines