Quick Use
- Use
model='grok-2-vision-latest'with the openai SDK - Pass images via
image_url(URL or data URI) - Combine with
response_format=json_objectfor structured extraction
Intro
Grok-2 Vision (model grok-2-vision-latest) accepts images in the standard OpenAI chat.completions message format — pass image_url with a public URL or base64 data URI. Output is text reasoning grounded in the image. Best for: UI screenshot critique, chart and dashboard reading, document OCR with semantic understanding, content moderation, accessibility alt-text generation. Works with: openai-python, openai-node, any OpenAI-compatible client. Setup time: 2 minutes.
Public URL
from openai import OpenAI
client = OpenAI(base_url="https://api.x.ai/v1", api_key=os.environ["XAI_API_KEY"])
resp = client.chat.completions.create(
model="grok-2-vision-latest",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Critique this dashboard. Mention spacing, color contrast, info hierarchy."},
{"type": "image_url", "image_url": {"url": "https://example.com/dashboard.png"}},
],
}],
)
print(resp.choices[0].message.content)Base64 (for local files / private URLs)
import base64, mimetypes
def to_data_uri(path: str) -> str:
mime = mimetypes.guess_type(path)[0] or "image/png"
b64 = base64.b64encode(open(path, "rb").read()).decode()
return f"data:{mime};base64,{b64}"
resp = client.chat.completions.create(
model="grok-2-vision-latest",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract the line items as JSON: description, qty, unit_price, total"},
{"type": "image_url", "image_url": {"url": to_data_uri("invoice.png")}},
],
}],
response_format={"type": "json_object"},
)Multi-image comparison
content = [{"type": "text", "text": "Spot 3 differences between these UI mockups."}]
for url in ["https://example.com/v1.png", "https://example.com/v2.png"]:
content.append({"type": "image_url", "image_url": {"url": url}})
resp = client.chat.completions.create(
model="grok-2-vision-latest",
messages=[{"role": "user", "content": content}],
)Constraints
| Limit | Value |
|---|---|
| Max images per message | 10 |
| Max image size | 20 MB |
| Supported formats | PNG, JPEG, WebP, non-animated GIF |
| Context window | 32,768 tokens |
| Max long edge | 8,192 px (resized server-side) |
FAQ
Q: Grok-2 Vision vs GPT-4o vision vs Claude 3.5 Sonnet vision? A: Comparable on most benchmarks. Grok-2 is fastest end-to-end and cheapest at ~$2/M input; GPT-4o is best at fine OCR; Claude is best at structured extraction with long instructions. Pick by primary task.
Q: Does it do bounding boxes?
A: No — Grok-2 Vision returns text descriptions, not coordinates. For object detection with bboxes use Gemini 2.5 (detect_objects tool) or YOLO. Combine: YOLO for boxes, Grok for semantic interpretation of cropped regions.
Q: Can I fine-tune on images? A: Not yet — xAI has not opened image fine-tuning as of May 2026. Workaround: embed examples in the system prompt (few-shot with text-only descriptions of the image type), or use a downstream classifier on the textual output.
Source & Thanks
Built by xAI. Vision docs at docs.x.ai/docs/guides/image-understanding.
Public SDK: xai-org