# Grok 2 Vision — Image Understanding API for Apps

> Grok-2 Vision handles images via OpenAI-compat chat.completions. Pass URL or base64. UI critique, screenshot QA, OCR, chart reading.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

## Quick Use

1. Use `model='grok-2-vision-latest'` with the openai SDK
2. Pass images via `image_url` (URL or data URI)
3. Combine with `response_format=json_object` for structured extraction

---

## Intro

Grok-2 Vision (model `grok-2-vision-latest`) accepts images in the standard OpenAI chat.completions message format — pass `image_url` with a public URL or base64 data URI. Output is text reasoning grounded in the image. Best for: UI screenshot critique, chart and dashboard reading, document OCR with semantic understanding, content moderation, accessibility alt-text generation. Works with: openai-python, openai-node, any OpenAI-compatible client. Setup time: 2 minutes.

---

### Public URL

```python
from openai import OpenAI
client = OpenAI(base_url="https://api.x.ai/v1", api_key=os.environ["XAI_API_KEY"])

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Critique this dashboard. Mention spacing, color contrast, info hierarchy."},
            {"type": "image_url", "image_url": {"url": "https://example.com/dashboard.png"}},
        ],
    }],
)
print(resp.choices[0].message.content)
```

### Base64 (for local files / private URLs)

```python
import base64, mimetypes

def to_data_uri(path: str) -> str:
    mime = mimetypes.guess_type(path)[0] or "image/png"
    b64 = base64.b64encode(open(path, "rb").read()).decode()
    return f"data:{mime};base64,{b64}"

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the line items as JSON: description, qty, unit_price, total"},
            {"type": "image_url", "image_url": {"url": to_data_uri("invoice.png")}},
        ],
    }],
    response_format={"type": "json_object"},
)
```

### Multi-image comparison

```python
content = [{"type": "text", "text": "Spot 3 differences between these UI mockups."}]
for url in ["https://example.com/v1.png", "https://example.com/v2.png"]:
    content.append({"type": "image_url", "image_url": {"url": url}})

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{"role": "user", "content": content}],
)
```

### Constraints

| Limit | Value |
|---|---|
| Max images per message | 10 |
| Max image size | 20 MB |
| Supported formats | PNG, JPEG, WebP, non-animated GIF |
| Context window | 32,768 tokens |
| Max long edge | 8,192 px (resized server-side) |

---

### FAQ

**Q: Grok-2 Vision vs GPT-4o vision vs Claude 3.5 Sonnet vision?**
A: Comparable on most benchmarks. Grok-2 is fastest end-to-end and cheapest at ~$2/M input; GPT-4o is best at fine OCR; Claude is best at structured extraction with long instructions. Pick by primary task.

**Q: Does it do bounding boxes?**
A: No — Grok-2 Vision returns text descriptions, not coordinates. For object detection with bboxes use Gemini 2.5 (`detect_objects` tool) or YOLO. Combine: YOLO for boxes, Grok for semantic interpretation of cropped regions.

**Q: Can I fine-tune on images?**
A: Not yet — xAI has not opened image fine-tuning as of May 2026. Workaround: embed examples in the system prompt (few-shot with text-only descriptions of the image type), or use a downstream classifier on the textual output.

---

## Source & Thanks

> Built by [xAI](https://x.ai). Vision docs at [docs.x.ai/docs/guides/image-understanding](https://docs.x.ai/docs/guides/image-understanding).
>
> Public SDK: [xai-org](https://github.com/xai-org)

---

<!-- ZH -->

## 快速使用

1. openai SDK 用 `model='grok-2-vision-latest'`
2. 通过 `image_url` 传图（URL 或 data URI）
3. 结合 `response_format=json_object` 做结构化抽取

---

## 简介

Grok-2 Vision（模型 `grok-2-vision-latest`）按标准 OpenAI chat.completions message 格式吃图片 —— `image_url` 传公网 URL 或 base64 data URI。输出是基于图片的文本推理。适合 UI 截图评审、图表和仪表盘读取、带语义理解的文档 OCR、内容审核、无障碍 alt 文本生成。兼容 openai-python、openai-node、任何 OpenAI 兼容客户端。装机时间 2 分钟。

---

### 公网 URL

```python
from openai import OpenAI
client = OpenAI(base_url="https://api.x.ai/v1", api_key=os.environ["XAI_API_KEY"])

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "评审这个仪表盘。讲讲间距、对比度、信息层级。"},
            {"type": "image_url", "image_url": {"url": "https://example.com/dashboard.png"}},
        ],
    }],
)
print(resp.choices[0].message.content)
```

### Base64（本地文件 / 私有 URL）

```python
import base64, mimetypes

def to_data_uri(path: str) -> str:
    mime = mimetypes.guess_type(path)[0] or "image/png"
    b64 = base64.b64encode(open(path, "rb").read()).decode()
    return f"data:{mime};base64,{b64}"

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "把行项目抽成 JSON：description / qty / unit_price / total"},
            {"type": "image_url", "image_url": {"url": to_data_uri("invoice.png")}},
        ],
    }],
    response_format={"type": "json_object"},
)
```

### 多图对比

```python
content = [{"type": "text", "text": "找出这两版 UI mockup 的 3 处差别。"}]
for url in ["https://example.com/v1.png", "https://example.com/v2.png"]:
    content.append({"type": "image_url", "image_url": {"url": url}})

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{"role": "user", "content": content}],
)
```

### 限制

| 限制 | 值 |
|---|---|
| 单条消息最大图数 | 10 |
| 单图最大 | 20 MB |
| 支持格式 | PNG / JPEG / WebP / 非动图 GIF |
| 上下文窗口 | 32,768 tokens |
| 最大长边 | 8,192 px（服务端 resize）|

---

### FAQ

**Q: Grok-2 Vision vs GPT-4o 视觉 vs Claude 3.5 Sonnet 视觉？**
A: 大部分基准上不分上下。Grok-2 端到端最快，最便宜 ~$2/M 输入；GPT-4o 精细 OCR 最强；Claude 长指令结构化抽取最强。按主任务选。

**Q: 能给 bounding box 吗？**
A: 不能 —— Grok-2 Vision 返回文字描述，不是坐标。要带 bbox 的物体检测用 Gemini 2.5（`detect_objects` 工具）或 YOLO。组合：YOLO 出框、Grok 对裁剪区域做语义解释。

**Q: 能在图片上微调吗？**
A: 还不能 —— 截至 2026 年 5 月 xAI 没开图像微调。变通：在 system prompt 里嵌例子（用图像类型的文本描述做 few-shot），或在文本输出上跑下游分类器。

---

## 来源与感谢

> Built by [xAI](https://x.ai). Vision docs at [docs.x.ai/docs/guides/image-understanding](https://docs.x.ai/docs/guides/image-understanding).
>
> Public SDK: [xai-org](https://github.com/xai-org)


---
Source: https://tokrepo.com/en/workflows/grok-2-vision-image-understanding-api-for-apps
Author: xAI