What is Grok 2 Vision — Image Understanding API for Apps?

Grok-2 Vision handles images via OpenAI-compat chat.completions. Pass URL or base64. UI critique, screenshot QA, OCR, chart reading.

Is Grok 2 Vision — Image Understanding API for Apps free to use?

Yes. Grok 2 Vision — Image Understanding API for Apps is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Grok 2 Vision — Image Understanding API for Apps?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Grok 2 Vision — Image Understanding API for Apps

简介

Grok-2 Vision（模型 grok-2-vision-latest）按标准 OpenAI chat.completions message 格式吃图片 —— image_url 传公网 URL 或 base64 data URI。输出是基于图片的文本推理。适合 UI 截图评审、图表和仪表盘读取、带语义理解的文档 OCR、内容审核、无障碍 alt 文本生成。兼容 openai-python、openai-node、任何 OpenAI 兼容客户端。装机时间 2 分钟。

公网 URL

from openai import OpenAI
client = OpenAI(base_url="https://api.x.ai/v1", api_key=os.environ["XAI_API_KEY"])

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "评审这个仪表盘。讲讲间距、对比度、信息层级。"},
            {"type": "image_url", "image_url": {"url": "https://example.com/dashboard.png"}},
        ],
    }],
)
print(resp.choices[0].message.content)

Base64（本地文件 / 私有 URL）

import base64, mimetypes

def to_data_uri(path: str) -> str:
    mime = mimetypes.guess_type(path)[0] or "image/png"
    b64 = base64.b64encode(open(path, "rb").read()).decode()
    return f"data:{mime};base64,{b64}"

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "把行项目抽成 JSON：description / qty / unit_price / total"},
            {"type": "image_url", "image_url": {"url": to_data_uri("invoice.png")}},
        ],
    }],
    response_format={"type": "json_object"},
)

多图对比

content = [{"type": "text", "text": "找出这两版 UI mockup 的 3 处差别。"}]
for url in ["https://example.com/v1.png", "https://example.com/v2.png"]:
    content.append({"type": "image_url", "image_url": {"url": url}})

resp = client.chat.completions.create(
    model="grok-2-vision-latest",
    messages=[{"role": "user", "content": content}],
)

限制

限制	值
单条消息最大图数	10
单图最大	20 MB
支持格式	PNG / JPEG / WebP / 非动图 GIF
上下文窗口	32,768 tokens
最大长边	8,192 px（服务端 resize）

FAQ

Q: Grok-2 Vision vs GPT-4o 视觉 vs Claude 3.5 Sonnet 视觉？ A: 大部分基准上不分上下。Grok-2 端到端最快，最便宜 ~$2/M 输入；GPT-4o 精细 OCR 最强；Claude 长指令结构化抽取最强。按主任务选。

Q: 能给 bounding box 吗？ A: 不能 —— Grok-2 Vision 返回文字描述，不是坐标。要带 bbox 的物体检测用 Gemini 2.5（detect_objects 工具）或 YOLO。组合：YOLO 出框、Grok 对裁剪区域做语义解释。

Q: 能在图片上微调吗？ A: 还不能 —— 截至 2026 年 5 月 xAI 没开图像微调。变通：在 system prompt 里嵌例子（用图像类型的文本描述做 few-shot），或在文本输出上跑下游分类器。

Grok 2 Vision — Image Understanding API for Apps

这个资产会安全暂存

简介

公网 URL

Base64（本地文件 / 私有 URL）

多图对比

限制

FAQ

来源与感谢

讨论

相关资产

Grok Live Search Tool — Real-Time Web Grounding via API

grok-cli — Terminal Coding Agent for Grok API

xAI Grok API Quickstart — OpenAI-Compatible Frontier Model

Jackett — Unified Torrent Indexer API for Media Automation