Skills2026年5月13日·1 分钟阅读

OmniParser — Screen Parsing Toolkit for AI Agents

OmniParser by Microsoft Research converts screenshots into structured data that AI agents can understand and act upon, enabling vision-based GUI automation across desktop and web applications.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
OmniParser Overview
通用 CLI 安装命令
npx tokrepo install edfd1172-4ea3-11f1-9bc6-00163e2b0d79

Introduction

OmniParser is a screen parsing library from Microsoft Research that extracts structured information from UI screenshots. It identifies interactive elements, labels, and layout structure so that vision-language models can plan and execute actions on any graphical interface.

What OmniParser Does

  • Detects clickable UI elements (buttons, links, inputs) in screenshots
  • Extracts text labels and associates them with detected regions
  • Outputs structured bounding-box data for downstream agent use
  • Works across web pages, desktop apps, and mobile UIs
  • Integrates with vision-language models for end-to-end automation

Architecture Overview

OmniParser combines a fine-tuned object detection model for UI element localization with an OCR module for text extraction. The detection model identifies interactable regions while the OCR module reads labels and content. Results are merged into a unified structured output that maps each element to its screen coordinates and semantic label.

Self-Hosting & Configuration

  • Clone the repo and install Python dependencies
  • Download pre-trained model weights via the provided script
  • Run the demo server for interactive testing
  • Configure detection thresholds via command-line arguments
  • GPU recommended for real-time parsing; CPU mode available

Key Features

  • Generalizes across different UI frameworks and platforms
  • Provides pixel-accurate bounding boxes for each element
  • Combines detection and OCR in a single pipeline
  • Open weights released under permissive license
  • Benchmarked on ScreenSpot and other UI grounding datasets

Comparison with Similar Tools

  • SeeClick — click prediction model; OmniParser provides full element parsing
  • CogAgent — end-to-end GUI agent; OmniParser is a modular parsing component
  • Ferret-UI — Apple multimodal model; OmniParser focuses on structured output
  • Set-of-Mark — visual prompting overlay; OmniParser directly detects elements

FAQ

Q: What image formats does OmniParser accept? A: Standard formats including PNG, JPEG, and BMP are supported.

Q: Does it require a GPU? A: GPU is recommended for speed but CPU inference is supported at lower throughput.

Q: Can OmniParser handle dynamic web pages? A: It operates on static screenshots, so you capture the current state and parse it. For dynamic content, take sequential screenshots.

Q: How accurate is element detection? A: It achieves strong results on standard UI grounding benchmarks. Accuracy varies by UI complexity.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产