Scripts2026年5月21日·1 分钟阅读

UI-TARS Desktop — Multimodal AI Agent Stack by ByteDance

Open-source multimodal AI agent stack connecting vision models with desktop automation, browser control, and tool use through a single desktop application.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Needs Confirmation · 66/100策略:需确认
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
UI-TARS Desktop
通用 CLI 安装命令
npx tokrepo install 0f676ee5-5530-11f1-9bc6-00163e2b0d79

Introduction

UI-TARS Desktop is an open-source desktop application from ByteDance that brings multimodal AI agents to your screen. It connects cutting-edge vision-language models with real GUI interactions so agents can see, reason about, and operate desktop applications and browsers autonomously.

What UI-TARS Desktop Does

  • Provides a desktop shell that lets vision-language models observe and interact with any GUI application
  • Supports browser automation through built-in Chromium integration and MCP server connectivity
  • Offers a screenshot-to-action pipeline where the model sees the screen and generates mouse/keyboard actions
  • Includes an agent orchestration layer with planning, reflection, and tool-use capabilities
  • Ships pre-built connectors for multiple VLM backends including GPT-4o, Claude, and open-weight models

Architecture Overview

UI-TARS Desktop is built on Electron with a modular agent core. The perception layer captures screenshots and feeds them to a configurable VLM backend. The planning module breaks user goals into sub-tasks, and the action executor translates model outputs into native OS events (clicks, keystrokes, scrolls). An MCP server bridge allows external tools to plug in, while a replay and debugging panel records every agent step for inspection and improvement.

Self-Hosting & Configuration

  • Requires Node.js 18+ and npm; builds for macOS, Windows, and Linux via Electron
  • Configure your VLM backend by setting API keys in the settings panel (supports OpenAI, Anthropic, local endpoints)
  • Adjust screenshot resolution, action delay, and safety guards in the config file
  • MCP servers can be registered in the settings to extend tool capabilities
  • GPU is not required on the desktop side; inference runs against remote or local API endpoints

Key Features

  • Multimodal perception: the agent literally sees the screen and reasons about UI elements
  • Cross-platform desktop automation without brittle selectors or accessibility APIs
  • Built-in browser agent mode for web tasks alongside native app control
  • Step-by-step replay and debugging to audit every decision the agent made
  • Extensible via MCP protocol for adding custom tools and data sources

Comparison with Similar Tools

  • Browser Use — browser-only automation; UI-TARS handles native desktop apps too
  • OpenHands — focuses on coding tasks in a sandboxed environment; UI-TARS targets general GUI automation
  • Anthropic Computer Use — similar vision-action loop but closed-source; UI-TARS is fully open
  • LaVague — web-focused large action model; UI-TARS combines web and desktop in one stack
  • Stagehand — browser automation SDK; UI-TARS provides a full desktop application with built-in agent loop

FAQ

Q: Do I need a local GPU to run UI-TARS Desktop? A: No. The desktop app sends screenshots to a remote VLM API. You only need a GPU if you self-host the vision-language model locally.

Q: Which operating systems are supported? A: macOS, Windows, and Linux. The app is packaged with Electron and builds for all three platforms.

Q: Can I use open-weight models instead of commercial APIs? A: Yes. Point the VLM backend to any OpenAI-compatible endpoint, including locally hosted models via vLLM or Ollama.

Q: Is it safe to let an AI agent control my desktop? A: UI-TARS includes configurable safety guards, action confirmation prompts, and a sandboxed browser mode. Review the security settings before granting full control.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产