Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 21, 2026·3 min de lecture

UI-TARS Desktop — Multimodal AI Agent Stack by ByteDance

Open-source multimodal AI agent stack connecting vision models with desktop automation, browser control, and tool use through a single desktop application.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Needs Confirmation · 66/100Policy : confirmer
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
UI-TARS Desktop
Commande CLI universelle
npx tokrepo install 0f676ee5-5530-11f1-9bc6-00163e2b0d79

Introduction

UI-TARS Desktop is an open-source desktop application from ByteDance that brings multimodal AI agents to your screen. It connects cutting-edge vision-language models with real GUI interactions so agents can see, reason about, and operate desktop applications and browsers autonomously.

What UI-TARS Desktop Does

  • Provides a desktop shell that lets vision-language models observe and interact with any GUI application
  • Supports browser automation through built-in Chromium integration and MCP server connectivity
  • Offers a screenshot-to-action pipeline where the model sees the screen and generates mouse/keyboard actions
  • Includes an agent orchestration layer with planning, reflection, and tool-use capabilities
  • Ships pre-built connectors for multiple VLM backends including GPT-4o, Claude, and open-weight models

Architecture Overview

UI-TARS Desktop is built on Electron with a modular agent core. The perception layer captures screenshots and feeds them to a configurable VLM backend. The planning module breaks user goals into sub-tasks, and the action executor translates model outputs into native OS events (clicks, keystrokes, scrolls). An MCP server bridge allows external tools to plug in, while a replay and debugging panel records every agent step for inspection and improvement.

Self-Hosting & Configuration

  • Requires Node.js 18+ and npm; builds for macOS, Windows, and Linux via Electron
  • Configure your VLM backend by setting API keys in the settings panel (supports OpenAI, Anthropic, local endpoints)
  • Adjust screenshot resolution, action delay, and safety guards in the config file
  • MCP servers can be registered in the settings to extend tool capabilities
  • GPU is not required on the desktop side; inference runs against remote or local API endpoints

Key Features

  • Multimodal perception: the agent literally sees the screen and reasons about UI elements
  • Cross-platform desktop automation without brittle selectors or accessibility APIs
  • Built-in browser agent mode for web tasks alongside native app control
  • Step-by-step replay and debugging to audit every decision the agent made
  • Extensible via MCP protocol for adding custom tools and data sources

Comparison with Similar Tools

  • Browser Use — browser-only automation; UI-TARS handles native desktop apps too
  • OpenHands — focuses on coding tasks in a sandboxed environment; UI-TARS targets general GUI automation
  • Anthropic Computer Use — similar vision-action loop but closed-source; UI-TARS is fully open
  • LaVague — web-focused large action model; UI-TARS combines web and desktop in one stack
  • Stagehand — browser automation SDK; UI-TARS provides a full desktop application with built-in agent loop

FAQ

Q: Do I need a local GPU to run UI-TARS Desktop? A: No. The desktop app sends screenshots to a remote VLM API. You only need a GPU if you self-host the vision-language model locally.

Q: Which operating systems are supported? A: macOS, Windows, and Linux. The app is packaged with Electron and builds for all three platforms.

Q: Can I use open-weight models instead of commercial APIs? A: Yes. Point the VLM backend to any OpenAI-compatible endpoint, including locally hosted models via vLLM or Ollama.

Q: Is it safe to let an AI agent control my desktop? A: UI-TARS includes configurable safety guards, action confirmation prompts, and a sandboxed browser mode. Review the security settings before granting full control.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires