ScriptsMay 21, 2026·3 min read

UI-TARS Desktop — Multimodal AI Agent Stack by ByteDance

Open-source multimodal AI agent stack connecting vision models with desktop automation, browser control, and tool use through a single desktop application.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Needs Confirmation · 66/100Policy: confirm
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
UI-TARS Desktop
Universal CLI install command
npx tokrepo install 0f676ee5-5530-11f1-9bc6-00163e2b0d79

Introduction

UI-TARS Desktop is an open-source desktop application from ByteDance that brings multimodal AI agents to your screen. It connects cutting-edge vision-language models with real GUI interactions so agents can see, reason about, and operate desktop applications and browsers autonomously.

What UI-TARS Desktop Does

  • Provides a desktop shell that lets vision-language models observe and interact with any GUI application
  • Supports browser automation through built-in Chromium integration and MCP server connectivity
  • Offers a screenshot-to-action pipeline where the model sees the screen and generates mouse/keyboard actions
  • Includes an agent orchestration layer with planning, reflection, and tool-use capabilities
  • Ships pre-built connectors for multiple VLM backends including GPT-4o, Claude, and open-weight models

Architecture Overview

UI-TARS Desktop is built on Electron with a modular agent core. The perception layer captures screenshots and feeds them to a configurable VLM backend. The planning module breaks user goals into sub-tasks, and the action executor translates model outputs into native OS events (clicks, keystrokes, scrolls). An MCP server bridge allows external tools to plug in, while a replay and debugging panel records every agent step for inspection and improvement.

Self-Hosting & Configuration

  • Requires Node.js 18+ and npm; builds for macOS, Windows, and Linux via Electron
  • Configure your VLM backend by setting API keys in the settings panel (supports OpenAI, Anthropic, local endpoints)
  • Adjust screenshot resolution, action delay, and safety guards in the config file
  • MCP servers can be registered in the settings to extend tool capabilities
  • GPU is not required on the desktop side; inference runs against remote or local API endpoints

Key Features

  • Multimodal perception: the agent literally sees the screen and reasons about UI elements
  • Cross-platform desktop automation without brittle selectors or accessibility APIs
  • Built-in browser agent mode for web tasks alongside native app control
  • Step-by-step replay and debugging to audit every decision the agent made
  • Extensible via MCP protocol for adding custom tools and data sources

Comparison with Similar Tools

  • Browser Use — browser-only automation; UI-TARS handles native desktop apps too
  • OpenHands — focuses on coding tasks in a sandboxed environment; UI-TARS targets general GUI automation
  • Anthropic Computer Use — similar vision-action loop but closed-source; UI-TARS is fully open
  • LaVague — web-focused large action model; UI-TARS combines web and desktop in one stack
  • Stagehand — browser automation SDK; UI-TARS provides a full desktop application with built-in agent loop

FAQ

Q: Do I need a local GPU to run UI-TARS Desktop? A: No. The desktop app sends screenshots to a remote VLM API. You only need a GPU if you self-host the vision-language model locally.

Q: Which operating systems are supported? A: macOS, Windows, and Linux. The app is packaged with Electron and builds for all three platforms.

Q: Can I use open-weight models instead of commercial APIs? A: Yes. Point the VLM backend to any OpenAI-compatible endpoint, including locally hosted models via vLLM or Ollama.

Q: Is it safe to let an AI agent control my desktop? A: UI-TARS includes configurable safety guards, action confirmation prompts, and a sandboxed browser mode. Review the security settings before granting full control.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.