How do I install UI-TARS Desktop — Multimodal AI Agent Stack by ByteDance?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

UI-TARS Desktop — Multimodal AI Agent Stack by ByteDance

Introduction

UI-TARS Desktop is an open-source desktop application from ByteDance that brings multimodal AI agents to your screen. It connects cutting-edge vision-language models with real GUI interactions so agents can see, reason about, and operate desktop applications and browsers autonomously.

What UI-TARS Desktop Does

Provides a desktop shell that lets vision-language models observe and interact with any GUI application
Supports browser automation through built-in Chromium integration and MCP server connectivity
Offers a screenshot-to-action pipeline where the model sees the screen and generates mouse/keyboard actions
Includes an agent orchestration layer with planning, reflection, and tool-use capabilities
Ships pre-built connectors for multiple VLM backends including GPT-4o, Claude, and open-weight models

Architecture Overview

UI-TARS Desktop is built on Electron with a modular agent core. The perception layer captures screenshots and feeds them to a configurable VLM backend. The planning module breaks user goals into sub-tasks, and the action executor translates model outputs into native OS events (clicks, keystrokes, scrolls). An MCP server bridge allows external tools to plug in, while a replay and debugging panel records every agent step for inspection and improvement.

Self-Hosting & Configuration

Requires Node.js 18+ and npm; builds for macOS, Windows, and Linux via Electron
Configure your VLM backend by setting API keys in the settings panel (supports OpenAI, Anthropic, local endpoints)
Adjust screenshot resolution, action delay, and safety guards in the config file
MCP servers can be registered in the settings to extend tool capabilities
GPU is not required on the desktop side; inference runs against remote or local API endpoints

Key Features

Multimodal perception: the agent literally sees the screen and reasons about UI elements
Cross-platform desktop automation without brittle selectors or accessibility APIs
Built-in browser agent mode for web tasks alongside native app control
Step-by-step replay and debugging to audit every decision the agent made
Extensible via MCP protocol for adding custom tools and data sources

Comparison with Similar Tools

Browser Use — browser-only automation; UI-TARS handles native desktop apps too
OpenHands — focuses on coding tasks in a sandboxed environment; UI-TARS targets general GUI automation
Anthropic Computer Use — similar vision-action loop but closed-source; UI-TARS is fully open
LaVague — web-focused large action model; UI-TARS combines web and desktop in one stack
Stagehand — browser automation SDK; UI-TARS provides a full desktop application with built-in agent loop

FAQ

Q: Do I need a local GPU to run UI-TARS Desktop? A: No. The desktop app sends screenshots to a remote VLM API. You only need a GPU if you self-host the vision-language model locally.

Q: Which operating systems are supported? A: macOS, Windows, and Linux. The app is packaged with Electron and builds for all three platforms.

Q: Can I use open-weight models instead of commercial APIs? A: Yes. Point the VLM backend to any OpenAI-compatible endpoint, including locally hosted models via vLLM or Ollama.

Q: Is it safe to let an AI agent control my desktop? A: UI-TARS includes configurable safety guards, action confirmation prompts, and a sandboxed browser mode. Review the security settings before granting full control.