# UI-TARS Desktop — Multimodal AI Agent Stack by ByteDance

> Open-source multimodal AI agent stack connecting vision models with desktop automation, browser control, and tool use through a single desktop application.

## Install

Save as a script file and run:

# UI-TARS Desktop — Multimodal AI Agent Stack by ByteDance

## Quick Use
```bash
# Clone and install
git clone https://github.com/bytedance/UI-TARS-desktop.git
cd UI-TARS-desktop
npm install
npm run build
npm run start
```

## Introduction
UI-TARS Desktop is an open-source desktop application from ByteDance that brings multimodal AI agents to your screen. It connects cutting-edge vision-language models with real GUI interactions so agents can see, reason about, and operate desktop applications and browsers autonomously.

## What UI-TARS Desktop Does
- Provides a desktop shell that lets vision-language models observe and interact with any GUI application
- Supports browser automation through built-in Chromium integration and MCP server connectivity
- Offers a screenshot-to-action pipeline where the model sees the screen and generates mouse/keyboard actions
- Includes an agent orchestration layer with planning, reflection, and tool-use capabilities
- Ships pre-built connectors for multiple VLM backends including GPT-4o, Claude, and open-weight models

## Architecture Overview
UI-TARS Desktop is built on Electron with a modular agent core. The perception layer captures screenshots and feeds them to a configurable VLM backend. The planning module breaks user goals into sub-tasks, and the action executor translates model outputs into native OS events (clicks, keystrokes, scrolls). An MCP server bridge allows external tools to plug in, while a replay and debugging panel records every agent step for inspection and improvement.

## Self-Hosting & Configuration
- Requires Node.js 18+ and npm; builds for macOS, Windows, and Linux via Electron
- Configure your VLM backend by setting API keys in the settings panel (supports OpenAI, Anthropic, local endpoints)
- Adjust screenshot resolution, action delay, and safety guards in the config file
- MCP servers can be registered in the settings to extend tool capabilities
- GPU is not required on the desktop side; inference runs against remote or local API endpoints

## Key Features
- Multimodal perception: the agent literally sees the screen and reasons about UI elements
- Cross-platform desktop automation without brittle selectors or accessibility APIs
- Built-in browser agent mode for web tasks alongside native app control
- Step-by-step replay and debugging to audit every decision the agent made
- Extensible via MCP protocol for adding custom tools and data sources

## Comparison with Similar Tools
- **Browser Use** — browser-only automation; UI-TARS handles native desktop apps too
- **OpenHands** — focuses on coding tasks in a sandboxed environment; UI-TARS targets general GUI automation
- **Anthropic Computer Use** — similar vision-action loop but closed-source; UI-TARS is fully open
- **LaVague** — web-focused large action model; UI-TARS combines web and desktop in one stack
- **Stagehand** — browser automation SDK; UI-TARS provides a full desktop application with built-in agent loop

## FAQ
**Q: Do I need a local GPU to run UI-TARS Desktop?**
A: No. The desktop app sends screenshots to a remote VLM API. You only need a GPU if you self-host the vision-language model locally.

**Q: Which operating systems are supported?**
A: macOS, Windows, and Linux. The app is packaged with Electron and builds for all three platforms.

**Q: Can I use open-weight models instead of commercial APIs?**
A: Yes. Point the VLM backend to any OpenAI-compatible endpoint, including locally hosted models via vLLM or Ollama.

**Q: Is it safe to let an AI agent control my desktop?**
A: UI-TARS includes configurable safety guards, action confirmation prompts, and a sandboxed browser mode. Review the security settings before granting full control.

## Sources
- https://github.com/bytedance/UI-TARS-desktop
- https://github.com/bytedance/UI-TARS-desktop/blob/main/README.md

---
Source: https://tokrepo.com/en/workflows/asset-0f676ee5
Author: Script Depot