# UI-TARS Desktop — Multimodal AI Agent Stack by ByteDance > Open-source multimodal AI agent stack connecting vision models with desktop automation, browser control, and tool use through a single desktop application. ## Install Save as a script file and run: # UI-TARS Desktop — Multimodal AI Agent Stack by ByteDance ## Quick Use ```bash # Clone and install git clone https://github.com/bytedance/UI-TARS-desktop.git cd UI-TARS-desktop npm install npm run build npm run start ``` ## Introduction UI-TARS Desktop is an open-source desktop application from ByteDance that brings multimodal AI agents to your screen. It connects cutting-edge vision-language models with real GUI interactions so agents can see, reason about, and operate desktop applications and browsers autonomously. ## What UI-TARS Desktop Does - Provides a desktop shell that lets vision-language models observe and interact with any GUI application - Supports browser automation through built-in Chromium integration and MCP server connectivity - Offers a screenshot-to-action pipeline where the model sees the screen and generates mouse/keyboard actions - Includes an agent orchestration layer with planning, reflection, and tool-use capabilities - Ships pre-built connectors for multiple VLM backends including GPT-4o, Claude, and open-weight models ## Architecture Overview UI-TARS Desktop is built on Electron with a modular agent core. The perception layer captures screenshots and feeds them to a configurable VLM backend. The planning module breaks user goals into sub-tasks, and the action executor translates model outputs into native OS events (clicks, keystrokes, scrolls). An MCP server bridge allows external tools to plug in, while a replay and debugging panel records every agent step for inspection and improvement. ## Self-Hosting & Configuration - Requires Node.js 18+ and npm; builds for macOS, Windows, and Linux via Electron - Configure your VLM backend by setting API keys in the settings panel (supports OpenAI, Anthropic, local endpoints) - Adjust screenshot resolution, action delay, and safety guards in the config file - MCP servers can be registered in the settings to extend tool capabilities - GPU is not required on the desktop side; inference runs against remote or local API endpoints ## Key Features - Multimodal perception: the agent literally sees the screen and reasons about UI elements - Cross-platform desktop automation without brittle selectors or accessibility APIs - Built-in browser agent mode for web tasks alongside native app control - Step-by-step replay and debugging to audit every decision the agent made - Extensible via MCP protocol for adding custom tools and data sources ## Comparison with Similar Tools - **Browser Use** — browser-only automation; UI-TARS handles native desktop apps too - **OpenHands** — focuses on coding tasks in a sandboxed environment; UI-TARS targets general GUI automation - **Anthropic Computer Use** — similar vision-action loop but closed-source; UI-TARS is fully open - **LaVague** — web-focused large action model; UI-TARS combines web and desktop in one stack - **Stagehand** — browser automation SDK; UI-TARS provides a full desktop application with built-in agent loop ## FAQ **Q: Do I need a local GPU to run UI-TARS Desktop?** A: No. The desktop app sends screenshots to a remote VLM API. You only need a GPU if you self-host the vision-language model locally. **Q: Which operating systems are supported?** A: macOS, Windows, and Linux. The app is packaged with Electron and builds for all three platforms. **Q: Can I use open-weight models instead of commercial APIs?** A: Yes. Point the VLM backend to any OpenAI-compatible endpoint, including locally hosted models via vLLM or Ollama. **Q: Is it safe to let an AI agent control my desktop?** A: UI-TARS includes configurable safety guards, action confirmation prompts, and a sandboxed browser mode. Review the security settings before granting full control. ## Sources - https://github.com/bytedance/UI-TARS-desktop - https://github.com/bytedance/UI-TARS-desktop/blob/main/README.md --- Source: https://tokrepo.com/en/workflows/asset-0f676ee5 Author: Script Depot