ByteDance UI-TARS Desktop: The Open-Source Agent That Controls Any App Through Screenshots

Photo by Josh Sorenson on Pexels Source

ByteDance’s UI-TARS Desktop is an open-source agent that controls your computer the way a human does: by looking at the screen and clicking things. No DOM parsing, no accessibility tree, no API integration. It takes a screenshot, figures out what it sees, decides where to click or type, executes the action, then takes another screenshot to verify the result. The 72B model scores 24.6% on the OSWorld benchmark, beating Anthropic’s Claude Computer Use (22.0%) and doubling GPT-4o’s baseline (12.2%). The 7B model runs on a single consumer GPU and still hits 18.8%, making local deployment realistic for teams that cannot send screen data to external APIs.

That pure-vision approach is what makes UI-TARS interesting. Every other agent framework assumes structured access to the application it automates, whether through a browser’s DOM, a REST API, or an MCP server. UI-TARS assumes nothing except pixels on a screen. This makes it the only open-source agent that can automate legacy desktop software, proprietary tools with no API, or any application where the only interface is a GUI.

How the Agent Loop Works: Screenshots In, Actions Out

UI-TARS Desktop is built on Electron and runs on macOS, Windows, and Linux. The core loop is straightforward, as described in the project’s architecture documentation:

Capture a screenshot using platform-native APIs (CGWindowListCreateImage on macOS, Win32 API on Windows)
Send the screenshot plus the user’s instruction and conversation history to the vision-language model
Parse the model’s output into structured action tokens: click(x, y), type('text'), scroll(direction), hotkey('ctrl+c'), wait(), or finished()
Execute the action through native OS input simulation (robotjs/nut.js)
Take a verification screenshot and compare with the expected outcome
Repeat until the task is done or the model outputs finished()

The verification step is critical. After each action, UI-TARS does not blindly proceed. It compares the before and after screenshots, and if the expected state change did not happen (a button was not clicked, a menu did not open), the model generates a corrective plan. ByteDance calls this “System-2 reflection,” borrowing the terminology from Daniel Kahneman’s dual-process theory. In practice, it means the agent retries intelligently rather than plowing through a broken sequence.

The Model Under the Hood

UI-TARS is built on Qwen2-VL, Alibaba’s vision-language model, and comes in three sizes:

UI-TARS-2B: Fast inference, limited capability. Suitable for simple, repetitive tasks.
UI-TARS-7B-DPO: The sweet spot for local deployment. Fits on a consumer GPU (RTX 3090/4090 with quantization) and scores 18.8% on OSWorld, beating GPT-4o’s 12.2%.
UI-TARS-72B-DPO: Full-size model with the best performance. Requires cloud inference or a multi-GPU setup.

All three process screenshots at up to 1344x1344 resolution and unify three capabilities in a single forward pass: perception (understanding what is on screen), grounding (locating specific UI elements by coordinates), and action prediction (deciding what to do next). No separate OCR pipeline, no object detection model, no handoff between components.

The training pipeline, detailed in the arXiv paper, runs in three stages: large-scale GUI perception pretraining across web, desktop, and mobile interfaces; supervised fine-tuning on human action trajectories; and iterative DPO (Direct Preference Optimization) where the agent attempts tasks autonomously, and successful traces become positive training examples while failed traces become negative ones. This self-improvement loop runs over multiple iterations, which explains why the smaller 7B model punches above its weight.

Benchmark Reality Check: What 24.6% Actually Means

The headline number, 24.6% on OSWorld, sounds low. Human performance on the same benchmark is 72.4%. But context matters: OSWorld tests agents on genuinely difficult multi-step desktop tasks (installing software, configuring system settings, manipulating files across applications), and no agent comes close to human performance yet. Here is how the field stacks up:

Agent	OSWorld (screenshot only)
UI-TARS-72B-DPO	24.6%
Claude 3.5 Computer Use	22.0%
GPT-4o baseline	12.2%
SeeAct (GPT-4V)	11.3%
CogAgent	4.3%
Human baseline	72.4%

UI-TARS-72B leads by 2.6 percentage points over Claude Computer Use, which is a meaningful gap in a benchmark where most models cluster below 15%. On WebArena (web-only tasks), UI-TARS-72B scores 52.1%. On AndroidWorld (mobile), the 7B model alone hits 46.6%.

The practical takeaway: these agents reliably handle routine multi-step tasks (filling forms, transferring data between apps, navigating menus) but still fail on anything that requires novel problem-solving or deep application knowledge. Plan for 70-80% automation of repetitive workflows, not full autonomy.

Vision-Based vs. DOM-Based Agents: The Real Tradeoff

The AI agent ecosystem has split into two camps, and UI-TARS sits firmly in one of them.

DOM-based agents (Browser Use, Playwright MCP, Chrome WebMCP) parse the structured document model of web pages. They know exactly where every button, link, and input field is because the browser tells them. This makes them fast and reliable for web automation: they can click a button by its CSS selector, not by guessing pixel coordinates.

Vision-based agents (UI-TARS, Claude Computer Use) work from raw pixels. They receive a screenshot and must figure out everything from the image: what application is open, where the buttons are, what text says, what state the interface is in. This is harder and slower, but it works on literally anything with a screen.

The tradeoff is simple:

Dimension	DOM-Based (Browser Use, Playwright)	Vision-Based (UI-TARS, Claude Computer Use)
Speed	Fast (direct element access)	Slow (screenshot + VLM inference)
Reliability on web	High (structured data)	Medium (visual ambiguity)
Desktop app support	None	Full
Legacy software	None	Full
API dependency	Needs browser APIs	Needs nothing but a screen
Local deployment	Needs LLM API	UI-TARS 7B runs locally
Data privacy	Data flows to LLM provider	Can stay on-device

If your automation target is web-only, DOM-based agents are faster and more reliable. If you need to automate SAP, Excel, a legacy ERP system, or any desktop application that was never designed for programmatic access, vision-based agents are your only option besides traditional RPA. UI-TARS is the first open-source, open-weight option in that space.

Where This Matters: Legacy Software and Data Sovereignty

The strongest use case for UI-TARS is not automating Chrome. Browser agents do that better. The strongest case is automating applications that resist automation: the SAP GUI that only accepts keyboard shortcuts, the medical records system from 2008, the insurance underwriting tool that only runs on Windows. These applications have no API, no DOM, no MCP support, and no plans to add any. For decades, RPA tools automated them through brittle, pixel-coordinate-based scripts that broke every time the UI changed.

UI-TARS replaces those brittle scripts with a model that actually understands what it sees. When a button moves 20 pixels to the right after an update, a traditional RPA script breaks. UI-TARS reads the button label and clicks it anyway.

For enterprise teams in regulated industries (banking, insurance, healthcare), the 7B model’s ability to run locally is the buried advantage. Screen data never leaves the machine. No screenshots sent to Anthropic or OpenAI. For organizations bound by GDPR or sector-specific regulations, this is the difference between “possible” and “compliance nightmare.”

Getting Started: What You Actually Need

Running UI-TARS Desktop requires:

For the 7B model (local): A GPU with 16GB+ VRAM (RTX 3090, 4090, or equivalent). With 4-bit quantization, 12GB VRAM is workable but slower.
For the 72B model (cloud): API access to a cloud inference provider running the model, or a multi-GPU setup with 4x A100 80GB or equivalent.
The Electron app: Download from the GitHub releases page. Works on macOS, Windows, and Linux.

Configuration is straightforward: point the app at your model endpoint (local or remote), give it a natural language instruction, and watch it work. The app overlays action annotations on your screen so you can see exactly what the agent is doing and intervene if needed.

One practical warning: inference speed with the 72B model is roughly 3-5 seconds per action cycle (screenshot capture, model inference, action execution). For a task with 20 steps, expect about 60-100 seconds total. The 7B model is faster at ~1-2 seconds per cycle but makes more mistakes, requiring more correction steps. Neither is fast enough for real-time interaction; this is a tool for batch automation of repetitive tasks, not a replacement for a mouse.

Frequently Asked Questions

What is ByteDance UI-TARS Desktop?

UI-TARS Desktop is an open-source GUI agent application by ByteDance that controls any computer application through natural language instructions. It works by taking screenshots, understanding the visual interface through a vision-language model, and performing mouse and keyboard actions. It runs on macOS, Windows, and Linux under the Apache 2.0 license.

How does UI-TARS compare to Claude Computer Use?

UI-TARS-72B-DPO scores 24.6% on the OSWorld benchmark, beating Claude Computer Use’s 22.0%. The key difference is that UI-TARS is open-source and open-weight, meaning the 7B model can run entirely locally without sending data to external APIs. Claude Computer Use requires API calls to Anthropic and does not offer a local deployment option.

Can UI-TARS Desktop run locally without cloud APIs?

Yes. The UI-TARS-7B model runs on a single consumer GPU with 16GB VRAM (like an RTX 3090 or 4090). With 4-bit quantization, 12GB VRAM is workable. The 7B model scores 18.8% on OSWorld, which is lower than the 72B model’s 24.6% but still significantly above GPT-4o’s 12.2% baseline.

What is the difference between vision-based and DOM-based AI agents?

DOM-based agents (like Browser Use or Playwright MCP) parse the structured HTML of web pages and interact with elements through selectors. They are faster and more reliable for web automation but only work in browsers. Vision-based agents (like UI-TARS and Claude Computer Use) work from screenshots and can automate any application with a visible GUI, including desktop software and legacy systems, but are slower and less precise on web tasks.

What are realistic use cases for UI-TARS Desktop in 2026?

The strongest use cases are automating legacy desktop applications that have no API: SAP GUI interactions, data entry across proprietary enterprise tools, cross-application workflows that require copying data between desktop apps, and automated UI testing for applications without test automation hooks. It is less suited for web automation, where DOM-based agents are faster and more reliable.

How the Agent Loop Works: Screenshots In, Actions Out#

The Model Under the Hood#

Benchmark Reality Check: What 24.6% Actually Means#

Vision-Based vs. DOM-Based Agents: The Real Tradeoff#

Where This Matters: Legacy Software and Data Sovereignty#

Getting Started: What You Actually Need#

Frequently Asked Questions#

What is ByteDance UI-TARS Desktop?#

How does UI-TARS compare to Claude Computer Use?#

Can UI-TARS Desktop run locally without cloud APIs?#

What is the difference between vision-based and DOM-based AI agents?#

What are realistic use cases for UI-TARS Desktop in 2026?#