AI models directly interacting with graphical user interfaces by perceiving and controlling screens
Computer Use refers to the capability of AI models to directly interact with computer interfaces by viewing screenshots and issuing commands to move cursors, click buttons, type text, and navigate windows—essentially controlling a computer as a human would through a graphical interface. Unlike tool-use APIs where a model calls predefined functions, computer use grants models perception of and control over arbitrary software, web applications, and operating system interfaces in real-time. The model sees a screenshot, reasons about what action is needed, executes a click or keystroke, observes the resulting screen state, and iterates.
Anthropologic released Computer Use as a capability with Claude 3.5 Sonnet in late 2024, making it the first major foundation model to ship this capability at scale. The technical challenge is substantial: the model must process high-resolution images (screenshots), reason about spatial layout and semantics ("where is the submit button?"), map that reasoning to coordinate-based actions, and maintain context across multiple screen states. Vision transformers provide the perception layer, while reinforcement learning and demonstrations help train the action selection. Importantly, computer use works alongside language—a model can read text on screen and use that context to navigate. It differs fundamentally from API-based tool use because it doesn't require pre-integration: if a human can use software, so can the model, given sufficient capability.
Computer use unlocks automation of knowledge work that was previously hard to automate: navigating complex web portals, managing spreadsheets with dynamic layouts, executing multi-step workflows across disparate systems, and testing software. It also raises significant safety concerns. A system that can control your computer without guardrails could exfiltrate data, make unauthorized transactions, or introduce malware. Therefore, computer use applications typically run in sandboxed or isolated environments and require explicit user authorization for each interaction. The potential impact is enormous—routine office work, debugging, research, and many forms of personal assistance could be partially or fully automated.