Introduction
Page-Agent is an open-source JavaScript library from Alibaba that enables AI agents to interact with web page UIs using natural language instructions. Unlike browser automation tools that operate externally via protocols like CDP, Page-Agent runs inside the page context itself, giving it direct access to the DOM, event system, and application state.
What Page-Agent Does
- Translates natural language commands into precise DOM interactions
- Operates inside the browser page for direct access to elements and state
- Handles complex multi-step UI workflows like form filling and navigation
- Provides visual grounding by understanding page layout and element semantics
- Works with any web application without requiring custom selectors or scripts
Architecture Overview
Page-Agent injects a lightweight runtime into the target page that captures a structured snapshot of the DOM, including element positions, text content, and interactive affordances. This snapshot is sent to an LLM that plans and executes actions as a sequence of DOM operations. The runtime handles action execution, waits for page transitions, and captures updated state for multi-step workflows.
Self-Hosting & Configuration
- Install via npm and bundle with your browser extension or automation script
- Configure the LLM provider and model via initialization options
- Set action timeout and retry policies for unreliable network conditions
- Customize element selection strategies for specific application patterns
- Enable debug mode for step-by-step action logging and screenshots
Key Features
- In-page execution gives access to JavaScript state and shadow DOM
- Vision-free approach using structured DOM snapshots instead of screenshots
- Supports complex workflows spanning multiple page navigations
- MCP server integration for use with AI coding agents
- Lightweight runtime with no heavy browser dependencies
Comparison with Similar Tools
- Playwright/Puppeteer — require external browser control; Page-Agent runs in-page
- Browser Use — Python-based with screenshot vision; Page-Agent uses DOM snapshots
- Stagehand — similar in-page approach; Page-Agent provides more granular DOM analysis
- Selenium — heavyweight framework; Page-Agent is a lightweight embeddable library
FAQ
Q: Does it require a headless browser? A: No. Page-Agent runs inside any browser context, including headed browsers and extensions.
Q: Which LLM providers are supported? A: Any provider with a chat completions API, including Anthropic, OpenAI, and local models.
Q: Can it handle dynamic single-page applications? A: Yes. It captures DOM state after JavaScript rendering and handles React, Vue, and Angular apps.
Q: Is it suitable for testing? A: It can be used for exploratory testing, but dedicated testing tools offer better assertion and reporting capabilities.