Page-Agent — In-Page GUI Agent for Natural Language Browser Control

Introduction

Page-Agent is an open-source JavaScript library from Alibaba that enables AI agents to interact with web page UIs using natural language instructions. Unlike browser automation tools that operate externally via protocols like CDP, Page-Agent runs inside the page context itself, giving it direct access to the DOM, event system, and application state.

What Page-Agent Does

Translates natural language commands into precise DOM interactions
Operates inside the browser page for direct access to elements and state
Handles complex multi-step UI workflows like form filling and navigation
Provides visual grounding by understanding page layout and element semantics
Works with any web application without requiring custom selectors or scripts

Architecture Overview

Page-Agent injects a lightweight runtime into the target page that captures a structured snapshot of the DOM, including element positions, text content, and interactive affordances. This snapshot is sent to an LLM that plans and executes actions as a sequence of DOM operations. The runtime handles action execution, waits for page transitions, and captures updated state for multi-step workflows.

Self-Hosting & Configuration

Install via npm and bundle with your browser extension or automation script
Configure the LLM provider and model via initialization options
Set action timeout and retry policies for unreliable network conditions
Customize element selection strategies for specific application patterns
Enable debug mode for step-by-step action logging and screenshots

Key Features

In-page execution gives access to JavaScript state and shadow DOM
Vision-free approach using structured DOM snapshots instead of screenshots
Supports complex workflows spanning multiple page navigations
MCP server integration for use with AI coding agents
Lightweight runtime with no heavy browser dependencies

Comparison with Similar Tools

Playwright/Puppeteer — require external browser control; Page-Agent runs in-page
Browser Use — Python-based with screenshot vision; Page-Agent uses DOM snapshots
Stagehand — similar in-page approach; Page-Agent provides more granular DOM analysis
Selenium — heavyweight framework; Page-Agent is a lightweight embeddable library

FAQ

Q: Does it require a headless browser? A: No. Page-Agent runs inside any browser context, including headed browsers and extensions.

Q: Which LLM providers are supported? A: Any provider with a chat completions API, including Anthropic, OpenAI, and local models.

Q: Can it handle dynamic single-page applications? A: Yes. It captures DOM state after JavaScript rendering and handles React, Vue, and Angular apps.

Q: Is it suitable for testing? A: It can be used for exploratory testing, but dedicated testing tools offer better assertion and reporting capabilities.

Page-Agent — In-Page GUI Agent for Natural Language Browser Control

Agent 可直接安装

Introduction

What Page-Agent Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Open Interpreter OS Mode — Natural-Language Computer Control

AI Shell — Natural Language to Shell Commands

NLTK — Natural Language Processing Toolkit for Python

TypeChat — Schema-First Natural Language UIs