ScriptsMay 31, 2026·3 min read

Page-Agent — In-Page GUI Agent for Natural Language Browser Control

JavaScript library by Alibaba that lets AI agents control web interfaces using natural language commands directly in the browser page context.

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
Page-Agent
Direct install command
npx -y tokrepo@latest install e3a97b73-5ca7-11f1-9bc6-00163e2b0d79 --target codex

Run after dry-run confirms the install plan.

Introduction

Page-Agent is an open-source JavaScript library from Alibaba that enables AI agents to interact with web page UIs using natural language instructions. Unlike browser automation tools that operate externally via protocols like CDP, Page-Agent runs inside the page context itself, giving it direct access to the DOM, event system, and application state.

What Page-Agent Does

  • Translates natural language commands into precise DOM interactions
  • Operates inside the browser page for direct access to elements and state
  • Handles complex multi-step UI workflows like form filling and navigation
  • Provides visual grounding by understanding page layout and element semantics
  • Works with any web application without requiring custom selectors or scripts

Architecture Overview

Page-Agent injects a lightweight runtime into the target page that captures a structured snapshot of the DOM, including element positions, text content, and interactive affordances. This snapshot is sent to an LLM that plans and executes actions as a sequence of DOM operations. The runtime handles action execution, waits for page transitions, and captures updated state for multi-step workflows.

Self-Hosting & Configuration

  • Install via npm and bundle with your browser extension or automation script
  • Configure the LLM provider and model via initialization options
  • Set action timeout and retry policies for unreliable network conditions
  • Customize element selection strategies for specific application patterns
  • Enable debug mode for step-by-step action logging and screenshots

Key Features

  • In-page execution gives access to JavaScript state and shadow DOM
  • Vision-free approach using structured DOM snapshots instead of screenshots
  • Supports complex workflows spanning multiple page navigations
  • MCP server integration for use with AI coding agents
  • Lightweight runtime with no heavy browser dependencies

Comparison with Similar Tools

  • Playwright/Puppeteer — require external browser control; Page-Agent runs in-page
  • Browser Use — Python-based with screenshot vision; Page-Agent uses DOM snapshots
  • Stagehand — similar in-page approach; Page-Agent provides more granular DOM analysis
  • Selenium — heavyweight framework; Page-Agent is a lightweight embeddable library

FAQ

Q: Does it require a headless browser? A: No. Page-Agent runs inside any browser context, including headed browsers and extensions.

Q: Which LLM providers are supported? A: Any provider with a chat completions API, including Anthropic, OpenAI, and local models.

Q: Can it handle dynamic single-page applications? A: Yes. It captures DOM state after JavaScript rendering and handles React, Vue, and Angular apps.

Q: Is it suitable for testing? A: It can be used for exploratory testing, but dedicated testing tools offer better assertion and reporting capabilities.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets