Scripts2026年5月31日·1 分钟阅读

Page-Agent — In-Page GUI Agent for Natural Language Browser Control

JavaScript library by Alibaba that lets AI agents control web interfaces using natural language commands directly in the browser page context.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Page-Agent
直接安装命令
npx -y tokrepo@latest install e3a97b73-5ca7-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

Page-Agent is an open-source JavaScript library from Alibaba that enables AI agents to interact with web page UIs using natural language instructions. Unlike browser automation tools that operate externally via protocols like CDP, Page-Agent runs inside the page context itself, giving it direct access to the DOM, event system, and application state.

What Page-Agent Does

  • Translates natural language commands into precise DOM interactions
  • Operates inside the browser page for direct access to elements and state
  • Handles complex multi-step UI workflows like form filling and navigation
  • Provides visual grounding by understanding page layout and element semantics
  • Works with any web application without requiring custom selectors or scripts

Architecture Overview

Page-Agent injects a lightweight runtime into the target page that captures a structured snapshot of the DOM, including element positions, text content, and interactive affordances. This snapshot is sent to an LLM that plans and executes actions as a sequence of DOM operations. The runtime handles action execution, waits for page transitions, and captures updated state for multi-step workflows.

Self-Hosting & Configuration

  • Install via npm and bundle with your browser extension or automation script
  • Configure the LLM provider and model via initialization options
  • Set action timeout and retry policies for unreliable network conditions
  • Customize element selection strategies for specific application patterns
  • Enable debug mode for step-by-step action logging and screenshots

Key Features

  • In-page execution gives access to JavaScript state and shadow DOM
  • Vision-free approach using structured DOM snapshots instead of screenshots
  • Supports complex workflows spanning multiple page navigations
  • MCP server integration for use with AI coding agents
  • Lightweight runtime with no heavy browser dependencies

Comparison with Similar Tools

  • Playwright/Puppeteer — require external browser control; Page-Agent runs in-page
  • Browser Use — Python-based with screenshot vision; Page-Agent uses DOM snapshots
  • Stagehand — similar in-page approach; Page-Agent provides more granular DOM analysis
  • Selenium — heavyweight framework; Page-Agent is a lightweight embeddable library

FAQ

Q: Does it require a headless browser? A: No. Page-Agent runs inside any browser context, including headed browsers and extensions.

Q: Which LLM providers are supported? A: Any provider with a chat completions API, including Anthropic, OpenAI, and local models.

Q: Can it handle dynamic single-page applications? A: Yes. It captures DOM state after JavaScript rendering and handles React, Vue, and Angular apps.

Q: Is it suitable for testing? A: It can be used for exploratory testing, but dedicated testing tools offer better assertion and reporting capabilities.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产