Introduction
"One script tag. Natural language controls your web app. No screenshots, no browser extensions, no backend, no Python."
This is article #107 in the "One Open Source Project a Day" series. Today's project is page-agent — Alibaba's open-source client-side GUI Agent library.
The dominant approach to making AI operate a browser follows one path: screenshot → multimodal LLM recognizes elements → execute action. That path has two costs: multimodal models (expensive) and a server or headless browser environment (complex infrastructure).
page-agent's answer: serializing the DOM to text is enough. Assign numeric indices to interactive elements, send the text to any LLM, get back "click element 3" as a tool call, execute it directly in the browser. The entire loop stays in-page. No screenshots. No multimodal capability needed.
What You'll Learn
- Text-based DOM: how to turn a page into an LLM-readable indexed structure
- ReAct loop architecture: Observe → Think (reflection + action) → Act, fully implemented in TypeScript
- Reflection-Before-Action model: evaluating the previous step before planning the next
- Built-in tool system: how click, input, scroll, and JS execution tools are defined
- Single package vs. core package:
page-agent(with UI) and@page-agent/core(pure logic) - Chrome extension + MCP Server for cross-page capability
Prerequisites
- Understanding of LLM function calling (tool use) mechanics
- Basic knowledge of DOM and browser events
- Experience with OpenAI SDK or similar APIs
Project Background
Overview
page-agent is a pure client-side GUI Agent library that embeds LLM reasoning directly into web pages. It understands page structure through serialized DOM text and executes actions without leaving the browser.
The core architectural decision: DOM text instead of screenshots. The DOM already fully describes what elements exist on the page, what type they are, and whether they're currently interactive. Serializing this to text is more precise than a screenshot (button labels are never blurry), cheaper (no multimodal models), and faster (DOM reads are synchronous).
The project acknowledges browser-use (server-side Python browser automation) as its inspiration. page-agent's positioning is the client-side counterpart: runs inside the page, not controlling a headless browser from a server.
Author / Team
- Organization: Alibaba
- Primary Language: TypeScript
- License: MIT
-
npm packages:
page-agent(with UI Panel),@page-agent/core(pure agent logic) - Latest version: v1.10.0
Project Stats
- 📄 License: MIT
- 📦 npm:
page-agent - 💻 Stack: TypeScript + npm workspaces + Vite
Features
What It Does
Traditional browser GUI Agent (screenshot path):
Page → Screenshot → Multimodal LLM visual understanding → Return coordinates/elements → Execute
Cost: multimodal API fees + server/headless browser infrastructure
page-agent (text DOM path):
Page → Serialize DOM with indexed interactive elements → Pure text LLM reasoning
→ Return tool call: { click_element_by_index: { index: 2 } }
→ Execute directly in the page
Cost: text LLM (cheaper) + zero backend (pure frontend JS)
Use Cases
- SaaS AI Copilot: Embed an AI assistant directly in your product — user says "create a new project named X" and it happens. No backend rewrite needed.
- Smart form filling: Compress 20-click workflows in ERP/CRM/admin systems into a single sentence
- Accessibility: Add natural language control to any web application — voice commands, screen reader enhancement
- Cross-tab Agent: With the Chrome extension, agents can work across tabs (e.g., read data from a spreadsheet tab, fill it into a form in another tab)
- MCP control: Through the MCP Server (Beta), external agent clients can control the browser
Quick Start
One-line integration (free Demo LLM, technical evaluation only):
<script src="https://cdn.jsdelivr.net/npm/page-agent@1.10.0/dist/iife/page-agent.demo.js" crossorigin="true"></script>
After loading, a floating Agent panel appears in the bottom-right corner. Type natural language commands directly.
NPM installation (production):
npm install page-agent
import { PageAgent } from 'page-agent'
const agent = new PageAgent({
model: 'gpt-4o',
baseURL: 'https://api.openai.com/v1',
apiKey: 'YOUR_API_KEY',
language: 'en-US',
})
const result = await agent.execute('Click the login button, then fill in the username as test@example.com')
console.log(result.success, result.message)
Using @page-agent/core (no UI panel, embedded scenarios):
import { PageAgentCore } from '@page-agent/core'
import { PageController } from '@page-agent/page-controller'
const pageController = new PageController({ enableMask: true })
const agent = new PageAgentCore({
pageController,
model: 'gpt-4o',
apiKey: 'sk-...',
maxSteps: 20,
onAfterStep: async (agent, history) => {
console.log('step completed', history.at(-1))
}
})
const result = await agent.execute('Find the lowest-priced item in the product list and add it to the cart')
Supported models (any OpenAI-compatible endpoint works):
| Provider | Models |
|---|---|
| OpenAI | gpt-4o, gpt-4-turbo, gpt-5.2, gpt-5.4 |
| Anthropic | claude-opus-4.8, claude-sonnet-4, claude-haiku-3.5 |
| Alibaba | qwen3.5-plus, qwen3.6-max, qwen3.6-flash |
| DeepSeek | deepseek-chat, deepseek-reasoner |
| gemini-2.0-flash (via OpenAI-compatible endpoint) | |
| Local | Ollama (qwen3:14b, tested on RTX 3090 24GB) |
Deep Dive
Text-Based DOM: The Core Technique
page-agent's PageController converts the DOM into an indexed simplified text structure:
Raw DOM (simplified):
<div class="form-container">
<input type="text" placeholder="Username" />
<input type="password" placeholder="Password" />
<button class="btn-primary">Sign In</button>
<a href="/register">Create account</a>
</div>
Serialized for LLM (interactive elements with indices):
URL: https://example.com/login
Title: Login - My App
[0]<input placeholder="Username" />
[1]<input type="password" placeholder="Password" />
[2]<button>Sign In</button>
[3]<a>Create account</a>
Each interactive element gets a numeric index [N]. The LLM only needs to return "click [2]" or "input 'admin@example.com' into [0]."
DOM processing pipeline:
Live DOM
↓ dom_tree/ module
FlatDomTree (flattened tree with DomNode map)
↓ Dehydration (simplification)
Indexed text representation
↓
LLM context input
↓
LLM returns tool call: { click_element_by_index: { index: 2 } }
↓
PageController.clickElement(2) → locate HTMLElement by index → fire click
Only elements meeting all three criteria are indexed (to reduce noise):
-
isVisible: true— element is in or scrollable to viewport -
isInteractive: true— clickable/inputtable/selectable elements -
isTopElement: true— not obscured by overlapping elements
ReAct Loop Architecture
agent.execute("complete some task")
↓
┌─────────────────────────────────────────────┐
│ Main loop (up to maxSteps) │
│ │
│ ① Observe │
│ pageController.updateTree() │
│ → refresh DOM, get current page text │
│ │
│ ② Think (LLM call) │
│ Input: system prompt + history + DOM │
│ LLM output (Reflection model): │
│ { │
│ evaluation_previous_goal: "...", ←── how did the last step go?
│ memory: "...", ←── what to remember?
│ next_goal: "...", ←── what to do next?
│ action: { click_element_by_index: {index: 2} }
│ } │
│ │
│ ③ Act │
│ Execute the tool call returned by LLM │
│ Record to history (persistent, visible │
│ to LLM in next steps) │
│ │
│ ④ Check termination │
│ LLM calls done tool → return result │
│ OR maxSteps reached │
└─────────────────────────────────────────────┘
↓
ExecutionResult { success, message, history }
Reflection-Before-Action Model
Before each LLM call, the previous step's result and full history are passed to the model, which is required to reflect before acting:
{"evaluation_previous_goal":"Successfully clicked the login button, page navigated to the dashboard","memory":"Login complete, currently on the dashboard. Still need to find the settings page.","next_goal":"Locate and click the Settings or Account option in the navigation bar","action":{"click_element_by_index":{"index":5}}}
evaluation_previous_goal forces the model to evaluate the previous step's outcome before proceeding — preventing blind continuation (e.g., if a click triggered nothing, the model should pivot rather than click again).
memory is the short-term memory mechanism: compressing key progress into 1-3 sentences that persist in history, so the LLM doesn't "forget" what it has already accomplished in a long multi-step task.
Built-in Tool System
Tools the LLM can call via function calling:
| Tool | Purpose |
|---|---|
click_element_by_index |
Click element at specified index |
input_text |
Type text into input field at index |
select_dropdown_option |
Select dropdown option by text content |
scroll |
Vertical scroll (by pages or pixels) |
scroll_horizontally |
Horizontal scroll |
execute_javascript |
Execute arbitrary JS (AbortSignal supported) |
wait |
Wait 1-10 seconds (for page loads or animations) |
ask_user |
Ask the user a question (human-in-the-loop node) |
done |
Task complete, return result |
Custom tools (extend or override defaults):
const agent = new PageAgent({
model: 'gpt-4o',
apiKey: '...',
customTools: {
// Add a custom tool
get_current_user: {
description: 'Get the currently logged-in user information',
inputSchema: z.object({}),
execute: async function() {
const user = await fetchCurrentUser()
return JSON.stringify(user)
}
},
// Set to null to disable a built-in tool
execute_javascript: null
}
})
Monorepo Architecture
packages/
├── page-agent/ → Main package (npm: page-agent), includes UI panel
├── core/ → Pure agent logic (npm: @page-agent/core), no UI
├── llms/ → LLM client, multi-provider support
├── page-controller/ → DOM operations and visual feedback
├── ui/ → Control panel + i18n, decoupled from core
├── extension/ → Chrome extension (WXT + React)
└── website/ → Documentation site (React)
Key module boundary design:
-
page-controllerhas no knowledge of LLMs — only handles DOM operations -
llmshas no knowledge of page structure — only handles LLM communication -
corecombines them into the ReAct loop -
page-agent(main package) adds a UI panel on top ofcore— both are independently usable
How the LLM Client Works
The llms package uses a MacroTool pattern that wraps the reflection model:
// Each LLM call expects this structure back
export interface MacroToolInput extends Partial<AgentReflection> {
action: Record<string, any>
}
export interface AgentReflection {
evaluation_previous_goal: string // "Previous action succeeded/failed because..."
memory: string // "Key facts to remember: ..."
next_goal: string // "Next I will..."
}
The OpenAI-compatible client handles:
- Converting tools to OpenAI function-calling format
- Building requests with
parallel_tool_calls: false(one action per step) - Provider-specific patches (DeepSeek: disable explicit
tool_choice; MiniMax: temperature/tool-call compatibility) - Automatic retry with exponential backoff for 429/500 errors
- Token usage tracking including prompt cache hits and reasoning tokens
Cross-Page Capability: Chrome Extension + MCP
Chrome Extension (for cross-tab workflows):
- Agent runs in an extension-controlled context with access to switch tabs and read different pages' DOM
- Use case: "Read data from a spreadsheet tab, paste it into a form in another tab"
MCP Server (Beta):
- Exposes page-agent as MCP tools, letting external agent clients (Claude Desktop, Claude Code) control the browser remotely
- Use case: connecting browser control capability into a larger agent workflow
Reliability Design
- Step delay: 400ms between steps — breathing room for page renders and network requests
- Click wait: 200ms after click operations — ensures DOM updates complete
-
Concurrency guard: Prevents concurrent
execute()calls to avoid race conditions -
AbortSignal support: All tool execution honors cancellation signals;
execute_javascriptis interruptible - Automatic retry: LLM failures (429 rate limits, 500 server errors) auto-retry with exponential backoff (100ms base)
-
Token tracking: Every step records
promptTokens,completionTokens, prompt cache hits, and reasoning tokens
Resources
Official Links
- 🌟 GitHub: alibaba/page-agent
- 🚀 Live Demo: alibaba.github.io/page-agent
- 📖 Documentation: alibaba.github.io/page-agent/docs
- 📦 npm: page-agent
Summary
page-agent's core insight is that browser automation doesn't need to "look at" the page. The DOM is already a structured description of what exists on the page — serializing it to indexed text lets any text-only LLM understand and operate it directly. No multimodal model, no visual understanding, no coordinate targeting.
This insight dramatically lowers the barrier for GUI agents: no server required, no headless browser, no screenshots, no extra infrastructure — a single script tag or npm package running in the user's browser.
The Reflection-Before-Action model (evaluate last step → plan next step) combined with persistent memory in history is what lets multi-step tasks stay on track. The agent doesn't just chain commands blindly; it continuously re-evaluates whether its actions are having the intended effect.
For developers who want to add AI Copilot capability to a web product, or automate internal tool workflows without touching backend infrastructure, page-agent is one of the lowest-friction options currently available.
Explore PrimeSkills — a curated marketplace of AI agents and skills, each validated against real enterprise workflows. No hype, just what actually works.
Visit my personal site for more insights and interesting products.