The AI Tech behind Web Agent
by Sheng Yi, 01/29/2024
www.simplegen.ai
In a previous post, we introduced the vision of an autonomous AI agent performing daily tasks on behalf of humans. This article will provide a brief introduction to the tech behind a specific type of AI agent: the web agent.
What is a Web Agent
Since "web agent" is not a commonly accepted term, it's worth explaining upfront. By 'web agent', I'm referring to an AI Agent that can perform browsing actions on the web, such as clicking, scrolling, or inputting text.
List of Representative Papers
Tech Problems to solve
To autonomously browse the web, there are three major technical problems to solve:
- plan the next action given the history of actions and observations
- identify the html element to act (aka grounding)
- execute the action on the identified HTML element
Existing solutions
The solution to the last problem primarily depends on 1st party browser APIs or third-party UI test automation libs, which is out of the scope of this article. AI is required to solve the first two problems. Large Language Models (LLMs), such as GPT and its various iterations, are increasingly utilized to facilitate the navigation of web pages.
For example in [2] a transformer model is trained on top of T5 with both visual and text tokens to plan the next action given the current context and history.
And the performance of GPT-4V in the planning and grounding has been studied in [5].
Challenges
As mentioned in [4][5], the main challenge of the existing solutions of Web Agent lies in the low success rate of grounding (problem #2). The gap shown in the image above between SeeAct-Oracle and SeeAct-Choices is mainly due to grounding errors (SeeAct-Oracle assumes perfect grounding and thus serves as an upper limit).
In this article, we won't delve deeply into the various grounding methods. However, we will offer a concise overview to shed light on why mastering these methods presents significant challenges.
- For vision-based grounding methods, annotation is added to the screenshot image as registration labels between the image and HTML elements. Such annotation can often block key information on the page which causes errors.
- For text-based grounding methods, HTML attributes are required as input for inference. However often these attributes are missing, shared across multiple elements, or changed (for example page reload, the new element added to the page…) after inference.