Automating Browser-Based Workflows with LLMs and Computer Vision

Why do robotics scientists value humanoid robots? Because all the infrastructure and tools of current society are intended for humanoids.

We believe that for workflow automation, vision-based LLMs are the equivalent of humanoid robots in robotics because most current computer software and tools are designed for and optimized around human's vision-based interactions. Vision-based LLMs can interpret and interact with these systems in ways that mimic human behavior, making them highly adaptable and effective in automating tasks across a wide range of applications.

Automated Web Interaction Demo
Demonstration of automated interactions:"Could you find me some can food for cats?"

Automation of browser-based workflows has traditionally involved writing custom scripts for specific websites, often relying on DOM parsing and XPath-based interactions. These approaches can be fragile and susceptible to breaking when website layouts change. Recent advancements in large language models (LLMs) and computer vision offer a robust alternative by enabling real-time interaction with web elements based on visual and textual inputs.

The system integrates LLMs and computer vision to interpret web elements, generate interaction plans, and execute them without developer intervention. This approach allows the system to operate on websites it has never encountered before, adapt to layout changes, and apply a single workflow across multiple websites.

System Diagram
System diagram highlighting the process of input processing, scene understanding, and output generation.

Key features of this approach include:

  • Adaptability: The system can function on new websites without requiring customized code.
  • Resilience: It is resistant to layout changes, as it does not rely on predetermined selectors.
  • Scalability: A single workflow can be applied to numerous websites.
  • Reasoning: The use of LLMs enables handling complex interactions and inferring information.

Demonstrations of this system were conducted in various scenarios, such as generating insurance quotes, performing competitor analysis, and automating job applications. Quantitative evaluations indicate a high success rate in generating accurate interactions, while qualitative feedback from developers highlights the system's potential benefits and areas for improvement.

Key contributions of this work include:

  • Designing a system for integrating real-time behavior generation in web automation.
  • Evaluating the performance and reliability of the system in various scenarios.
  • Analyzing feedback from developers regarding integration into existing workflows.

Real-time behavior generation offers promising opportunities for enhancing web automation. However, addressing quality control and maintaining user expectations are essential for broader adoption. Future work will focus on developing robust guardrail systems to manage the quality and impact of generated behaviors effectively.

Back