Visually Control the Web with your AI Agent on Android
Imagine telling your phone, "Book the usual flight for next Tuesday," or "Summarize the key arguments on this forum page," and having it simply... *happen*. While we're not quite there yet, the AGI Companion Browser is a significant step towards that future – a future where AI agents handle the mundane (and complex) tasks of web interaction for us.
Born from the desire to automate web navigation and escape the endless cycle of clicks and typing, this experimental Android browser empowers advanced AI models to see, understand, and act on web pages, bridging the gap between language commands and visual web reality.
How It Works: Giving AI Eyes and Hands
The core concept merges multimodal AI perception with precise action simulation:
See: The app captures the current web page visually – providing both a clean screenshot and one overlaid with a numbered grid (like R5C8) to the AI.
Understand: You provide a natural language goal (e.g., "Log me in", "Find the latest AI news"). The multimodal AI (currently Google Gemini) analyzes your request alongside the visual context of the screenshots and its own internal memory.
Plan & Act: The AI plans the necessary steps and responds with specific, actionable commands. These aren't vague suggestions; they are precise instructions like CLICK R7C3, TYPE R2C4 :: My Username, NAVIGATE https://example.com, or uses NOTE :: ... to update its internal checklist and reasoning.
Execute: The AGI Companion Browser receives these commands and simulates the corresponding user actions (taps, text input, navigation) directly within the Android WebView component.
Iterate: This see-understand-act cycle repeats, allowing the agent to perform complex, multi-step tasks across different pages until your goal is achieved or the interaction concludes.
Demo: Autonomous Web Task
Words can only explain so much. Watch this short demo where the AGI Companion Browser is asked to find cat videos on Bilibili:
Demo showing navigation, search bar interaction, typing (with a small AI typo!), and clicking video results.
As the demo shows, the agent successfully navigates, identifies elements, types (albeit imperfectly – demonstrating current AI limitations!), and completes the task. This highlights the incredible potential even at this early stage.
The Development Journey & Challenges
Building this bridge between language, vision, and web action presented numerous fascinating challenges:
Sophisticated Prompt Engineering: Designing the master prompt ("System Prompt") was critical. It needs to meticulously guide the AI on interpreting visual grid coordinates, formatting commands flawlessly, reasoning about its actions, and using the NOTE :: command as its working memory and checklist.
Reliable Action Simulation: Programmatically triggering precise clicks and keystrokes within the Android WebView across diverse website structures required careful handling of coordinates, timing, and event dispatching.
Maintaining Context: For multi-step tasks, ensuring the AI remembers the overall goal, the plan (checklist in its NOTE), and the outcome of its last action based on the *new* screenshot is crucial but complex.
Visual Grounding: Translating the AI's understanding ("the blue button") into a specific, clickable coordinate (R6C2) using the grid overlay is fundamental to the agent's operation.
Robust API Handling: Managing the asynchronous communication with the external AI API, including encoding/decoding image data, handling network timeouts, and gracefully parsing potentially unexpected responses.
Current Status & Future Vision
The AGI Companion Browser is proudly **Open Source (MIT Licensed)** and available for you to explore, build, and experiment with on GitHub. While currently powered by Google Gemini, this is just the beginning:
Multi-Model Future: Active development is underway to integrate other leading multimodal models, including **Qwen (Alibaba)**, **DeepSeek-VL**, and **Llama** models with visual capabilities. This will provide users with choice and allow leveraging the unique strengths of each architecture.
Continuous Refinement: Both the AI models themselves and the agent's core system architecture are rapidly evolving. Expect significant improvements in reliability, efficiency, and the complexity of tasks the agent can handle. The goal? Near-flawless web automation.
The Ultimate Assistant?: Can this evolve into an agent seamlessly managing *all* routine web browsing, or even extending to control desktop applications? While ambitious, the potential is immense. **I anticipate** major leaps perhaps by late 2025 or soon after.
Connecting to the Road to Free Open AGI
This browser isn't just a technical demonstration; it embodies the core principles of the **Road to Free Open AGI** project. By developing and openly sharing powerful AI agent tools like this, we aim to democratize access to advanced AI capabilities, foster collaborative innovation, and ensure the benefits of artificial general intelligence are accessible to everyone ethically and openly.
Get Involved & Explore!
Dive into the future of web interaction! Your feedback, ideas, and contributions are highly encouraged: