AGI Companion Browser

Visually Control the Web with your AI Agent on Android

Imagine telling your phone, "Book the usual flight for next Tuesday," or "Summarize the key arguments on this forum page," and having it simply... *happen*. While we're not quite there yet, the AGI Companion Browser is a significant step towards that future – a future where AI agents handle the mundane (and complex) tasks of web interaction for us.

Born from the desire to automate web navigation and escape the endless cycle of clicks and typing, this experimental Android browser empowers advanced AI models to see, understand, and act on web pages, bridging the gap between language commands and visual web reality.

How It Works: Giving AI Eyes and Hands

The core concept merges multimodal AI perception with precise action simulation:

  1. See: The app captures the current web page visually – providing both a clean screenshot and one overlaid with a numbered grid (like R5C8) to the AI.
  2. Understand: You provide a natural language goal (e.g., "Log me in", "Find the latest AI news"). The multimodal AI (currently Google Gemini) analyzes your request alongside the visual context of the screenshots and its own internal memory.
  3. Plan & Act: The AI plans the necessary steps and responds with specific, actionable commands. These aren't vague suggestions; they are precise instructions like CLICK R7C3, TYPE R2C4 :: My Username, NAVIGATE https://example.com, or uses NOTE :: ... to update its internal checklist and reasoning.
  4. Execute: The AGI Companion Browser receives these commands and simulates the corresponding user actions (taps, text input, navigation) directly within the Android WebView component.
  5. Iterate: This see-understand-act cycle repeats, allowing the agent to perform complex, multi-step tasks across different pages until your goal is achieved or the interaction concludes.

Demo: Autonomous Web Task

Words can only explain so much. Watch this short demo where the AGI Companion Browser is asked to find cat videos on Bilibili:

Demo showing navigation, search bar interaction, typing (with a small AI typo!), and clicking video results.

As the demo shows, the agent successfully navigates, identifies elements, types (albeit imperfectly – demonstrating current AI limitations!), and completes the task. This highlights the incredible potential even at this early stage.

The Development Journey & Challenges

Building this bridge between language, vision, and web action presented numerous fascinating challenges:

Current Status & Future Vision

The AGI Companion Browser is proudly **Open Source (MIT Licensed)** and available for you to explore, build, and experiment with on GitHub. While currently powered by Google Gemini, this is just the beginning:

Connecting to the Road to Free Open AGI

This browser isn't just a technical demonstration; it embodies the core principles of the **Road to Free Open AGI** project. By developing and openly sharing powerful AI agent tools like this, we aim to democratize access to advanced AI capabilities, foster collaborative innovation, and ensure the benefits of artificial general intelligence are accessible to everyone ethically and openly.

Get Involved & Explore!

Dive into the future of web interaction! Your feedback, ideas, and contributions are highly encouraged:

View Code on GitHub Report Issues / Suggest Features

Feel free to reach out via the main site's contact options to discuss the project!

← Back to Open AGI Apps