AI agents excel at web tasks because they can read the DOM. But when you point that same agent at a mobile app, it hits a wall. There is no DOM to read. Some have tried to use computer vision, but it is slow, expensive, and flaky.
I built Agenteract to give mobile apps a "DOM" that agents can actually read.
Instead of taking screenshots, Agenteract instruments your app to "self-report" its view hierarchy through a secure local WebSocket. It serializes the UI into a token-efficient JSON tree.
On top of this, Agenteract provides a clean CLI that enables agents to interact with the app with minimal context overhead.
Here's a normal flow of using the CLI to interact with the app:
npx @agenteract/agents hierarchy your-app
Simplified response example: {
"status": "success",
"hierarchy": {
"name": "main(RootComponent)",
"children": [
{
"name": "ViewStuff",
"children": [
{ "name": "Text", "text": "Home" },
{
"name": "Pressable",
"testID": "test-button",
"children": [
{ "name": "Text", "text": "Simulate Target" }
]
}
]
}
]
},
"id": "296d657e-f5e8-49d1-a7ab-62815a"
}
Then tapping the button: npx @agenteract/agents tap your-app test-button
If the interaction is successful, the agent receives: {"status":"success","message":"Tapped test-button"}
Alternatively, the agent could retrieve the view hierarchy again to see the updated state.We support the following actions: - tap - input - scroll - longPress - swipe - wait - log - cmd (send keystrokes to dev servers, eg reload)
Why not just use screenshots? * Speed: JSON is instant; processing screenshots takes seconds. * Cost: Text tokens are orders of magnitude cheaper than image tokens. * Reliability: Deterministic targeting means no hallucinated coordinates.
Supported Platforms: * React Native / Expo (via Fiber tree introspection) * Flutter (via Widget tree traversal) * Native iOS (Swift/UIKit/SwiftUI) * Native Android (Kotlin)