AI Browser Agent for UI State Capture

This project is an AI-powered browser agent designed to autonomously navigate web applications and capture UI states, even for actions that don't have unique URLs.

Given a high-level task (e.g., "Add two todos: 'Buy milk' and 'Wash car', then delete 'Call mom'"), the agent uses a large language model (LLM) to "think" step-by-step, controlling a live browser to execute the task while saving a screenshot at every stage.

This system was built to solve the challenge of capturing "non-URL states" (like modals, form inputs, or dynamic UI changes) for multi-step workflows.

How It Works: The Perception-Decision-Action Loop 🚀

The agent operates on a continuous Perception-Decision-Action loop, inspired by autonomous agent frameworks like ReAct (Reason + Act).

At every step, the agent performs this loop until the LLM decides the task is "FINISH"ed.

Perceive: The agent "looks" at the current page. It doesn't read the raw, complex HTML. Instead, it runs the get_simplified_dom_string() function to build a simple, text-based representation of all interactive elements (buttons, links, inputs). This abstract view is perfect for an LLM.

Example "Simplified DOM":

<element id='1' type='input'>What needs to be done?</element>

Capture: Before acting, the agent captures a full-page screenshot of the current UI state and saves it to a task-specific folder (ex, screenshots/create_a_new_project_in_linear.../).
Decide (The "Brain"): The agent sends the Task, Action History, and Current Simplified DOM to an LLM (like GPT-4o). The LLM's sole job is to respond with a single JSON command, like: {"action": "TYPE", "element_id": 1, "text": "New Project"}
Act (The "Hands"): The script parses the JSON command and executes it. It uses get_element_by_agent_id() to find the Nth element on the page (ex, element_id: 1 is the 1st element). This "re-finding" logic is crucial for handling dynamic UIs and DOM re-renders. The loop repeats until the task is complete.

How it Solves "Non-URL" States

This agent solves the "non-URL state" problem (like modals or forms that don't change the URL) by ignoring URLs entirely.

The agent's "state" is its perception of the current UI, which it re-evaluates every loop. When a modal appears, it simply sees a new list of interactive elements (like "Save", "Cancel") and acts accordingly.

For the interpretation task, you can choose from any of the six model providers: ChatGPT, Grok, Gemini, Claude, Perplexity, or DeepSeek. Just add your API keys to the .env file using the corresponding <PROVIDER_API_KEY> variable names, and you’ll be all set.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Browser Agent for UI State Capture

How It Works: The Perception-Decision-Action Loop 🚀

How it Solves "Non-URL" States

About

Uh oh!

Releases

Packages

Languages

License

kartikhans/SoftLight-Take-Home-Assignment

Folders and files

Latest commit

History

Repository files navigation

AI Browser Agent for UI State Capture

How It Works: The Perception-Decision-Action Loop 🚀

How it Solves "Non-URL" States

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages