Skip to content

Added support for more ollama models #239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 32 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,28 +83,37 @@ Use Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a c
operate -m qwen-vl
```

#### Try LLaVa Hosted Through Ollama `-m llava`
#### Try Multimodal Models Hosted Through Ollama `-m <model_name>`
If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
*Note: Ollama currently only supports MacOS and Linux. Windows now in Preview*

First, install Ollama on your machine from https://ollama.ai/download.
First, install Ollama on your machine from https://ollama.com/download.

Once Ollama is installed, pull the LLaVA model:
Once Ollama is installed, pull the model you want to use:
```
ollama pull llava
ollama pull <model_name>
```
This will download the model on your machine which takes approximately 5 GB of storage.
This will download the model on your machine which takes approximately 5 GB of storage for llava:7b.

When Ollama has finished pulling LLaVA, start the server:
When Ollama has finished pulling the model, start the server:
```
ollama serve
```

That's it! Now start `operate` and select the LLaVA model:
That's it! Now start `operate` and specify the model you want to use directly:
```
operate -m llama-3.1-vision
```

For better text recognition when clicking on elements, you can enable OCR with the `--ocr` flag:
```
operate -m llama-3.1-vision --ocr
```
operate -m llava
```
**Important:** Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.

**Important:**
- The OCR flag is only available for Ollama models
- The system will attempt to run any model you specify, regardless of whether it's detected as multimodal
- Error rates when using ollama are very high, even with large models like llama3.2-vision:90b. This is simply intended to be a base to build off of as local multimodal models improve over time.

Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)

Expand Down Expand Up @@ -136,6 +145,19 @@ Run with voice mode
operate --voice
```

### Browser Preference `-b` or `--browser`
By default, Self-Operating Computer uses Google Chrome as the browser when providing instructions. If you prefer a different browser, you can specify it using the `-b` or `--browser` flag:

```
operate -b "Firefox"
```

```
operate --browser "Microsoft Edge"
```

The specified browser will be used in the system prompts to guide the model.

### Optical Character Recognition Mode `-m gpt-4-with-ocr`
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.

Expand Down
4 changes: 4 additions & 0 deletions operate/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ class Config:
openai_api_key (str): API key for OpenAI.
google_api_key (str): API key for Google.
ollama_host (str): url to ollama running remotely.
ocr_enabled (bool): Flag indicating whether OCR is enabled for Ollama models.
browser (str): Preferred browser to use in system prompts.
"""

_instance = None
Expand All @@ -31,6 +33,8 @@ def __new__(cls):
def __init__(self):
load_dotenv()
self.verbose = False
self.ocr_enabled = False
self.browser = "Google Chrome"
self.openai_api_key = (
None # instance variables are backups in case saving to a `.env` fails
)
Expand Down
24 changes: 23 additions & 1 deletion operate/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,36 @@ def main_entry():
type=str,
required=False,
)

# Add OCR flag for Ollama models
parser.add_argument(
"--ocr",
help="Enable OCR for Ollama models",
action="store_true",
)

# Add browser preference flag
parser.add_argument(
"-b",
"--browser",
help="Specify preferred browser (default: Google Chrome)",
type=str,
default="Google Chrome",
)

try:
args = parser.parse_args()

# No need to prompt for model name if it's directly specified
# The Ollama model name can now be passed directly

main(
args.model,
terminal_prompt=args.prompt,
voice_mode=args.voice,
verbose_mode=args.verbose
verbose_mode=args.verbose,
ocr_mode=args.ocr,
browser=args.browser
)
except KeyboardInterrupt:
print(f"\n{ANSI_BRIGHT_MAGENTA}Exiting...")
Expand Down
Loading