OthersideAI · kk-spartans · Mar 21, 2025 · Mar 27, 2025
diff --git a/README.md b/README.md
@@ -83,28 +83,37 @@ Use Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a c
 operate -m qwen-vl
 ```
 
-#### Try LLaVa Hosted Through Ollama `-m llava`
+#### Try Multimodal Models Hosted Through Ollama `-m <model_name>`
 If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!   
 *Note: Ollama currently only supports MacOS and Linux. Windows now in Preview*   
 
-First, install Ollama on your machine from https://ollama.ai/download.   
+First, install Ollama on your machine from https://ollama.com/download.   
 
-Once Ollama is installed, pull the LLaVA model:
+Once Ollama is installed, pull the model you want to use:
 ```
-ollama pull llava
+ollama pull <model_name>
 ```
-This will download the model on your machine which takes approximately 5 GB of storage.   
+This will download the model on your machine which takes approximately 5 GB of storage for llava:7b.   
 
-When Ollama has finished pulling LLaVA, start the server:
+When Ollama has finished pulling the model, start the server:
 ```
 ollama serve
 ```
 
-That's it! Now start `operate` and select the LLaVA model:
+That's it! Now start `operate` and specify the model you want to use directly:
+```
+operate -m llama-3.1-vision
+```
+
+For better text recognition when clicking on elements, you can enable OCR with the `--ocr` flag:
+```
+operate -m llama-3.1-vision --ocr
 ```
-operate -m llava
-```   
-**Important:** Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.
+
+**Important:** 
+- The OCR flag is only available for Ollama models
+- The system will attempt to run any model you specify, regardless of whether it's detected as multimodal
+- Error rates when using ollama are very high, even with large models like llama3.2-vision:90b. This is simply intended to be a base to build off of as local multimodal models improve over time.
 
 Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)
 
@@ -136,6 +145,19 @@ Run with voice mode
 operate --voice
 ```
 
+### Browser Preference `-b` or `--browser`
+By default, Self-Operating Computer uses Google Chrome as the browser when providing instructions. If you prefer a different browser, you can specify it using the `-b` or `--browser` flag:
+
+```
+operate -b "Firefox"
+```
+
+```
+operate --browser "Microsoft Edge"
+```
+
+The specified browser will be used in the system prompts to guide the model.
+
 ### Optical Character Recognition Mode `-m gpt-4-with-ocr`
 The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click. 
 

diff --git a/operate/config.py b/operate/config.py
@@ -18,6 +18,8 @@ class Config:
         openai_api_key (str): API key for OpenAI.
         google_api_key (str): API key for Google.
         ollama_host (str): url to ollama running remotely.
+        ocr_enabled (bool): Flag indicating whether OCR is enabled for Ollama models.
+        browser (str): Preferred browser to use in system prompts.
     """
 
     _instance = None
@@ -31,6 +33,8 @@ def __new__(cls):
     def __init__(self):
         load_dotenv()
         self.verbose = False
+        self.ocr_enabled = False
+        self.browser = "Google Chrome"
         self.openai_api_key = (
             None  # instance variables are backups in case saving to a `.env` fails
         )

diff --git a/operate/main.py b/operate/main.py
@@ -39,14 +39,36 @@ def main_entry():
         type=str,
         required=False,
     )
+
+    # Add OCR flag for Ollama models
+    parser.add_argument(
+        "--ocr",
+        help="Enable OCR for Ollama models",
+        action="store_true",
+    )
+
+    # Add browser preference flag
+    parser.add_argument(
+        "-b",
+        "--browser",
+        help="Specify preferred browser (default: Google Chrome)",
+        type=str,
+        default="Google Chrome",
+    )
 
     try:
         args = parser.parse_args()
+
+        # No need to prompt for model name if it's directly specified
+        # The Ollama model name can now be passed directly
+
         main(
             args.model,
             terminal_prompt=args.prompt,
             voice_mode=args.voice,
-            verbose_mode=args.verbose
+            verbose_mode=args.verbose,
+            ocr_mode=args.ocr,
+            browser=args.browser
         )
     except KeyboardInterrupt:
         print(f"\n{ANSI_BRIGHT_MAGENTA}Exiting...")