You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+22-20
Original file line number
Diff line number
Diff line change
@@ -113,26 +113,6 @@ operate -m llava
113
113
114
114
Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)
115
115
116
-
### Optical Character Recognition Mode `-m gpt-4-with-ocr`
117
-
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
118
-
119
-
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
120
-
121
-
`operate` or `operate -m gpt-4-with-ocr` will also work.
122
-
123
-
### Set-of-Mark Prompting `-m gpt-4-with-som`
124
-
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
125
-
126
-
Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
127
-
128
-
For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
129
-
130
-
Start `operate` with the SoM model
131
-
132
-
```
133
-
operate -m gpt-4-with-som
134
-
```
135
-
136
116
### Voice Mode `--voice`
137
117
The framework supports voice inputs for the objective. Try voice by following the instructions below.
138
118
**Clone the repo** to a directory on your computer:
@@ -161,6 +141,28 @@ Run with voice mode
161
141
operate --voice
162
142
```
163
143
144
+
### Optical Character Recognition Mode `-m gpt-4-with-ocr`
145
+
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
146
+
147
+
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
148
+
149
+
`operate` or `operate -m gpt-4-with-ocr` will also work.
150
+
151
+
### Set-of-Mark Prompting `-m gpt-4-with-som`
152
+
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
153
+
154
+
Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
155
+
156
+
For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
157
+
158
+
Start `operate` with the SoM model
159
+
160
+
```
161
+
operate -m gpt-4-with-som
162
+
```
163
+
164
+
165
+
164
166
## Contributions are Welcomed!:
165
167
166
168
If you want to contribute yourself, see [CONTRIBUTING.md](https://github.com/OthersideAI/self-operating-computer/blob/main/CONTRIBUTING.md).
0 commit comments