Skip to content

Commit bf41c78

Browse files
committed
Move up ### Voice Mode in README.MD
1 parent 0e4f965 commit bf41c78

File tree

1 file changed

+22
-20
lines changed

1 file changed

+22
-20
lines changed

Diff for: README.md

+22-20
Original file line numberDiff line numberDiff line change
@@ -113,26 +113,6 @@ operate -m llava
113113

114114
Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)
115115

116-
### Optical Character Recognition Mode `-m gpt-4-with-ocr`
117-
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
118-
119-
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
120-
121-
`operate` or `operate -m gpt-4-with-ocr` will also work.
122-
123-
### Set-of-Mark Prompting `-m gpt-4-with-som`
124-
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
125-
126-
Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
127-
128-
For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
129-
130-
Start `operate` with the SoM model
131-
132-
```
133-
operate -m gpt-4-with-som
134-
```
135-
136116
### Voice Mode `--voice`
137117
The framework supports voice inputs for the objective. Try voice by following the instructions below.
138118
**Clone the repo** to a directory on your computer:
@@ -161,6 +141,28 @@ Run with voice mode
161141
operate --voice
162142
```
163143

144+
### Optical Character Recognition Mode `-m gpt-4-with-ocr`
145+
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
146+
147+
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
148+
149+
`operate` or `operate -m gpt-4-with-ocr` will also work.
150+
151+
### Set-of-Mark Prompting `-m gpt-4-with-som`
152+
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
153+
154+
Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
155+
156+
For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
157+
158+
Start `operate` with the SoM model
159+
160+
```
161+
operate -m gpt-4-with-som
162+
```
163+
164+
165+
164166
## Contributions are Welcomed!:
165167

166168
If you want to contribute yourself, see [CONTRIBUTING.md](https://github.com/OthersideAI/self-operating-computer/blob/main/CONTRIBUTING.md).

0 commit comments

Comments
 (0)