Add Computer Use with Voice Control Interface #217

Blaizzy · 2025-02-25T22:56:39Z

Overview

This PR introduces Computer Use powered by MLX-VLM, enabling AI-driven control of Mac interfaces through visual understanding. The system processes screenshots to comprehend application states and execute contextually appropriate actions based on user instructions. Voice control has been added as an optional input method for hands-free operation.

Current Status

The implementation delivers a functional GUI Agent (Level 1) with:

Visual understanding of screen elements and UI context
Action execution based on user commands
Dual input options: standard text interface (main.py) and new voice control (main_voice.py)
Privacy-focused local processing on Apple Silicon

Key Features

Visual Intelligence: Interprets on-screen content and application states
Cross-Application Control: Works across native Mac applications and interfaces
Flexible Input: Choose between typing commands or using voice input
Local Processing: Speech recognition via mlx-whisper runs entirely on-device
Mac-Optimized: Performance-tuned for Apple Silicon with MLX framework

Usage Examples

Control your Mac through either text or voice commands:

"Open Safari"
"Click on notifications"
"Search for MLX"

Technical Implementation

Voice recognition pipeline integrated with existing visual processing system
Command parser standardizes inputs from both text and voice sources
Screenshot analysis and action execution framework remains consistent across input methods
Memory footprint optimized to maintain responsiveness on M-series chips

Testing Completed

Functionality verified across core macOS applications (Finder, Safari, Chrome, Mail)
Performance benchmarked on M1 and M3 devices
Voice recognition tested with multiple accent patterns and ambient conditions

Roadmap

This implementation advances our progress toward Level 2 (Autonomous GUI Agent) capabilities with planned enhancements:

Voice feedback system for confirmation and status updates
Enhanced reasoning for multi-step task planning
Contextual memory for maintaining state across complex operations

lin72h · 2025-02-26T07:33:18Z

Incredible! I just discovered a secret PR that's packed with innovative ideas.

add computer use

4799709

Blaizzy changed the title ~~Add Computer Use~~ Add Computer Use with Voice Control Interface Feb 25, 2025

Blaizzy merged commit 39242eb into main Feb 25, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Computer Use with Voice Control Interface #217

Add Computer Use with Voice Control Interface #217

Blaizzy commented Feb 25, 2025 •

edited

Loading

lin72h commented Feb 26, 2025

Add Computer Use with Voice Control Interface #217

Add Computer Use with Voice Control Interface #217

Conversation

Blaizzy commented Feb 25, 2025 • edited Loading

Overview

Current Status

Key Features

Usage Examples

Technical Implementation

Testing Completed

Roadmap

lin72h commented Feb 26, 2025

Blaizzy commented Feb 25, 2025 •

edited

Loading