Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Computer Use with Voice Control Interface #217

Merged
merged 1 commit into from
Feb 25, 2025
Merged

Conversation

Blaizzy
Copy link
Owner

@Blaizzy Blaizzy commented Feb 25, 2025

Overview

This PR introduces Computer Use powered by MLX-VLM, enabling AI-driven control of Mac interfaces through visual understanding. The system processes screenshots to comprehend application states and execute contextually appropriate actions based on user instructions. Voice control has been added as an optional input method for hands-free operation.

Current Status

The implementation delivers a functional GUI Agent (Level 1) with:

  • Visual understanding of screen elements and UI context
  • Action execution based on user commands
  • Dual input options: standard text interface (main.py) and new voice control (main_voice.py)
  • Privacy-focused local processing on Apple Silicon

Key Features

  • Visual Intelligence: Interprets on-screen content and application states
  • Cross-Application Control: Works across native Mac applications and interfaces
  • Flexible Input: Choose between typing commands or using voice input
  • Local Processing: Speech recognition via mlx-whisper runs entirely on-device
  • Mac-Optimized: Performance-tuned for Apple Silicon with MLX framework

Usage Examples

Control your Mac through either text or voice commands:

"Open Safari"
"Click on notifications"
"Search for MLX"

Technical Implementation

  • Voice recognition pipeline integrated with existing visual processing system
  • Command parser standardizes inputs from both text and voice sources
  • Screenshot analysis and action execution framework remains consistent across input methods
  • Memory footprint optimized to maintain responsiveness on M-series chips

Testing Completed

  • Functionality verified across core macOS applications (Finder, Safari, Chrome, Mail)
  • Performance benchmarked on M1 and M3 devices
  • Voice recognition tested with multiple accent patterns and ambient conditions

Roadmap

This implementation advances our progress toward Level 2 (Autonomous GUI Agent) capabilities with planned enhancements:

  • Voice feedback system for confirmation and status updates
  • Enhanced reasoning for multi-step task planning
  • Contextual memory for maintaining state across complex operations

@Blaizzy Blaizzy changed the title Add Computer Use Add Computer Use with Voice Control Interface Feb 25, 2025
@Blaizzy Blaizzy merged commit 39242eb into main Feb 25, 2025
1 check passed
@lin72h
Copy link

lin72h commented Feb 26, 2025

Incredible! I just discovered a secret PR that's packed with innovative ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants