-
Notifications
You must be signed in to change notification settings - Fork 214
feat(video): add video recording and transcription tools for mobile devices #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
8c05a44
feat(video): add video recording tools for mobile devices
plfavreau 3e4a2b7
fix(video): make prompt optional in stop_video_recording
plfavreau 978787b
feat(video): add compression for Gemini API limits
plfavreau 1f18e1b
docs(example): reorder steps to record before play
plfavreau 5f6d143
feat(video): make video_analyzer optional and validate when video recβ¦
plfavreau f66ecea
feat(video): add CLI flag to enable video recording tools
plfavreau fef8039
chore: remove unused VS Code workspace file
plfavreau ae82a93
fix(config): correct override file name in recommended config comment
plfavreau 5444119
feat(video): add ffmpeg availability check with platform-specific insβ¦
plfavreau f6ba10b
docs(video): streamline video recording tool descriptions and add plaβ¦
plfavreau 8b82396
refactor(video): rename video_recording flag to with_video_recording_β¦
plfavreau 5ee475f
refactor(video): move ffmpeg check import to top-level imports
plfavreau dbe6c80
chore(doc): Update graph documentation [skip ci]
plfavreau 075a674
fix(video): cleanup compressed files and use shutil for safer file opβ¦
plfavreau 6cc7ebf
chore(config): update cortex model to gemini-3-pro-preview with geminβ¦
plfavreau File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| { | ||
| "folders": [ | ||
| { | ||
| "path": ".." | ||
| } | ||
| ], | ||
| "settings": { | ||
| "python.languageServer": "None" | ||
| } | ||
| } | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file has been pushed by mistake and is removed in a next commit x) |
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| """Video analyzer utility for analyzing video content with Gemini models.""" | ||
|
|
||
| from minitap.mobile_use.agents.video_analyzer.video_analyzer import analyze_video | ||
|
|
||
| __all__ = ["analyze_video"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| Please analyze the following video recording and respond to my request. | ||
|
|
||
| --- | ||
|
|
||
| **My Request**: {{ prompt }} |
37 changes: 37 additions & 0 deletions
37
minitap/mobile_use/agents/video_analyzer/video_analyzer.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| ## You are a **Video Analysis Assistant** | ||
|
|
||
| You analyze video recordings of mobile device screens and provide accurate, detailed responses based on what you observe. | ||
|
|
||
| --- | ||
|
|
||
| ## Your Focus Areas | ||
|
|
||
| When analyzing videos, pay attention to: | ||
|
|
||
| - **UI elements** and their states (buttons, text fields, toggles, etc.) | ||
| - **Text content** displayed on screen | ||
| - **Actions that occur** (taps, scrolls, transitions, animations) | ||
| - **Notifications or dialogs** that appear | ||
| - **Changes in the interface** over time | ||
| - **Audio content** if present (transcribe speech, describe sounds) | ||
|
|
||
| --- | ||
|
|
||
| ## Guidelines | ||
|
|
||
| - **Be precise and factual** - Only describe what you can actually see or hear | ||
| - **Note uncertainty** - If you cannot clearly see or determine something, say so | ||
| - **Be thorough** - Capture all relevant details that relate to the user's question | ||
| - **Use timestamps** when describing sequences of events (e.g., "At 0:05, the user taps...") | ||
| - **Structure your response** clearly when there's a lot of information | ||
|
|
||
| --- | ||
|
|
||
| ## Response Format | ||
|
|
||
| Adapt your response format to the user's request: | ||
|
|
||
| - For **transcription requests**: Provide clean, readable text of what was spoken or displayed | ||
| - For **description requests**: Give a chronological narrative of events | ||
| - For **specific questions**: Answer directly and concisely | ||
| - For **extraction requests**: List items clearly (e.g., notifications, text content) |
99 changes: 99 additions & 0 deletions
99
minitap/mobile_use/agents/video_analyzer/video_analyzer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| """ | ||
| Video Analyzer utility for analyzing video content using Gemini models. | ||
|
|
||
| This utility sends video files to video-capable Gemini models for analysis | ||
| and returns text descriptions based on the provided prompt. | ||
| """ | ||
|
|
||
| import base64 | ||
| from pathlib import Path | ||
|
|
||
| from jinja2 import Template | ||
| from langchain_core.messages import HumanMessage, SystemMessage | ||
|
|
||
| from minitap.mobile_use.context import MobileUseContext | ||
| from minitap.mobile_use.services.llm import get_llm, invoke_llm_with_timeout_message, with_fallback | ||
| from minitap.mobile_use.utils.logger import get_logger | ||
|
|
||
| logger = get_logger(__name__) | ||
|
|
||
|
|
||
| async def analyze_video( | ||
| ctx: MobileUseContext, | ||
| video_path: Path, | ||
| prompt: str, | ||
| ) -> str: | ||
| """ | ||
| Analyze a video file using a video-capable Gemini model. | ||
|
|
||
| Args: | ||
| ctx: The MobileUseContext containing LLM configuration | ||
| video_path: Path to the video file (MP4) | ||
| prompt: The analysis prompt/question about the video | ||
|
|
||
| Returns: | ||
| Text analysis result from the model | ||
|
|
||
| Raises: | ||
| Exception: If video analysis fails | ||
| """ | ||
| logger.info(f"Starting video analysis for {video_path}") | ||
|
|
||
| if not video_path.exists(): | ||
| raise FileNotFoundError(f"Video file not found: {video_path}") | ||
|
|
||
| with open(video_path, "rb") as video_file: | ||
| video_bytes = video_file.read() | ||
|
|
||
| video_base64 = base64.b64encode(video_bytes).decode("utf-8") | ||
|
|
||
| suffix = video_path.suffix.lower() | ||
| mime_type = "video/mp4" if suffix in [".mp4", ".m4v"] else f"video/{suffix[1:]}" | ||
|
|
||
| system_message_content = Template( | ||
| Path(__file__).parent.joinpath("video_analyzer.md").read_text(encoding="utf-8") | ||
| ).render() | ||
|
|
||
| human_message_content = Template( | ||
| Path(__file__).parent.joinpath("human.md").read_text(encoding="utf-8") | ||
| ).render(prompt=prompt) | ||
|
|
||
| messages = [ | ||
| SystemMessage(content=system_message_content), | ||
| HumanMessage( | ||
| content=[ | ||
| { | ||
| "type": "text", | ||
| "text": human_message_content, | ||
| }, | ||
| { | ||
| "type": "file", | ||
| "source_type": "base64", | ||
| "mime_type": mime_type, | ||
| "data": video_base64, | ||
| }, | ||
| ] | ||
| ), | ||
| ] | ||
|
|
||
| llm = get_llm(ctx=ctx, name="video_analyzer", is_utils=True, temperature=0.2) | ||
| llm_fallback = get_llm( | ||
| ctx=ctx, name="video_analyzer", is_utils=True, use_fallback=True, temperature=0.2 | ||
| ) | ||
|
|
||
| logger.info("Sending video to LLM for analysis...") | ||
|
|
||
| response = await with_fallback( | ||
| main_call=lambda: invoke_llm_with_timeout_message( | ||
| llm.ainvoke(messages), timeout_seconds=120 | ||
| ), | ||
| fallback_call=lambda: invoke_llm_with_timeout_message( | ||
| llm_fallback.ainvoke(messages), timeout_seconds=120 | ||
| ), | ||
| ) | ||
|
|
||
| content = response.content if hasattr(response, "content") else str(response) | ||
| result = content if isinstance(content, str) else str(content) | ||
| logger.info("Video analysis completed") | ||
|
|
||
| return result |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.