Stars
This Windows Batchscript helps setup a Mingw-w64 compiler environment for building ffmpeg and other media tools under Windows.
Scripts to build a trimmed-down Windows 11 image.
Automatically convert epubs to audiobooks
A real-time silent speech recognition tool.
Autonomous coding agent right in your IDE, capable of creating/editing files, executing commands, using the browser, and more with your permission every step of the way.
A machine learning-based video super resolution and frame interpolation framework. Est. Hack the Valley II, 2018.
AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch
X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsens…
[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
User-friendly AI Interface (Supports Ollama, OpenAI API, ...)
[TMM 2023] VideoXum: Cross-modal Visual and Textural Summarization of Videos
[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
GIT: A Generative Image-to-text Transformer for Vision and Language
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Implementation of MaMMUT, a simple vision-encoder text-decoder architecture for multimodal tasks from Google, in Pytorch
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark