Skip to content
View R1im's full-sized avatar

Block or report R1im

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

This Windows Batchscript helps setup a Mingw-w64 compiler environment for building ffmpeg and other media tools under Windows.

Shell 1,610 271 Updated Feb 26, 2025

Scripts to build a trimmed-down Windows 11 image.

PowerShell 11,256 930 Updated Nov 17, 2024

Automatically convert epubs to audiobooks

Python 197 17 Updated Feb 13, 2025

A real-time silent speech recognition tool.

Python 468 33 Updated Feb 3, 2025

Auto-AVSR: Lip-Reading Sentences Project

Python 315 48 Updated Jan 8, 2025

Autonomous coding agent right in your IDE, capable of creating/editing files, executing commands, using the browser, and more with your permission every step of the way.

TypeScript 32,229 3,140 Updated Mar 1, 2025
Python 3,458 319 Updated Feb 24, 2025

A machine learning-based video super resolution and frame interpolation framework. Est. Hack the Valley II, 2018.

C++ 12,394 1,118 Updated Feb 24, 2025

AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording

TypeScript 12,462 875 Updated Feb 28, 2025

MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

Python 18,736 1,339 Updated Feb 21, 2025

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Python 2,770 168 Updated Jan 22, 2025

Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch

Python 1,106 89 Updated Dec 12, 2023

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsens…

Python 968 105 Updated Feb 27, 2023

[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Python 41 3 Updated Dec 25, 2024

[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset

Python 90 11 Updated Sep 6, 2023

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

Python 3,232 214 Updated Mar 5, 2024

User-friendly AI Interface (Supports Ollama, OpenAI API, ...)

JavaScript 79,869 9,560 Updated Mar 1, 2025

[TMM 2023] VideoXum: Cross-modal Visual and Textural Summarization of Videos

Python 40 2 Updated Apr 9, 2024

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Python 1,438 142 Updated Dec 8, 2023

GIT: A Generative Image-to-text Transformer for Vision and Language

Python 557 69 Updated Dec 2, 2023

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Python 5,532 939 Updated Feb 3, 2025

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Python 3,171 229 Updated Dec 3, 2024

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

Python 14,358 2,096 Updated Jul 24, 2024
Python 177 9 Updated Jul 12, 2024

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Python 1,087 72 Updated Jan 23, 2025

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)

Python 772 45 Updated Jul 29, 2024

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Python 1,708 101 Updated Feb 27, 2025

[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Python 277 17 Updated Dec 25, 2024

Implementation of MaMMUT, a simple vision-encoder text-decoder architecture for multimodal tasks from Google, in Pytorch

Python 100 4 Updated Oct 10, 2023

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark

Python 4,466 1,265 Updated Aug 14, 2024
Next
Showing results