R1im

Follow

R1im

Follow

2 followers · 1 following

Stars

m-ab-s / media-autobuild_suite

This Windows Batchscript helps setup a Mingw-w64 compiler environment for building ffmpeg and other media tools under Windows.

Shell 1,610 271 Updated Feb 26, 2025

ntdevlabs / tiny11builder

Scripts to build a trimmed-down Windows 11 image.

PowerShell 11,256 930 Updated Nov 17, 2024

plusuncold / autiobooks

Automatically convert epubs to audiobooks

Python 197 17 Updated Feb 13, 2025

amanvirparhar / chaplin

A real-time silent speech recognition tool.

Python 468 33 Updated Feb 3, 2025

mpc001 / auto_avsr

Auto-AVSR: Lip-Reading Sentences Project

Python 315 48 Updated Jan 8, 2025

cline / cline

Autonomous coding agent right in your IDE, capable of creating/editing files, executing commands, using the browser, and more with your permission every step of the way.

TypeScript 32,229 3,140 Updated Mar 1, 2025

LLaVA-VL / LLaVA-NeXT

Python 3,458 319 Updated Feb 24, 2025

k4yt3x / video2x

A machine learning-based video super resolution and frame interpolation framework. Est. Hack the Valley II, 2018.

C++ 12,394 1,118 Updated Feb 24, 2025

mediar-ai / screenpipe

AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording

TypeScript 12,462 875 Updated Feb 28, 2025

OpenBMB / MiniCPM-o

MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

Python 18,736 1,339 Updated Feb 21, 2025

InternLM / InternLM-XComposer

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Python 2,770 168 Updated Jan 22, 2025

lucidrains / CoCa-pytorch

Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch

Python 1,106 89 Updated Dec 12, 2023

YehLi / xmodaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsens…

Python 968 105 Updated Feb 27, 2023

TXH-mercury / COSA

[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Python 41 3 Updated Dec 25, 2024

jayleicn / TVCaption

[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset

Python 90 11 Updated Sep 6, 2023

EvolvingLMMs-Lab / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

Python 3,232 214 Updated Mar 5, 2024

open-webui / open-webui

User-friendly AI Interface (Supports Ollama, OpenAI API, ...)

JavaScript 79,869 9,560 Updated Mar 1, 2025

jylins / videoxum

[TMM 2023] VideoXum: Cross-modal Visual and Textural Summarization of Videos

Python 40 2 Updated Apr 9, 2024

MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Python 1,438 142 Updated Dec 8, 2023

microsoft / GenerativeImage2Text

GIT: A Generative Image-to-text Transformer for Vision and Language

Python 557 69 Updated Dec 2, 2023

facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Python 5,532 939 Updated Feb 3, 2025

PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Python 3,171 229 Updated Dec 3, 2024

microsoft / Swin-Transformer

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

Python 14,358 2,096 Updated Jul 24, 2024

md-mohaiminul / VideoRecap

Python 177 9 Updated Jul 12, 2024

DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Python 1,087 72 Updated Jan 23, 2025

dvlab-research / LLaMA-VID

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)

Python 772 45 Updated Jul 29, 2024

OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Python 1,708 101 Updated Feb 27, 2025

TXH-mercury / VALOR

[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Python 277 17 Updated Dec 25, 2024

lucidrains / MaMMUT-pytorch

Implementation of MaMMUT, a simple vision-encoder text-decoder architecture for multimodal tasks from Google, in Pytorch

Python 100 4 Updated Oct 10, 2023

open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark

Python 4,466 1,265 Updated Aug 14, 2024