Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add audio optimization for qwen2.5-omni #13037

Merged
merged 3 commits into from
Apr 7, 2025

Conversation

MeouSker77
Copy link
Contributor

@MeouSker77 MeouSker77 commented Apr 1, 2025

Description

optimize qwen2.5-omni audio part, tts part is still not supported

after installing ipex-llm, then

pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
pip install accelerate==1.5.2
pip install qwen-omni-utils
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
from ipex_llm import optimize_model
from qwen_omni_utils import process_mm_info

model_path = r"Qwen2.5-Omni-7B"

model = Qwen2_5OmniModel.from_pretrained(model_path, enable_audio_output=False)
model = optimize_model(model, low_bit="sym_int4",
                       modules_to_not_convert=["audio_tower", "visual", "token2wav"])
model = model.half().to('xpu')

processor = Qwen2_5OmniProcessor.from_pretrained(model_path)

# video input (use audio in video)
conversation = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": r"test.mp4"},
        ],
    },
]

# image input
# conversation = [
#     {
#         "role": "system",
#         "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
#     },
#     {
#         "role": "user",
#         "content": [
#             {"type": "image", "image": r"test.png"},
#             {"type": "text", "text": "Describe the image in detail"},
#         ],
#     },
# ]

# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True)
inputs = inputs.to(model.device).to(model.dtype)

# note: use `thinker_max_new_tokens` instead of `max_new_tokens`
text_ids = model.generate(**inputs, use_audio_in_video=True, thinker_max_new_tokens=128)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)

image processor will convert 14*14 pixels to 1 vision model token, then convert 4 vision model tokens to 1 language model token.

when using video input, processor will set max_pixels to 768 * 28 * 28 automatically

you can set max_pixels to control image's token number, such as, 1024*14*14 means this image will be converted to 1024 tokens during vision model, then be converted to 256 tokens during language model.

for image

{"type": "image", "image": r"test.png", "max_pixels": 1024*14*14},

for video

{"type": "video", "video": r"test.mp4", "max_pixels": 1024*14*14},

@MeouSker77
Copy link
Contributor Author

add support for tts part, to use it, just remove enable_audio_output=False in from_pretrained, and use

text_ids, audio = model.generate(...)

import soundfile as sf

sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)

@jason-dai
Copy link
Contributor

Do we have an example?

@MeouSker77
Copy link
Contributor Author

Do we have an example?

no, we need to add one after this PR merged

@MeouSker77 MeouSker77 requested a review from rnwang04 April 7, 2025 09:16
Copy link
Contributor

@rnwang04 rnwang04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MeouSker77 MeouSker77 merged commit ef852dc into intel:main Apr 7, 2025
1 check passed
@MeouSker77 MeouSker77 deleted the add-qwen2_5-omni-opt-again branch April 7, 2025 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants