Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microphone or wave-recorder button not work, no voice chat as described #1887

Open
bigmw opened this issue Feb 12, 2025 · 5 comments
Open

Microphone or wave-recorder button not work, no voice chat as described #1887

bigmw opened this issue Feb 12, 2025 · 5 comments
Labels
bug Something isn't working frontend Pertains to the frontend. needs-triage

Comments

@bigmw
Copy link

bigmw commented Feb 12, 2025

Describe the bug
I have been using Gradio to build chat apps. I recently found Chainlit, and was attracted immediately. The chat UI seems to be very professional and it’s simple to pick up.
However, I noticed a major bug: its microphone or voice chat does not work! Mic is disabled by default. When I turned it on and clicked on the mic or wave-recorder button it did not respond at all. It did not bring up the "allow connecting to microphone?" message in the browser, nor did it connect to the mic. In fact, the process seems to be hanging there forever. (see the attached pics).
I think it would be a major bug that a framework specialized in chat cannot talk or connect to microphone. This feature has been available for quite a while in frameworks like Gradio, which are less specialized in chat.
I noticed that this problem was reported a while ago like in (#626), but has not been addressed. I tried different solutions suggested by users, e.g. deploy over https instead of http etc. nothing worked so far.
So please take it as a major bug as it is and address it. thank you!

Expected behavior
-simple ways to turn mic on and off.
-make mic or wave-recorder button work
-support voice chat in general (both users and AI can talk)

Screenshots

Image
Image

@dosubot dosubot bot added bug Something isn't working frontend Pertains to the frontend. labels Feb 12, 2025
@AidanShipperley
Copy link
Contributor

AidanShipperley commented Feb 13, 2025

Hi @bigmw,

Can you share what code do you currently have in your @cl.on_audio_chunk and @cl.on_audio_end function decorators? Voice chat is supported in Chainlit, but it gives you full control over how you want to implement the code to handle the audio chunks. This allows you to pass audio to OpenAI's Realtime API, pass it to a whisper model for transcription, etc.

They have an example for setting up Chainlit with realtime audio in their Cookbook, and it worked well for me with some small modifications to fit my use case. Are you able to try that code in your app?

@bigmw
Copy link
Author

bigmw commented Feb 14, 2025

Aidan,
Here is the code in my @cl.on_audio_chunk and @cl.on_audio_end function decorators.
As you mentioned, I also checked the example for realtime audio in their Cookbook. In addition, I also checked the Quivr Chatbot Example.
Let me know if you find any problem here. thank you!

import os
import speech_recognition as sr
from io import BytesIO
import chainlit as cl
from chainlit.element import Element


@cl.on_audio_chunk
async def on_audio_chunk(chunk: cl.InputAudioChunk):
    if chunk.isStart:
        buffer = BytesIO()
        # This is required for whisper to recognize the file type
        buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
        # Initialize the session for a new audio stream
        cl.user_session.set("audio_buffer", buffer)
        cl.user_session.set("audio_mime_type", chunk.mimeType)

    # Write the chunks to a buffer and transcribe the whole audio at the end
    cl.user_session.get("audio_buffer").write(chunk.data)


@cl.on_audio_end
async def on_audio_end(elements: list[Element]):
    # Get the audio buffer from the session
    audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
    audio_buffer.seek(0)  # Move the file pointer to the beginning
    audio_file = audio_buffer.read()
    audio_mime_type: str = cl.user_session.get("audio_mime_type")

    input_audio_el = cl.Audio(
        mime=audio_mime_type, content=audio_file, name=audio_buffer.name
    )
    await cl.Message(
        author="You",
        type="user_message",
        content="",
        elements=[input_audio_el, *elements],
    ).send()

    recognizer = sr.Recognizer()
    with sr.AudioFile(audio_buffer.name) as source:
        audio_data = recognizer.record(source)
        try:
            text = recognizer.recognize_google(audio_data)
        except sr.UnknownValueError:
            await cl.Message(content="Sorry, I couldn't understand that audio.").send()
        except sr.RequestError:
            await cl.Message(content="Could not request results, please try again.").send()

    msg = cl.Message(author="You", content=text, elements=elements)
    await on_message(message=msg)

@AidanShipperley
Copy link
Contributor

Hi @bigmw,

Your code looks a lot like the Quivr Chatbot Example you provided, but unfortunately this method of handling audio was changed in Chainlit v2.0.0 to add support for realtime conversations like OpenAI's realtime audio.

You can still get this code to work, but you would need to follow their migration guide from the prerelease. The reason you are seeing the permanent spinning icon is that you do not have a function decorated with @cl.on_audio_start, which is required to begin an audio conversation. You could then run your own voice activity detection (VAD) in the on_audio_chunk() function, and when the user stops talking you could run the code you currently have in on_audio_end() in the on_audio_chunk() function. Additionally, you would need to remove the elements input argument from on_audio_end() as that is no longer present post-v2.0.0.

Alternatively, you could use a Chainlit version pre-v2.0.0 and your code would potentially work.

@bigmw
Copy link
Author

bigmw commented Feb 16, 2025

Hi Aidan @AidanShipperley,
Thanks for the input and detailed suggestion. It make a lot of sense.
However, I still saw the same problem after I updated my script accordingly.

I further followed the second Multi-Modality example in the Chainlit documentation, which includes both Text To Speech and Speech to Text, very similar to my app.
note the example is released/updated recently after chainlit 2.0 release, and is consistent with what you suggested. And I still got the same problem.

For demo purpose, I simplified my app script and remove the LLM calling and SST/TTS parts as below. I did see the app prompt me for mic access now. But otherwise it was still the same, mic does not work and connecting try (spinning icon) lasts forever.
In the demo script below, I inserted a few "await cl.Message().send()" lines for debugging purpose. This showed that on_audio_start() did worked, but on_audio_chunk() never. If fact it never stated as the first cl.Message().send() line in on_audio_chunk() never worked. Hope this give you guys some better idea on the bug.
Note both on_audio_start() and on_audio_chunk() are copied from the official openai-whisper example, except process_audio() was not called for simple demo. The problem/bug can be replicated if you run the demo app.
Let me know if you have further thoughts/suggestions. Thank you!

import io
import os
import wave
import numpy as np
import audioop
import chainlit as cl

# Define a threshold for detecting silence and a timeout for ending a turn
SILENCE_THRESHOLD = 3500 # Adjust based on your audio level (e.g., lower for quieter audio)
SILENCE_TIMEOUT = 1300.0 # Seconds of silence to consider the turn finished


@cl.on_chat_start
async def start_chat():
    msg0="Hello! How can I help you?"
    await cl.Message(content=msg0).send()


@cl.on_audio_start
async def on_audio_start():
    cl.user_session.set("silent_duration_ms", 0)
    cl.user_session.set("is_speaking", False)
    cl.user_session.set("audio_chunks", [])
    # await cl.Message(content="audio starts.").send()
    return True

@cl.on_audio_chunk
async def on_audio_chunk(chunk: cl.InputAudioChunk):
#     await cl.Message(content="On audio chunk now").send()
    audio_chunks = cl.user_session.get("audio_chunks")
    
    if audio_chunks is not None:
        await cl.Message(content="adding audio chunk..").send()
        audio_chunk = np.frombuffer(chunk.data, dtype=np.int16)
        audio_chunks.append(audio_chunk)
        cl.user_session.set("audio_chunks", audio_chunks)

    # If this is the first chunk, initialize timers and state
    if chunk.isStart:
        await cl.Message(content="first audio chunk..").send()
        cl.user_session.set("last_elapsed_time", chunk.elapsedTime)
        cl.user_session.set("is_speaking", True)
        return

    audio_chunks = cl.user_session.get("audio_chunks")
    last_elapsed_time = cl.user_session.get("last_elapsed_time")
    silent_duration_ms = cl.user_session.get("silent_duration_ms")
    is_speaking = cl.user_session.get("is_speaking")

    # Calculate the time difference between this chunk and the previous one
    time_diff_ms = chunk.elapsedTime - last_elapsed_time
    cl.user_session.set("last_elapsed_time", chunk.elapsedTime)

    # Compute the RMS (root mean square) energy of the audio chunk
    audio_energy = audioop.rms(chunk.data, 2)  # Assumes 16-bit audio (2 bytes per sample)

    if audio_energy < SILENCE_THRESHOLD:
        # Audio is considered silent
        silent_duration_ms += time_diff_ms
        cl.user_session.set("silent_duration_ms", silent_duration_ms)
        if silent_duration_ms >= SILENCE_TIMEOUT and is_speaking:
            cl.user_session.set("is_speaking", False)
            # await process_audio()
            await cl.Message(content="This is an audio response.").send()
    else:
        # Audio is not silent, reset silence timer and mark as speaking
        cl.user_session.set("silent_duration_ms", 0)
        if not is_speaking:
            cl.user_session.set("is_speaking", True)


# @cl.on_audio_end
# async def on_audio_end():
#     pass

@cl.on_message
async def on_message(message: cl.Message):
    await cl.Message(content="This is a reponse.").send()

@AidanShipperley
Copy link
Contributor

I have not directly tested your code yet, first could you give me a few of these things so I can help narrow down where this is happening?

  1. Could you share your .chainlit/config.toml file? Just to ensure that you've set your sample rate to 24000 and everything else is in order.
  2. Can you share your OS, what browser you are using, and what the browser's version is?
    • I happened to, just by pure chance, be testing my own audio code and I noticed that Chainlit's current implementation of the realtime assistant doesn't seem to work in FireFox as a custom sample rate is being set for one AudioContext (or another node supplying the microphone stream) while the microphone data comes at the device’s default rate. Edge and Chrome often handle this discrepancy automatically by resampling or allowing inter-context connections, but Firefox enforces stricter rules.
    • Could you try your code with another browser just in case?
  3. After you click the audio button, do any errors print out in either your terminal or in the web browser's developer console (right click -> inspect -> click on Console tab at the top)?
  4. Instead of sending messages to the chat for debugging, which are quite slow to send (compared to how fast on_audio_chunk() will be called), can you try print statements instead and see which functions get called? I usually add print statements at the top of each function.
  5. Was there a reason you commented out on_audio_end()? You may need all three functions defined for it to work, but this is just a guess.

I think with these we can narrow down where the issue is arising from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working frontend Pertains to the frontend. needs-triage
Projects
None yet
Development

No branches or pull requests

2 participants