Hallucinating conclusive remarks ("Thanks for watching!", "That's all!", etc.) with non-speech noise just above VAD threshold #185

kjhenner · 2024-03-19T23:29:52Z

Running the medium English model with VAD enabled, I've noticed a tendency to hallucinate phrases like "Thanks for watching!", "Thanks!", "That's all," etc.. I assume it's receiving some audio data just above the VAD threshold but without any intelligible speech. Not really sure about the inner workings of the Whisper model, but it makes some sense that the model would be biased towards seeing these kinds of phrases at the end of training data? Maybe there's an end token that's putting a lot of probability on those phrases in the absence of any other intelligible data?

I haven't looked at the VAD code at all, so I'm not really sure what the approach would be to address this, but it'd be nice if it's fixable!

Ye83 · 2024-03-20T01:36:36Z

The same thing happened to me, and it would be great to fix it

makaveli10 · 2024-03-20T06:37:32Z

Can you guys checkout no_speech_threshold here

WhisperLive/whisper_live/server.py

Line 714 in 8d77f0f

self.no_speech_thresh = 0.45

and try to change it and see if results are better.

kjhenner · 2024-03-20T19:45:53Z

Thanks, I'll take a look!

Ye83 · 2024-03-21T09:30:21Z

Did you solve it? What value have you set? @kjhenner

kjhenner · 2024-03-22T16:17:03Z

Nothing yet--just gotta get this going on my local system so I can experiment a little. I'll let you know if I find a good solution.

Siim · 2024-03-28T01:23:41Z

I faced a comparable issue with the TalTechNLP/whisper-large-et model. To tackle the problem in my Node.js testing application, I utilized Silero VAD for initial speech detection. However, the model still encountered difficulties, hallucinating and generating random text, even when no data was sent to Whisper Live.

Notably, there was a noticeable improvement in the model's performance when I transmitted empty data to Whisper Live whenever no speech was detected. This approach is highlighted in my test code with the condition if (!speaking) { ...

import { WebSocket } from "ws"
import { v4 as uuidv4 } from 'uuid'
import { logger } from './src/logger'
import { SpeechDetector } from "./src/vad/speechDetector"

const { spawn } = require('child_process');
const ffmpeg = spawn('ffmpeg', [
  '-f', 'avfoundation',
  '-i', ':0', // Make sure the index matches your device
  '-ac', '1',  // Capture in mono
  '-ar', '16000',  // Set sample rate to 16kHz
  '-f', 's16le',  // Set format to signed 16-bit little-endian
  '-af highpass=f=300,asendcmd=0.0 afftdn sn start,asendcmd=1.5 afftdn sn stop,afftdn=nf=-20,dialoguenhance,lowpass=f=3000',
  '-'
])

function bufferToFloat32Array(buffer: Buffer) {
  const data = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / Int16Array.BYTES_PER_ELEMENT);
  const float32Array = new Float32Array(data.length);
  for (let i = 0; i < data.length; i++) {
    float32Array[i] = data[i] / 0x8000;
  }
  return float32Array;
}

let speaking = false
SpeechDetector.create(0.9, 0.75).then((speechDetector) => {
  speechDetector.readFromStream(ffmpeg.stdout as any).then(() => {

    speechDetector.on('speechStart', (start: number) => {
      speaking = true
      console.log('Speech start:', start)
    })

    speechDetector.on('speechEnd', (end: number) => {
      speaking = false
      console.log('Speech end:', end)
    })
  })
})

type Transcript = {
  uid: string
  message: string
  segments: Array<{
    start: string
    end: string
    text: string
  }>
}


const whisper = new WebSocket('ws://46.227.xxx.xxx:24882')
const uid = uuidv4()

whisper.on('open', () => {
  logger.info('Whisper connection open')
  whisper.send(
    JSON.stringify({
      uid,
      language: "et",
      task: "transcribe",
      use_vad: true
    })
  )
})

whisper.onmessage = (event) => {
  const data: Transcript = JSON.parse(event.data.toString())
  if (data.uid !== uid) return // ignore messages that are not for this recording
  if (data?.message && data?.message === 'SERVER_READY') {
    console.log('Server ready')
    return
  }

  if (data.message === 'DISCONNECTED') {
    console.log('Server disconnected')
    whisper.close()
    return
  }
  console.log(data.segments)
}


whisper.on('open', async () => {
  let windowSizeSamples = 512
  let sampleBuffer = new Float32Array(windowSizeSamples); // Buffer for accumulating samples
  let bufferIndex = 0; // Index for the next sample in the buffer

  ffmpeg.stdout.on('data', (chunk: Buffer) => {
    if (whisper.readyState !== WebSocket.OPEN) {
      logger.error('Whisper not open')
      return
    }
    if (!speaking) {
      whisper.send(new Float32Array(80000).buffer) // 5 x 16k samples of empty data
      return
    }
    const audioData = bufferToFloat32Array(chunk)
    for (let sample of audioData) {
      sampleBuffer[bufferIndex++] = sample;
      if (bufferIndex === windowSizeSamples) {
        whisper.send(Buffer.from(sampleBuffer.buffer))
        bufferIndex = 0
        sampleBuffer = new Float32Array(windowSizeSamples)
      }
    }
  })
})

I'm still not free from the issues, so experimenting with different params.

Hkaisense · 2024-04-25T11:41:13Z

Isn't there an expert who can solve this annoying problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hallucinating conclusive remarks ("Thanks for watching!", "That's all!", etc.) with non-speech noise just above VAD threshold #185

Hallucinating conclusive remarks ("Thanks for watching!", "That's all!", etc.) with non-speech noise just above VAD threshold #185

kjhenner commented Mar 19, 2024

Ye83 commented Mar 20, 2024

makaveli10 commented Mar 20, 2024

kjhenner commented Mar 20, 2024

Ye83 commented Mar 21, 2024

kjhenner commented Mar 22, 2024

Siim commented Mar 28, 2024 •

edited

Hkaisense commented Apr 25, 2024

Hallucinating conclusive remarks ("Thanks for watching!", "That's all!", etc.) with non-speech noise just above VAD threshold #185

Hallucinating conclusive remarks ("Thanks for watching!", "That's all!", etc.) with non-speech noise just above VAD threshold #185

Comments

kjhenner commented Mar 19, 2024

Ye83 commented Mar 20, 2024

makaveli10 commented Mar 20, 2024

kjhenner commented Mar 20, 2024

Ye83 commented Mar 21, 2024

kjhenner commented Mar 22, 2024

Siim commented Mar 28, 2024 • edited

Hkaisense commented Apr 25, 2024

Siim commented Mar 28, 2024 •

edited