Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hallucinating conclusive remarks ("Thanks for watching!", "That's all!", etc.) with non-speech noise just above VAD threshold #185

Open
kjhenner opened this issue Mar 19, 2024 · 7 comments

Comments

@kjhenner
Copy link

Running the medium English model with VAD enabled, I've noticed a tendency to hallucinate phrases like "Thanks for watching!", "Thanks!", "That's all," etc.. I assume it's receiving some audio data just above the VAD threshold but without any intelligible speech. Not really sure about the inner workings of the Whisper model, but it makes some sense that the model would be biased towards seeing these kinds of phrases at the end of training data? Maybe there's an end token that's putting a lot of probability on those phrases in the absence of any other intelligible data?

I haven't looked at the VAD code at all, so I'm not really sure what the approach would be to address this, but it'd be nice if it's fixable!

@Ye83
Copy link

Ye83 commented Mar 20, 2024

The same thing happened to me, and it would be great to fix it

@makaveli10
Copy link
Collaborator

Can you guys checkout no_speech_threshold here

self.no_speech_thresh = 0.45

and try to change it and see if results are better.

@kjhenner
Copy link
Author

Thanks, I'll take a look!

@Ye83
Copy link

Ye83 commented Mar 21, 2024

Did you solve it? What value have you set? @kjhenner

@kjhenner
Copy link
Author

Nothing yet--just gotta get this going on my local system so I can experiment a little. I'll let you know if I find a good solution.

@Siim
Copy link

Siim commented Mar 28, 2024

I faced a comparable issue with the TalTechNLP/whisper-large-et model. To tackle the problem in my Node.js testing application, I utilized Silero VAD for initial speech detection. However, the model still encountered difficulties, hallucinating and generating random text, even when no data was sent to Whisper Live.

Notably, there was a noticeable improvement in the model's performance when I transmitted empty data to Whisper Live whenever no speech was detected. This approach is highlighted in my test code with the condition if (!speaking) { ...

import { WebSocket } from "ws"
import { v4 as uuidv4 } from 'uuid'
import { logger } from './src/logger'
import { SpeechDetector } from "./src/vad/speechDetector"

const { spawn } = require('child_process');
const ffmpeg = spawn('ffmpeg', [
  '-f', 'avfoundation',
  '-i', ':0', // Make sure the index matches your device
  '-ac', '1',  // Capture in mono
  '-ar', '16000',  // Set sample rate to 16kHz
  '-f', 's16le',  // Set format to signed 16-bit little-endian
  '-af highpass=f=300,asendcmd=0.0 afftdn sn start,asendcmd=1.5 afftdn sn stop,afftdn=nf=-20,dialoguenhance,lowpass=f=3000',
  '-'
])

function bufferToFloat32Array(buffer: Buffer) {
  const data = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / Int16Array.BYTES_PER_ELEMENT);
  const float32Array = new Float32Array(data.length);
  for (let i = 0; i < data.length; i++) {
    float32Array[i] = data[i] / 0x8000;
  }
  return float32Array;
}

let speaking = false
SpeechDetector.create(0.9, 0.75).then((speechDetector) => {
  speechDetector.readFromStream(ffmpeg.stdout as any).then(() => {

    speechDetector.on('speechStart', (start: number) => {
      speaking = true
      console.log('Speech start:', start)
    })

    speechDetector.on('speechEnd', (end: number) => {
      speaking = false
      console.log('Speech end:', end)
    })
  })
})

type Transcript = {
  uid: string
  message: string
  segments: Array<{
    start: string
    end: string
    text: string
  }>
}


const whisper = new WebSocket('ws://46.227.xxx.xxx:24882')
const uid = uuidv4()

whisper.on('open', () => {
  logger.info('Whisper connection open')
  whisper.send(
    JSON.stringify({
      uid,
      language: "et",
      task: "transcribe",
      use_vad: true
    })
  )
})

whisper.onmessage = (event) => {
  const data: Transcript = JSON.parse(event.data.toString())
  if (data.uid !== uid) return // ignore messages that are not for this recording
  if (data?.message && data?.message === 'SERVER_READY') {
    console.log('Server ready')
    return
  }

  if (data.message === 'DISCONNECTED') {
    console.log('Server disconnected')
    whisper.close()
    return
  }
  console.log(data.segments)
}


whisper.on('open', async () => {
  let windowSizeSamples = 512
  let sampleBuffer = new Float32Array(windowSizeSamples); // Buffer for accumulating samples
  let bufferIndex = 0; // Index for the next sample in the buffer

  ffmpeg.stdout.on('data', (chunk: Buffer) => {
    if (whisper.readyState !== WebSocket.OPEN) {
      logger.error('Whisper not open')
      return
    }
    if (!speaking) {
      whisper.send(new Float32Array(80000).buffer) // 5 x 16k samples of empty data
      return
    }
    const audioData = bufferToFloat32Array(chunk)
    for (let sample of audioData) {
      sampleBuffer[bufferIndex++] = sample;
      if (bufferIndex === windowSizeSamples) {
        whisper.send(Buffer.from(sampleBuffer.buffer))
        bufferIndex = 0
        sampleBuffer = new Float32Array(windowSizeSamples)
      }
    }
  })
})

I'm still not free from the issues, so experimenting with different params.

@Hkaisense
Copy link

Isn't there an expert who can solve this annoying problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants