What kind of performance can we expect? #2157

genglinxiao · 2024-05-16T01:26:02Z

I'm experimenting the streaming mode on a M2 Macbook Air and found something like 1/3 of the voice are not recognized - Is that expected or do I need more RAM or something else went wrong? I tried both medium and large_v3 modes.
Here's one of the command and its initial output:
`./stream --model models/ggml-large-v3.bin --language zh --step 0 --length 4000
init: found 2 capture devices:
init: - Capture device #0: 'MacBook Air麦克风'
init: - Capture device #1: 'Microsoft Teams Audio'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init: - sample rate: 16000
init: - format: 33056 (required: 33056)
init: - channels: 1 (required: 1)
init: - samples per frame: 1024
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/linxiaogeng/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 2951.02 MiB, ( 2952.89 / 16384.02)
whisper_model_load: Metal total size = 3094.36 MB
whisper_model_load: model size = 3094.36 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/linxiaogeng/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 210.00 MiB, ( 3163.89 / 16384.02)
whisper_init_state: kv self size = 220.20 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 234.38 MiB, ( 3398.27 / 16384.02)
whisper_init_state: kv cross size = 245.76 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 32.97 MiB, ( 3431.23 / 16384.02)
whisper_init_state: compute buffer (conv) = 36.26 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 889.44 MiB, ( 4320.67 / 16384.02)
whisper_init_state: compute buffer (encode) = 934.34 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 7.33 MiB, ( 4328.00 / 16384.02)
whisper_init_state: compute buffer (cross) = 9.38 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 197.95 MiB, ( 4525.95 / 16384.02)
whisper_init_state: compute buffer (decode) = 209.26 MB

main: processing 0 samples (step = 0.0 sec / len = 4.0 sec / keep = 0.0 sec), 4 threads, lang = zh, task = transcribe, timestamps = 1 ...
main: using VAD, will transcribe on speech activity

[Start speaking]
`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What kind of performance can we expect? #2157

What kind of performance can we expect? #2157

genglinxiao commented May 16, 2024

What kind of performance can we expect? #2157

What kind of performance can we expect? #2157

Comments

genglinxiao commented May 16, 2024