Tensor backend core dumped #208

muaydin · 2024-05-06T08:44:54Z

here is my nvidia-smi result

python -c "import torch; import tensorrt; import tensorrt_llm" working well

When a client is connected server is getting core dumped related to libcudnn_cnn_infer library.
here is the related part of the log

60      0x7ff18fd2dac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff18fd2dac3]
61      0x7ff18fdbebf4 clone + 68
Could not load library libcudnn_cnn_infer.so.8. Error: /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8: undefined symbol: _ZN5cudnn14cublasSaxpy_v2EP13cublasContextiPKfS3_iPfi, version libcudnn_ops_infer.so.8
[e41f6f59a514:02294] *** Process received signal ***
[e41f6f59a514:02294] Signal: Aborted (6)
[e41f6f59a514:02294] Signal code:  (-6)
[e41f6f59a514:02294] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff18fcdb520]
[e41f6f59a514:02294] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7ff18fd2f9fc]
[e41f6f59a514:02294] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7ff18fcdb476]
[e41f6f59a514:02294] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7ff18fcc17f3]

what could be the reason ?

my ubutu version is

cat /etc/os-release on azure vm and getting following
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"

your docker image ubuntu version is (running on 20.04)

root@e41f6f59a514:/home/WhisperLive# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"

can it be related to Ubuntu 22.04?

The text was updated successfully, but these errors were encountered:

muaydin · 2024-05-06T09:06:37Z

fixed with

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

but now I am getting TensorRT-LLM not supported:

python3 run_server.py --port 9090                       --backend tensorrt                       --trt_model_path "/root/TensorRT-LLM-examples/whisper/whisper_small"
[05/06/2024-09:07:24] TensorRT-LLM not supported: [TensorRT-LLM][ERROR] CUDA runtime error in cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr, cubTempStorageSize, logProbs, (T*) nullptr, idVals, (int*) nullptr, vocabSize * batchSize, batchSize, beginOffsetBuf, offsetBuf + 1, 0, sizeof(T) * 8, stream): no kernel image is available for execution on the device (/root/TensorRT-LLM/cpp/tensorrt_llm/kernels/samplingTopPKernels.cu:322)
1       0x7f4b9c74b825 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2       0x7f4b9c837858 void tensorrt_llm::kernels::invokeBatchTopPSampling<__half>(void*, unsigned long&, unsigned long&, int**, int*, tensorrt_llm::kernels::FinishedState const*, tensorrt_llm::kernels::FinishedState*, float*, float*, __half const*, int const*, int*, int*, curandStateXORWOW*, int, unsigned long, int const*, float, float const*, CUstream_st*, bool const*) + 2200

no kernel image is available for execution on the device (/root/TensorRT-LLM/cpp/tensorrt_llm/kernels/samplingTopPKernels.cu:322)

muaydin · 2024-05-08T09:24:36Z

if I try to build TensorRT-LLM container manually eventually I got

python3 run_server.py --port 9090 --backend tensorrt --trt_model_path "/app/tensorrt_llm/examples/whisper/whisper_small"
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[05/08/2024-09:19:41] TensorRT-LLM not supported: Trying to create tensor with negative dimension -1: [-1, 1500, 768]

GPU: Tesla T4,
I built TensorRT-LLM with
make -C docker release_build CUDA_ARCHS="75"

Note: and it throws exception
[05/08/2024-09:11:46] TensorRT-LLM not supported: ModelConfig.__init__() missing 2 required positional arguments: 'max_batch_size' and 'max_beam_width'
If fixed it by adding

decoder_model_config = ModelConfig(
            max_batch_size=self.decoder_config['max_batch_size'],
            max_beam_width=self.decoder_config['max_beam_width'],
...

makaveli10 · 2024-05-15T06:56:01Z

Thanks for reporting and tracking the issue, we are looking into this at our end as well.

peldszus · 2024-05-30T09:05:51Z

I also ran into those issues.

When you stick to TensorRT LLM 0.7.1, you neither get model config error (I applied the same fix as you), nor the negative dimension error (I didn't have the time to look deeper into that).

I have a working build in #221, feel free to give it a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor backend core dumped #208

Tensor backend core dumped #208

muaydin commented May 6, 2024 •

edited

muaydin commented May 6, 2024 •

edited

muaydin commented May 8, 2024 •

edited

makaveli10 commented May 15, 2024

peldszus commented May 30, 2024

Tensor backend core dumped #208

Tensor backend core dumped #208

Comments

muaydin commented May 6, 2024 • edited

muaydin commented May 6, 2024 • edited

muaydin commented May 8, 2024 • edited

makaveli10 commented May 15, 2024

peldszus commented May 30, 2024

muaydin commented May 6, 2024 •

edited

muaydin commented May 6, 2024 •

edited

muaydin commented May 8, 2024 •

edited