Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor backend core dumped #208

Open
muaydin opened this issue May 6, 2024 · 4 comments
Open

Tensor backend core dumped #208

muaydin opened this issue May 6, 2024 · 4 comments

Comments

@muaydin
Copy link

muaydin commented May 6, 2024

here is my nvidia-smi result
image

python -c "import torch; import tensorrt; import tensorrt_llm" working well

When a client is connected server is getting core dumped related to libcudnn_cnn_infer library.
here is the related part of the log

60      0x7ff18fd2dac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff18fd2dac3]
61      0x7ff18fdbebf4 clone + 68
Could not load library libcudnn_cnn_infer.so.8. Error: /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8: undefined symbol: _ZN5cudnn14cublasSaxpy_v2EP13cublasContextiPKfS3_iPfi, version libcudnn_ops_infer.so.8
[e41f6f59a514:02294] *** Process received signal ***
[e41f6f59a514:02294] Signal: Aborted (6)
[e41f6f59a514:02294] Signal code:  (-6)
[e41f6f59a514:02294] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff18fcdb520]
[e41f6f59a514:02294] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7ff18fd2f9fc]
[e41f6f59a514:02294] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7ff18fcdb476]
[e41f6f59a514:02294] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7ff18fcc17f3]

what could be the reason ?

my ubutu version is

cat /etc/os-release on azure vm and getting following
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"

your docker image ubuntu version is (running on 20.04)

root@e41f6f59a514:/home/WhisperLive# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"

can it be related to Ubuntu 22.04?

@muaydin
Copy link
Author

muaydin commented May 6, 2024

fixed with

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

but now I am getting TensorRT-LLM not supported:

python3 run_server.py --port 9090                       --backend tensorrt                       --trt_model_path "/root/TensorRT-LLM-examples/whisper/whisper_small"
[05/06/2024-09:07:24] TensorRT-LLM not supported: [TensorRT-LLM][ERROR] CUDA runtime error in cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr, cubTempStorageSize, logProbs, (T*) nullptr, idVals, (int*) nullptr, vocabSize * batchSize, batchSize, beginOffsetBuf, offsetBuf + 1, 0, sizeof(T) * 8, stream): no kernel image is available for execution on the device (/root/TensorRT-LLM/cpp/tensorrt_llm/kernels/samplingTopPKernels.cu:322)
1       0x7f4b9c74b825 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2       0x7f4b9c837858 void tensorrt_llm::kernels::invokeBatchTopPSampling<__half>(void*, unsigned long&, unsigned long&, int**, int*, tensorrt_llm::kernels::FinishedState const*, tensorrt_llm::kernels::FinishedState*, float*, float*, __half const*, int const*, int*, int*, curandStateXORWOW*, int, unsigned long, int const*, float, float const*, CUstream_st*, bool const*) + 2200

no kernel image is available for execution on the device (/root/TensorRT-LLM/cpp/tensorrt_llm/kernels/samplingTopPKernels.cu:322)

@muaydin
Copy link
Author

muaydin commented May 8, 2024

if I try to build TensorRT-LLM container manually eventually I got

python3 run_server.py --port 9090 --backend tensorrt --trt_model_path "/app/tensorrt_llm/examples/whisper/whisper_small"
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[05/08/2024-09:19:41] TensorRT-LLM not supported: Trying to create tensor with negative dimension -1: [-1, 1500, 768]

GPU: Tesla T4,
I built TensorRT-LLM with
make -C docker release_build CUDA_ARCHS="75"

Note: and it throws exception
[05/08/2024-09:11:46] TensorRT-LLM not supported: ModelConfig.__init__() missing 2 required positional arguments: 'max_batch_size' and 'max_beam_width'
If fixed it by adding

decoder_model_config = ModelConfig(
            max_batch_size=self.decoder_config['max_batch_size'],
            max_beam_width=self.decoder_config['max_beam_width'],
...

@makaveli10
Copy link
Collaborator

Thanks for reporting and tracking the issue, we are looking into this at our end as well.

@peldszus
Copy link
Contributor

I also ran into those issues.

When you stick to TensorRT LLM 0.7.1, you neither get model config error (I applied the same fix as you), nor the negative dimension error (I didn't have the time to look deeper into that).

I have a working build in #221, feel free to give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants