Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker compose tensorrt #177

Closed
wants to merge 6 commits into from

Conversation

makaveli10
Copy link
Collaborator

@makaveli10 makaveli10 commented Mar 11, 2024

  • optimize tensorrt docker build with multi stages
  • docker compose setup to build and run container with any model size from command line which builds the models on runtime if they dont exist in the mounted volume.

Should close #164

@jooni22
Copy link

jooni22 commented Mar 14, 2024

Hi, it's possible to build docker image with CUDA_ARCH=52-virtual ?
I got tesla m40 on proxmox as vGPU and during build I get error:

688.3 ptxas /tmp/tmpxft_00000f2a_00000000-6_customAllReduceKernels.ptx, line 2273; error : Feature 'f16 arithemetic and compare instructions' requires .target sm_53 or higher

@makaveli10
Copy link
Collaborator Author

Hello @jooni22 from the error it seems that its not possible to build on this gpu!

Copy link
Contributor

@peldszus peldszus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I was trying to get whisper live running using a trt backend. I saw your comment in #164 which brought me here. If you don't mind, I'd bring up a question and two smaller comments here in the MR.

# convert small multilingual model
bash scripts/build_whisper_tensorrt.sh /root/TensorRT-LLM-examples small
# For e.g. 3090 RTX cuda architecture is 86-real
CUDA_ARCH=86-real docker compose build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could successfully build an image with these instructions.

It took 50min on a 12-core workstation. Most of the time, 38min, was spent in build-trt-llm.sh, compiling.

I am wondering, why it is necessary to compile TensorRT-LLM from scratch? What is the advantage over simply installing the pre-compiled wheel? Are the 0.7.1 wheels not supporting newer archs?

Btw: I appreciate your efforts to clean up the image. In the end, the image is still 38.4gb big, though, but I guess it would have been even bigger without those efforts. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THe pre-compiled wheel doesnt work on all ARCHS but its been a while since we tested, thanks for pointing that out, looking into that next.

--backend tensorrt \
--trt_model_path "path/to/whisper_trt/from/build/step" \
--trt_multilingual
MODEL_SIZE=small.en BACKEND=tensorrt docker compose up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I couldn't get any of the model sizes running like that. I always got Invalid model name:

$ MODEL_SIZE=small.en BACKEND=tensorrt docker compose up                                                                                                         
WARN[0000] /.../WhisperLive/docker-compose.yml: `version` is obsolete                                                                                                                          
[+] Running 0/1                                                                                                                                                                                                                        
 ⠹ Container whisperlive-whisperlive-tensorrt-1  Recreated                                                                                                                                                                        0.3s 
Attaching to whisperlive-tensorrt-1
whisperlive-tensorrt-1  | MODEL_SIZE is set to: small.en
whisperlive-tensorrt-1  | BACKEND is set to: tensorrt
whisperlive-tensorrt-1  | Running build-models.sh...
whisperlive-tensorrt-1  | whisper_small_en directory does not exist or is empty. Building whisper
whisperlive-tensorrt-1  | Installing requirements for Whisper TensorRT-LLM ...
whisperlive-tensorrt-1  | Invalid model name: whisper_small_en
whisperlive-tensorrt-1 exited with code 1

Copy link
Contributor

@peldszus peldszus Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! download_and_build_model() needs to get passed model_name and output_dir as the arguments. Currently only the output_dir is passed and then treated as a model_name in the case-structure.

#!/bin/bash -e

apt-get update && apt-get -y install git git-lfs
git clone --depth=1 -b cuda12.2 https://github.com/makaveli10/TensorRT-LLM.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I see this right, the difference between your fork+branch and the original TensorRT-LLM repo is only the adding of smaller model parameters to examples/whisper/build.py.

If using an official TensorRT-LLM wheel could be an option (see comment above), then the smaller model options could be added as a simple diff on the example files without the need to use a fork.

echo "Running build script for $model_name with output directory $output_dir"
python3 build.py --output_dir "$output_dir" --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --model_name "$model_name"
echo "Running TensorRT-LLM build script for $model_name with output directory $output_dir"
python3 build.py --output_dir "$output_dir" --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --model_name "$model_name" > /dev/null 2>&1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose not to dump the stdout/err to devnull.

This is actually hiding errors that might be useful to know when one is starting the service.

This is especially important when the script continues, even if one of its commands errored. I'd thus also recommend to add set -e in the beginning of this and the run-whisperlive.sh script.

An example I just ran into: Some cuda error during the trt-build of the engine. After the hidden error, the next echo told me that the engine was built, but it wasn't.

@makaveli10 makaveli10 closed this May 29, 2024
@peldszus
Copy link
Contributor

PS: I have a working tensor rt branch. It is not perfect but it works and the image is not too big. It does not have the compile-model-at-start option yet though. I was considering to open an MR, if you like. Or you could just have a look and I can share my learnings.

@makaveli10
Copy link
Collaborator Author

Feel free to Open a PR or share the branch whatever works best for you. I looked into the pip install and it works so, I started looking into updating the setup already.

@peldszus
Copy link
Contributor

Sure, here you go: #221

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Message from Server:TensorRT-LLM not supported on Server yet.
3 participants