Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to replicate fps results on AGX Xavier #275

Open
lpkoh opened this issue Nov 23, 2021 · 11 comments
Open

Unable to replicate fps results on AGX Xavier #275

lpkoh opened this issue Nov 23, 2021 · 11 comments

Comments

@lpkoh
Copy link

lpkoh commented Nov 23, 2021

Hi,

I am using an AGX Xavier. I followed the instructions to run the demo for 2d object detection. I built a yolo4_fp16.rt model, which would be a 416x416 model. I then ran ./demo yolo4_fp16.rt, with batch = 1, and received an fps of ~9. This is significantly less than the ~ 41 FPS reported. Images are shown below:

image

image

I have no other background processes running. I do not have CUDA_VISIBLE_DEVICES set to anything. My nvpmodel is set to 1 (settings below). I have run sudo jetson_clocks.

image

I am aware from looking at some of the other issues that this reported fps corresponds to inference only, so unsure why it is so slow (significantly slower than just testing with tensorrt with ./trtexec)

@lpkoh
Copy link
Author

lpkoh commented Nov 23, 2021

image

Result on csp. Don't think it is a thermal throttling issue as Jetson AGX Xavier is cool to the touch and I have a fan blowing directly at it.

@lpkoh
Copy link
Author

lpkoh commented Nov 23, 2021

Hi,

I have repeated the experiment. The original one was with a low power setting.

This is my environment:

  • Jetpack 4.5.1, TensorRT 7.1.3, CUDA 10.2, CUDNN 8.01

Other details:

  • Device: AGX Xavier
  • Mode: nvpmodel 4 - 30W 6CORE
  • Model: yolov4-fp16-516x516
  • Batch: 1

Results:
image

Another result, this time with Yolov4-csp-512x512 fp16:
image

I have two questions:

  1. The first result does not match the 41.01 FPS from AGX Xavier tolo4 416 result
    image
    Why is this so? Could it be because I am on MODE 30W 6CORE vs MAXN setting for AGX Xavier? I can't test this as I face an issue where the device shuts off when I run it with nvpmodel -m 0
  2. The results seem to be slower than just pure tensorrt. I ran a separate experiment with just ./trtexec on darknet weights that were converted to trt. I ran this multiple times, including on Yolov4-csp-512x512-fp16 (same number of classes, filters, etc.). The nvpmodel and jetson clocks were all the same. However, I obtained a result of 37.4 fps vs 21.9 fps above. As both are inference only fps, this would imply tensorrt + tkdnn is actually slower, all else held constant (as far as I can see). Is there a reason for this? Am I not maxing out tkdnn in some way? As it seems to be slower than raw tensorrt.

@lpkoh
Copy link
Author

lpkoh commented Nov 24, 2021

I have re run the result with adac857, thinking it might be due to this issue: #226

However, the results have actually slightly worsened, with ~18 fps on yolo4-csp. Can anyone advise?

@mive93
Copy link
Collaborator

mive93 commented Nov 24, 2021

Hi @lpkoh,

Three considerations:

  • when I do the tests, I always use the MAXN configuration.
  • I generally use sudo jetson_clocks to have the best performance.
  • On the readme, as specified, it is only reported the performance of the inference, what the demos prints on screen is preprocessing + inference + postprocessing. Pre/post could be optimized indeed.

Finally, yolo4-csp is not Yolov4, it's Scaled Yolo and it is slower, but more precise.

Let me know if you have further questions.

@mive93
Copy link
Collaborator

mive93 commented Nov 24, 2021

Actually I get very similar results for Yolov4 and Yolov4-csp.
These results are obtained on a Xavier AGX, with Jetpack 4.5 with full precision (FP32), selecting only those models in this script .

test avg ms min ms max ms avg FPS
yolo4_fp32_2 47.3199 46.5271 63.1509 21.1328
yolo4-csp_fp32_2 51.1207 50.8716 51.8859 19.5615

@lpkoh
Copy link
Author

lpkoh commented Nov 24, 2021

Hi thank you for replying on this.

I am confused. You said here and #186 and #173 that what demo prints on screen is the preprocessing + inference + postprocessing. I thought what the demo prints on screen = demo output, hence I thought that the "only inference fps" on tkDNN was slower than the "only inference fps" on ./trtexec. Can I check where do I find the demo output that corresponds to just inference, no pre/post processing then? I don't find that information here.

Also as I understand tkDNN is a wrapper around tensorrt and cudnn. Does this mean its actually meant to be faster than just running ./trtexec on a jetson board, at least theoretically?

@mive93
Copy link
Collaborator

mive93 commented Nov 24, 2021

Yeah, you are actually right. In the past the demo was printing also pre/post, but currently it prints the inference time only, so what you see is the inference time. For ./test_rtinference and the script scripts/test_all_tests.sh it's the same.

Yes, tkDNN is just a wrapper of tensorRT and cuDNN. It is just a framework that we use to optimize NN for our projects. It is not because it's faster that we develop it, but to easily port not supported models.

@lpkoh
Copy link
Author

lpkoh commented Nov 24, 2021

Ah gotcha. So I guess the difference between ~27 fps on yolo4-416x416 vs ~44 in your repo is probably down to MAXN? Could the Tensorrt version difference be an issue? I am using 7, your repo mentions 8. I heard 8 is faster, but for things like transformers, not yolo.

@mive93
Copy link
Collaborator

mive93 commented Nov 24, 2021

Maybe it's due to MAXN and jetson_clock.
Jetpack 4.5 uses TensorRT 7.
TensorRT8, that will be supported by tkDNN very soon, is actually slower on Jetson platform for now. We hope NVIDIA will solve the issue with the next minor release.

@mive93
Copy link
Collaborator

mive93 commented Nov 24, 2021

TensorRT8 is now supported on tensorrt8 branch.

@masip85
Copy link

masip85 commented Jul 20, 2022

Does TensorRT8 still being slower?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants