Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastApi多模型,多情绪,3090显卡,推理压测并发量超过10个请求,推理程序就会因为使用率100%卡死 #1065

Open
baddogly opened this issue May 8, 2024 · 9 comments

Comments

@baddogly
Copy link

baddogly commented May 8, 2024

压测并发数:20,约在5分钟时因为显卡使用率100%,导致程序卡死
推理参数:
{
"text": "优化设置,降低游戏的图形设置或调整应用程序的性能参数,以减轻GPU的负担关闭不必要的后台程序",
"prompt_lang": "zh",
"top_k": 5,
"top_p": 1,
"temperature": 1,
"text_split_method": "cut5",
"batch_size": 1,
"batch_threshold": 0.75,
"split_bucket": true,
"speed_factor": 1.0,
"fragment_interval": 0.3,
"seed": -1,
"media_type": "wav",
"streaming_mode": true,
"parallel_infer": true,
"repetition_penalty": 1.35
}
image
image

@BOCEAN-FENG
Copy link

我自己也不敢并发执行,似乎进程间会对显存有抢占,最后报错,不知道这个项目现在的并发功能是否完善?我自己的并发是做了一个线程池然后API调用,但是会报错 @baddogly 请问你是怎么做的?

@baddogly
Copy link
Author

baddogly commented May 8, 2024

@BOCEAN-FENG 开启流式调用"streaming_mode": true,每一个请求都会占用一部分显存,每一个请求都会均分显卡的算力

@BOCEAN-FENG
Copy link

streaming_mode好像是设置是否流式响应的吧,我看了一下fast_inference的分支,下面正好有个参数
"streaming_mode": false, # bool.(optional) whether to return a streaming response.
"parallel_infer": True, # bool.(optional) whether to use parallel inference.
这个parallel_infer应该是设置是否并发的,我自己也去尝试一下🤞

@TC10127
Copy link

TC10127 commented May 10, 2024

想问下大佬怎么修改代码可以支持多模型多情绪,感谢🙏

@BOCEAN-FENG
Copy link

多模型的话得随时切换,比如现在是角色A那就切换到角色A的模型,角色B就切换到B的模型,多情绪这一块得为每个角色喜怒哀乐都准备一段参考音频了

@TC10127
Copy link

TC10127 commented May 10, 2024

多模型的话得随时切换,比如现在是角色A那就切换到角色A的模型,角色B就切换到B的模型,多情绪这一块得为每个角色喜怒哀乐都准备一段参考音频了
我可能没表达清楚意思,我的意思是多人调用api的时候,可以互相不干扰,该怎么修改代码

@baddogly
Copy link
Author

代码里面/set_gpt_weights、/set_sovits_weights这俩接口加载到全局变量里,然后推理的地方用的地方也从全局变量里面拿就可以了 @TC10127 @BOCEAN-FENG

@TC10127
Copy link

TC10127 commented May 10, 2024

大佬方便分享下代码么🙏

@xwang0415
Copy link

GPU原生就是并行计算,设置多个worker实现多实例进行并发,会导致GPU的计算资源抢占,反而会降低每个请求的平均响应。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants