CreativesSimplifiedDiversityRankPyTorchDiversityRankPyTorchDiversityRank V2
10633.6 us (29,748/s)692 us (1,444/s)423 us (2,363/s)
427136 us (7,374/s)760 us (1,316/s)483 us (2,069/s)
934340 us (2,945/s)900 us (1,111/s)623 us (1,604/s)
2,156785 us (1,274/s)1,181 us (847/s)842 us (1,187/s)
3,0161,103 us (907/s)1,388 us (720/s)996 us (1,003/s)
3,6471,314 us (761/s)1,514 us (660/s)1,087 us (920/s)
4,184*1,533 us (652/s)1,648 us (607/s)1,109 us (902/s)
export LD_LIBRARY_PATH=/usr/local/services/preranking-1.0/lib:/data/cuda/cuda-12.5/lib64/:/usr/local/services/preranking-1.0/thirdparty/gcc/12.3.1-b1/lib:./thirdparty/gcc/12.3.1-b1/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/services/preranking-1.0/thirdparty/gcc/12.3.1-b1/bin:$PATH

SimplifiedDiversityRank 需要对齐 V2 的逻辑,并重新评测。

  • 回答一个问题:怎么评估吞吐CPU/Pytorch/GPU。
  • 终态:业务相关都要PyTorch
  • 方案要求:1)时延、2)成本、3)易用性、4)切换方案
MetricCPU (Simplified)PyTorch
QPS9541.31771.3
Avg Latency(us)843.16344.5
P50 Latency(us)827.06124.0
P99 Latency(us)1199.011808.0

Pytorch qps直接拉胯了…,加个multi stream试试看

加了 multi stream(pool size:20): 相比之前提升了一倍,但是打不平CPU啊

MetricCPU (Simplified)PyTorch
QPS9531.43670.3
Avg Latency(us)846.22865.1
P50 Latency(us)830.02838.0
P99 Latency(us)1201.03876.0

pool size 调整到 200 效果没啥变化(默认并发为200)

并发调整为1,看起来pytorch qps > cpu qps。 并发调整为2,pytorch qps 约等于 cpu qps 调整为3, pytorch qps < cpu qps