| Creatives | SimplifiedDiversityRank | PyTorchDiversityRank | PyTorchDiversityRank V2 |
|---|---|---|---|
| 106 | 33.6 us (29,748/s) | 692 us (1,444/s) | 423 us (2,363/s) |
| 427 | 136 us (7,374/s) | 760 us (1,316/s) | 483 us (2,069/s) |
| 934 | 340 us (2,945/s) | 900 us (1,111/s) | 623 us (1,604/s) |
| 2,156 | 785 us (1,274/s) | 1,181 us (847/s) | 842 us (1,187/s) |
| 3,016 | 1,103 us (907/s) | 1,388 us (720/s) | 996 us (1,003/s) |
| 3,647 | 1,314 us (761/s) | 1,514 us (660/s) | 1,087 us (920/s) |
| 4,184* | 1,533 us (652/s) | 1,648 us (607/s) | 1,109 us (902/s) |
export LD_LIBRARY_PATH=/usr/local/services/preranking-1.0/lib:/data/cuda/cuda-12.5/lib64/:/usr/local/services/preranking-1.0/thirdparty/gcc/12.3.1-b1/lib:./thirdparty/gcc/12.3.1-b1/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/services/preranking-1.0/thirdparty/gcc/12.3.1-b1/bin:$PATHSimplifiedDiversityRank 需要对齐 V2 的逻辑,并重新评测。
- 回答一个问题:怎么评估吞吐CPU/Pytorch/GPU。
- 终态:业务相关都要PyTorch
- 方案要求:1)时延、2)成本、3)易用性、4)切换方案
| Metric | CPU (Simplified) | PyTorch |
|---|---|---|
| QPS | 9541.3 | 1771.3 |
| Avg Latency(us) | 843.1 | 6344.5 |
| P50 Latency(us) | 827.0 | 6124.0 |
| P99 Latency(us) | 1199.0 | 11808.0 |
Pytorch qps直接拉胯了…,加个multi stream试试看
加了 multi stream(pool size:20): 相比之前提升了一倍,但是打不平CPU啊
| Metric | CPU (Simplified) | PyTorch |
|---|---|---|
| QPS | 9531.4 | 3670.3 |
| Avg Latency(us) | 846.2 | 2865.1 |
| P50 Latency(us) | 830.0 | 2838.0 |
| P99 Latency(us) | 1201.0 | 3876.0 |
pool size 调整到 200 效果没啥变化(默认并发为200)
并发调整为1,看起来pytorch qps > cpu qps。 并发调整为2,pytorch qps 约等于 cpu qps 调整为3, pytorch qps < cpu qps