Built an end-to-end workflow for high-throughput LLM serving, combining optimized inference pipelines, model quantization, and task-specific fine-tuning to improve latency, memory efficiency, and cost-performance trade-offs.
| Concurrency | Requests | TTFT p50 (ms) | TTFT p99 (ms) | TPOT p50 (ms) | TPOT p99 (ms) | E2E p50 (ms) | tok/s | req/s |
|---|---|---|---|---|---|---|---|---|
| 1 | 8 | 16.1 | 37.4 | 11.5 | 11.7 | 1480 | 87 | 0.7 |
| 4 | 16 | 33.9 | 34.7 | 11.8 | 11.9 | 1532 | 337 | 2.6 |
| 8 | 32 | 35.3 | 38.4 | 11.7 | 11.8 | 1518 | 675 | 5.3 |
| 16 | 64 | 39.3 | 42.2 | 11.9 | 12.2 | 1544 | 1324 | 10.3 |
| 32 | 128 | 49.1 | 62.8 | 12.0 | 12.8 | 1576 | 2550 | 19.9 |
| 64 | 256 | 68.3 | 98.6 | 13.5 | 13.9 | 1782 | 4579 | 35.8 |
| 128 | 512 | 121.6 | 2279.3 | 15.6 | 16.6 | 2126 | 5480 | 42.8 |
| Concurrency | Requests | TTFT p50 (ms) | TTFT p99 (ms) | TPOT p50 (ms) | TPOT p99 (ms) | E2E p50 (ms) | tok/s | req/s |
|---|---|---|---|---|---|---|---|---|
| 1 | 8 | 51.7 | 94.8 | 38.1 | 38.2 | 4896 | 26 | 0.2 |
| 4 | 16 | 86.2 | 111.9 | 40.9 | 46.8 | 5275 | 94 | 0.7 |
| 8 | 32 | 172.8 | 205.5 | 53.9 | 57.8 | 6992 | 144 | 1.1 |
| 16 | 64 | 309.9 | 314.1 | 51.9 | 53.0 | 6900 | 297 | 2.3 |
| 32 | 128 | 340.7 | 366.1 | 48.8 | 52.2 | 6507 | 624 | 4.9 |
| 64 | 256 | 437.4 | 2098.7 | 64.1 | 66.3 | 8584 | 916 | 7.2 |
| 128 | 512 | 707.6 | 15839.3 | 108.9 | 111.9 | 14494 | 812 | 6.3 |
| Concurrency | Requests | TTFT p50 (ms) | TTFT p99 (ms) | TPOT p50 (ms) | TPOT p99 (ms) | E2E p50 (ms) | tok/s | req/s |
|---|---|---|---|---|---|---|---|---|
| 1 | 8 | 33.0 | 54.2 | 8.3 | 8.3 | 1081 | 118 | 0.9 |
| 4 | 16 | 46.2 | 56.0 | 9.8 | 9.9 | 1295 | 395 | 3.1 |
| 8 | 32 | 42.5 | 58.3 | 10.2 | 11.0 | 1340 | 764 | 6.0 |
| 16 | 64 | 62.7 | 74.9 | 10.2 | 11.0 | 1355 | 1501 | 11.7 |
| 32 | 128 | 78.0 | 88.7 | 12.1 | 12.3 | 1590 | 2650 | 20.7 |
| 64 | 256 | 96.5 | 126.8 | 12.8 | 13.9 | 1703 | 4740 | 37.2 |
| 128 | 512 | 152.0 | 2852.5 | 20.2 | 21.8 | 2735 | 4297 | 33.6 |