LLM Deployment

Built an end-to-end workflow for high-throughput LLM serving, combining optimized inference pipelines, model quantization, and task-specific fine-tuning to improve latency, memory efficiency, and cost-performance trade-offs.

Local LLM Deployment Benchmark

Benchmarked with vLLM (v0.18.0) serving engine.

Prefix caching was turned off for all benchmarks.

Model: NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | GPU: B200

Endpoint: localhost:8000/v1/chat/completions | Max tokens: 256

Concurrency Requests TTFT p50 (ms) TTFT p99 (ms) TPOT p50 (ms) TPOT p99 (ms) E2E p50 (ms) tok/s req/s Avg In Avg Out
4 16 116.7 122.1 6.4 6.4 1737 589 2.3 343 256
8 32 117.9 126.5 7.6 7.6 2064 991 3.9 348 256
16 64 154.5 164.4 8.7 8.8 2372 1725 6.7 347 256
32 128 287.3 356.1 11.0 11.6 3082 2652 10.4 348 256
64 256 465.8 600.2 14.1 15.7 4072 4000 15.6 348 256

Model: Qwen3-30B-A3B-Instruct-2507-FP8 | GPU: H100 NVL

Endpoint: localhost:8000/v1/chat/completions | Max tokens: 256

Concurrency Requests TTFT p50 (ms) TTFT p99 (ms) TPOT p50 (ms) TPOT p99 (ms) E2E p50 (ms) tok/s req/s Avg In Avg Out
4 16 197.7 205.5 7.2 7.2 2031 508 2.0 331 256
8 32 211.5 225.3 8.4 8.6 2348 869 3.4 336 256
16 64 258.4 335.6 9.9 10.7 2787 1465 5.7 336 256
32 128 374.7 510.3 11.9 13.2 3426 2385 9.3 336 256
64 256 464.2 911.9 15.1 16.3 4316 3776 14.8 336 256

Model: Qwen3-30B-A3B-Instruct-2507-FP8 | GPU: NVIDIA RTX 6000 Ada

Endpoint: localhost:8000/v1/chat/completions | Max tokens: 256

Concurrency Requests TTFT p50 (ms) TTFT p99 (ms) TPOT p50 (ms) TPOT p99 (ms) E2E p50 (ms) tok/s req/s Avg In Avg Out
4 16 88.8 97.9 12.3 12.5 3217 318 1.2 331 256
8 32 126.2 168.8 15.6 15.9 4104 498 1.9 336 256
16 64 206.1 302.8 19.9 20.4 5281 774 3.0 336 256
32 128 305.9 591.0 25.2 26.0 6719 1217 4.8 336 256
64 256 433.3 1298.3 31.8 35.4 8639 1869 7.3 336 256

Model: Qwen3-32B-FP8 | GPU: H100 NVL

Endpoint: localhost:8000/v1/chat/completions | Max tokens: 256

Concurrency Requests TTFT p50 (ms) TTFT p99 (ms) TPOT p50 (ms) TPOT p99 (ms) E2E p50 (ms) tok/s req/s Avg In Avg Out
4 16 194.1 224.3 24.5 24.7 6429 159 0.6 331 256
8 32 372.4 421.5 24.8 25.9 6686 306 1.2 336 256
16 64 627.7 829.2 26.1 28.0 7275 562 2.2 336 256
32 128 735.6 1652.3 29.7 32.3 8340 977 3.8 336 256
64 256 1170.5 3219.5 37.3 41.0 10762 1532 6.0 336 256

Model: Qwen3-32B-FP8 | GPU: NVIDIA RTX 6000 Ada

Endpoint: localhost:8000/v1/chat/completions | Max tokens: 256

Concurrency Requests TTFT p50 (ms) TTFT p99 (ms) TPOT p50 (ms) TPOT p99 (ms) E2E p50 (ms) tok/s req/s Avg In Avg Out
4 16 556.2 595.6 48.2 49.7 12825 80 0.3 331 256
8 32 1007.4 1185.5 50.0 53.6 13795 148 0.6 336 256
16 64 1666.0 2453.4 54.0 60.0 15481 264 1.0 336 256
32 128 2329.6 4406.1 61.2 69.0 17913 456 1.8 336 256
64 256 4929.9 14339.0 92.2 152.1 28461 525 2.1 336 256