LLM Deployment

Built an end-to-end workflow for high-throughput LLM serving, combining optimized inference pipelines, model quantization, and task-specific fine-tuning to improve latency, memory efficiency, and cost-performance trade-offs.

Local LLM Deployment Benchmark

Model: Meta-Llama-3-8B-Instruct-FP8 | GPU: NVIDIA RTX 4090

Endpoint: localhost:8000/v1/completions | Max tokens: 128

Concurrency Requests TTFT p50 (ms) TTFT p99 (ms) TPOT p50 (ms) TPOT p99 (ms) E2E p50 (ms) tok/s req/s
1 8 16.1 37.4 11.5 11.7 1480 87 0.7
4 16 33.9 34.7 11.8 11.9 1532 337 2.6
8 32 35.3 38.4 11.7 11.8 1518 675 5.3
16 64 39.3 42.2 11.9 12.2 1544 1324 10.3
32 128 49.1 62.8 12.0 12.8 1576 2550 19.9
64 256 68.3 98.6 13.5 13.9 1782 4579 35.8
128 512 121.6 2279.3 15.6 16.6 2126 5480 42.8

Model: Qwen3-32B-FP8 | GPU: NVIDIA RTX 6000 Ada

Endpoint: localhost:8000/v1/completions | Max tokens: 128

Concurrency Requests TTFT p50 (ms) TTFT p99 (ms) TPOT p50 (ms) TPOT p99 (ms) E2E p50 (ms) tok/s req/s
1 8 51.7 94.8 38.1 38.2 4896 26 0.2
4 16 86.2 111.9 40.9 46.8 5275 94 0.7
8 32 172.8 205.5 53.9 57.8 6992 144 1.1
16 64 309.9 314.1 51.9 53.0 6900 297 2.3
32 128 340.7 366.1 48.8 52.2 6507 624 4.9
64 256 437.4 2098.7 64.1 66.3 8584 916 7.2
128 512 707.6 15839.3 108.9 111.9 14494 812 6.3

Model: Qwen3-30B-A3B-Instruct-2507-FP8 | GPU: NVIDIA RTX 6000 Ada

Endpoint: localhost:8000/v1/completions | Max tokens: 128

Concurrency Requests TTFT p50 (ms) TTFT p99 (ms) TPOT p50 (ms) TPOT p99 (ms) E2E p50 (ms) tok/s req/s
1 8 33.0 54.2 8.3 8.3 1081 118 0.9
4 16 46.2 56.0 9.8 9.9 1295 395 3.1
8 32 42.5 58.3 10.2 11.0 1340 764 6.0
16 64 62.7 74.9 10.2 11.0 1355 1501 11.7
32 128 78.0 88.7 12.1 12.3 1590 2650 20.7
64 256 96.5 126.8 12.8 13.9 1703 4740 37.2
128 512 152.0 2852.5 20.2 21.8 2735 4297 33.6