AI Model Launch Settings

LOCKED Configuration — Do Not Change Without Approval

DEFINITIVE MODEL LAUNCH SETTINGS — NEVER LOSE AGAIN

Created: Session 951 (2026-03-12) Updated: Session 968 (2026-03-14) — Synced with ACTUAL running Docker containers Cost of this lesson: $2,000+ over 3 days of downtime Root cause: Wrong Python packages in manual venv (torch cu128 vs cu129, missing flashinfer-jit-cache, wrong sgl-fa4 version) Solution: Run models from official Docker image lmsysorg/sglang:dev-x86


THE GOLDEN RULE

NEVER run SGLang from a manual venv again. ALWAYS use the official Docker image.

The Docker image lmsysorg/sglang:dev-x86 contains the EXACT blessed package set that SGLang developers test with. Our manual venv had: - torch==2.9.1+cu128 (WRONG — should be cu129) - flashinfer-python==0.6.3 (WRONG — should be 0.6.4) - Missing flashinfer-jit-cache==0.6.4+cu129 entirely - sgl-fa4==4.0.5 (WRONG — should be 4.0.3)

These mismatches caused SIGSEGV (segfault) during CUDA graph capture EVERY time.


QWEN3.5-397B-A17B-FP8 (PRIMARY — Port 8010, GPUs 0-3)

Docker Launch Command (PRODUCTION — ACTUAL RUNNING CONFIG)

docker run -d --name truthsi-llm-primary \
  --gpus '"device=0,1,2,3"' \
  --shm-size 32g \
  --restart unless-stopped \
  -v /opt/dlami/nvme/models/Qwen3.5-397B-A17B-FP8:/model \
  -p 8010:8010 \
  --env SGLANG_DISABLE_CUDNN_CHECK=1 \
  --env SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
  lmsysorg/sglang:dev-x86 \
  python -m sglang.launch_server \
    --model-path /model \
    --tp 4 \
    --port 8010 \
    --host 0.0.0.0 \
    --trust-remote-code \
    --context-length 1048576 \
    --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144,"mrope_section":[11,11,10],"mrope_interleaved":true,"rope_theta":10000000,"partial_rotary_factor":0.25}}' \
    --mem-fraction-static 0.92 \
    --served-model-name "genesis,deepseek-chat" \
    --schedule-policy lpm \
    --schedule-conservativeness 0.8 \
    --chunked-prefill-size 16384 \
    --max-prefill-tokens 65536 \
    --enable-mixed-chunk \
    --disable-custom-all-reduce \
    --num-continuous-decode-steps 8 \
    --mamba-full-memory-ratio 1.5 \
    --reasoning-parser qwen3 \
    --cuda-graph-max-bs 1536 \
    --enable-metrics \
    --fp8-gemm-backend triton \
    --moe-runner-backend triton \
    --watchdog-timeout 1200

Key Parameters

Parameter Value Why
--gpus "device=0,1,2,3" GPUs 0-3 4x H200 for tensor parallelism
--tp 4 4-way tensor parallel 397B model needs 4 GPUs
--context-length 1048576 1M tokens Extended via YaRN (4x native 262K)
--json-model-override-args YaRN rope_scaling Full mRoPE preservation (mrope_section, mrope_interleaved, rope_theta, partial_rotary_factor)
--mem-fraction-static 0.92 92% GPU memory LOCKED parameter — maximizes KV cache
--served-model-name "genesis,deepseek-chat" Dual aliases Backward compatibility with deepseek-chat API calls
--schedule-policy lpm Longest Prefix Match Optimizes for repeated context/prefix sharing
--schedule-conservativeness 0.8 0.8 Balances throughput vs latency
--chunked-prefill-size 16384 16K tokens per chunk Prevents OOM on long prefills
--max-prefill-tokens 65536 64K max prefill Limits single-request prefill memory
--enable-mixed-chunk Mixed chunking Overlaps prefill and decode for throughput
--disable-custom-all-reduce Disable custom AR Uses NCCL default (more stable on H200)
--num-continuous-decode-steps 8 8 decode steps Batch decode optimization
--mamba-full-memory-ratio 1.5 1.5x Memory for Mamba-style attention layers
--reasoning-parser qwen3 Qwen3 parser Parses thinking/reasoning tokens correctly
--cuda-graph-max-bs 1536 Max batch 1536 CUDA graph optimization for large batches
--enable-metrics Prometheus metrics Exposed at /metrics endpoint
--fp8-gemm-backend triton Triton for FP8 GEMM Avoids DeepGEMM assertion error on H200
--moe-runner-backend triton Triton for MoE Stable, tested backend
--watchdog-timeout 1200 20 min timeout Allows slow startup
--shm-size 32g Shared memory Required for NCCL communication
--restart unless-stopped Auto-restart Survives reboots/crashes
SGLANG_DISABLE_CUDNN_CHECK=1 Disable CuDNN check torch 2.9.1 + CuDNN < 9.15 compatibility
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 Allow extended context REQUIRED to set context > model's native max_position_embeddings (262144)

Flags NOT Used (And Why)

Flag Reason NOT Used
--enable-hierarchical-cache Incompatible with Qwen3.5 hybrid attention (linear + full). Raises HiRadixCache only supports MHA and MLA yet
--disable-cuda-graph CUDA graphs ARE enabled (default). --cuda-graph-max-bs 1536 controls batch size limit

Performance (Verified Working)


GLM-4.7-FP8 (CRITIC — Port 8011, GPUs 4-7)

Docker Launch Command (PRODUCTION — ACTUAL RUNNING CONFIG)

docker run -d --name truthsi-llm-critic \
  --gpus '"device=4,5,6,7"' \
  --shm-size 32g \
  --restart unless-stopped \
  -v /opt/dlami/nvme/models/GLM-4.7-FP8:/model \
  -p 8011:8011 \
  --env SGLANG_DISABLE_CUDNN_CHECK=1 \
  lmsysorg/sglang:dev-x86 \
  python -m sglang.launch_server \
    --model-path /model \
    --tp 4 \
    --port 8011 \
    --host 0.0.0.0 \
    --trust-remote-code \
    --served-model-name "glm-4.7-fp8" \
    --mem-fraction-static 0.80 \
    --max-running-requests 100 \
    --cuda-graph-max-bs 1024 \
    --fp8-gemm-backend triton \
    --moe-runner-backend triton \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --kv-cache-dtype bf16 \
    --enable-metrics \
    --watchdog-timeout 1200

Key Parameters

Parameter Value Why
--gpus "device=4,5,6,7" GPUs 4-7 4x H200 for tensor parallelism
--tp 4 4-way tensor parallel GLM-4.7 is 355B MoE (32B active)
--mem-fraction-static 0.80 80% GPU memory LOCKED — shares GPU 7 with NV-Embed
--max-running-requests 100 Max 100 concurrent Prevents memory overcommit
--cuda-graph-max-bs 1024 Max batch 1024 CUDA graph optimization
--reasoning-parser glm45 GLM4.5 parser Handles interleaved thinking tokens
--tool-call-parser glm47 GLM4.7 tool parser Parses tool call output format
--kv-cache-dtype bf16 BFloat16 KV cache Better precision for review tasks
--served-model-name "glm-4.7-fp8" Model name Used in API routing

Performance


NV-EMBED-V2 (EMBEDDINGS — Port 8014, GPU 7)

NV-Embed runs as a systemd service, NOT Docker (separate from SGLang). Service: genesis-nv-embed.service GPU: 7 (shared with GLM-4.7) VRAM: ~23 GB (INT8 quantized — Session 933 breakthrough) Embedding dimension: 4096 Max tokens: 32768


BLESSED DOCKER IMAGE

Image: lmsysorg/sglang:dev-x86 Pull command: docker pull lmsysorg/sglang:dev-x86

Critical Packages Inside (DO NOT CHANGE)

Package Version Notes
torch 2.9.1+cu129 MUST be cu129 (NOT cu128!)
flashinfer-python 0.6.4 MUST be 0.6.4 (NOT 0.6.3!)
flashinfer-cubin 0.6.4 Pre-compiled CUDA binaries
flashinfer-jit-cache 0.6.4+cu129 JIT kernel cache (CRITICAL)
sgl-fa4 4.0.3 Flash Attention 4 (NOT 4.0.5!)
sgl-kernel 0.3.21 SGLang CUDA kernels
triton 3.5.1 Triton compiler
cuda-bindings 12.9.5 CUDA 12.9 bindings

Full Package Freeze

Saved to: /mnt/data/truth-si-dev-env/docs/BLESSED_DOCKER_PIP_FREEZE.txt


RECOVERY PROCEDURE (IF MODELS GO DOWN)

Step 1: Check Container Status

docker ps --filter name=truthsi-llm

Step 2: If Containers Stopped, Restart

docker start truthsi-llm-primary
docker start truthsi-llm-critic

Step 3: If Containers Don't Exist, Recreate

Copy-paste the Docker launch commands above.

Step 4: If Docker Image Missing

docker pull lmsysorg/sglang:dev-x86

Then recreate containers.

Step 5: If Model Files Missing (Spot Instance Termination)

Model files are on /opt/dlami/nvme (EPHEMERAL!). Restore: bash scripts/restore-models.sh

Step 6: Verify

curl -s http://localhost:8010/v1/models  # Qwen3.5
curl -s http://localhost:8011/v1/models  # GLM-4.7

RESTORE SCRIPT

Location: scripts/restore-models.sh This script contains the EXACT same flags as the running Docker containers. It is the authoritative restore procedure.


WHAT WENT WRONG (Session 949-951, 3 Days Lost)

  1. Spot instance was terminated — models had to be reloaded
  2. sglang-venv was rebuilt from pip — pulled WRONG package versions
  3. pip installed torch cu128 instead of cu129 (CUDA toolkit is 12.9)
  4. pip installed flashinfer 0.6.4 but without flashinfer-jit-cache
  5. Research from Session 950 incorrectly recommended downgrading flashinfer to 0.6.3
  6. SIGSEGV during CUDA graph capture — compiled kernels from wrong packages
  7. 3 days of debugging wrong packages, mem-fraction, flags
  8. Solution: Run from Docker image which has the EXACT tested packages

LESSONS LEARNED (PERMANENT)

  1. NEVER rebuild sglang from pip — Always use Docker image
  2. Docker IS production — It's not just for testing
  3. Package versions are SACRED — cu128 vs cu129 matters
  4. When something worked, capture the EXACT environment (pip freeze + Docker image tag)
  5. Test from Docker FIRST when debugging — isolates package issues immediately
  6. NVMe is EPHEMERAL — Model weights must be backed up to /mnt/data

Created by THE ARCHITECT — Session 951 Updated by THE ARCHITECT — Session 968 (synced with actual running Docker containers) THIS DOCUMENT IS SACRED. NEVER DELETE. ALWAYS UPDATE.