Disaster Recovery Runbook

Step-by-Step Recovery Procedures

GENESIS DISASTER RECOVERY RUNBOOK

Session 963 — Full dry-run verified 2026-03-13 Previous version (Session 706) archived below this section Answer to Carter's question: "Are we 100% we could be back up after a spot interruption?" ANSWER: YES. Estimated recovery time: 25-35 minutes.


THE BOTTOM LINE

When a spot instance is terminated:

The EBS volumes (10TB /mnt/data + 6TB root) automatically re-attach to the new instance. Everything critical is preserved.


IF THE SERVER DIES — DO THIS

Step 1: Don't Panic

The spot instance died. This is expected and planned for. Your data is safe.

The only thing lost is the NVMe cache (/opt/dlami/nvme), which is rebuilt automatically.


Step 2: Launch a New Instance

Launch a new p5en.48xlarge spot instance in AWS:

Wait 3-5 minutes for the instance to boot and the volumes to mount.

The new instance will have a different public IP address.


Step 3: SSH Into the New Instance

ssh -i ~/.ssh/aws-p5en-key.pem ubuntu@<NEW_IP_ADDRESS>
cd /mnt/data/truth-si-dev-env

Verify drives mounted:

df -h | grep -E "mnt/data|nvme"

You should see /mnt/data at ~10TB. NVMe (/opt/dlami/nvme) will be mostly empty.


Step 4: Restore Model Weights to NVMe (LONGEST STEP ~20 min)

This copies the model weights from EBS to NVMe:

sudo bash /mnt/data/truth-si-dev-env/scripts/nvme-cache-restore.sh

What it does: - Copies Qwen3.5-397B (379 GB) to NVMe - Copies GLM-4.7-FP8 (338 GB) to NVMe - Copies NV-Embed-v2 (15 GB) to NVMe - Creates 64 GB swap file on NVMe - Total: ~732 GB at ~1 GB/s = approximately 15-20 minutes

Watch it run:

tail -f /var/log/nvme-cache-restore.log

Step 5: Launch All Three AI Models

Once Step 4 completes, run:

sudo bash /mnt/data/truth-si-dev-env/scripts/restore-models.sh

This starts the models in the correct order (critical): 1. NV-Embed (embeddings) on GPU 7 — ~1 min 2. Qwen3.5-397B (Genesis primary AI) on GPUs 0-3 — ~2 min 3. GLM-4.7-FP8 (critic/reviewer) on GPUs 4-7 — ~3 min

The script waits for each to become healthy before moving on.

Watch it run:

tail -f /var/log/restore-models.log

Step 6: Start Infrastructure Containers

cd /mnt/data/truth-si-dev-env
sudo docker compose up -d

Wait 2 minutes. Then check:

sudo docker ps

All containers should show "Up" status.


Step 7: Restart Daemons

sudo mkdir -p /var/log/truthsi    # Needed for disaster-recovery.sh logs
sudo systemctl restart truthsi-*.service

Step 8: Verify Everything Works

curl http://localhost:8010/v1/models   # Qwen3.5 (Genesis primary)
curl http://localhost:8011/v1/models   # GLM-4.7 (critic)
curl http://localhost:8014/v1/models   # NV-Embed (embeddings)
curl http://localhost:8000/health      # API

All four should return valid responses (not "connection refused").


Step 9: Update Mac SSH Tunnel

The new instance has a different IP. Update and restart the tunnel:

# On Mac:
./scripts/forge-tunnel.sh restart

RECOVERY TIME BREAKDOWN

Phase Description Time
AWS launch Boot new p5en.48xlarge instance 3-5 min
Drive attach EBS volumes auto-attach 0-1 min
Step 4: NVMe restore rsync 732 GB EBS → NVMe 15-20 min
Step 5: Model startup NV-Embed + Qwen3.5 + GLM-4.7 4-5 min
Step 6: Docker compose All infrastructure containers 2 min
Step 7: Daemons Restart systemd services 1 min
TOTAL Full system operational 25-35 min

ONE-COMMAND RECOVERY (Advanced)

For a fully automated recovery, one script handles everything:

sudo bash /mnt/data/truth-si-dev-env/scripts/genesis-boot-recovery.sh

This runs all phases automatically including NVMe restore, model startup, venv rebuild, and MCP tools. Check progress:

cat /mnt/data/genesis-recovery-status.json

WHAT'S SAFE vs LOST AFTER SPOT TERMINATION

Data Location Safe?
Model weights (Qwen3.5, GLM-4.7, NV-Embed) /mnt/data/models/ YES (EBS)
Neo4j, Redis, Weaviate, YugabyteDB /mnt/data/docker/volumes/ YES (EBS)
Backup archives /mnt/data/backups/ YES (EBS)
Code, configs, scripts /mnt/data/truth-si-dev-env/ YES (EBS + GitHub)
NVMe model cache /opt/dlami/nvme/models/ LOST (rebuilt in Step 4)
Running containers RAM LOST (restarted in Step 6)
Running AI models VRAM LOST (restarted in Step 5)
Running daemons RAM LOST (restarted in Step 7)

TROUBLESHOOTING

Models Not Starting

# Check GPUs
nvidia-smi

# Check container logs
docker logs truthsi-llm-primary --tail 50
docker logs truthsi-llm-critic --tail 50

# Check NV-Embed (runs as systemd, not Docker)
journalctl -u genesis-nv-embed.service -n 50

IMPORTANT: NV-Embed MUST start before GLM-4.7. They share GPU 7. If GLM-4.7 starts first, NV-Embed will fail with out-of-memory error. The restore script handles this order automatically.

NVMe Restore Slow or Failed

df -h /opt/dlami/nvme     # Check available space (need 750+ GB free)
ls -la /mnt/data/models/  # Verify source files exist
sudo bash /mnt/data/truth-si-dev-env/scripts/nvme-cache-restore.sh  # Re-run

Docker Image Not Cached (Adds ~15 min)

If the SGLang Docker image is not cached on the new instance:

docker pull lmsysorg/sglang:dev-x86

This is ~58 GB and takes 10-15 min. After that, models can be launched.

API Not Responding

docker ps | grep api              # Check if API container is up
docker logs truthsi-api --tail 50 # Check API logs
docker restart truthsi-api        # Restart if needed

RECOVERY SCRIPTS QUICK REFERENCE

Script Use When
scripts/nvme-cache-restore.sh After new instance — restore model files to NVMe
scripts/restore-models.sh Launch all 3 AI models (PRIMARY recovery script)
scripts/genesis-boot-recovery.sh Full automated recovery (one command)
scripts/disaster-recovery.sh Restart daemons and verify system state
scripts/full-recovery.sh Restore databases from backup files
scripts/spot-recovery-nvme-models.sh Alternative NVMe restore with S3 fallback
scripts/weaviate-restore.py Restore Weaviate vector database
scripts/restore-from-backup.py Restore from encrypted backup archives

BACKUP STATUS

Your data is automatically backed up:

Backup Type Location How Often
Neo4j /mnt/data/backups/enterprise/neo4j/15min/ Every 15 minutes
Neo4j /mnt/data/backups/enterprise/neo4j/hourly/ Every hour
Neo4j /mnt/data/backups/enterprise/neo4j/daily/ Daily
Neo4j /mnt/data/backups/enterprise/neo4j/spot/ On spot interruption event
Weaviate /mnt/data/backups/weaviate/ Multiple times daily
Redis /mnt/data/backups/enterprise/redis/ Continuous
YugabyteDB /mnt/data/backups/enterprise/yugabyte/ Regular

IMPORTANT FILES TO KNOW

File What It Contains
docs/DEFINITIVE_MODEL_LAUNCH_SETTINGS.md Exact Docker commands and flags for each model
docs/BLESSED_DOCKER_PIP_FREEZE.txt Exact package versions in the Docker image
venv-backups/nv-embed-server.py.backup NV-Embed server script backup
venv-backups/nv-embed-venv-requirements.txt NV-Embed Python requirements
/mnt/data/genesis-recovery-status.json Real-time recovery status from genesis-boot-recovery.sh

WHAT THIS DRY RUN VERIFIED (Session 963)

This runbook was created after a full verification pass including actual script execution:


PREVIOUS RUNBOOK (Session 706)

The original disaster recovery runbook from Session 706 covered file deletion, ransomware, and code corruption scenarios. That content is no longer accurate (old file paths, pre-spot-instance architecture). See git history for the original content if needed.


THE ARCHITECT — Session 963 Dry-run verified: 2026-03-13