Enterprise Backup Guide

Complete Backup and Recovery Procedures

ENTERPRISE BACKUP & RECOVERY GUIDE

Truth.SI / Genesis AWS — Session 938 EXT1

Last Updated: 2026-03-09 Status: PRODUCTION — All systems active Recovery Target: < 10 minutes


QUICK REFERENCE

What you need Command
Run a manual backup now bash /mnt/data/truth-si-dev-env/scripts/run-backup-once.sh [15min\|hourly\|daily]
Check backup health ls -la /mnt/data/backups/enterprise/*/hourly/ 2>/dev/null
Restore everything bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh
Dry-run recovery bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh --dry-run
Restore one service bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh --service redis
Take EBS snapshot bash /mnt/data/truth-si-dev-env/scripts/ebs-snapshot.sh manual
Check timer status systemctl list-timers 'truthsi-backup-*'

ARCHITECTURE OVERVIEW

The 3-2-1-1-0 Rule (Implemented)

Backup Stack

Genesis AWS (p5en.48xlarge)
│
├── /mnt/data/backups/enterprise/     ← LOCAL BACKUPS (10TB EBS, persistent)
│   ├── neo4j/
│   │   ├── 15min/   (SKIPPED — Neo4j backup takes 5-8 min, exceeds window)
│   │   ├── hourly/  (48 retained = 2 days coverage)
│   │   ├── daily/   (14 retained = 2 weeks coverage)
│   │   └── weekly/  (52 retained = 1 year coverage)
│   ├── redis/       (same interval structure)
│   ├── yugabyte/    (same interval structure)
│   ├── weaviate/    (hourly/daily only — 26.8M objects, slow backup)
│   └── env/         (all intervals — small, fast)
│
├── EBS Snapshots (AWS)               ← VOLUME-LEVEL BACKUPS
│   └── Tagged: Project=truth-si, daily + on spot-warning
│
└── S3 (truthsi-sovereign-backup*)    ← CLOUD BACKUPS
    └── Synced by s3-sovereignty-sync daemon

BACKUP SERVICES

Systemd Timers (Running Automatically)

Timer Schedule Services backed up Timeout
truthsi-backup-15min.timer Every 15 min Redis, YugabyteDB, .env 240s
truthsi-backup-hourly.timer Every hour Neo4j, Redis, YugabyteDB, .env 1200s
truthsi-backup-daily.timer Daily 00:00 All services + Weaviate 1800s

Check timer status:

systemctl list-timers 'truthsi-backup-*'
systemctl status truthsi-backup-hourly.timer
journalctl -u truthsi-backup-hourly -n 50

Why 15min skips Neo4j: Neo4j's 27GB+ data directory takes 5-8 minutes to docker cp and compress. A 15-minute interval would leave no buffer time. Neo4j runs on hourly and daily timers instead.

Backup Script: run-backup-once.sh

The workhorse of the backup system. Standalone bash script — no daemon, no Redis Streams queue, no blocking.

# Usage
bash scripts/run-backup-once.sh 15min    # Fast: Redis + YugabyteDB + .env
bash scripts/run-backup-once.sh hourly   # Full: all services including Neo4j
bash scripts/run-backup-once.sh daily    # Full + Weaviate
bash scripts/run-backup-once.sh weekly   # Same as daily, longer retention

Log: /var/log/truthsi-backup-<interval>.log

Backup Storage Format

Each backup creates a timestamped directory:

/mnt/data/backups/enterprise/<service>/<interval>/<YYYYMMDD_HHMMSS>/
├── <data file>      (neo4j_data.tar.gz, redis_<ts>.rdb, yugabyte_<ts>.sql.gz)
└── metadata.json    (service, interval, timestamp, size_bytes, status, host)

metadata.json example:

{
  "service": "redis",
  "interval": "hourly",
  "timestamp": "20260309_043820",
  "size_bytes": 274890752,
  "status": "success",
  "host": "ip-10-0-0-1",
  "backed_up_at": "2026-03-09T04:38:20Z"
}

RETENTION POLICY

Interval Retained Coverage
15min 3 backups 45 minutes
hourly 48 backups 2 days
daily 14 backups 2 weeks
weekly 52 backups 1 year

Old backups are automatically pruned at the end of each backup run.


SPOT INTERRUPTION HANDLER

AWS Spot instances get a 2-minute warning before termination.

The spot handler (scripts/spot-interruption-handler.sh) polls the IMDS endpoint every 5 seconds. When termination is detected, it runs this sequence within the 2-minute window:

Step Action Time budget
1 ntfy.sh urgent alert ~1s
2 Redis BGSAVE + docker cp ~5s
3 .env backup ~1s
4 Git emergency commit + push ~15s
5 Neo4j docker cp (best-effort, 60s timeout) up to 60s
6 EBS snapshot of all volumes async
7 S3 sync async
8 Stop SGLang model services gracefully ~10s
9 Completion notification ~1s

Spot backups land in:

/mnt/data/backups/enterprise/redis/spot/<timestamp>/
/mnt/data/backups/enterprise/neo4j/spot/<timestamp>/
/mnt/data/backups/enterprise/env/spot/<timestamp>/

Notification: Uses ntfy.sh — topic set via NTFY_TOPIC env var in .env

Service: truthsi-spot-handler.service (persistent, always running)


EBS SNAPSHOT AUTOMATION

EBS snapshots capture the entire /mnt/data volume (10TB) at a point in time. This is the fastest way to restore everything after spot termination — attach the snapshot to a new instance.

Script: scripts/ebs-snapshot.sh

# Manual snapshot
bash scripts/ebs-snapshot.sh manual

# Called automatically by spot handler on interruption
bash scripts/ebs-snapshot.sh spot-warning

Retention: Snapshots older than 7 days are automatically deleted.

Tags applied to each snapshot: - Name: TruthSI-<reason>-<timestamp> - Project: truth-si - Instance: <instance-id> - Device: /dev/sdb (or whichever device) - Reason: manual | spot-warning | daily

View snapshots:

aws ec2 describe-snapshots --filters "Name=tag:Project,Values=truth-si" \
    --query 'Snapshots[*].[SnapshotId,StartTime,Description,State]' \
    --output table

RECOVERY PROCEDURES

One-Command Full Recovery

bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh

Automatically: 1. Checks prerequisites (docker, backup root) 2. Restores .env (if not present) 3. Starts all Docker containers 4. Restores Redis (< 60s) 5. Restores YugabyteDB (< 120s) 6. Restores Neo4j (< 300s) 7. Restores Weaviate (via filesystem backup API) 8. Verifies all services healthy 9. Reports total time (target: < 10 minutes)

Backup selection: Picks the most recently timestamped backup across ALL intervals (hourly, daily, 15min, spot, weekly). A fresh 15min backup beats a stale hourly backup.

Supported flags:

--dry-run          # Show what would be restored, no changes made
--service neo4j    # Restore only one service
--service redis
--service yugabyte
--service weaviate
--service env

Manual Redis Restore

# Find latest backup
ls -dt /mnt/data/backups/enterprise/redis/*/ | head -5

# Restore uncompressed (.rdb — new format)
docker stop truthsi-redis
docker cp /mnt/data/backups/enterprise/redis/15min/<ts>/redis_<ts>.rdb truthsi-redis:/data/dump.rdb
docker start truthsi-redis

# Restore compressed (.rdb.gz — old format)
gunzip -c /mnt/data/backups/enterprise/redis/hourly/<ts>/dump.rdb.gz > /tmp/dump.rdb
docker stop truthsi-redis
docker cp /tmp/dump.rdb truthsi-redis:/data/dump.rdb
docker start truthsi-redis

Manual Neo4j Restore

TMP=/tmp/neo4j-restore-$$
mkdir -p $TMP
tar xzf /mnt/data/backups/enterprise/neo4j/hourly/<ts>/neo4j_data.tar.gz -C $TMP

docker stop truthsi-neo4j
sleep 5
docker cp ${TMP}/neo4j_bak_<ts>/. truthsi-neo4j:/data/
docker start truthsi-neo4j
sleep 25

# Verify
docker exec truthsi-neo4j cypher-shell -u neo4j -p "${NEO4J_PASSWORD}" "RETURN 1"
rm -rf $TMP

Credential note: If Neo4j password was rotated after the backup was taken, the restored auth file reflects the OLD password. Login with the old password, then update it.

Manual YugabyteDB Restore

YB_DUMP=/mnt/data/backups/enterprise/yugabyte/hourly/<ts>/yugabyte_<ts>.sql.gz
zcat "$YB_DUMP" | docker exec -i truthsi-yugabyte \
    bash -c "PGPASSWORD='${YUGABYTE_PASSWORD}' psql -h localhost -U yugabyte -d yugabyte"

Restore from EBS Snapshot (Full Instance Recovery)

For complete instance recovery on a new spot instance:

  1. Launch new p5en.48xlarge spot instance in us-west-2
  2. AWS console: Volumes → Create volume from snapshot (filter by Project=truth-si tag, pick latest)
  3. Attach volume to new instance as /dev/sdb
  4. Mount: sudo mount /dev/sdb /mnt/data
  5. All data is immediately available — no per-database restore needed
  6. Run: cd /mnt/data/truth-si-dev-env && docker compose up -d

BACKUP HEALTH MONITORING

Check Current Status

for svc in neo4j redis yugabyte weaviate env; do
    echo "=== $svc ==="
    ls -dt /mnt/data/backups/enterprise/${svc}/*/ 2>/dev/null | head -3 | while read d; do
        echo "  $(basename $d) — $(cat $d/metadata.json 2>/dev/null | \
            python3 -c 'import sys,json; d=json.load(sys.stdin); \
            print(d.get("status","?"), d.get("size_bytes",0)//1024//1024, "MB")' 2>/dev/null || echo 'no metadata')"
    done
done

Timer Next Run Times

systemctl list-timers 'truthsi-backup-*' --no-pager

Backup Logs

tail -50 /var/log/truthsi-backup-15min.log
tail -50 /var/log/truthsi-backup-hourly.log
tail -50 /var/log/truthsi-backup-daily.log

Staleness Alert Thresholds

Service Alert if backup older than
Redis 2 hours
YugabyteDB 2 hours
Neo4j 4 hours
Weaviate 24 hours
.env 2 hours

KNOWN ISSUES & WORKAROUNDS

Neo4j 15min Skip

Issue: Neo4j backup takes 5-8 minutes — too slow for 15min interval. Workaround: 15min timer skips Neo4j. Neo4j runs hourly and daily only.

Redis Format Compatibility

Issue: Old enterprise-backup-daemon.py created dump.rdb.gz; new run-backup-once.sh creates redis_<ts>.rdb. Workaround: full-recovery.sh handles both formats automatically (decompresses .gz if needed).

S3 Sync Permissions

Issue: genesisdeploy IAM user lacks s3:ListObjectsV2 permission on backup buckets. Workaround: Attach IAM policy with s3:* on arn:aws:s3:::truthsi-sovereign-backup*, or use instance role.


ARCHITECTURE DECISIONS

Decision Rationale
Standalone script vs daemon enterprise-backup-daemon.py was blocked in Redis Streams polling loop; standalone oneshot avoids that bug entirely
systemd timers vs cron Timers have Persistent=true (catch-up after reboot), built-in logging via journald, dependency management
Skip Neo4j at 15min 27GB tar.gz exceeds interval window; hourly coverage is sufficient
Most-recent timestamp wins A fresh 15min backup is more valuable than a stale hourly of the same data
EBS snapshots as disaster recovery Fastest path to full restore on new instance — no per-database restore scripts needed
ntfy.sh for spot alerts Zero-infrastructure push notifications; works without Slack/email setup

FILES REFERENCE

File Purpose
scripts/run-backup-once.sh Main backup script (called by systemd timers)
scripts/full-recovery.sh One-command full recovery
scripts/ebs-snapshot.sh EBS volume snapshot automation
scripts/spot-interruption-handler.sh Spot 2-minute warning handler
/etc/systemd/system/truthsi-backup-15min.{service,timer} 15-minute timer
/etc/systemd/system/truthsi-backup-hourly.{service,timer} Hourly timer
/etc/systemd/system/truthsi-backup-daily.{service,timer} Daily timer
/var/log/truthsi-backup-*.log Backup run logs
/mnt/data/backups/enterprise/ Backup root directory

Created: Session 938 EXT1 — THE ARCHITECT Carter's mandate: "Every fucking little tiny thing where we can be back up in 5-10 minutes, not this fucking insanity."