Veeam University – Tuần 6 — Intermediate
Tuần 6: VM Replication & Failover

⏱️ 4 giờ

🔁 Replication

⚡ Failover

📋 DR Plan

🎓 Kết Quả Học Tập

  • ✅ Hiểu sự khác biệt Backup vs Replication
  • ✅ Tạo replication job (RPO 30 phút)
  • ✅ Thực hiện failover test (planned & unplanned)
  • ✅ Đo RTO thực tế
  • ✅ Viết DR runbook cơ bản

Backup vs Replication

Tiêu ChíBackupReplication
Lưu ở đâu?Repository (file .vib)Datastore (VM replica sẵn sàng)
RTO15-60 phút (phải restore)2-5 phút (chỉ power on)
RPODaily (8-24h)15-60 phút (frequent sync)
Storage costThấp (compressed)Cao (full VM copy)
Use caseLong-term retentionCritical VMs, fast failover
Replication Flow:

[Source VM] ──(snapshot every 30min)──→ [Proxy] ──→ [Replica VM]
(Production) (Standby, power off)

Khi failover:
[Source VM] ✗ DOWN ──→ [Replica VM] POWER ON ──→ Takes over!
RTO: 2-5 phút

LAB 6A: Tạo Replication Job

Bước 1: Chuẩn Bị Target Datastore

# Trong hypervisor (vSphere/Hyper-V):
# Tạo datastore riêng cho replicas

vSphere:
├─ Datastore name: DS-Replicas
├─ Size: 200GB+ (đủ cho replica VMs)
├─ Type: NFS hoặc VMFS
└─ Location: Secondary host/cluster (nếu có)

Hyper-V:
├─ Path: D:\Replicas\
├─ Size: 200GB+
└─ Location: Different physical disk

Bước 2: Tạo Replication Job

Veeam Console → Home → Jobs → Replication

Cấu hình:
├─ Name: Lab6-Repl-WebServer
├─ Source VM: Web-Server-01
├─ Target host: [Secondary ESXi / Same host]
├─ Target datastore: DS-Replicas
├─ Replica name suffix: _replica
├─ Restore points: 7 (giữ 7 điểm phục hồi)
├─ RPO schedule: Every 30 minutes
│ ├─ Start: Now
│ └─ Run continuously: ✓
└─ Network mapping:
├─ Source network: Production-VLAN
└─ Target network: DR-VLAN (hoặc same)

→ Click Finish & Run

Bước 3: Monitor Initial Replication

# Initial sync (baseline) - Chờ hoàn thành
# Thời gian: ~30-60 phút (tùy VM size)

Veeam Console → Home → Jobs → Lab6-Repl-WebServer
→ Click vào job để xem tiến trình

Ghi chép:
├─ Initial sync size: _____ GB
├─ Duration: _____ phút
├─ Transfer rate: _____ MB/s
└─ Status: Completed ✓

# Chờ thêm 2-3 incremental replications (mỗi 30 phút)
# Sau 1.5h: Sẽ có 3-4 restore points

Bước 4: Verify Replica VM

# Trong hypervisor, kiểm tra replica:

vSphere Client → VMs → Web-Server-01_replica

Verify:
├─ VM exists: ✓
├─ Power state: OFF (standby)
├─ Disk size: Same as source
├─ Snapshots: 3-4 restore points visible
└─ Network: Connected to DR-VLAN

# QUAN TRỌNG: Replica VM ở trạng thái OFF
# Chỉ power on khi failover!

LAB 6B: Failover Testing

Test 1: Planned Failover (Graceful)

# Planned failover = có thời gian chuẩn bị
# (ví dụ: migration, maintenance window)

Veeam Console → Replicas → Lab6-Repl-WebServer
→ Right-click → Planned Failover

Wizard:
├─ Select replica: Web-Server-01_replica
├─ Restore point: Latest (recommended)
├─ ✓ Power off source VM before failover
│ (ensures no split-brain)
└─ Click Failover

Process:
1. Source VM powered off (graceful shutdown)
2. Latest changes replicated to replica
3. Replica VM powered on
4. Network assigned (DR-VLAN or re-IP)

Ghi chép:
├─ Start time: _____
├─ Source powered off: _____
├─ Replica powered on: _____
├─ Total RTO: _____ phút
└─ Expected: 2-5 phút ✓

Test 2: Verify Replica is Working

# SSH vào replica VM (hoặc console)
ssh root@[replica-IP]

# Kiểm tra services
systemctl status nginx # Web server running?
systemctl status mysql # Database running?

# Kiểm tra data integrity
ls -la /var/www/html/ # Files present?
mysql -e "SELECT COUNT(*) FROM users;" # Data intact?

# Kiểm tra network
curl http://localhost # Website responds?
ping 8.8.8.8 # Internet connectivity?

# Kiểm tra timestamps
ls -la /var/log/syslog # Latest log entry?
# → Should be current (from last replication point)

KẾT QUẢ:
✓ Services running
✓ Data intact (RPO verified)
✓ Network connected
✓ Application functional

Test 3: Undo Failover (Rollback)

# Sau khi test xong, undo failover

Veeam Console → Replicas → Lab6-Repl-WebServer
→ Right-click → Undo Failover

Process:
1. Replica VM powered off
2. Source VM powered back on
3. Replication resumes (incremental sync)

Verify:
├─ Source VM online: ✓
├─ Replication job running: ✓
└─ No data loss during test: ✓

Test 4: Unplanned Failover (Emergency)

# Simulate: Source VM suddenly crashes

# Step 1: Kill source VM (simulate crash)
# In hypervisor: Right-click → Power Off (hard)
# DO NOT graceful shutdown - simulate crash!

# Step 2: Emergency failover in Veeam
Veeam Console → Replicas → Lab6-Repl-WebServer
→ Right-click → Failover Now

├─ Restore point: Latest available
│ (may be up to 30 min old = RPO)
└─ Click Failover

# Step 3: Measure RTO
├─ Source crashed at: _____ (T0)
├─ Failover initiated at: _____ (T1)
├─ Replica online at: _____ (T2)
├─ Detection time: T1 - T0 = _____ phút
├─ Failover time: T2 - T1 = _____ phút
└─ Total RTO: T2 - T0 = _____ phút

Expected:
├─ Detection: 1-5 min (manual) or 30-60s (automated)
├─ Failover: 2-3 min
└─ Total RTO: 3-8 min ✓

Bài Tập Ứng Dụng (4 bài)

💼 Bài 1: Tính Toán RPO/RTO Thực Tế

Dữ liệu: Replication mỗi 30 phút. Last successful replication: 09:15 AM. Server crash: 09:40 AM. Failover time: 3 phút.

Tính: (1) RPO thực tế? (2) RTO thực tế? (3) Data loss?

Đáp án:
RPO = 09:40 - 09:15 = 25 phút (data từ 09:15-09:40 mất)
RTO = 3 phút (failover time)
Data loss = 25 phút of transactions

💼 Bài 2: DR Runbook

Yêu cầu: Viết DR runbook cho 3-tier application:

  • Tier 1: Database (MySQL) - Critical, RTO 5 min
  • Tier 2: Web Server (Nginx) - Important, RTO 15 min
  • Tier 3: App Server (Node.js) - Standard, RTO 30 min

Viết: Thứ tự recovery? Dependencies? Verification steps?

Template:
1. DB first (no dependencies) → verify MySQL running
2. App server (depends on DB) → verify DB connection
3. Web server (depends on App) → verify curl response
Total expected RTO: 5 + 15 + 30 = 50 min worst case

💼 Bài 3: Network Bandwidth Calculation

Tình huống: Replicate 10 VMs (each 50GB) across WAN (100Mbps link). Change rate: 5% per 30min cycle.

Tính: (1) Initial sync time? (2) Incremental bandwidth needed? (3) Feasible?

Đáp án:
Initial: 10 × 50GB = 500GB / 100Mbps = ~11 hours
Incremental: 10 × 50GB × 5% = 25GB per 30min
Required: 25GB / 30min = 111Mbps → EXCEEDS 100Mbps!
Solution: Enable compression (2x) → 55Mbps needed ✓

💼 Bài 4: Backup vs Replication Decision Matrix

Cho mỗi scenario, chọn Backup hoặc Replication (giải thích):

VMRTO NeedRPO NeedChoice?
Payment Gateway<5 min<15 min?
Dev/Test Server4 hoursDaily?
Email Server1 hour4 hours?
Archive Storage24 hoursWeekly?
Đáp án:
Payment: Replication (RTO <5min requires instant failover)
Dev/Test: Backup only (4h RTO acceptable, save cost)
Email: Backup + Replication (1h RTO, 4h RPO = replication every 4h)
Archive: Backup only (weekly RPO, 24h RTO = no urgency)

Checklist Hoàn Thành

Tuần 5: Hardened Repo Index Tuần 7: HA Cluster