⏱️ 4 giờ
🔁 Replication
⚡ Failover
📋 DR Plan
🎓 Kết Quả Học Tập
- ✅ Hiểu sự khác biệt Backup vs Replication
- ✅ Tạo replication job (RPO 30 phút)
- ✅ Thực hiện failover test (planned & unplanned)
- ✅ Đo RTO thực tế
- ✅ Viết DR runbook cơ bản
Backup vs Replication
| Tiêu Chí | Backup | Replication |
|---|---|---|
| Lưu ở đâu? | Repository (file .vib) | Datastore (VM replica sẵn sàng) |
| RTO | 15-60 phút (phải restore) | 2-5 phút (chỉ power on) |
| RPO | Daily (8-24h) | 15-60 phút (frequent sync) |
| Storage cost | Thấp (compressed) | Cao (full VM copy) |
| Use case | Long-term retention | Critical VMs, fast failover |
Replication Flow:
[Source VM] ──(snapshot every 30min)──→ [Proxy] ──→ [Replica VM]
(Production) (Standby, power off)
Khi failover:
[Source VM] ✗ DOWN ──→ [Replica VM] POWER ON ──→ Takes over!
RTO: 2-5 phút
[Source VM] ──(snapshot every 30min)──→ [Proxy] ──→ [Replica VM]
(Production) (Standby, power off)
Khi failover:
[Source VM] ✗ DOWN ──→ [Replica VM] POWER ON ──→ Takes over!
RTO: 2-5 phút
LAB 6A: Tạo Replication Job
Bước 1: Chuẩn Bị Target Datastore
# Trong hypervisor (vSphere/Hyper-V):
# Tạo datastore riêng cho replicas
vSphere:
├─ Datastore name: DS-Replicas
├─ Size: 200GB+ (đủ cho replica VMs)
├─ Type: NFS hoặc VMFS
└─ Location: Secondary host/cluster (nếu có)
Hyper-V:
├─ Path: D:\Replicas\
├─ Size: 200GB+
└─ Location: Different physical disk
# Tạo datastore riêng cho replicas
vSphere:
├─ Datastore name: DS-Replicas
├─ Size: 200GB+ (đủ cho replica VMs)
├─ Type: NFS hoặc VMFS
└─ Location: Secondary host/cluster (nếu có)
Hyper-V:
├─ Path: D:\Replicas\
├─ Size: 200GB+
└─ Location: Different physical disk
Bước 2: Tạo Replication Job
Veeam Console → Home → Jobs → Replication
Cấu hình:
├─ Name: Lab6-Repl-WebServer
├─ Source VM: Web-Server-01
├─ Target host: [Secondary ESXi / Same host]
├─ Target datastore: DS-Replicas
├─ Replica name suffix: _replica
├─ Restore points: 7 (giữ 7 điểm phục hồi)
├─ RPO schedule: Every 30 minutes
│ ├─ Start: Now
│ └─ Run continuously: ✓
└─ Network mapping:
├─ Source network: Production-VLAN
└─ Target network: DR-VLAN (hoặc same)
→ Click Finish & Run
Cấu hình:
├─ Name: Lab6-Repl-WebServer
├─ Source VM: Web-Server-01
├─ Target host: [Secondary ESXi / Same host]
├─ Target datastore: DS-Replicas
├─ Replica name suffix: _replica
├─ Restore points: 7 (giữ 7 điểm phục hồi)
├─ RPO schedule: Every 30 minutes
│ ├─ Start: Now
│ └─ Run continuously: ✓
└─ Network mapping:
├─ Source network: Production-VLAN
└─ Target network: DR-VLAN (hoặc same)
→ Click Finish & Run
Bước 3: Monitor Initial Replication
# Initial sync (baseline) - Chờ hoàn thành
# Thời gian: ~30-60 phút (tùy VM size)
Veeam Console → Home → Jobs → Lab6-Repl-WebServer
→ Click vào job để xem tiến trình
Ghi chép:
├─ Initial sync size: _____ GB
├─ Duration: _____ phút
├─ Transfer rate: _____ MB/s
└─ Status: Completed ✓
# Chờ thêm 2-3 incremental replications (mỗi 30 phút)
# Sau 1.5h: Sẽ có 3-4 restore points
# Thời gian: ~30-60 phút (tùy VM size)
Veeam Console → Home → Jobs → Lab6-Repl-WebServer
→ Click vào job để xem tiến trình
Ghi chép:
├─ Initial sync size: _____ GB
├─ Duration: _____ phút
├─ Transfer rate: _____ MB/s
└─ Status: Completed ✓
# Chờ thêm 2-3 incremental replications (mỗi 30 phút)
# Sau 1.5h: Sẽ có 3-4 restore points
Bước 4: Verify Replica VM
# Trong hypervisor, kiểm tra replica:
vSphere Client → VMs → Web-Server-01_replica
Verify:
├─ VM exists: ✓
├─ Power state: OFF (standby)
├─ Disk size: Same as source
├─ Snapshots: 3-4 restore points visible
└─ Network: Connected to DR-VLAN
# QUAN TRỌNG: Replica VM ở trạng thái OFF
# Chỉ power on khi failover!
vSphere Client → VMs → Web-Server-01_replica
Verify:
├─ VM exists: ✓
├─ Power state: OFF (standby)
├─ Disk size: Same as source
├─ Snapshots: 3-4 restore points visible
└─ Network: Connected to DR-VLAN
# QUAN TRỌNG: Replica VM ở trạng thái OFF
# Chỉ power on khi failover!
LAB 6B: Failover Testing
Test 1: Planned Failover (Graceful)
# Planned failover = có thời gian chuẩn bị
# (ví dụ: migration, maintenance window)
Veeam Console → Replicas → Lab6-Repl-WebServer
→ Right-click → Planned Failover
Wizard:
├─ Select replica: Web-Server-01_replica
├─ Restore point: Latest (recommended)
├─ ✓ Power off source VM before failover
│ (ensures no split-brain)
└─ Click Failover
Process:
1. Source VM powered off (graceful shutdown)
2. Latest changes replicated to replica
3. Replica VM powered on
4. Network assigned (DR-VLAN or re-IP)
Ghi chép:
├─ Start time: _____
├─ Source powered off: _____
├─ Replica powered on: _____
├─ Total RTO: _____ phút
└─ Expected: 2-5 phút ✓
# (ví dụ: migration, maintenance window)
Veeam Console → Replicas → Lab6-Repl-WebServer
→ Right-click → Planned Failover
Wizard:
├─ Select replica: Web-Server-01_replica
├─ Restore point: Latest (recommended)
├─ ✓ Power off source VM before failover
│ (ensures no split-brain)
└─ Click Failover
Process:
1. Source VM powered off (graceful shutdown)
2. Latest changes replicated to replica
3. Replica VM powered on
4. Network assigned (DR-VLAN or re-IP)
Ghi chép:
├─ Start time: _____
├─ Source powered off: _____
├─ Replica powered on: _____
├─ Total RTO: _____ phút
└─ Expected: 2-5 phút ✓
Test 2: Verify Replica is Working
# SSH vào replica VM (hoặc console)
ssh root@[replica-IP]
# Kiểm tra services
systemctl status nginx # Web server running?
systemctl status mysql # Database running?
# Kiểm tra data integrity
ls -la /var/www/html/ # Files present?
mysql -e "SELECT COUNT(*) FROM users;" # Data intact?
# Kiểm tra network
curl http://localhost # Website responds?
ping 8.8.8.8 # Internet connectivity?
# Kiểm tra timestamps
ls -la /var/log/syslog # Latest log entry?
# → Should be current (from last replication point)
KẾT QUẢ:
✓ Services running
✓ Data intact (RPO verified)
✓ Network connected
✓ Application functional
ssh root@[replica-IP]
# Kiểm tra services
systemctl status nginx # Web server running?
systemctl status mysql # Database running?
# Kiểm tra data integrity
ls -la /var/www/html/ # Files present?
mysql -e "SELECT COUNT(*) FROM users;" # Data intact?
# Kiểm tra network
curl http://localhost # Website responds?
ping 8.8.8.8 # Internet connectivity?
# Kiểm tra timestamps
ls -la /var/log/syslog # Latest log entry?
# → Should be current (from last replication point)
KẾT QUẢ:
✓ Services running
✓ Data intact (RPO verified)
✓ Network connected
✓ Application functional
Test 3: Undo Failover (Rollback)
# Sau khi test xong, undo failover
Veeam Console → Replicas → Lab6-Repl-WebServer
→ Right-click → Undo Failover
Process:
1. Replica VM powered off
2. Source VM powered back on
3. Replication resumes (incremental sync)
Verify:
├─ Source VM online: ✓
├─ Replication job running: ✓
└─ No data loss during test: ✓
Veeam Console → Replicas → Lab6-Repl-WebServer
→ Right-click → Undo Failover
Process:
1. Replica VM powered off
2. Source VM powered back on
3. Replication resumes (incremental sync)
Verify:
├─ Source VM online: ✓
├─ Replication job running: ✓
└─ No data loss during test: ✓
Test 4: Unplanned Failover (Emergency)
# Simulate: Source VM suddenly crashes
# Step 1: Kill source VM (simulate crash)
# In hypervisor: Right-click → Power Off (hard)
# DO NOT graceful shutdown - simulate crash!
# Step 2: Emergency failover in Veeam
Veeam Console → Replicas → Lab6-Repl-WebServer
→ Right-click → Failover Now
├─ Restore point: Latest available
│ (may be up to 30 min old = RPO)
└─ Click Failover
# Step 3: Measure RTO
├─ Source crashed at: _____ (T0)
├─ Failover initiated at: _____ (T1)
├─ Replica online at: _____ (T2)
├─ Detection time: T1 - T0 = _____ phút
├─ Failover time: T2 - T1 = _____ phút
└─ Total RTO: T2 - T0 = _____ phút
Expected:
├─ Detection: 1-5 min (manual) or 30-60s (automated)
├─ Failover: 2-3 min
└─ Total RTO: 3-8 min ✓
# Step 1: Kill source VM (simulate crash)
# In hypervisor: Right-click → Power Off (hard)
# DO NOT graceful shutdown - simulate crash!
# Step 2: Emergency failover in Veeam
Veeam Console → Replicas → Lab6-Repl-WebServer
→ Right-click → Failover Now
├─ Restore point: Latest available
│ (may be up to 30 min old = RPO)
└─ Click Failover
# Step 3: Measure RTO
├─ Source crashed at: _____ (T0)
├─ Failover initiated at: _____ (T1)
├─ Replica online at: _____ (T2)
├─ Detection time: T1 - T0 = _____ phút
├─ Failover time: T2 - T1 = _____ phút
└─ Total RTO: T2 - T0 = _____ phút
Expected:
├─ Detection: 1-5 min (manual) or 30-60s (automated)
├─ Failover: 2-3 min
└─ Total RTO: 3-8 min ✓
Bài Tập Ứng Dụng (4 bài)
💼 Bài 1: Tính Toán RPO/RTO Thực Tế
Dữ liệu: Replication mỗi 30 phút. Last successful replication: 09:15 AM. Server crash: 09:40 AM. Failover time: 3 phút.
Tính: (1) RPO thực tế? (2) RTO thực tế? (3) Data loss?
Đáp án:
RPO = 09:40 - 09:15 = 25 phút (data từ 09:15-09:40 mất)
RTO = 3 phút (failover time)
Data loss = 25 phút of transactions
RPO = 09:40 - 09:15 = 25 phút (data từ 09:15-09:40 mất)
RTO = 3 phút (failover time)
Data loss = 25 phút of transactions
💼 Bài 2: DR Runbook
Yêu cầu: Viết DR runbook cho 3-tier application:
- Tier 1: Database (MySQL) - Critical, RTO 5 min
- Tier 2: Web Server (Nginx) - Important, RTO 15 min
- Tier 3: App Server (Node.js) - Standard, RTO 30 min
Viết: Thứ tự recovery? Dependencies? Verification steps?
Template:
1. DB first (no dependencies) → verify MySQL running
2. App server (depends on DB) → verify DB connection
3. Web server (depends on App) → verify curl response
Total expected RTO: 5 + 15 + 30 = 50 min worst case
1. DB first (no dependencies) → verify MySQL running
2. App server (depends on DB) → verify DB connection
3. Web server (depends on App) → verify curl response
Total expected RTO: 5 + 15 + 30 = 50 min worst case
💼 Bài 3: Network Bandwidth Calculation
Tình huống: Replicate 10 VMs (each 50GB) across WAN (100Mbps link). Change rate: 5% per 30min cycle.
Tính: (1) Initial sync time? (2) Incremental bandwidth needed? (3) Feasible?
Đáp án:
Initial: 10 × 50GB = 500GB / 100Mbps = ~11 hours
Incremental: 10 × 50GB × 5% = 25GB per 30min
Required: 25GB / 30min = 111Mbps → EXCEEDS 100Mbps!
Solution: Enable compression (2x) → 55Mbps needed ✓
Initial: 10 × 50GB = 500GB / 100Mbps = ~11 hours
Incremental: 10 × 50GB × 5% = 25GB per 30min
Required: 25GB / 30min = 111Mbps → EXCEEDS 100Mbps!
Solution: Enable compression (2x) → 55Mbps needed ✓
💼 Bài 4: Backup vs Replication Decision Matrix
Cho mỗi scenario, chọn Backup hoặc Replication (giải thích):
| VM | RTO Need | RPO Need | Choice? |
|---|---|---|---|
| Payment Gateway | <5 min | <15 min | ? |
| Dev/Test Server | 4 hours | Daily | ? |
| Email Server | 1 hour | 4 hours | ? |
| Archive Storage | 24 hours | Weekly | ? |
Đáp án:
Payment: Replication (RTO <5min requires instant failover)
Dev/Test: Backup only (4h RTO acceptable, save cost)
Email: Backup + Replication (1h RTO, 4h RPO = replication every 4h)
Archive: Backup only (weekly RPO, 24h RTO = no urgency)
Payment: Replication (RTO <5min requires instant failover)
Dev/Test: Backup only (4h RTO acceptable, save cost)
Email: Backup + Replication (1h RTO, 4h RPO = replication every 4h)
Archive: Backup only (weekly RPO, 24h RTO = no urgency)