Tuần 7: High Availability & Cluster Configuration (Lab 7)

Lý Thuyết: Veeam HA Architecture

Sơ Đồ Kiến Trúc HA

╔══════════════════════════════════════════════════════════════════╗
║                    VEEAM HA CLUSTER ARCHITECTURE                 ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  ┌─────────────────────┐    Heartbeat     ┌──────────────────┐  ║
║  │  PRIMARY NODE       │◄────────────────►│  SECONDARY NODE  │  ║
║  │  (Active)           │   (10 giây)      │  (Standby)       │  ║
║  │                     │                  │                  │  ║
║  │  VBR Service ●ON    │  Shared Storage  │  VBR Service ●OFF│  ║
║  │  SQL Server ●ON     │◄══════════════►  │  SQL Server ●ON  │  ║
║  │  VBR Console ●ON    │   (config DB)    │  VBR Console ●OFF│  ║
║  └────────┬────────────┘                  └──────────────────┘  ║
║           │                                                      ║
║           │ Virtual IP (Floating)                                ║
║           ▼                                                      ║
║  ┌────────────────────────────────────────────┐                 ║
║  │           BACKUP REPOSITORIES              │                 ║
║  │  ┌──────────┐  ┌──────────┐  ┌──────────┐ │                 ║
║  │  │  Repo 1  │  │  Repo 2  │  │  Tape    │ │                 ║
║  │  │ (Local)  │  │  (NAS)   │  │          │ │                 ║
║  │  └──────────┘  └──────────┘  └──────────┘ │                 ║
║  └────────────────────────────────────────────┘                 ║
║                                                                  ║
║  FAILOVER TRIGGER CONDITIONS:                                    ║
║  • Heartbeat timeout > 60 giây                                   ║
║  • VBR service crash (không tự phục hồi)                        ║
║  • SQL Server không phản hồi                                     ║
║  • Manual switchover (planned maintenance)                       ║
╚══════════════════════════════════════════════════════════════════╝

Active/Passive Model

✅ Active Node (Primary)

• Chạy tất cả Veeam services
• Nhận kết nối từ agents & consoles
• Ghi vào shared configuration DB
• Giữ Virtual IP (floating IP)
• Thực thi backup jobs theo lịch

⏸️ Passive Node (Standby)

• VBR services ở trạng thái dừng
• SQL Server chạy (sync liên tục)
• Giám sát heartbeat mỗi 10 giây
• Sẵn sàng tiếp quản trong <60 giây
• Không nhận connection trực tiếp

⚠️ Lưu ý quan trọng: Veeam HA không phải Active/Active (load balancing). Chỉ một node chạy tại một thời điểm. Secondary node chỉ "tiếp quản" — không phân tải công việc. Nếu cần scale-out, dùng SOBR (đã học ở Tuần 4).

So Sánh: CDP vs Backup Truyền Thống

Tiêu Chí	Backup Truyền Thống	Replication (RPO 30p)	CDP (RPO Seconds)
RPO Tốt Nhất	24 giờ	15-30 phút	2-15 giây
RTO	15-60 phút	2-5 phút	<1 phút
Cơ chế	Snapshot theo lịch	Snapshot mỗi N phút	I/O Filter (liên tục)
Tài nguyên CPU/RAM	Thấp (scheduled)	Trung bình	Cao (liên tục)
Bandwidth mạng	Burst khi chạy job	Burst định kỳ	Liên tục (steady)
Use case	Archive, compliance	Critical VMs	Mission-critical, DB
Chi phí license	Standard	Enterprise	Enterprise Plus

So Sánh RTO: Không HA vs Có HA

Kịch Bản	Không HA	Có HA (Manual)	Có HA (Automatic)
VBR Service Crash	30-60 phút (diagnose + restart)	10-15 phút (admin action)	~60 giây (auto)
Hardware Failure	2-8 giờ (rebuild server)	15-30 phút (manual switch)	60-90 giây (auto)
OS Patch/Reboot	Downtime suốt quá trình	5-10 phút planned	<2 phút (planned switch)
Network Issue	Toàn bộ jobs dừng	Phụ thuộc admin	Auto detect & failover

LAB 7A: Configuration Backup

Bước 1: Enable Configuration Backup

Configuration Backup lưu toàn bộ cấu hình VBR (jobs, credentials, repositories) vào file .bco để restore khi xảy ra thảm họa.

# Trong VBR Console:
# Main Menu (☰) → Configuration Backup → Enable

Cấu hình như sau:
├─ Enable automatic configuration backup: ✅ TẮT (check)
├─ Backup folder: D:\VeeamConfigBackup\  (ổ đĩa riêng, không phải C:\)
├─ Restore points to keep: 10
├─ Schedule:
│   ├─ Daily at: 02:00 AM
│   └─ Run on: Every day
├─ Encryption:
│   ├─ Enable backup file encryption: ✅ TẮT (check)
│   └─ Password: [tạo password mạnh, lưu vào KeePass]
└─ Click OK để lưu

# Hoặc dùng PowerShell:
Set-VBRConfigurationBackupJob `
    -Enabled $true `
    -Target "D:\VeeamConfigBackup" `
    -RestorePointsToKeep 10 `
    -EncryptionEnabled $true `
    -EncryptionPassword (ConvertTo-SecureString "P@ssw0rd123!" -AsPlainText -Force)

✅ Expected Output (Console):

Configuration Backup Settings:
  Status:          Enabled
  Target folder:   D:\VeeamConfigBackup
  Schedule:        Daily at 2:00 AM
  Encryption:      Enabled (AES 256-bit)
  Retention:       10 restore points
  Last run:        Never (will run tonight)
  Next run:        02/05/2026 02:00:00

Bước 2: Verify Backup Created

Chạy backup thủ công ngay để kiểm tra không chờ đến 2 giờ sáng. Sau đó kiểm tra file .bco được tạo ra.

# Trong VBR Console: Main Menu → Configuration Backup → Backup Now
# Hoặc PowerShell:
Start-VBRConfigurationBackup

# Kiểm tra file được tạo:
Get-ChildItem "D:\VeeamConfigBackup\" | Sort-Object LastWriteTime -Descending

# Xem chi tiết file .bco:
Get-VBRConfigurationBackup | Select-Object -First 5 | Format-List *

✅ Expected Output:

Mode  LastWriteTime       Length  Name
----  -------------       ------  ----
-a--  02/05/2026 10:15    47.2 MB VeeamConfigBackup-2026-05-02-10-15-00.bco
-a--  02/05/2026 10:15    1.1 KB  VeeamConfigBackup-2026-05-02-10-15-00.bco.md5

ConfigurationBackup Details:
  File:          VeeamConfigBackup-2026-05-02-10-15-00.bco
  Size:          47.2 MB (compressed + encrypted)
  Created:       02/05/2026 10:15:42
  Jobs count:    12 jobs included
  Repos count:   4 repositories included
  Credentials:   8 credential records (encrypted)
  Status:        Success
  MD5 checksum:  a3f9b2c1d8e7f4a6b5c3d2e1f8a7b6c5

Bước 3: Test Config Restore (Simulate Disaster)

Mô phỏng tình huống server bị lỗi: dừng toàn bộ services, sau đó restore từ backup. Đây là bài test quan trọng nhất!

⚠️ Cảnh báo: Chỉ thực hiện bước này trong môi trường lab! Sẽ dừng tất cả backup jobs đang chạy.

# BƯỚC 3A: Mô phỏng thảm họa - xóa/corrupt database
# (Trong lab: chỉ đổi tên DB để simulate)
Stop-Service -Name "VeeamBackupSvc" -Force
Stop-Service -Name "VeeamBrokerSvc" -Force
Stop-Service -Name "VeeamMountSvc" -Force

# Đổi tên DB để simulate corrupt:
Rename-Item "C:\Program Files\Veeam\Backup and Replication\Backup\VeeamBackup.mdf" `
            "VeeamBackup.mdf.BROKEN"

Write-Host "=== Disaster Simulated === Veeam DB is now unavailable"

# BƯỚC 3B: Restore từ Configuration Backup
# Mở Veeam Backup & Replication console
# → Main Menu → Configuration Restore
# → Browse đến file .bco
# → Nhập encryption password
# → Select components: All (Jobs, Repos, Credentials, Settings)
# → Start Restore

# PowerShell method:
Restore-VBRConfiguration `
    -Path "D:\VeeamConfigBackup\VeeamConfigBackup-2026-05-02-10-15-00.bco" `
    -Password (ConvertTo-SecureString "P@ssw0rd123!" -AsPlainText -Force) `
    -RestoreJobs $true `
    -RestoreRepositories $true `
    -RestoreCredentials $true

✅ Expected Output (Restore Progress):

[10:22:01] Starting configuration restore...
[10:22:03] Decrypting backup file... OK
[10:22:05] Validating backup integrity (MD5)... OK
[10:22:08] Restoring database schema... OK
[10:22:15] Restoring credentials (8 records)... OK
[10:22:18] Restoring repositories (4 records)... OK
[10:22:22] Restoring backup jobs (12 records)... OK
[10:22:25] Restoring tape jobs (0 records)... OK
[10:22:27] Restoring global settings... OK
[10:22:30] Restarting Veeam services... OK
[10:22:45] Configuration restore completed successfully!
  Total time: 44 seconds
  Jobs restored: 12/12
  Repositories restored: 4/4
  Credentials restored: 8/8

Bước 4: Verify Restore Thành Công

# Kiểm tra jobs đã được restore:
Get-VBRJob | Select-Object Name, JobType, IsScheduleEnabled | Format-Table -AutoSize

# Kiểm tra repositories:
Get-VBRBackupRepository | Select-Object Name, Path, IsAvailable | Format-Table

# Kiểm tra credentials:
Get-VBRCredentials | Select-Object Name, Type, UserName | Format-Table

# Test kết nối đến repositories:
Get-VBRBackupRepository | ForEach-Object {
    $test = Test-VBRRepositoryConnectivity -Repository $_
    Write-Host "$($_.Name): $($test.Status)"
}

# Chạy thử 1 job để xác nhận hoạt động bình thường:
Get-VBRJob -Name "Daily-Backup-Lab" | Start-VBRJob

✅ Expected Output:

Name                    JobType      IsScheduleEnabled
----                    -------      -----------------
Daily-Backup-Lab        Backup       True
Weekly-Full-Backup      Backup       True
Replication-CriticalVM  Replica      True
[... 9 more jobs ...]

Name          Path                    IsAvailable
----          ----                    -----------
Repo-Local    D:\VeeamBackup          True
Repo-NAS      \\nas01\veeam-backup    True
SOBR-Extents  Multiple               True
Hardened-Repo /mnt/hardened          True

Repository Connectivity Test:
  Repo-Local:    Connected (latency: 2ms)
  Repo-NAS:      Connected (latency: 15ms)
  Hardened-Repo: Connected (latency: 8ms)

Job "Daily-Backup-Lab" started successfully at 10:30:15
Status: Running... (Processing VM1: 45%)

LAB 7B: Veeam HA Deployment

Bước 1: Chuẩn Bị Secondary Server

Secondary server phải có cấu hình tương đương primary. Không được dùng hardware thấp hơn vì nó sẽ chạy toàn bộ workload khi failover.

# === Cấu hình tối thiểu Secondary Server ===
Hostname:       VBR-NODE2
OS:             Windows Server 2022 Datacenter
CPU:            8 cores (giống Primary)
RAM:            32 GB (giống Primary)
Disk C:         100 GB (OS + Veeam install)
Disk D:         200 GB (local staging)
Network NIC1:   Production network (192.168.1.x/24)
Network NIC2:   Heartbeat network (10.10.10.x/24) ← RIÊNG BIỆT!

# Join domain (quan trọng cho Kerberos auth):
Add-Computer -DomainName "corp.local" `
             -Credential (Get-Credential) `
             -OUPath "OU=Servers,DC=corp,DC=local"
Restart-Computer

# Cài SQL Server 2019 Express (hoặc dùng shared SQL):
# Download SQL Server 2019 Express
.\SQLEXPR_x64_ENU.exe /Q /IACCEPTSQLSERVERLICENSETERMS `
    /ACTION=Install /FEATURES=SQLEngine `
    /INSTANCENAME=VEEAMSQL2019 `
    /SQLSYSADMINACCOUNTS="CORP\VeeamAdmins" `
    /TCPENABLED=1

# Mở firewall cho Veeam:
New-NetFirewallRule -DisplayName "Veeam Ports" -Direction Inbound `
    -Protocol TCP -LocalPort 9392,9393,9394,9395,2500-3300 `
    -Action Allow

✅ Expected Output:

Computer "VBR-NODE2" joined domain "corp.local" successfully.
Restart required: Yes

SQL Server 2019 Express Installation:
  Instance: VEEAMSQL2019
  Status: Installed successfully
  TCP Port: 1433 (enabled)
  Service: SQL Server (VEEAMSQL2019) - Running

Firewall Rules Created:
  "Veeam Ports" - TCP 9392-9395, 2500-3300 - Allow Inbound

Bước 2: Cài Đặt Veeam trên Secondary (Silent Install)

# Copy installer từ Primary (hoặc mount ISO):
# \\VBR-NODE1\C$\Install\VeeamBackup&Replication_12.x.x.iso

# Silent install với các tham số:
.\Setup.exe /silent /accepteula /acceptthirdpartylicenses `
    /installdir:"C:\Program Files\Veeam" `
    /vbrservice_user:"CORP\veeam-svc" `
    /vbrservice_password:"SvcP@ss2024!" `
    /sqlserver:"VBR-NODE2\VEEAMSQL2019" `
    /sqldatabase:"VeeamBackup" `
    /create_db

# Theo dõi tiến trình install:
Get-Content "C:\ProgramData\Veeam\Setup\Logs\VBR_setup_*.log" -Wait -Tail 20

# Sau khi cài xong, verify services:
Get-Service -Name "Veeam*" | Select-Object Name, Status, StartType

✅ Expected Output:

[Install Log]
[14:05:12] Installing Veeam Backup & Replication...
[14:05:45] Installing SQL components... OK
[14:06:30] Creating database VeeamBackup... OK
[14:07:15] Installing VBR services... OK
[14:08:02] Installation completed successfully.

Service Status (Secondary Node - Standby mode):
Name                     Status   StartType
----                     ------   ---------
VeeamBackupSvc           Stopped  Manual   ← Dừng trên standby
VeeamBrokerSvc           Stopped  Manual
VeeamCatalogueSvc        Running  Automatic
VeeamMountSvc            Stopped  Manual
VeeamTransportSvc        Running  Automatic

Bước 3: Cấu Hình Cluster Service

# === Trên PRIMARY NODE (VBR-NODE1) ===
# Cài đặt Veeam HA feature:
Import-Module VeeamPowerShell

# Tạo HA cluster với secondary node:
New-VBRHACluster `
    -PrimaryServer "VBR-NODE1" `
    -SecondaryServer "VBR-NODE2" `
    -HeartbeatNetwork "10.10.10.0/24" `
    -HeartbeatInterval 10 `
    -SharedStoragePath "\\SAN01\VeeamShared" `
    -VirtualIP "192.168.1.100" `
    -SubnetMask "255.255.255.0"

# Cấu hình heartbeat network (NIC2 trên cả hai node):
# VBR-NODE1 NIC2: 10.10.10.1/24
# VBR-NODE2 NIC2: 10.10.10.2/24

# Kiểm tra kết nối heartbeat:
Test-NetConnection -ComputerName "10.10.10.2" -Port 9395

✅ Expected Output:

HA Cluster Configuration:
  Cluster Name:     VBR-HA-Cluster
  Primary (Active): VBR-NODE1 (192.168.1.101)
  Secondary (Standby): VBR-NODE2 (192.168.1.102)
  Virtual IP:       192.168.1.100 (currently on NODE1)
  Heartbeat NW:     10.10.10.0/24 (interval: 10s)
  Shared Storage:   \\SAN01\VeeamShared (accessible: Yes)
  Cluster Status:   Healthy

TcpTestSucceeded: True (heartbeat network OK)
PingSucceeded:    True (latency: 1ms)

Bước 4: Set Failover Conditions

# Cấu hình điều kiện kích hoạt failover tự động:
Set-VBRHAFailoverPolicy `
    -HeartbeatTimeoutSec 60 `
    -ServiceMonitoring $true `
    -MonitoredServices @("VeeamBackupSvc","VeeamBrokerSvc") `
    -MaxServiceRestartAttempts 3 `
    -ServiceRestartIntervalSec 30 `
    -SQLMonitoring $true `
    -SQLTimeoutSec 45 `
    -AutoFailover $true `
    -NotifyEmail "[email protected]"

# Xem lại cấu hình đã set:
Get-VBRHAFailoverPolicy | Format-List *

✅ Expected Output:

HA Failover Policy:
  HeartbeatTimeout:      60 seconds
  ServiceMonitoring:     Enabled
  Monitored Services:    VeeamBackupSvc, VeeamBrokerSvc
  MaxRestartAttempts:    3 (every 30s before failover)
  SQLMonitoring:         Enabled (timeout: 45s)
  AutoFailover:          Enabled
  NotificationEmail:     [email protected]
  SplitBrainProtection:  Enabled (SCSI reservations)

Bước 5: Verify Cluster Status

# Kiểm tra trạng thái toàn bộ cluster:
Get-VBRHANode | Format-Table -AutoSize

# Chi tiết từng node:
Get-VBRHANode | ForEach-Object {
    Write-Host "=== $($_.Name) ==="
    Write-Host "  Role:         $($_.Role)"
    Write-Host "  Status:       $($_.Status)"
    Write-Host "  Services:     $($_.ServicesStatus)"
    Write-Host "  Last Hrtbeat: $($_.LastHeartbeat)"
    Write-Host "  DB Sync:      $($_.DatabaseSyncStatus)"
}

# Kiểm tra Virtual IP:
Resolve-DnsName "vbr-cluster.corp.local" | Select-Object Name, IPAddress

# Verify backup jobs vẫn hiển thị đúng:
Get-VBRJob | Measure-Object | Select-Object Count

✅ Expected Output:

Name       Role      Status   Services  LastHeartbeat        DBSync
----       ----      ------   --------  -------------        ------
VBR-NODE1  Active    Healthy  Running   02/05/2026 10:45:30  In Sync
VBR-NODE2  Standby   Healthy  Stopped   02/05/2026 10:45:28  In Sync

=== VBR-NODE1 ===
  Role:         Active
  Status:       Healthy
  Services:     All Running
  Last Hrtbeat: 2 seconds ago
  DB Sync:      Synchronized (lag: 0ms)

=== VBR-NODE2 ===
  Role:         Standby
  Status:       Healthy
  Services:     Stopped (ready to start)
  Last Hrtbeat: 4 seconds ago
  DB Sync:      Synchronized (lag: 150ms)

vbr-cluster.corp.local → 192.168.1.100 (Virtual IP)
Jobs Count: 12 (all visible via cluster VIP)

LAB 7C: Failover Testing

📋 Mục tiêu: Kiểm tra 4 tình huống failover khác nhau. Ghi lại thời gian failover thực tế của môi trường lab để so sánh với RTO đã cam kết.

Test 1: Planned Switchover (Bảo Trì Có Kế Hoạch)

Dùng khi cần reboot/patch Primary node mà không muốn downtime. Admin chủ động chuyển active role sang Secondary.

# === Bắt đầu theo dõi (chạy trên CLIENT machine) ===
# Mở terminal riêng để liên tục ping VIP:
ping -t 192.168.1.100

# === Thực hiện Planned Switchover (từ VBR Console) ===
# VBR Console → Infrastructure → Backup Infrastructure
# → HA Cluster → Right-click → Switch Active Node
# → Confirm: "Switch to VBR-NODE2"

# Hoặc PowerShell:
Invoke-VBRHASwitchover -TargetNode "VBR-NODE2" -Reason "Planned maintenance"

# Theo dõi trong real-time:
Watch-VBRHAStatus  # Refresh mỗi 5 giây

✅ Expected Behavior & Output:

[10:55:00] Switchover initiated to VBR-NODE2
[10:55:02] Draining active connections on NODE1...
[10:55:05] Stopping VBR services on NODE1 gracefully...
[10:55:08] Starting VBR services on NODE2...
[10:55:12] VBR-NODE2 services started
[10:55:14] Transferring Virtual IP to NODE2...
[10:55:15] VIP 192.168.1.100 → now on VBR-NODE2

Ping output (từ client):
  Reply from 192.168.1.100 time=1ms  ← NODE1 active
  Reply from 192.168.1.100 time=1ms
  Request timeout for icmp_seq 5     ← ~3-5 giây gián đoạn
  Request timeout for icmp_seq 6
  Reply from 192.168.1.100 time=2ms  ← NODE2 active
  Reply from 192.168.1.100 time=2ms

Switchover complete:
  Active node:  VBR-NODE2 (NEW)
  Standby node: VBR-NODE1 (now standby)
  Total time:   15 seconds
  Ping loss:    3-5 packets (planned)

Verification - Jobs continue running:
  Daily-Backup-Lab: Running (không bị interrupt)

Test 2: Simulate Primary Crash (Failover Tự Động)

Mô phỏng tình huống Primary node bị crash đột ngột. Secondary phát hiện qua heartbeat timeout và tự động tiếp quản sau ~60 giây.

# === Chú ý: Hiện tại NODE2 đang Active (sau Test 1) ===
# Simulate crash trên NODE2 (current active):

# Dừng đột ngột VBR services (không graceful):
$services = @("VeeamBackupSvc","VeeamBrokerSvc","VeeamMountSvc")
foreach ($svc in $services) {
    Stop-Service -Name $svc -Force -ErrorAction SilentlyContinue
}
# Hoặc: kill process
Stop-Process -Name "Veeam.Backup.Service" -Force

# === Theo dõi từ NODE1 (current standby) ===
# NODE1 sẽ detect heartbeat mất và tự failover

# Monitor heartbeat detection:
Get-VBRHAHeartbeatLog | Select-Object -Last 10

# Bấm giờ từ lúc kill process đến khi VIP chuyển sang NODE1

✅ Expected Behavior (Timeline ~60 giây):

T+0s:   NODE2 services killed (crash simulated)
T+10s:  NODE1 detects missing heartbeat (1st miss)
T+20s:  NODE1 detects missing heartbeat (2nd miss)
T+30s:  NODE1 tries to reconnect to NODE2... timeout
T+45s:  NODE1 attempts SQL query to NODE2... no response
T+60s:  Failover threshold reached! Initiating auto-failover
T+62s:  NODE1 starts VBR services
T+68s:  NODE1 acquires SCSI reservation (split-brain check)
T+72s:  VBR services on NODE1 fully started
T+75s:  NODE1 claims Virtual IP 192.168.1.100
T+78s:  Failover complete. NODE1 is now Active.

Email notification sent to [email protected]:
  Subject: [ALERT] Veeam HA Failover Occurred
  Node: VBR-NODE2 failed
  Failover to: VBR-NODE1
  Time: 78 seconds
  Jobs affected: 0 (queued for next run)

Heartbeat Log:
  10:58:00 NODE2: Last heartbeat received
  10:58:10 WARN: Heartbeat miss #1
  10:58:20 WARN: Heartbeat miss #2
  10:58:30 WARN: Heartbeat miss #3 - initiating failover check
  10:59:00 CRITICAL: Failover triggered
  10:59:18 INFO: NODE1 now Active

Test 3: Network Partition (Ngắt Mạng Production)

Ngắt NIC production của Primary (không phải heartbeat NIC). Secondary nhận thấy VIP không reachable và tiếp quản.

# === NODE1 đang Active (sau Test 2) ===
# Disable NIC production trên NODE1 (giữ heartbeat NIC):
Disable-NetAdapter -Name "Ethernet" -Confirm:$false
# NIC2 (heartbeat: 10.10.10.1) vẫn UP

# NODE2 phát hiện: heartbeat còn nhưng production IP mất
# Veeam HA dùng "dead man's switch" logic:
# NODE1 biết production NIC down → tự nguyện yield

# Theo dõi từ NODE2:
Get-VBRHAStatus -Watch

# Re-enable sau khi test:
Enable-NetAdapter -Name "Ethernet" -Confirm:$false

✅ Expected Behavior:

NODE1 (production NIC disabled):
  - Heartbeat still active via NIC2
  - Detects: Production network unreachable
  - Action: Self-demotion (yield to NODE2)
  - Stops VBR services gracefully
  - Releases Virtual IP

NODE2 response:
  - Receives yield signal from NODE1
  - Starts VBR services immediately (no timeout wait)
  - Acquires Virtual IP 192.168.1.100
  - Total time: ~20 seconds (faster than crash scenario)

Failover time with Network Partition: ~20 seconds
Failover time with crash: ~78 seconds
→ Network partition = faster because coordinated

Test 4: Split-Brain Prevention Verification

Xác minh Veeam ngăn tình huống cả hai node cùng nghĩ mình là Active — đây là lỗi nguy hiểm nhất trong HA.

# Simulate split-brain: cắt đứt heartbeat network (NIC2) trên cả hai
# Đây là kịch bản nguy hiểm nhất

# Trên VBR-NODE1: disable heartbeat NIC
Disable-NetAdapter -Name "Ethernet 2" -Confirm:$false

# Trên VBR-NODE2: disable heartbeat NIC
# (Chạy từ remote session khác)
Disable-NetAdapter -Name "Ethernet 2" -Confirm:$false

# Cả hai node không thấy nhau qua heartbeat
# → Mỗi node có thể nghĩ: "node kia đã chết, tôi cần lên Active"

# Veeam dùng SCSI Persistent Reservations (tiebreaker):
# Chỉ node GIÀNH ĐƯỢC shared storage lock mới được lên Active
# Node còn lại sẽ bị fence (tự shutdown VBR services)

Get-VBRHASplitBrainStatus
Get-VBRHAStorageLock | Format-List *

✅ Expected Output (Split-Brain Prevention):

Split-Brain Detection:
  Heartbeat network: DOWN (both NICs disabled)
  Resolution method: SCSI Persistent Reservations

Storage Lock Race:
  NODE1 attempts SCSI reservation... SUCCESS (won)
  NODE2 attempts SCSI reservation... FAILED (NODE1 holds lock)

Result:
  NODE1: Active (holds storage lock)
  NODE2: Fenced (VBR services stopped automatically)
  Split-Brain: PREVENTED ✅

Node2 Log:
  CRITICAL: Lost heartbeat to NODE1
  INFO: Attempting shared storage lock...
  WARN: Storage lock held by NODE1
  INFO: Entering fenced state - stopping VBR services
  INFO: Fencing complete. Awaiting admin resolution.

→ VBR ngăn thành công cả hai node cùng Active!
→ Re-enable heartbeat NIC để restore cluster health

LAB 7D: CDP (Continuous Data Protection)

📋 Yêu cầu: CDP yêu cầu Veeam Enterprise Plus license, VMware vSphere 6.7+, và I/O Filter driver được cài trên ESXi host. Lab này sử dụng môi trường vSphere.

Bước 1: Enable CDP trên VM

# Trước khi enable CDP: Cài I/O Filter trên ESXi host
# VBR Console → Inventory → VMware vSphere
# → Chọn ESXi host → Install I/O Filters

# Enable CDP qua PowerShell:
$vm = Get-VBRViEntity -Name "DB-Production-01"

# Tạo CDP Policy:
Add-VBRCDPPolicy `
    -Name "CDP-DatabaseServers" `
    -VMObject $vm `
    -TargetHost "esxi-dr-01.corp.local" `
    -TargetDatastore "DS-CDP-Replicas" `
    -RPOSeconds 15 `
    -BookmarksEnabled $true `
    -BookmarkSchedule "Every 1 Hour"

# Kích hoạt policy:
Enable-VBRCDPPolicy -Name "CDP-DatabaseServers"

✅ Expected Output:

CDP Policy Created:
  Name:           CDP-DatabaseServers
  Protected VMs:  DB-Production-01
  RPO:            15 seconds
  Target host:    esxi-dr-01.corp.local
  Target DS:      DS-CDP-Replicas
  Bookmarks:      Every 1 hour (point-in-time recovery)
  Status:         Enabled

I/O Filter Status on esxi-prod-01:
  VeeamCDP filter: Installed v12.x
  Status:         Active, intercepting I/Os
  Protected VMs:  1 (DB-Production-01)

Bước 2: Cấu Hình CDP Policy Chi Tiết

# Cấu hình chi tiết hơn cho production workload:
Set-VBRCDPPolicy -Name "CDP-DatabaseServers" `
    -Throttle 50 `
    -ThrottleUnit Percent `
    -NetworkThrottleMbps 100 `
    -CompressionLevel Optimal `
    -EncryptionEnabled $true

# Kiểm tra I/O Filter đang hoạt động trên VM:
Get-VBRCDPVMStatus -VMName "DB-Production-01"

# Xem bandwidth đang dùng:
Get-VBRCDPStats -PolicyName "CDP-DatabaseServers" -Last 1Hour

✅ Expected Output:

CDP VM Status - DB-Production-01:
  Protection:     Active
  Current RPO:    8 seconds (within 15s target) ✅
  I/O Filter:     Loaded, intercepting writes
  Write rate:     45 MB/s (database writes)
  Transfer rate:  12 MB/s (compressed+encrypted)
  Compression:    3.75x ratio

CDP Stats (last 1 hour):
  Data written:   162 GB
  Data replicated: 43 GB (after compression)
  Network used:   ~12 MB/s average
  RPO violations: 0 (all within 15s)
  Bookmarks:      1 created (hourly)

Bước 3: Monitor CDP Replication

# Dashboard CDP trong VBR Console:
# Home → CDP Replicas → chọn "DB-Production-01_replica"

# PowerShell monitoring:
while ($true) {
    $status = Get-VBRCDPReplicaState -VMName "DB-Production-01"
    $timestamp = Get-Date -Format "HH:mm:ss"
    Write-Host "$timestamp | RPO: $($status.CurrentRPO)s | Lag: $($status.ReplicationLag)ms | State: $($status.State)"
    Start-Sleep -Seconds 5
}

# Check replica VM trên DR site:
Get-VBRCDPReplica -VMName "DB-Production-01" | Format-List *

# Xem lịch sử bookmark để restore point-in-time:
Get-VBRCDPBookmark -PolicyName "CDP-DatabaseServers" | Format-Table Time, Type, Description

✅ Expected Output (Live Monitoring):

11:00:05 | RPO: 6s  | Lag: 234ms | State: Replicating
11:00:10 | RPO: 4s  | Lag: 198ms | State: Replicating
11:00:15 | RPO: 9s  | Lag: 312ms | State: Replicating
11:00:20 | RPO: 5s  | Lag: 201ms | State: Replicating
[All within 15s RPO target - HEALTHY]

CDP Replica Details:
  Name:           DB-Production-01_replica
  Host:           esxi-dr-01.corp.local
  Datastore:      DS-CDP-Replicas
  State:          PoweredOff (standby, ready)
  Last sync:      11:00:20 (5 seconds ago)
  Restore points: 4 bookmarks available

Bookmarks Available:
  Time                  Type    Description
  ----                  ----    -----------
  02/05/2026 11:00:00   Auto    Hourly bookmark
  02/05/2026 10:00:00   Auto    Hourly bookmark
  02/05/2026 09:00:00   Auto    Hourly bookmark
  02/05/2026 08:00:00   Manual  Before-patch bookmark

Bước 4: Test CDP Failover (Near-Zero Data Loss)

# Bước 4A: Ghi dấu thời gian trước khi failover
$beforeFailover = Get-Date
Write-Host "Failover initiated at: $beforeFailover"

# Bước 4B: Tạo dữ liệu test (mô phỏng transaction đang chạy)
# (Giả sử đây là SQL Server)
Invoke-Sqlcmd -ServerInstance "DB-Production-01" `
    -Query "INSERT INTO TestTable VALUES ('Test-$(Get-Date)', 'pre-failover')"

# Bước 4C: Thực hiện CDP Failover
Start-VBRCDPFailover `
    -VMName "DB-Production-01" `
    -TargetPoint Latest `
    -Reason "Testing CDP failover - lab exercise"

# Theo dõi tiến trình:
Get-VBRCDPFailoverStatus -VMName "DB-Production-01" -Watch

# Bước 4D: Verify trên DR site - replica VM đã boot
# Kết nối đến replica và kiểm tra dữ liệu:
$afterFailover = Get-Date
$rto = ($afterFailover - $beforeFailover).TotalSeconds
Write-Host "Actual RTO: $rto seconds"

✅ Expected Output (Near-Zero Data Loss):

Failover initiated at: 02/05/2026 11:15:00

CDP Failover Progress:
[11:15:00] Stopping CDP replication stream...
[11:15:02] Final sync: transferring last 8s of I/Os...
[11:15:04] Final sync complete (2.3 MB transferred)
[11:15:05] Powering on replica VM on esxi-dr-01...
[11:15:18] VM powered on, booting OS...
[11:15:45] VM guest OS ready
[11:15:48] Failover complete!

Results:
  Actual RTO:        48 seconds ✅ (target: <60s)
  Data loss (RPO):   8 seconds worth of I/Os
  Data transferred:  2.3 MB (last 8 seconds)
  Transactions lost: 0 committed transactions
                     (8s of in-flight = 3 transactions)

Verification on Replica VM:
  SQL Query: SELECT * FROM TestTable ORDER BY ID DESC
  Latest record: "Test-02/05/2026 11:14:52" ← 8 giây trước failover
  → Dữ liệu cách điểm failover chỉ 8 giây!

Bài Tập Thực Hành

Bài Tập 1: HA Capacity Planning

Công ty bạn có 2 datacenter cách nhau 15km. Tính băng thông cần thiết cho HA cluster và CDP replication. Điều kiện: 50 VMs, tổng dung lượng 20TB, thay đổi data hàng ngày 5%, có 10 VM mission-critical cần CDP với RPO 15 giây.

💡 Hướng Dẫn Giải:

=== TÍNH TOÁN BANDWIDTH ===

1. Heartbeat Network (tối thiểu):
   • Heartbeat packet: ~1 KB mỗi 10 giây
   • Bandwidth: 1KB × 2 nodes × 6/phút = 12 KB/phút = cực thấp
   • Khuyến nghị: Dành riêng 1Gbps link (chỉ cho heartbeat + DB sync)
   • DB sync overhead: ~5-10 Mbps (config changes thấp)
   → Heartbeat NIC: 1Gbps dedicated = ĐỦ

2. Replication Bandwidth (50 VMs):
   • Total data: 20 TB
   • Daily change rate: 5% = 1 TB/ngày = 1,000 GB/ngày
   • Backup window: 8 giờ (đêm)
   • Required: 1,000 GB / 8h / 3600s = 34.7 MB/s = 278 Mbps
   • Với compression 3x: thực tế ~93 Mbps
   → Production link cần: 1Gbps (headroom cho peak)

3. CDP Bandwidth (10 VMs mission-critical):
   • Assume mỗi VM write 20 MB/s (database workload)
   • Total write: 10 × 20 MB/s = 200 MB/s
   • Với compression 3x: ~67 MB/s = 536 Mbps
   • Thêm headroom 30%: 536 × 1.3 = 697 Mbps
   → CDP cần: 1Gbps dedicated link

TỔNG KẾT BANDWIDTH GIỮA 2 DATACENTER:
├─ Link 1 (Heartbeat + HA sync): 1Gbps
├─ Link 2 (Replication + Backup): 1Gbps
└─ Link 3 (CDP mission-critical): 1Gbps
   TOTAL: 3Gbps inter-DC bandwidth

Chi phí ước tính:
• 15km dark fiber: ~$2,000-5,000/tháng
• 3× 1Gbps circuits: ~$6,000-15,000/tháng

Bài Tập 2: RTO Analysis - So Sánh 4 Kịch Bản

Phân tích và so sánh RTO cho hệ thống ERP (Oracle DB, 500 users) trong 4 kịch bản khác nhau. Giả sử chi phí downtime: $10,000/giờ.

💡 Phân Tích:

KỊCH BẢN A: Không có HA, chỉ Backup truyền thống
  Phát hiện sự cố:    15 phút (monitoring alert)
  Escalate to admin:  10 phút
  Diagnose root cause: 30 phút
  Order/replace hardware: 4-8 giờ (nếu hardware fail)
  Restore OS:         1-2 giờ
  Restore Veeam DB:   30 phút
  Restore backup data: 2-4 giờ (tùy kích thước)
  Test & verify:      30 phút
  ─────────────────────────────────────────────
  TỔNG RTO:  8-15 giờ
  DOWNTIME COST: $80,000 - $150,000

KỊCH BẢN B: Manual Failover (có secondary, không tự động)
  Phát hiện + alert:  5 phút
  Admin nhận alert:   5 phút (on-call)
  Manual switch cmd:  5 phút
  Service start time: 5 phút
  Verify & test:      10 phút
  ─────────────────────────────────────────────
  TỔNG RTO:  30 phút
  DOWNTIME COST: $5,000

KỊCH BẢN C: Automatic HA (VBR HA Cluster)
  Heartbeat timeout:  60 giây
  Auto failover:      30-90 giây
  Service availability: 15 giây sau
  ─────────────────────────────────────────────
  TỔNG RTO:  2-3 phút
  DOWNTIME COST: ~$500

KỊCH BẢN D: CDP (Enterprise Plus)
  I/O Filter failover: 30-60 giây
  VM power-on:         15-30 giây
  ─────────────────────────────────────────────
  TỔNG RTO:  <2 phút (+ RPO chỉ 8-15 giây!)
  DOWNTIME COST: ~$300

SO SÁNH TỔNG CHI PHÍ (cho 1 sự cố nghiêm trọng):
  A: $80,000-150,000 downtime / năm
  B: $5,000 downtime + nhân sự on-call
  C: $500 downtime + Veeam Enterprise ~$15,000 setup
  D: $300 downtime + Veeam Enterprise Plus ~$25,000 setup

→ Veeam HA (C) hoàn vốn sau lần sự cố đầu tiên!

Bài Tập 3: Viết DR Runbook (Step-by-Step)

Viết quy trình DR đầy đủ cho tình huống: Datacenter chính mất điện toàn bộ lúc 2 giờ sáng, cần restore hoạt động tại DR site trong vòng 30 phút.

💡 DR Runbook Mẫu:

═══════════════════════════════════════════════════
DR RUNBOOK v2.0 - Datacenter Power Outage
Tổ chức: Corp Inc. | Cập nhật: 02/05/2026
Người thực hiện: On-call Engineer
RTO Target: 30 phút | RPO Target: 15 phút
═══════════════════════════════════════════════════

GIAI ĐOẠN 1: PHÁT HIỆN & KHAI BÁO (0-5 phút)
□ T+0: Nhận alert từ monitoring system
□ T+1: Xác nhận sự cố (ping test DC chính)
□ T+2: Gọi cho DC Manager xác nhận power outage
□ T+3: Khai báo DR Event - gửi email/SMS team
□ T+4: Mở conference bridge: +1-800-XXX-XXXX code 9876
□ T+5: Assign roles: DR Lead, Network, Apps, DB

GIAI ĐOẠN 2: KÍCH HOẠT DR SITE (5-15 phút)
□ T+5: Kết nối VPN đến DR site
□ T+6: Login VBR Console tại DR site: vbr-dr.corp.local
□ T+7: Kiểm tra CDP replica status:
        Get-VBRCDPReplicaState -VMName "DB-Production-01"
□ T+8: Kích hoạt HA Failover (nếu HA cluster chưa tự switch):
        Invoke-VBRHASwitchover -TargetNode "VBR-DR-NODE"
□ T+10: Start CDP replicas cho mission-critical VMs:
         Start-VBRCDPFailover -VMName "DB-Production-01"
         Start-VBRCDPFailover -VMName "WebApp-Prod-01"
□ T+13: Verify VMs booted: ping 192.168.2.10 (DR IP)
□ T+15: Notify Network team: update DNS, BGP routing

GIAI ĐOẠN 3: KHÔI PHỤC DỊCH VỤ (15-25 phút)
□ T+15: Start remaining VMs từ replication:
         Start-VBRFailover -ReplicaName "AppServer*"
□ T+18: Verify database connectivity:
         Invoke-Sqlcmd -ServerInstance "DB-DR-01" -Query "SELECT 1"
□ T+20: Test web application: curl https://app.corp.local/health
□ T+22: Enable DR load balancer rules
□ T+23: Send user notification: "Services restored at DR site"
□ T+25: Update status page: status.corp.local → Degraded → OK

GIAI ĐOẠN 4: VERIFICATION (25-30 phút)
□ T+25: Run smoke tests: DR-SmokeTest.ps1
□ T+27: Verify backup jobs running di DR site
□ T+28: Confirm với business stakeholders: services UP
□ T+30: DR COMPLETE - Ghi lại actual RTO: _____ phút

POST-DR ACTIONS (sau 30 phút):
□ Giữ communication mỗi 30 phút với management
□ Lên kế hoạch failback khi DC chính phục hồi
□ Document timeline và lessons learned

Bài Tập 4: Cost-Benefit Analysis

Tính toán ROI (Return on Investment) khi đầu tư vào Veeam HA Infrastructure. Công ty có doanh thu $5M/năm, uptime requirement 99.9%, và hiện tại có 3 sự cố lớn/năm.

💡 Cost-Benefit Calculation:

═══════════════════════════════════════════════
COST-BENEFIT ANALYSIS: Veeam HA Investment
Công ty: Corp Inc. | Doanh thu: $5M/năm
═══════════════════════════════════════════════

PHẦN 1: CHI PHÍ DOWNTIME HIỆN TẠI (Không có HA)

Doanh thu/giờ: $5,000,000 / 8760h = $570/giờ
Chi phí trực tiếp downtime:
  • Lost revenue:          $570/giờ
  • Staff overtime:        $200/giờ (4 engineers)
  • Customer SLA penalty:  $500/sự cố (average)
  • Reputation damage:     Khó đo lường (ước tính $2,000/sự cố)
  Tổng cost/giờ downtime:  $1,270/giờ

Lịch sử sự cố (3 sự cố/năm):
  • Sự cố 1: 6 giờ downtime = $7,620
  • Sự cố 2: 4 giờ downtime = $5,080
  • Sự cố 3: 8 giờ downtime = $10,160
  TỔNG DOWNTIME COST/NĂM:  $22,860

PHẦN 2: CHI PHÍ ĐẦU TƯ HA

CapEx (một lần):
  • Secondary server hardware: $15,000
  • Shared storage (SAN):      $20,000
  • Network upgrade (HA):      $5,000
  • Veeam Enterprise Plus:     $8,000 (1 socket)
  • Implementation & testing:  $10,000
  Tổng CapEx:                  $58,000

OpEx/năm:
  • Veeam maintenance (20%):   $1,600
  • Hardware maintenance:      $2,000
  • Power & cooling (+1 server): $1,200
  Tổng OpEx/năm:               $4,800

PHẦN 3: DOWNTIME SAU KHI CÓ HA

Với Automatic HA (RTO ~2 phút):
  • 3 sự cố/năm × 2 phút = 6 phút downtime
  • Downtime cost: $1,270/60 × 6 = $127/năm

Saving so với trước: $22,860 - $127 = $22,733/năm

PHẦN 4: ROI CALCULATION

Payback Period:
  CapEx: $58,000
  Annual saving: $22,733
  Payback: $58,000 / $22,733 = 2.55 năm

3-Year ROI:
  Total saving (3 năm): $22,733 × 3 = $68,199
  Total cost (3 năm):   $58,000 + ($4,800 × 3) = $72,400
  Net benefit:          $68,199 - $72,400 = -$4,201

5-Year ROI:
  Total saving (5 năm): $22,733 × 5 = $113,665
  Total cost (5 năm):   $58,000 + ($4,800 × 5) = $82,000
  Net benefit:          $113,665 - $82,000 = $31,665 ✅
  ROI:                  38.6%

KẾT LUẬN:
✅ Break-even tại năm thứ 2.55
✅ ROI dương từ năm thứ 3 trở đi
✅ Nếu tính reputation damage + khách hàng bỏ đi = ROI tốt hơn
⚠️ Cần review lại nếu sự cố ít hơn 2/năm

Checklist Hoàn Thành Tuần 7

Đã enable Configuration Backup và chạy backup thủ công thành công (file .bco được tạo) Đã test Configuration Restore: simulate disaster và restore thành công trong <2 phút Đã cài đặt Secondary Server và cấu hình HA Cluster với heartbeat network riêng biệt Đã thực hiện Planned Switchover thành công (downtime <20 giây) Đã test Automatic Failover (simulate crash) và ghi lại RTO thực tế Đã xác nhận Split-Brain Prevention hoạt động (SCSI reservation tiebreaker) Đã cấu hình CDP Policy với RPO 15 giây và verify monitoring dashboard Đã hoàn thành 4 bài tập: capacity planning, RTO analysis, DR runbook, cost-benefit

🎓 Điều kiện hoàn thành: Tick đủ 8/8 mục. Đặc biệt quan trọng: Test 2 (automatic failover) và bài tập DR Runbook phải hoàn thành vì đây là kỹ năng thực chiến nhất trong tuần này.

Tổng Kết Tuần 7

🏗️ HA Cluster

• Active/Passive architecture
• Heartbeat mỗi 10 giây
• Failover tự động <90 giây
• SCSI reservation tiebreaker

⚙️ Config Backup

• File .bco = lifeline khi thảm họa
• Encrypt + lưu off-server
• Test restore định kỳ
• Retention 10+ restore points

⚡ CDP

• RPO 2-15 giây (I/O Filter)
• Chỉ cho mission-critical VMs
• Cần Enterprise Plus license
• Bandwidth cao (continuous)

🎓 Kết Quả Học Tập

Lý Thuyết: Veeam HA Architecture

Sơ Đồ Kiến Trúc HA

Active/Passive Model

✅ Active Node (Primary)

⏸️ Passive Node (Standby)

So Sánh: CDP vs Backup Truyền Thống

So Sánh RTO: Không HA vs Có HA

LAB 7A: Configuration Backup

Bước 1: Enable Configuration Backup

Bước 2: Verify Backup Created

Bước 3: Test Config Restore (Simulate Disaster)

Bước 4: Verify Restore Thành Công

LAB 7B: Veeam HA Deployment

Bước 1: Chuẩn Bị Secondary Server

Bước 2: Cài Đặt Veeam trên Secondary (Silent Install)

Bước 3: Cấu Hình Cluster Service

Bước 4: Set Failover Conditions

Bước 5: Verify Cluster Status

LAB 7C: Failover Testing

Test 1: Planned Switchover (Bảo Trì Có Kế Hoạch)

Test 2: Simulate Primary Crash (Failover Tự Động)

Test 3: Network Partition (Ngắt Mạng Production)

Test 4: Split-Brain Prevention Verification

LAB 7D: CDP (Continuous Data Protection)

Bước 1: Enable CDP trên VM

Bước 2: Cấu Hình CDP Policy Chi Tiết

Bước 3: Monitor CDP Replication

Bước 4: Test CDP Failover (Near-Zero Data Loss)

Bài Tập Thực Hành

Bài Tập 1: HA Capacity Planning

💡 Hướng Dẫn Giải:

Bài Tập 2: RTO Analysis - So Sánh 4 Kịch Bản

💡 Phân Tích:

Bài Tập 3: Viết DR Runbook (Step-by-Step)

💡 DR Runbook Mẫu:

Bài Tập 4: Cost-Benefit Analysis

💡 Cost-Benefit Calculation:

Checklist Hoàn Thành Tuần 7

Tổng Kết Tuần 7

🏗️ HA Cluster

⚙️ Config Backup

⚡ CDP