Buổi 7 — High Availability (HA) | VMware vSphere 8.0.3

Mục Tiêu Buổi Học

Hiểu cơ chế vSphere HA: Master Host, heartbeat, datastore heartbeating
Cấu hình Admission Control với 3 policies khác nhau
Bật và test VM Monitoring (VMware Tools heartbeat)
Test HA failover thực tế: giả lập host failure và xem VMs restart
Hiểu Fault Tolerance: RPO=0, RTO=0 và các hạn chế
Thiết kế SLA tiers với HA, DRS và FT cho doanh nghiệp

LÝ THUYẾT — MODULE 8 (phần HA)

8.1 vSphere High Availability (HA) — Cơ Chế Hoạt Động

Mục tiêu: Tự động restart VMs khi ESXi host bị failure

Cơ chế hoạt động:
  1. HA Agent trên mỗi host gửi heartbeat mỗi 1 giây
  2. Master Host (1 host được bầu chọn) monitor tất cả
  3. Nếu Master mất heartbeat từ host ≥ 10 giây (configurable):
     → Kiểm tra datastore heartbeat (secondary mechanism)
     → Nếu confirm host fail → Restart VMs trên còn lại hosts

Admission Control (đảm bảo capacity cho failover):
  Policy 1: Cluster resource percentage
    Reserve 25% CPU + 25% Memory cho failover capacity
  Policy 2: Dedicated failover hosts
    1-2 hosts luôn dự phòng (hot spare)
  Policy 3: Slot Policy
    1 slot = max(vCPU reservation, max) × hosts

VM Monitoring:
  - Heartbeat từ VMware Tools mỗi 20 giây
  - Nếu mất heartbeat ≥ Failure Interval → Reset VM
  - Application Monitoring: Custom via VMCB API

vSphere Fault Tolerance (FT) — Zero Downtime

Cao hơn HA: Zero downtime (RPO=0, RTO=0)

Cơ chế:
  - Primary VM: chạy bình thường
  - Secondary VM: chạy đồng bộ 100% (lock-step)
  - Nếu Primary host fail → Secondary tiếp tục ngay lập tức
  - Tạo Secondary mới trên host khác

Hạn chế FT:
  - Tối đa 8 vCPUs, 128 GB RAM
  - Không hỗ trợ Storage vMotion
  - Cần 10 Gbps NIC riêng cho FT logging
  - Storage Eager Zeroed Thick bắt buộc
  - Chi phí: Gấp đôi tài nguyên (Primary + Secondary)

So Sánh HA vs FT vs DRS

Tính năng	vSphere HA	Fault Tolerance	DRS
Mục đích	Restart VMs sau host fail	Zero downtime	Load balancing
RPO	Vài phút	0 (zero)	N/A
RTO	1–5 phút	0 (zero)	N/A
Tài nguyên thêm	25% capacity	2× (double)	10–20% headroom
vCPU limit	Không giới hạn	Max 8 vCPU	Không giới hạn
Use case	Production VMs	Mission critical DB	Resource balance

LAB THỰC HÀNH — MODULE 8 (HA)

Lab 8.1 — Bật và Cấu Hình vSphere HA

Cluster → Configure → vSphere Availability → Edit

Bật vSphere HA: ✓ ON

Host Failures Cluster Tolerates: 1
   (cần ít nhất 2 hosts)

Admission Control:
  ✓ Enable
  Policy: Cluster resource percentage
  Reserved failover CPU: 25%
  Reserved failover Memory: 25%

VM Monitoring:
  ✓ Enable VM Monitoring
  VM Monitoring Sensitivity: Medium

Datastore Heartbeating:
  ✓ Use datastores only from the specified list
  Chọn: DS-iSCSI-VMFS6-01 (shared storage)
  Minimum number of heartbeat datastores: 2

Advanced Options:
  das.isolationaddress1 = 10.100.100.1   (gateway)
  das.usedefaultisolationaddress = false

→ OK → Xác nhận HA reconfiguration

Lab 8.2 — Test vSphere HA

Test 1: Giả lập host failure
  1. Ghi nhớ VMs đang chạy trên ESXi-02
  2. Disconnect network cable (hoặc shutdown ESXi-02 forcefully)
  3. Theo dõi Events trong vSphere Client
  4. Sau 30-60 giây: VMs được restart trên ESXi-01 hoặc ESXi-03

  Expected events:
  - "Host ESXi-02 is not responding"
  - "Initiating failover"
  - "VM web-server-01 powered on" (on different host)

Test 2: Kiểm tra VM Monitoring
  1. Kill VMware Tools trong guest:
     kill $(pgrep vmtoolsd)
  2. Chờ 30 giây
  3. vSphere HA sẽ reset VM (restart guest)

  Expected: Event "VM monitoring detected a failure"

Lab 8.3 — Cấu Hình vSphere DRS (giới thiệu)

Cluster → Configure → vSphere DRS → Edit

Bật DRS: ✓ ON
Automation Level: Fully Automated
Migration Threshold: Level 3 (Balanced)

Advanced Options:
  ✓ Predictive DRS (cần vROps integration)
  ✓ VM Distribution (giảm thiểu VMs bị ảnh hưởng nếu 1 host fail)

DRS Groups (VM và Host Groups):
  1. New VM Group:
     Name: VMG-Critical
     VMs: oracle-db-01, oracle-db-02

  2. New Host Group:
     Name: HG-HighMem
     Hosts: esxi-01 (512 GB RAM)

  3. VM-Host Affinity Rule:
     Name: Oracle-on-HighMem
     Type: Should run on hosts in group
     VM Group: VMG-Critical
     Host Group: HG-HighMem

DRS Anti-Affinity Rule:
  Name: WebServers-Spread
  Type: Separate VMs
  VMs: web-01, web-02, web-03
  → Đảm bảo 3 web servers không bao giờ cùng host

Lab 8.4 — Proactive HA

Tích hợp với hardware monitoring (HP iLO, Dell iDRAC) và VMware Aria Operations:
  → Aria Operations (vROps) cung cấp predictive health monitoring
  → Hardware providers (iLO/iDRAC) báo cáo degradation events

Cluster → Configure → Proactive HA

Providers: (cần install plugin HP/Dell)
  HP iLO Provider: 192.168.1.100 (iLO address)

Automation Level: Automated
Remediation:
  Moderate degradation: Maintenance mode with DRS
  Severe degradation: Remove from resource pool

Benefit:
  → Khi iLO phát hiện PSU/Fan/Disk failure sắp xảy ra
  → vSphere proactively vMotion VMs ra trước khi host crash

ỨNG DỤNG DOANH NGHIỆP — MODULE 7

Thiết kế và vận hành vSphere HA trong môi trường doanh nghiệp — từ admission control đến APD/PDL response và HA event monitoring.

1. HA Admission Control Policy — Design Cho Production

Admission Control đảm bảo cluster luôn có đủ capacity để restart VMs sau khi một số host fail. Chọn sai policy → VMs không restart được sau failover.

Policy	Hoạt động như thế nào	Dùng khi	Nhược điểm
Cluster Resource %	Reserve X% CPU/RAM cho failover	Khuyến nghị cho production	Tính toán % phức tạp hơn
Dedicated Failover Hosts	1+ host dự phòng hoàn toàn	Tier 1 workloads, tài chính	Lãng phí capacity host dự phòng
Slot Policy	Tính slot dựa trên VM lớn nhất	Cluster đồng nhất (same-size VMs)	1 VM lớn = ảnh hưởng toàn cluster
Disabled	Không kiểm soát	Dev/Test — không quan trọng	Failover có thể thất bại do thiếu resource

### PowerCLI — Cấu hình HA Admission Control (Cluster Resource %)
Connect-VIServer vcsa-01.lab.local

$cluster = Get-Cluster "CL-HN-Prod-01"

# Cấu hình: cho phép 1 host failure, reserve 25% CPU + RAM
$spec = New-Object VMware.Vim.ClusterConfigSpecEx
$spec.DasConfig = New-Object VMware.Vim.ClusterDasConfigInfo
$spec.DasConfig.Enabled = $true
$spec.DasConfig.AdmissionControlEnabled = $true
$spec.DasConfig.AdmissionControlPolicy = New-Object VMware.Vim.ClusterFailoverResourcesAdmissionControlPolicy
$spec.DasConfig.AdmissionControlPolicy.CpuFailoverResourcesPercent    = 25
$spec.DasConfig.AdmissionControlPolicy.MemoryFailoverResourcesPercent = 25
$spec.DasConfig.AdmissionControlPolicy.FailoverLevel = 1  # Chịu được 1 host fail
$spec.DasConfig.DefaultVmSettings = New-Object VMware.Vim.ClusterDasVmSettings
$spec.DasConfig.DefaultVmSettings.RestartPriority = "high"
$spec.DasConfig.DefaultVmSettings.IsolationResponse = "none"

($cluster | Get-View).ReconfigureComputeResource_Task($spec, $true)

Tính toán Admission Control % cho cluster 6 hosts (N+1)

N+1 (1 host fail): Reserve = 1/6 = ~17% CPU + RAM → đặt 20% để có buffer
N+2 (2 host fail): Reserve = 2/6 = ~34% → đặt 35%
Luôn đặt % cao hơn tính toán lý thuyết thêm 5% để có buffer
Xem trực quan tại: Cluster → Configure → vSphere Availability → Current Failover Capacity

2. Heartbeat Datastore Configuration — Tránh False Positive Isolation

Datastore heartbeat là cơ chế backup để phân biệt "host thực sự bị isolated khỏi management network" với "host đã chết hoàn toàn".

Cơ chế hoạt động

Khi host không nhận heartbeat network từ master trong 15 giây → master kiểm tra datastore heartbeat
Nếu host vẫn ghi heartbeat vào datastore → host bị isolated (mất management network nhưng còn storage)
VMware khuyến nghị 2 heartbeat datastores trên 2 storage path khác nhau
Chọn datastores trên SAN/NFS — không dùng local storage làm heartbeat

### Cấu hình Heartbeat Datastores (vSphere Client + PowerCLI)
# vSphere Client:
# Cluster → Configure → vSphere Availability → Heartbeat Datastores
# → Use datastores only from the specified list → Add
# → Chọn 2 datastores trên 2 storage controller khác nhau

### PowerCLI — Đặt heartbeat datastores
$cluster = Get-Cluster "CL-HN-Prod-01"
$hbDS1   = Get-Datastore "VMFS-SAN-FC-LUN01"
$hbDS2   = Get-Datastore "NFS-NetApp-Vol01"

$clusterView = $cluster | Get-View
$spec = New-Object VMware.Vim.ClusterConfigSpecEx
$spec.DasConfig = New-Object VMware.Vim.ClusterDasConfigInfo
$spec.DasConfig.HBDatastoreCandidatePolicy = "userSelectedDs"
$spec.DasConfig.HeartbeatDatastore = @($hbDS1.Id, $hbDS2.Id)
$clusterView.ReconfigureComputeResource_Task($spec, $true)

### Xác nhận heartbeat status
$cluster | Get-View | Select-Object -ExpandProperty Summary |
    Select-Object -ExpandProperty DasData |
    Select-Object HeartbeatDatastoreInfo

3. APD / PDL Response Policy — VM Component Protection (VMCP)

VMCP phản ứng khi storage path bị mất — phân biệt APD (All Paths Down — tạm thời) vs PDL (Permanent Device Loss — mất hẳn).

Trạng thái	Ý nghĩa	Hành động đề xuất	Khi nào dùng "Power off"
APD	Path bị mất nhưng thiết bị chưa confirm mất hẳn (switch failure, HBA reset)	Issue Events → restart VMs nếu APD >timeout	Production — sau 140 giây (default) APD không phục hồi
PDL	Storage device confirm đã mất vĩnh viễn (SCSI sense code)	Power off & restart ngay lập tức	Luôn luôn — PDL không phục hồi tự động

### PowerCLI — Cấu hình VMCP (APD + PDL response)
$cluster = Get-Cluster "CL-HN-Prod-01"
$view    = $cluster | Get-View
$spec    = New-Object VMware.Vim.ClusterConfigSpecEx
$spec.DasConfig = New-Object VMware.Vim.ClusterDasConfigInfo

# PDL: Power off VMs ngay khi detect PDL
$spec.DasConfig.VmComponentProtecting = "enabled"
$spec.DasConfig.DefaultVmSettings = New-Object VMware.Vim.ClusterDasVmSettings
$spec.DasConfig.DefaultVmSettings.VmComponentProtectionSettings = `
    New-Object VMware.Vim.ClusterVmComponentProtectionSettings
$spec.DasConfig.DefaultVmSettings.VmComponentProtectionSettings.VmStorageProtectionForPDL = `
    "restartAggressive"     # Power off VM → restart on healthy host

# APD: Restart VMs sau 140 giây (default) nếu APD không phục hồi
$spec.DasConfig.DefaultVmSettings.VmComponentProtectionSettings.VmStorageProtectionForAPD = `
    "restartConservative"
$spec.DasConfig.DefaultVmSettings.VmComponentProtectionSettings.VmTerminateDelayForAPDSec = 140

$view.ReconfigureComputeResource_Task($spec, $true)
Write-Host "VMCP configured: PDL=restartAggressive, APD=restartConservative(140s)"

4. Network Isolation Response — Tránh Split-Brain

Khi host mất management network, vSphere HA cần quyết định: shutdown VMs trên host isolated hay để chạy? Chọn sai → VMs chạy song song trên 2 host (split-brain).

Leave Powered On

VM tiếp tục chạy trên isolated host. Dùng khi VM có network path riêng (iSCSI, NFS) không phụ thuộc management vmk.

Rủi ro: Split-brain nếu HA cũng restart VM trên host khác.

Power Off (Khuyến nghị)

VM bị power off trên isolated host → HA restart trên healthy host. An toàn nhất cho workloads có shared storage.

Dùng cho production cluster với shared SAN/vSAN.

Shutdown Guest OS

Gửi shutdown command đến OS qua VMware Tools trước khi power off. Cho OS graceful shutdown.

Cần VMware Tools installed & running.

Best Practice — Isolation Response

Cấu hình 2 isolation addresses — mặc định chỉ dùng default gateway, thêm management switch IP
Production SAN/vSAN cluster: dùng "Power Off" — loại bỏ hoàn toàn split-brain risk
Thêm isolation address: Advanced Settings → das.isolationaddress0 = 10.100.10.254 (switch IP)

5. HA Event Monitoring & Alerting Setup

Setup alerts tự động cho các HA events quan trọng — IT team được thông báo ngay khi có host failure hoặc VM restart do HA.

### vSphere Client — Tạo Alarm cho HA Events
# Cluster → Configure → Alarm Definitions → Add

## Alarm 1: Host bị HA isolate
Name:       "HA Host Isolation Detected"
Target:     Host
Trigger:    Event — com.vmware.vc.ha.HostIsolatedEvent
Action:     Send Email → [email protected]
            Run Script → /scripts/notify-oncall.sh

## Alarm 2: VM được HA restart
Name:       "HA VM Restart"
Target:     Virtual Machine
Trigger:    Event — com.vmware.vc.ha.VmRestartedByHAEvent
Action:     Send Email + SNMP Trap đến monitoring (Zabbix/PRTG)

## Alarm 3: HA insufficient failover capacity
Name:       "HA Admission Control Violated"
Target:     Cluster
Trigger:    Metric — vSphere HA.vSphere HA host status = Insufficient capacity
Threshold:  = true (any violation)
Action:     Send Email + PagerDuty webhook (critical)

### PowerCLI — Lấy lịch sử HA Events (audit log)
Connect-VIServer vcsa-01.lab.local

$start  = (Get-Date).AddDays(-7)
$haEvents = @(
    "com.vmware.vc.ha.VmRestartedByHAEvent",
    "com.vmware.vc.ha.HostIsolatedEvent",
    "com.vmware.vc.ha.VmFailoverFailed"
)

$events = Get-VIEvent -Start $start -MaxSamples 1000 |
          Where-Object { $haEvents -contains $_.EventTypeId }

$events | Select-Object CreatedTime, FullFormattedMessage, ObjectName |
          Format-Table -AutoSize |
          Tee-Object -FilePath "ha-events-$(Get-Date -f yyyyMMdd).txt"

Write-Host "Tổng HA events trong 7 ngày: $($events.Count)"

Buổi 6: vMotion Buổi 8: DRS & Resource Mgmt

Nội Dung Khóa Học

Kiến Thức Cần Nắm

Master Host election và vai trò monitor toàn bộ cluster
Datastore heartbeat: cơ chế phân biệt host fail vs network isolation
Admission Control: 3 policies và khi nào dùng mỗi loại
FT hạn chế: max 8 vCPU, không Storage vMotion, cần 10Gbps NIC
SRM RPO/RTO: replication 15 phút → RPO=15 phút, RTO=15-30 phút

Milestone: 50% Hoàn Thành

Bạn đã hoàn thành 7/14 buổi. Nửa khóa học đã qua! Tiếp tục với DRS, Lifecycle Manager và vSAN.

Tất Cả Buổi Học

High Availability (vSphere HA)