Buổi 8: DRS & Resource Pools — VMware vSphere 8.0.3

Mục tiêu buổi học

Hiểu kiến trúc và cơ chế hoạt động của vSphere DRS
Cấu hình DRS Automation Levels, Migration Threshold, Affinity/Anti-Affinity Rules
Triển khai Proactive HA tích hợp hardware monitoring
Hiểu vSphere Fault Tolerance: RPO=0, RTO=0
Thiết kế SLA tiers với HA + DRS + FT phù hợp workload

Lý Thuyết

vSphere DRS — Distributed Resource Scheduler

DRS liên tục monitor CPU và Memory utilization trong cluster, tự động vMotion VMs để đảm bảo cân bằng tải.

DRS Cluster — Load Balancing:

  ESXi-01 [CPU: 85%] ──vMotion──→ ESXi-02 [CPU: 35%]
  ESXi-02 [CPU: 35%]               ESXi-03 [CPU: 50%]
  ESXi-03 [CPU: 50%]

  DRS Score (0-5): mục tiêu < 2 (1 = cân bằng tốt)
  Kiểm tra mỗi 5 phút
  vMotion chỉ xảy ra nếu cải thiện DRS score đáng kể

DRS Automation Levels

Level	Hành động	Khi nào dùng
Manual	Chỉ đề xuất, admin quyết định	Môi trường nhạy cảm, cần kiểm soát chặt
Partially Automated	Tự động initial placement, đề xuất balance	Hybrid: tự động khi deploy, thủ công khi chạy
Fully Automated	Tự động tất cả vMotion	Production (khuyến nghị)

DRS Rules — Affinity & Anti-Affinity

VM-VM Affinity Rule:
  → VM A và VM B MUST run on SAME host
  Ví dụ: app-server + local-db cùng host để giảm latency

VM-VM Anti-Affinity Rule:
  → VM A và VM B MUST NOT on same host
  Ví dụ: web-01, web-02, web-03 không bao giờ cùng host
  → Nếu 1 host fail, chỉ mất 1/3 web servers

VM-Host Affinity Rule:
  → VM Group "VMG-Critical" SHOULD run on Host Group "HG-HighMem"
  Ví dụ: Oracle DB chạy trên hosts có 512 GB RAM

DRS Migration Threshold:
  Level 1: Conservative (rất ít migration)
  Level 3: Balanced (recommended)
  Level 5: Aggressive (nhiều migration, cluster ổn định hơn)

vSphere Fault Tolerance (FT)

FT cung cấp zero-downtime bảo vệ — Primary VM và Secondary VM chạy đồng bộ lock-step trên hai host khác nhau.

FT Architecture:

  ESXi-01                    ESXi-02
  ┌──────────────┐           ┌──────────────┐
  │ Primary VM   │◄──FT Log──│ Secondary VM │
  │ (Active)     │──────────►│ (Standby)    │
  └──────────────┘           └──────────────┘
       │                          │
       └─── 10 Gbps FT Network ───┘

  Nếu ESXi-01 fail:
  → Secondary VM trên ESXi-02 tiếp tục ngay lập tức
  → RPO = 0 (zero data loss)
  → RTO = 0 (zero downtime)
  → Tạo Secondary mới trên ESXi-03

Giới hạn & yêu cầu FT

Yêu cầu / Giới hạn	Giá trị	Ghi chú
Max vCPUs	8 vCPUs	Không phù hợp VM nhiều CPU
Max vRAM	128 GB	Tối đa per VM
NIC yêu cầu	10 Gbps riêng	FT logging network
Disk format	Eager Zeroed Thick	Bắt buộc
Chi phí resource	Gấp đôi	1 Primary + 1 Secondary
Storage vMotion	Không hỗ trợ	Hạn chế khi FT bật

Proactive HA (Predictive DRS)

Tích hợp với VMware Aria Operations (vROps) và hardware vendors (HP iLO, Dell iDRAC) để phát hiện hardware degradation trước khi xảy ra failure, tự động vMotion VMs ra khỏi host. Predictive DRS trong vSphere 8.0 yêu cầu tích hợp Aria Operations.

Proactive HA Flow:

  HP iLO / Dell iDRAC
       │ PSU degradation detected
       ↓
  vCenter (Proactive HA Provider)
       │
       ↓
  DRS: vMotion tất cả VMs ra khỏi host bị degraded
       │
       ↓
  Host được đặt vào "Quarantine Mode" hoặc "Maintenance Mode"
  (trước khi hardware thực sự fail)

Remediation levels:
  Moderate degradation → Maintenance mode + DRS migration
  Severe degradation → Remove from cluster resource pool

Lab Thực Hành

Lab 8.3 — Cấu hình vSphere DRS

Cluster → Configure → vSphere DRS → Edit

Bật DRS: ✓ ON
Automation Level: Fully Automated
Migration Threshold: Level 3 (Balanced)

Advanced Options:
  ✓ Predictive DRS (cần vROps integration)
  ✓ VM Distribution (giảm thiểu VMs bị ảnh hưởng nếu 1 host fail)

--- DRS Groups (VM và Host Groups) ---

1. New VM Group:
   Name: VMG-Critical
   VMs: oracle-db-01, oracle-db-02

2. New Host Group:
   Name: HG-HighMem
   Hosts: esxi-01 (512 GB RAM)

3. VM-Host Affinity Rule:
   Name: Oracle-on-HighMem
   Type: Should run on hosts in group
   VM Group: VMG-Critical
   Host Group: HG-HighMem

--- DRS Anti-Affinity Rule ---
Name: WebServers-Spread
Type: Separate VMs
VMs: web-01, web-02, web-03
→ Đảm bảo 3 web servers không bao giờ cùng host

Lab 8.4 — Proactive HA với Hardware Monitoring

Tích hợp với hardware monitoring (HP iLO, Dell iDRAC):

Cluster → Configure → Proactive HA

Providers: (cần install plugin HP/Dell)
  HP iLO Provider: 192.168.1.100 (iLO address)
  Credentials: admin / *****

Automation Level: Automated
Remediation:
  Moderate degradation: Maintenance mode with DRS
  Severe degradation: Remove from resource pool

--- Kiểm tra Proactive HA ---
Monitor → vSphere Availability → Proactive HA Events
Xem lịch sử events khi hardware degradation

Benefit:
  → Khi iLO phát hiện PSU/Fan/Disk failure sắp xảy ra
  → vSphere proactively vMotion VMs ra trước khi host crash
  → Zero unplanned downtime cho VMs

Lab 8.5 — Bật Fault Tolerance cho VM quan trọng

Yêu cầu trước khi bật FT:
  ✓ VM có ≤ 8 vCPU, ≤ 128 GB RAM
  ✓ Disk: Eager Zeroed Thick
  ✓ VMware Tools đã cài
  ✓ 10 Gbps NIC riêng cho FT logging

Bước 1: Tạo VMkernel cho FT Logging:
  Host → Configure → VMkernel Adapters → Add
  Port Group: DPG-FT-Logging (VLAN 50)
  IP: 192.168.50.11
  Services: ✓ Fault Tolerance logging

Bước 2: Bật FT trên VM:
  Right-click VM "finance-app-01" → Fault Tolerance → Turn On
  Chọn datastore cho Secondary VM: DS-iSCSI-VMFS6-01
  Chọn host cho Secondary: esxi-02.lab.local

Bước 3: Verify:
  VM Summary → Fault Tolerance: Protected
  Monitor → Events: "FT secondary VM created on esxi-02"

Bước 4: Test FT failover:
  Power off ESXi-01 (forcefully)
  → Secondary trên ESXi-02 ngay lập tức trở thành Primary
  → Không có downtime!
  Event: "FT failover occurred"

ỨNG DỤNG DOANH NGHIỆP — MODULE 8

Thiết kế DRS cluster và quản lý resource trong môi trường doanh nghiệp — từ migration threshold, affinity rules đến resource pool hierarchy và capacity planning.

1. DRS Cluster Design — Threshold & Migration Sensitivity

DRS migration threshold quyết định mức độ "mất cân bằng" nào sẽ trigger vMotion tự động. Quá thấp = vMotion liên tục gây overhead; quá cao = cluster mất cân bằng.

Level	Threshold	Hành vi	Dùng cho
1 — Conservative	DRS Score >4 mới migrate	Chỉ migrate khi rất mất cân bằng	Latency-sensitive workloads (database)
3 — Moderate (default)	DRS Score >2	Cân bằng tốt, vMotion vừa phải	Phần lớn production cluster
5 — Aggressive	DRS Score >1	vMotion rất thường xuyên	Dev/Test — cần maximize resource efficiency

### PowerCLI — Cấu hình DRS Cluster Production
Connect-VIServer vcsa-01.lab.local

# Fully Automated, Moderate threshold (level 3)
Set-Cluster "CL-HN-Prod-01" `
    -DrsEnabled $true `
    -DrsAutomationLevel FullyAutomated `
    -DrsNeedVMForBalancing $false `
    -Confirm:$false

# Set per-VM DRS automation level (override cluster setting)
# VM database quan trọng: Manual (không bị DRS move tự động)
$dbVM = Get-VM "WINDB-ERP-PRD-01"
$spec = New-Object VMware.Vim.VirtualMachineConfigSpec
$spec.ManagedBy = $null
($dbVM | Get-View).ReconfigVM_Task($spec)

# Set VM-level DRS override: PartiallyAutomated (recommend but not auto)
$clusterSpec = New-Object VMware.Vim.ClusterConfigSpecEx
$vmOverride  = New-Object VMware.Vim.ClusterDrsVmConfigSpec
$vmOverride.Operation = "edit"
$vmOverride.Info = New-Object VMware.Vim.ClusterDrsVmConfig
$vmOverride.Info.Key = $dbVM.Id
$vmOverride.Info.Behavior = "manual"
$clusterSpec.DrsVmConfigSpec = @($vmOverride)
(Get-Cluster "CL-HN-Prod-01" | Get-View).ReconfigureComputeResource_Task($clusterSpec, $true)

DRS per-VM Override — Khi nào dùng

Manual: Oracle Database, SAP HANA — cần kiểm soát migration, licensing per-socket
Partially Automated: DB servers — DRS recommend nhưng không tự move
Fully Automated: Web tier, app servers — DRS tự cân bằng không cần can thiệp
Disabled: VMs với software license gắn với host MAC address

2. Affinity / Anti-Affinity Rules — Enterprise Workloads

Rules đảm bảo VMs quan trọng không chạy cùng host (anti-affinity) hoặc luôn chạy gần nhau (affinity) theo yêu cầu kiến trúc.

Rule Type	Use Case	Ví dụ thực tế	Mandatory?
VM Anti-Affinity	2 VMs không được cùng host	DB Primary + DB Secondary (AlwaysOn AG)	✓ Must (production)
VM Affinity	VMs nên cùng host	App tier + Cache (Redis) — giảm latency	Should (prefer)
VM-Host Affinity	VM chỉ chạy trên hosts nhất định	VM cần GPU host, hoặc licensed host	Depends on use case
VM-Host Anti-Affinity	VM không được chạy trên host nhất định	Tách workload Production khỏi Dev hosts	Optional

### PowerCLI — Tạo Anti-Affinity Rule cho SQL AlwaysOn AG
$cluster = Get-Cluster "CL-HN-Prod-01"

# Anti-affinity: SQL-Primary và SQL-Secondary không được cùng host
$sqlPrimary   = Get-VM "WINSQL-ERP-PRD-01"
$sqlSecondary = Get-VM "WINSQL-ERP-PRD-02"

New-DrsRule -Cluster $cluster `
    -Name "AntiAffinity-SQL-AlwaysOn-ERP" `
    -KeepTogether $false `
    -VM $sqlPrimary, $sqlSecondary `
    -Enabled $true `
    -Mandatory $true

# Affinity: Web + Cache chạy cùng host (latency optimization)
$webVM   = Get-VM "LNXWEB-PORTAL-PRD-01"
$cacheVM = Get-VM "LNXCACHE-REDIS-PRD-01"

New-DrsRule -Cluster $cluster `
    -Name "Affinity-WebApp-Redis-Portal" `
    -KeepTogether $true `
    -VM $webVM, $cacheVM `
    -Enabled $true `
    -Mandatory $false   # Should (preferred) không bắt buộc

# Xem tất cả DRS rules
Get-DrsRule -Cluster $cluster | Select-Object Name, Type, Enabled, Mandatory, VMIds

3. Resource Pool Hierarchy — Phân Bổ Theo Phòng Ban & Môi Trường

Resource Pool tạo ra "quota" CPU/RAM cho từng phòng ban hoặc môi trường — ngăn một team chiếm toàn bộ cluster resource. Shares values: High = 4000, Normal = 2000, Low = 500 (áp dụng cho cả CPU và Memory).

### TOPOLOGY — Resource Pool Hierarchy (doanh nghiệp 500+ VMs)

  Cluster: CL-HN-Prod-01  (Total: 200 GHz CPU, 1.5 TB RAM)
  │
  ├── RP-PROD-Tier1  (Shares: High, Reservation: 40 GHz CPU / 400 GB RAM)
  │   ├── RP-PROD-ERP      (SAP, Oracle — high priority, reservation)
  │   └── RP-PROD-Finance  (Fintech, banking apps)
  │
  ├── RP-PROD-Tier2  (Shares: Normal, Reservation: 20 GHz / 200 GB)
  │   ├── RP-PROD-CRM      (Salesforce connector, CRM apps)
  │   └── RP-PROD-Web      (Web tier, load balancers)
  │
  ├── RP-STAGING   (Shares: Normal, No reservation, Limit: 30 GHz / 200 GB)
  │   └── (pre-prod testing environment)
  │
  └── RP-DEV       (Shares: Low, No reservation, Limit: 20 GHz / 100 GB)
      ├── RP-DEV-Team-A
      └── RP-DEV-Team-B

### PowerCLI — Tạo Resource Pool Hierarchy
$cluster = Get-Cluster "CL-HN-Prod-01"

# Tier 1 Production Pool
$rpProdT1 = New-ResourcePool -Location $cluster -Name "RP-PROD-Tier1" `
    -CpuSharesLevel High -MemSharesLevel High `
    -CpuReservationMHz 40000 -MemReservationMB 409600 `
    -CpuExpandableReservation $false -MemExpandableReservation $false

# Sub-pool ERP dưới Tier1
New-ResourcePool -Location $rpProdT1 -Name "RP-PROD-ERP" `
    -CpuSharesLevel High -MemSharesLevel High `
    -CpuReservationMHz 20000 -MemReservationMB 204800

# Dev Pool với limit (ceiling) để ngăn chiếm quá nhiều
$rpDev = New-ResourcePool -Location $cluster -Name "RP-DEV" `
    -CpuSharesLevel Low -MemSharesLevel Low `
    -CpuLimitMHz 20000 -MemLimitMB 102400   # Limit = hard cap

4. NUMA-Aware Scheduling — Tối Ưu Database & HPC Workloads

NUMA (Non-Uniform Memory Access) ảnh hưởng trực tiếp đến performance của database và in-memory workloads. ESXi có NUMA-aware scheduler nhưng cần cấu hình đúng.

Nguyên tắc NUMA Scheduling

Máy chủ dual-socket có 2 NUMA nodes — mỗi socket có local RAM riêng
VM nhỏ hơn 1 NUMA node → ESXi giữ VM hoàn toàn trong 1 node (NUMA local)
VM lớn hơn 1 NUMA node → span qua 2 node, remote memory access tốn thêm ~30% latency
Rule: vCPU + vRAM của VM không nên vượt quá 1 NUMA node

### Kiểm tra NUMA topology của ESXi host
# SSH vào ESXi host:
esxcli hardware numa get
# Output ví dụ (dual-socket, 28 cores/socket, 512 GB RAM/socket):
# NUMA Node 0: Cores 0-27,  Memory 524288 MB
# NUMA Node 1: Cores 28-55, Memory 524288 MB

# Kiểm tra VM NUMA placement (vCPU/vRAM phải trong 1 node)
# Nếu VM cần 24 vCPU + 384 GB RAM → phù hợp 1 node (28 cores, 512 GB)
# Nếu VM cần 32 vCPU + 512 GB RAM → span 2 nodes → cân nhắc resize

### PowerCLI — Bật NUMA affinity cho VM Database
$vm = Get-VM "WINDB-SAP-HANA-PRD-01"
$spec = New-Object VMware.Vim.VirtualMachineConfigSpec
$spec.NumaInfo = New-Object VMware.Vim.VirtualMachineVirtualNumaInfo
$spec.NumaInfo.AutoCoresPerNumaNode = $false
$spec.NumaInfo.CoresPerNumaNode = 28  # = 1 physical NUMA node
($vm | Get-View).ReconfigVM_Task($spec)

# Monitor NUMA stats trong vROps:
# VM → Metrics → Guest Memory → NUMA Remote memory (%)
# Alert nếu NUMA Remote > 20% — dấu hiệu VM span 2 NUMA nodes

5. DRS Automation Level Per VM Criticality

Không phải tất cả VMs đều cần DRS mức độ như nhau — phân loại theo SLA tier để tối ưu giữa automation và control.

VM Category	DRS Level	Lý do	Ví dụ
Oracle / SAP DB	Manual	Per-socket licensing, NUMA sensitivity	WINDB-ORACLE-, WINDB-SAP-
SQL Server AG	Partially Auto	DRS suggest migration nhưng DBA approve	WINSQL-ERP-PRD-, WINSQL-CRM-
App / Web Tier	Fully Auto	Stateless, vMotion-safe, cần cân bằng	LNXWEB-, LNXAPP-
Dev / Test	Fully Auto	Không quan trọng, maximize efficiency	-DEV-, -TST-

6. Capacity Planning với DRS Stats & vROps

DRS migration history và cluster utilization stats là dữ liệu quan trọng để lên kế hoạch mua host mới đúng thời điểm.

### PowerCLI — Cluster Capacity Report (hàng tuần)
Connect-VIServer vcsa-01.lab.local

$clusters = Get-Cluster
foreach ($cluster in $clusters) {
    $hosts = $cluster | Get-VMHost
    $totalCPU   = ($hosts | Measure-Object -Property CpuTotalMhz -Sum).Sum
    $usedCPU    = ($hosts | Measure-Object -Property CpuUsageMhz -Sum).Sum
    $totalMemGB = ($hosts | Measure-Object -Property MemoryTotalGB -Sum).Sum
    $usedMemGB  = ($hosts | Measure-Object -Property MemoryUsageGB -Sum).Sum

    $cpuPct = [math]::Round(($usedCPU / $totalCPU) * 100, 1)
    $memPct = [math]::Round(($usedMemGB / $totalMemGB) * 100, 1)

    # N+1 capacity (exclude 1 host)
    $n1CPU = $totalCPU - ($hosts | Sort-Object CpuTotalMhz -Descending | Select-Object -First 1).CpuTotalMhz
    $cpuN1Pct = [math]::Round(($usedCPU / $n1CPU) * 100, 1)

    [PSCustomObject]@{
        Cluster    = $cluster.Name
        Hosts      = $hosts.Count
        "CPU%"     = "$cpuPct%"
        "CPU N+1%" = "$cpuN1Pct%"
        "MEM%"     = "$memPct%"
        VMs        = ($cluster | Get-VM).Count
        Status     = if ($cpuN1Pct -gt 85 -or $memPct -gt 85) { "CẢNH BÁO — Cần thêm host" } else { "OK" }
    }
} | Format-Table -AutoSize

Ngưỡng Cảnh Báo Capacity

CPU utilization (N+1) >70% → lên kế hoạch mua host
Memory utilization >80% thường xuyên → thêm host hoặc right-size VMs
DRS migration count >20/ngày → cluster overloaded
VM CPU Ready >5% avg → cần thêm CPU capacity

Right-Sizing Workflow

Thu thập 90 ngày CPU/RAM usage data từ vROps
Xác định VMs oversized (avg CPU <10%, avg RAM <30%)
Đề xuất resize với app owner, lấy approval
Resize trong maintenance window, verify sau 2 tuần

Buổi 7: vMotion & Resource Mgmt Buổi 9: Lifecycle Manager

Nội dung khóa học

← Tất cả buổi học

Key Concepts

▸DRS Score: 0–5, target <2. Đo mức cân bằng tải của cluster.
▸Anti-Affinity Rule: Bắt buộc các VM quan trọng không chạy cùng host.
▸FT Lock-Step: Primary & Secondary đồng bộ 100%, không mất dữ liệu.
▸Proactive HA: Phát hiện hardware degradation TRƯỚC khi fail.
▸EVC Mode: Đảm bảo CPU compatibility cho vMotion trong cluster đa thế hệ CPU.
▸RPO/RTO: FT đạt RPO=0 & RTO=0. HA đạt RTO ~30 giây.

DRS & Resource Pools + Fault Tolerance