Buổi 14: Monitoring, Alarms & Performance Tuning — VMware vSphere 8.0.3

Mục tiêu buổi học

Nắm vững các metrics quan trọng: CPU Ready, Memory Balloon, Disk Latency
Cấu hình vSphere Alarms cho Datastore, CPU, Host, vSAN
Phân tích performance sự cố Oracle DB chậm end-to-end
Tối ưu NUMA-aware VM sizing cho large VMs
Xây dựng Monitoring Stack 3-tier và KPI Dashboard tuần

Lý Thuyết

vSphere Native Monitoring — Key Metrics

=== CPU Metrics ===
Usage (%)      → % CPU đang dùng (thực tế)
Ready (ms)     → Thời gian VM chờ được schedule pCPU
               → Target: <5% (500ms per second)
               → >10%: performance impact rõ rệt
Co-stop (ms)   → SMP VM bị block vì 1 vCPU của cặp chưa sẵn
               → Target: <3% (SMP VMs)
Swap Wait      → VM chờ memory swap — CPU bị ảnh hưởng

=== Memory Metrics ===
Active (MB)    → Memory VM thực sự đang dùng (hot pages)
Granted (MB)   → Memory VMkernel đã cấp
Balloon (MB)   → VMkernel reclaim idle memory via balloon driver
               → Target: 0 khi đủ RAM
               → >0: host đang memory pressure
Swap In/Out    → Paging VM memory ra disk — RẤT TỆ!
               → Target: 0 tuyệt đối
Consumed       → Total allocated to VM

=== Disk Metrics ===
Read/Write IOPS → Throughput
Latency (ms)    → Target: <10ms (SAS/HDD), <2ms (SSD/NVMe)
                → >20ms: application timeout risk
Queue Depth     → IO congestion indicator (>32: bottleneck)

=== Network Metrics ===
Received/Transmitted (Kbps) → Throughput
Packets Dropped              → Network overload, buffer overflow
                              → Target: 0

vSphere Alarms — Types & Actions

Alarm Types:
  Stateful:  Green → Yellow (Warning) → Red (Critical) → Green
             Theo dõi lifecycle, biết khi nào resolve
  Stateless: Trigger → Action (one-shot event)
             Không có state, chỉ fire khi event xảy ra

Action Types:
  Send email notification   → [email protected]
  Send SNMP trap            → NMS (Nagios, Zabbix, PRTG)
  Run script                → PagerDuty webhook, auto-remediation
  Log to syslog             → SIEM correlation

Alarm Scope:
  vCenter → Applied globally
  Datacenter → Applied to all objects in DC
  Cluster → Applied to cluster and hosts/VMs within
  Host / VM / Datastore → Per-object alarms

VMware Aria Operations (formerly vROps, v8.17) — Capabilities

Architecture:
  Analytics Node (primary)   → Processing + UI
  + Remote Collector          → Thu thập data từ site xa
  + Data Node                 → Scale-out storage

Key Capabilities:
  1. Predictive Analytics (ML-based)
     → Capacity forecasting 6-12 tháng
     → "Days until CPU/RAM/Storage exhaustion"
     → Alert khi <60 days remaining

  2. Right-sizing Recommendations
     → Phát hiện "oversized" VMs (alloc >> actual 90+ days)
     → Tiết kiệm: reduce vCPU/vRAM → giảm license cost
     → "Reclaim" potential: $X/year savings

  3. Compliance Dashboards
     → CIS Benchmark compliance score
     → PCI-DSS, HIPAA controls status
     → Automated remediation actions

  4. Workload Optimization
     → Recommend VM placement để giảm hotspots
     → Integrate với DRS recommendations

Performance Tuning — Key Issues

Vấn đề	Triệu chứng	Giải pháp
CPU Ready cao	%RDY >5%, VM chậm, latency cao	Giảm vCPU count, vMotion sang host ít tải, hoặc add host
Memory Balloon	Swap activity, app chậm đột ngột	Tăng RAM host, giảm VM RAM allocation, vMotion
Disk Latency cao	App timeout, DB slow queries	Move VM sang faster datastore (SSD), kiểm tra snapshot overhead
Network Drop	Packet loss, connection drops	Kiểm tra NIC teaming, tăng bandwidth, enable NIOC
NUMA Imbalance	Uneven CPU/RAM, high latency	Size vCPU ≤ pCPU/NUMA node, enable vNUMA
Co-Stop cao	SMP VM chậm, không tuyến tính	Giảm vCPU count (ít hơn = schedule dễ hơn)

NUMA — Non-Uniform Memory Access

Host với 2 sockets (2 NUMA nodes):

  NUMA Node 0              NUMA Node 1
  ┌──────────────┐         ┌──────────────┐
  │ Socket 0     │         │ Socket 1     │
  │ 16 cores     │◄─QPI───►│ 16 cores     │
  │ 256 GB RAM   │         │ 256 GB RAM   │
  └──────────────┘         └──────────────┘

  Local access:  CPU0 → RAM Node 0 = <80ns latency
  Remote access: CPU0 → RAM Node 1 = ~120-150ns latency (via QPI)

Best Practice:
  VM 16 vCPU → fits in 1 NUMA node → all local memory → fast
  VM 32 vCPU → spans 2 NUMA nodes → remote memory access → slower

  Rule: vCPU count ≤ physical cores per NUMA node
  Example: 2-socket × 16-core host → max 16 vCPU per VM (optimal)

vNUMA (Virtual NUMA — for large VMs >8 vCPU):
  ESXi exposes virtual NUMA topology to guest OS
  Guest scheduler becomes NUMA-aware → better performance
  Auto-enabled when VM vCPU > physical cores per socket

Lab Thực Hành

Lab 14.1 — Cấu hình vSphere Alarms

vSphere Client → vCenter → Configure → Alarm Definitions

--- Alarm 1: Datastore Usage Warning ---
Name: Datastore Usage > 80%
Object Type: Datastore
Trigger: Metric Threshold
  Metric: Disk → Capacity Usage %
  Warning: 75
  Critical: 85
Action (Warning):
  Send email: [email protected]
Action (Critical):
  Send email + SNMP trap to NMS

--- Alarm 2: VM CPU Ready Time High ---
Name: VM CPU Ready Time High
Object Type: Virtual Machine
Trigger:
  Metric: CPU → Ready
  Warning: >1000 ms/s  (~10% ready time)
  Critical: >2000 ms/s (~20% ready time)
Action: Email + create ServiceNow ticket (script)

--- Alarm 3: ESXi Host Not Responding ---
Name: ESXi Host Not Responding
Object Type: Host
Trigger: Condition = Host Connection State = Not Responding
Action: Email + PagerDuty webhook (script)

--- Alarm 4: vSAN Disk Failure ---
Object: Cluster
Trigger: vSAN → Degraded Disk
Action: Critical email + SMS alert
Repeat: every 30 minutes until acknowledged

Lab 14.2 — Performance Analysis: "Oracle DB chậm"

Kịch bản: oracle-db-01 báo cáo query chậm bất thường

Bước 1: Kiểm tra CPU Ready
  VM → Monitor → Performance → Advanced
  Chart: CPU → CPU Ready
  Interval: Last 1 hour, Last 1 day

  Ngưỡng đánh giá:
  <5%   → OK
  5-10% → Warning, ảnh hưởng nhẹ
  >10%  → Critical, cần xử lý ngay

  Nếu CPU Ready >5% → xem host CPU utilization:
  Host → Monitor → Performance → CPU
  Nếu host >80% → vMotion VM sang host ít tải hơn

Bước 2: Kiểm tra Storage Latency
  VM → Monitor → Performance → Disk
  Metric: Disk Latency (ms)
  Target: <10ms (HDD), <2ms (SSD)

  Nếu cao → kiểm tra:
  a. Có snapshot không? → Snapshot → Manage → xóa
  b. VMFS contention? (nhiều VMs cùng DS)
     → Storage vMotion VM sang DS riêng
  c. Storage queue depth?
     Host → Performance → Disk → Queue Depth
     >32 → storage bottleneck

Bước 3: Kiểm tra Memory
  VM → Monitor → Performance → Memory
  Balloon > 0 → RAM pressure trên host
  Swap > 0   → CRITICAL, thêm RAM ngay

Bước 4: esxtop real-time (SSH vào ESXi host)
  esxtop
  → c (CPU view)
  Tìm oracle-db-01 process (vmx)
  Cột %RDY → CPU Ready của VM này
  Cột %CSTP → Co-stop (SMP VMs)

  Batch capture để phân tích sâu:
  esxtop -b -d 5 -n 60 > /tmp/esxtop-oracle-$(date +%Y%m%d-%H%M).csv
  # 5 giây/sample × 60 samples = 5 phút dữ liệu

Bước 5: Kết luận & remediation
  CPU Ready cao → vMotion sang esxi-03 (CPU 45%)
  Disk Latency cao → xóa snapshot 3 ngày tuổi (2 GB delta)
  → Verify sau 15 phút: performance về bình thường

Lab 14.3 — SNMP Configuration trên ESXi

SSH vào ESXi host:

# Cấu hình SNMP community và trap target
esxcli system snmp set --communities public
esxcli system snmp set --targets 10.100.100.50@162/public
esxcli system snmp set --enable true

# Verify cấu hình
esxcli system snmp get

# Gửi test trap để kiểm tra
esxcli system snmp test
# → NMS (Zabbix/Nagios) phải nhận được trap

# Cấu hình SNMP v3 (khuyến nghị production):
esxcli system snmp set \
    --v3targets 10.100.100.50@162/authuser/SHA/authpass/AES/privpass \
    --authentication SHA \
    --privacy AES128

# PowerCLI — bulk enable SNMP trên tất cả hosts:
Get-VMHost | ForEach-Object {
    $esxcli = Get-EsxCli -VMHost $_ -V2
    $esxcli.system.snmp.set.Invoke(@{
        enable = $true
        communities = "public"
        targets = "10.100.100.50@162/public"
    })
}

Lab 14.4 — NUMA-Aware VM Sizing & Tuning

Xem NUMA topology của host (SSH):
  esxcli hardware numa get
  # Output ví dụ:
  # NUMA Node Count: 2
  # Node 0: 16 CPUs, 262144 MB RAM
  # Node 1: 16 CPUs, 262144 MB RAM

Rule: vCPU count ≤ pCPU per NUMA node
  Host: 2 sockets × 16 cores = 32 pCPUs total
  NUMA node size: 16 cores
  VM tối ưu: ≤ 16 vCPUs (fit in 1 NUMA node)
  VM 32 vCPU: span 2 NUMA nodes → remote memory access

Cấu hình VM NUMA pin (advanced):
  VM Edit Settings → VM Options → Advanced → NUMA
  Preferred NUMA Node: 0  (pin VM vào Node 0)

vNUMA verification trong Windows guest:
  Task Manager → Performance → CPU → Sockets
  Nếu thấy "2 Sockets, 16 Cores each" → vNUMA hoạt động

vNUMA verification trong Linux guest:
  numactl --hardware
  # Available: 2 nodes (0-1)
  # node 0 cpus: 0 1 2 ... 15
  # node 1 cpus: 16 17 ... 31

Sizing recommendation cho large VMs:
  DB server 32 vCPU → chia thành 2×16 hoặc không dùng limit
  16 vCPU VM trên 2×16 core host → 1 NUMA node → optimal

ỨNG DỤNG DOANH NGHIỆP — MODULE 14

VMware Aria Operations (formerly vRealize Operations, v8.17 với vSphere 8.0.3) deployment & automation thực chiến — sizing, custom dashboard, alert definitions, PowerCLI scripts, Terraform vSphere provider và Ansible VMware modules.

1. VMware Aria Operations Deployment Sizing — Small / Medium / Large

Chọn đúng deployment size ngay từ đầu — undersized Aria Operations gây chậm dashboard, mất metrics; oversized lãng phí tài nguyên. (Nodes = ESXi hosts + VMs được monitor)

Size	vCPU	RAM	Disk	Objects	Metrics/min	Use case
Small	4	16 GB	250 GB	≤100 nodes	<750K	Lab, nhỏ <100 hosts
Medium	8	32 GB	1 TB	≤300 nodes	<2.5M	Mid-size DC
Large	16	64 GB	2 TB	≤1,500 nodes	<7.5M	Enterprise
XLarge	24+	128 GB	4 TB	>1,500 nodes	>7.5M	Large enterprise, multi-vCenter

vROps Deployment Best Practices

Deploy trên All-Flash datastore — vROps ghi metrics liên tục, latency disk ảnh hưởng trực tiếp
Dùng dedicated Service Account với Read-Only role trong vCenter — không dùng admin
Enable "Global Settings → Remote Collector" nếu monitor multiple vCenters ở sites khác nhau
Schedule "maintenance window" trong vROps khi có downtime planned — tránh false alarms

2. Custom Dashboard cho Capacity Management

Dashboard capacity management cho phép dự báo khi nào cần mua thêm tài nguyên — tránh tình trạng hết tài nguyên bất ngờ trong production.

### Tạo Capacity Management Dashboard trong vROps

# 1. Home → Dashboards → Create Dashboard
# Name: "Capacity Management — Executive View"
# Layout: 3 columns

# WIDGET 1: Cluster Remaining Capacity (Scoreboard)
# Metrics:
#   - Cluster|CPU|Capacity Remaining (%)
#   - Cluster|Memory|Capacity Remaining (%)
#   - Datastore|Disk Space|Capacity Remaining (%)
# Color: Green >40%, Yellow 20-40%, Red <20%

# WIDGET 2: Time to Exhaustion (Metric Chart trend)
# Metrics:
#   - Cluster|CPU|Time Remaining (days)
#   - Cluster|Memory|Time Remaining (days)
# Threshold line: 60 days (procurement lead time)

# WIDGET 3: Top 10 Oversized VMs (Table)
# From: vROps Reclaim → Oversized VMs
# Columns: VM Name, Current vCPU, Recommended vCPU, Wasted CPU (GHz)
# Sort by: Wasted CPU descending
# → Direct link to "Right-size" action

# WIDGET 4: Capacity Forecast (12 months)
# vROps built-in: Capacity → Potential Headroom
# Shows: current + projected usage with confidence bands

KPI	Target	Warning	Action khi Warning
CPU Time to Exhaustion	>90 ngày	<60 ngày	Khởi động procurement process
Memory Time to Exhaustion	>90 ngày	<60 ngày	Right-size oversized VMs trước
Storage Remaining Capacity	>30%	<20%	Cleanup snapshots, archive data
vROps Reclaim Savings	0 oversized VMs	>5 oversized	Review & resize với VM owners

3. Alert Definition & Notification Plugins

vROps alert system vượt trội hơn vCenter alarms nhờ predictive analysis và root cause analysis tự động — giảm MTTR đáng kể.

### Tạo Custom Alert Definition trong vROps

# 1. Home → Alerts → Alert Definitions → Add
# Name: "VM CPU Ready Critical — Production"
# Base Object Type: Virtual Machine

# Symptom Set:
#   Symptom 1: CPU|CPU Ready (ms) > 500 for 5 minutes  [CRITICAL]
#   Symptom 2: CPU|Demand (MHz) > 90% of limit for 5m  [WARNING]
#   Condition: Symptom 1 OR Symptom 2

# Recommendation:
#   "Check host CPU utilization. Consider:
#    1. Reduce vCPU count of this VM
#    2. Migrate VM to less-loaded host (DRS manual)
#    3. Add host to cluster if sustained"

# Notification Plugin — Gửi alert đến Slack/Teams:
# Administration → Notifications → Outbound Settings
# → Add: Webhook notification plugin
# URL: https://hooks.slack.com/services/xxx/yyy/zzz
# Payload template:
# {"text":"*[vROps Alert]* {{alertName}}\n*Object:* {{resourceName}}\n*Severity:* {{criticality}}\n*Time:* {{startTimeUTC}}"}

# Email notification cho Critical alerts:
# Outbound Settings → SMTP → smtp.hoatranlab.io.local:587
# → Notification rule: Alert level = CRITICAL → email [email protected]

4. PowerCLI Automation Scripts cho Common Tasks

PowerCLI (hiện tại v13.x) scripts hóa các tác vụ lặp đi lặp lại — từ weekly health report đến bulk VM provisioning và compliance auditing.

### Cài đặt PowerCLI (v13.x — yêu cầu PowerShell 5.1+ hoặc PowerShell 7+)
Install-Module VMware.PowerCLI -AllowClobber -Scope CurrentUser
Get-Module VMware.PowerCLI -ListAvailable | Select Version

### Kết nối vCenter
Connect-VIServer -Server vcsa-01.lab.local -User [email protected] -Password 'VMware1!'

### Script 1: Weekly VM Health Report (gửi email tự động)
Connect-VIServer vcsa-01.lab.local -User [email protected]

$report = Get-VM | Select-Object Name,
    @{N='PowerState';E={$_.PowerState}},
    @{N='CPU_Ready_%';E={[math]::Round((Get-Stat -Entity $_ -Stat "cpu.ready.summation" -MaxSamples 1 -Realtime).Value / 200, 2)}},
    @{N='Mem_Balloon_MB';E={[math]::Round((Get-Stat -Entity $_ -Stat "mem.vmmemctl.average" -MaxSamples 1 -Realtime).Value / 1024, 0)}},
    @{N='Snapshot_Count';E={(Get-Snapshot -VM $_).Count}},
    @{N='Tools_Status';E={$_.ExtensionData.Guest.ToolsStatus}}

# Alert: VMs có snapshot >3 ngày
$oldSnapshots = Get-VM | Get-Snapshot | Where-Object {$_.Created -lt (Get-Date).AddDays(-3)}
$oldSnapshots | Format-Table VM, Name, Created, SizeGB -AutoSize

# Export & Email
$report | Export-Csv "C:\Reports\VM-Health-$(Get-Date -f yyyyMMdd).csv"
Send-MailMessage -To "[email protected]" -From "[email protected]" `
    -Subject "Weekly VM Health Report $(Get-Date -f yyyy-MM-dd)" `
    -Attachments "C:\Reports\VM-Health-$(Get-Date -f yyyyMMdd).csv" `
    -SmtpServer "smtp.hoatranlab.io.local"

### Script 2: Bulk VM Provisioning từ Template
$vmList = Import-Csv "C:\Provision\vm-list.csv"  # Columns: VMName,Template,Cluster,Datastore,Folder
$vmList | ForEach-Object {
    New-VM -Name $_.VMName -Template (Get-Template $_.Template) `
           -VMHost (Get-Cluster $_.Cluster | Get-VMHost | Get-Random) `
           -Datastore (Get-Datastore $_.Datastore) `
           -Location (Get-Folder $_.Folder) `
           -RunAsync
    Write-Host "Provisioning $($_.VMName)..."
}

5. Terraform vSphere Provider cho Infrastructure as Code

Terraform vSphere provider cho phép define và version-control toàn bộ VM infrastructure — reproducible, auditable, và tích hợp CI/CD pipeline.

### Terraform vSphere — Deploy VM Production (main.tf)

terraform {
  required_providers {
    vsphere = { source = "hashicorp/vsphere", version = "~> 2.5" }
  }
}

provider "vsphere" {
  user                 = var.vsphere_user
  password             = var.vsphere_password
  vsphere_server       = "vcsa-01.lab.local"
  allow_unverified_ssl = false
}

data "vsphere_datacenter"     "dc"      { name = "DC-HaNoi-01" }
data "vsphere_compute_cluster" "cluster" { name = "CL-HN-Prod-01"; datacenter_id = data.vsphere_datacenter.dc.id }
data "vsphere_datastore"       "ds"      { name = "ds-prod-ssd-01"; datacenter_id = data.vsphere_datacenter.dc.id }
data "vsphere_network"         "net"     { name = "PG-VLAN100-Production"; datacenter_id = data.vsphere_datacenter.dc.id }
data "vsphere_virtual_machine" "tmpl"    { name = "TMPL-RHEL9-Base"; datacenter_id = data.vsphere_datacenter.dc.id }

resource "vsphere_virtual_machine" "app_server" {
  count            = var.instance_count  # scale: terraform apply -var="instance_count=3"
  name             = "LNXAPP-${var.app_name}-PRD-0${count.index + 1}"
  resource_pool_id = data.vsphere_compute_cluster.cluster.resource_pool_id
  datastore_id     = data.vsphere_datastore.ds.id
  folder           = "/DC-HaNoi-01/vm/Production/${var.app_name}"

  num_cpus = 4; memory = 8192; guest_id = data.vsphere_virtual_machine.tmpl.guest_id

  network_interface { network_id = data.vsphere_network.net.id }
  disk { label = "disk0"; size = 100; thin_provisioned = true }

  clone {
    template_uuid = data.vsphere_virtual_machine.tmpl.id
    customize {
      linux_options { host_name = "lnxapp-${lower(var.app_name)}-prd-0${count.index + 1}"; domain = "hoatranlab.io.local" }
      network_interface { ipv4_address = "10.100.${var.vlan_octet}.${10 + count.index}"; ipv4_netmask = 24 }
      ipv4_gateway = "10.100.${var.vlan_octet}.1"
    }
  }
  tags = [vsphere_tag.env_prod.id, vsphere_tag.app.id]
}

6. Ansible VMware Modules cho Configuration Management

Ansible community.vmware collection có 100+ modules cho việc manage vSphere — idempotent, declarative và tích hợp tốt với AWX/Tower.

### Cài đặt Ansible VMware Collection
# Yêu cầu: Python 3.8+, Ansible 2.12+
ansible-galaxy collection install community.vmware
# Verify
ansible-galaxy collection list | grep vmware

### Ansible Playbook — ESXi Host Hardening (playbook-esxi-hardening.yml)
---
- name: ESXi Security Hardening
  hosts: localhost
  gather_facts: false
  vars:
    vcenter_hostname: vcsa-01.lab.local
    vcenter_username: "{{ vault_vcenter_user }}"     # Lưu trong Ansible Vault
    vcenter_password: "{{ vault_vcenter_password }}"
    datacenter: DC-HaNoi-01

  tasks:
    - name: Disable SSH on all ESXi hosts
      community.vmware.vmware_host_service_manager:
        hostname: "{{ vcenter_hostname }}"
        username: "{{ vcenter_username }}"
        password: "{{ vcenter_password }}"
        validate_certs: true
        cluster_name: CL-HN-Prod-01
        state: absent
        service_name: TSM-SSH
      loop: "{{ groups['esxi_hosts'] }}"

    - name: Configure NTP servers
      community.vmware.vmware_host_ntp:
        hostname: "{{ vcenter_hostname }}"
        username: "{{ vcenter_username }}"
        password: "{{ vcenter_password }}"
        validate_certs: true
        cluster_name: CL-HN-Prod-01
        state: present
        ntp_servers: ["10.100.1.10", "10.100.1.11"]

    - name: Enable Lockdown Mode (Normal)
      community.vmware.vmware_host_lockdown:
        hostname: "{{ vcenter_hostname }}"
        username: "{{ vcenter_username }}"
        password: "{{ vcenter_password }}"
        validate_certs: true
        cluster_name: CL-HN-Prod-01
        state: normal  # normal | strict | disabled

    - name: Configure Syslog forwarding
      community.vmware.vmware_host_config_manager:
        hostname: "{{ vcenter_hostname }}"
        username: "{{ vcenter_username }}"
        password: "{{ vcenter_password }}"
        validate_certs: true
        esxi_hostname: "{{ item }}"
        options:
          'Syslog.global.logHost': "ssl://siem.hoatranlab.io.local:6514"
      loop: "{{ esxi_hosts }}"

Chúc mừng hoàn thành khóa học!

Bạn đã hoàn thành 14 buổi học VMware vSphere 8.0.3 — từ cài đặt ESXi cơ bản đến vận hành enterprise-grade.

ESXi & vCenter vDS Networking iSCSI/NFS Storage HA + DRS + FT Lifecycle Manager Backup & DR Security & RBAC vSAN HCI Monitoring & Tuning

Bước tiếp theo: Thi chứng chỉ VCP-DCV (2V0-21.23)

Buổi 13: vSAN + vCenter HA Tất cả buổi học

Nội dung khóa học

← Tất cả buổi học

Key Concepts

▸CPU Ready >5%: Signal vCPU overcommit — giảm vCPU hoặc vMotion ngay.
▸Memory Swap >0: Critical — paging VM memory ra disk, performance thảm họa.
▸Stateful Alarm: Có lifecycle Green→Yellow→Red, biết khi nào resolve.
▸NUMA Node: Size VM ≤ cores per socket để tránh remote memory access latency.
▸vROps Right-sizing: Phát hiện VMs "oversized" → tiết kiệm license và tài nguyên.
▸Capacity 60-day rule: Alert khi còn <60 ngày CPU/RAM, <30 ngày Storage.

Chứng chỉ tiếp theo

Sau khi hoàn thành 14 buổi, hãy đăng ký thi:

VCP-DCV

VMware Certified Professional

Data Center Virtualization

Exam: 2V0-21.23