Mục tiêu buổi học
- Nắm vững các metrics quan trọng: CPU Ready, Memory Balloon, Disk Latency
- Cấu hình vSphere Alarms cho Datastore, CPU, Host, vSAN
- Phân tích performance sự cố Oracle DB chậm end-to-end
- Tối ưu NUMA-aware VM sizing cho large VMs
- Xây dựng Monitoring Stack 3-tier và KPI Dashboard tuần
Lý Thuyết
vSphere Native Monitoring — Key Metrics
=== CPU Metrics ===
Usage (%) → % CPU đang dùng (thực tế)
Ready (ms) → Thời gian VM chờ được schedule pCPU
→ Target: <5% (500ms per second)
→ >10%: performance impact rõ rệt
Co-stop (ms) → SMP VM bị block vì 1 vCPU của cặp chưa sẵn
→ Target: <3% (SMP VMs)
Swap Wait → VM chờ memory swap — CPU bị ảnh hưởng
=== Memory Metrics ===
Active (MB) → Memory VM thực sự đang dùng (hot pages)
Granted (MB) → Memory VMkernel đã cấp
Balloon (MB) → VMkernel reclaim idle memory via balloon driver
→ Target: 0 khi đủ RAM
→ >0: host đang memory pressure
Swap In/Out → Paging VM memory ra disk — RẤT TỆ!
→ Target: 0 tuyệt đối
Consumed → Total allocated to VM
=== Disk Metrics ===
Read/Write IOPS → Throughput
Latency (ms) → Target: <10ms (SAS/HDD), <2ms (SSD/NVMe)
→ >20ms: application timeout risk
Queue Depth → IO congestion indicator (>32: bottleneck)
=== Network Metrics ===
Received/Transmitted (Kbps) → Throughput
Packets Dropped → Network overload, buffer overflow
→ Target: 0
vSphere Alarms — Types & Actions
Alarm Types:
Stateful: Green → Yellow (Warning) → Red (Critical) → Green
Theo dõi lifecycle, biết khi nào resolve
Stateless: Trigger → Action (one-shot event)
Không có state, chỉ fire khi event xảy ra
Action Types:
Send email notification → [email protected]
Send SNMP trap → NMS (Nagios, Zabbix, PRTG)
Run script → PagerDuty webhook, auto-remediation
Log to syslog → SIEM correlation
Alarm Scope:
vCenter → Applied globally
Datacenter → Applied to all objects in DC
Cluster → Applied to cluster and hosts/VMs within
Host / VM / Datastore → Per-object alarms
VMware Aria Operations (formerly vROps, v8.17) — Capabilities
Architecture:
Analytics Node (primary) → Processing + UI
+ Remote Collector → Thu thập data từ site xa
+ Data Node → Scale-out storage
Key Capabilities:
1. Predictive Analytics (ML-based)
→ Capacity forecasting 6-12 tháng
→ "Days until CPU/RAM/Storage exhaustion"
→ Alert khi <60 days remaining
2. Right-sizing Recommendations
→ Phát hiện "oversized" VMs (alloc >> actual 90+ days)
→ Tiết kiệm: reduce vCPU/vRAM → giảm license cost
→ "Reclaim" potential: $X/year savings
3. Compliance Dashboards
→ CIS Benchmark compliance score
→ PCI-DSS, HIPAA controls status
→ Automated remediation actions
4. Workload Optimization
→ Recommend VM placement để giảm hotspots
→ Integrate với DRS recommendations
Performance Tuning — Key Issues
| Vấn đề | Triệu chứng | Giải pháp |
|---|---|---|
| CPU Ready cao | %RDY >5%, VM chậm, latency cao | Giảm vCPU count, vMotion sang host ít tải, hoặc add host |
| Memory Balloon | Swap activity, app chậm đột ngột | Tăng RAM host, giảm VM RAM allocation, vMotion |
| Disk Latency cao | App timeout, DB slow queries | Move VM sang faster datastore (SSD), kiểm tra snapshot overhead |
| Network Drop | Packet loss, connection drops | Kiểm tra NIC teaming, tăng bandwidth, enable NIOC |
| NUMA Imbalance | Uneven CPU/RAM, high latency | Size vCPU ≤ pCPU/NUMA node, enable vNUMA |
| Co-Stop cao | SMP VM chậm, không tuyến tính | Giảm vCPU count (ít hơn = schedule dễ hơn) |
NUMA — Non-Uniform Memory Access
Host với 2 sockets (2 NUMA nodes): NUMA Node 0 NUMA Node 1 ┌──────────────┐ ┌──────────────┐ │ Socket 0 │ │ Socket 1 │ │ 16 cores │◄─QPI───►│ 16 cores │ │ 256 GB RAM │ │ 256 GB RAM │ └──────────────┘ └──────────────┘ Local access: CPU0 → RAM Node 0 = <80ns latency Remote access: CPU0 → RAM Node 1 = ~120-150ns latency (via QPI) Best Practice: VM 16 vCPU → fits in 1 NUMA node → all local memory → fast VM 32 vCPU → spans 2 NUMA nodes → remote memory access → slower Rule: vCPU count ≤ physical cores per NUMA node Example: 2-socket × 16-core host → max 16 vCPU per VM (optimal) vNUMA (Virtual NUMA — for large VMs >8 vCPU): ESXi exposes virtual NUMA topology to guest OS Guest scheduler becomes NUMA-aware → better performance Auto-enabled when VM vCPU > physical cores per socket
Lab Thực Hành
Lab 14.1 — Cấu hình vSphere Alarms
vSphere Client → vCenter → Configure → Alarm Definitions --- Alarm 1: Datastore Usage Warning --- Name: Datastore Usage > 80% Object Type: Datastore Trigger: Metric Threshold Metric: Disk → Capacity Usage % Warning: 75 Critical: 85 Action (Warning): Send email: [email protected] Action (Critical): Send email + SNMP trap to NMS --- Alarm 2: VM CPU Ready Time High --- Name: VM CPU Ready Time High Object Type: Virtual Machine Trigger: Metric: CPU → Ready Warning: >1000 ms/s (~10% ready time) Critical: >2000 ms/s (~20% ready time) Action: Email + create ServiceNow ticket (script) --- Alarm 3: ESXi Host Not Responding --- Name: ESXi Host Not Responding Object Type: Host Trigger: Condition = Host Connection State = Not Responding Action: Email + PagerDuty webhook (script) --- Alarm 4: vSAN Disk Failure --- Object: Cluster Trigger: vSAN → Degraded Disk Action: Critical email + SMS alert Repeat: every 30 minutes until acknowledged
Lab 14.2 — Performance Analysis: "Oracle DB chậm"
Kịch bản: oracle-db-01 báo cáo query chậm bất thường
Bước 1: Kiểm tra CPU Ready
VM → Monitor → Performance → Advanced
Chart: CPU → CPU Ready
Interval: Last 1 hour, Last 1 day
Ngưỡng đánh giá:
<5% → OK
5-10% → Warning, ảnh hưởng nhẹ
>10% → Critical, cần xử lý ngay
Nếu CPU Ready >5% → xem host CPU utilization:
Host → Monitor → Performance → CPU
Nếu host >80% → vMotion VM sang host ít tải hơn
Bước 2: Kiểm tra Storage Latency
VM → Monitor → Performance → Disk
Metric: Disk Latency (ms)
Target: <10ms (HDD), <2ms (SSD)
Nếu cao → kiểm tra:
a. Có snapshot không? → Snapshot → Manage → xóa
b. VMFS contention? (nhiều VMs cùng DS)
→ Storage vMotion VM sang DS riêng
c. Storage queue depth?
Host → Performance → Disk → Queue Depth
>32 → storage bottleneck
Bước 3: Kiểm tra Memory
VM → Monitor → Performance → Memory
Balloon > 0 → RAM pressure trên host
Swap > 0 → CRITICAL, thêm RAM ngay
Bước 4: esxtop real-time (SSH vào ESXi host)
esxtop
→ c (CPU view)
Tìm oracle-db-01 process (vmx)
Cột %RDY → CPU Ready của VM này
Cột %CSTP → Co-stop (SMP VMs)
Batch capture để phân tích sâu:
esxtop -b -d 5 -n 60 > /tmp/esxtop-oracle-$(date +%Y%m%d-%H%M).csv
# 5 giây/sample × 60 samples = 5 phút dữ liệu
Bước 5: Kết luận & remediation
CPU Ready cao → vMotion sang esxi-03 (CPU 45%)
Disk Latency cao → xóa snapshot 3 ngày tuổi (2 GB delta)
→ Verify sau 15 phút: performance về bình thường
Lab 14.3 — SNMP Configuration trên ESXi
SSH vào ESXi host:
# Cấu hình SNMP community và trap target
esxcli system snmp set --communities public
esxcli system snmp set --targets 10.100.100.50@162/public
esxcli system snmp set --enable true
# Verify cấu hình
esxcli system snmp get
# Gửi test trap để kiểm tra
esxcli system snmp test
# → NMS (Zabbix/Nagios) phải nhận được trap
# Cấu hình SNMP v3 (khuyến nghị production):
esxcli system snmp set \
--v3targets 10.100.100.50@162/authuser/SHA/authpass/AES/privpass \
--authentication SHA \
--privacy AES128
# PowerCLI — bulk enable SNMP trên tất cả hosts:
Get-VMHost | ForEach-Object {
$esxcli = Get-EsxCli -VMHost $_ -V2
$esxcli.system.snmp.set.Invoke(@{
enable = $true
communities = "public"
targets = "10.100.100.50@162/public"
})
}
Lab 14.4 — NUMA-Aware VM Sizing & Tuning
Xem NUMA topology của host (SSH): esxcli hardware numa get # Output ví dụ: # NUMA Node Count: 2 # Node 0: 16 CPUs, 262144 MB RAM # Node 1: 16 CPUs, 262144 MB RAM Rule: vCPU count ≤ pCPU per NUMA node Host: 2 sockets × 16 cores = 32 pCPUs total NUMA node size: 16 cores VM tối ưu: ≤ 16 vCPUs (fit in 1 NUMA node) VM 32 vCPU: span 2 NUMA nodes → remote memory access Cấu hình VM NUMA pin (advanced): VM Edit Settings → VM Options → Advanced → NUMA Preferred NUMA Node: 0 (pin VM vào Node 0) vNUMA verification trong Windows guest: Task Manager → Performance → CPU → Sockets Nếu thấy "2 Sockets, 16 Cores each" → vNUMA hoạt động vNUMA verification trong Linux guest: numactl --hardware # Available: 2 nodes (0-1) # node 0 cpus: 0 1 2 ... 15 # node 1 cpus: 16 17 ... 31 Sizing recommendation cho large VMs: DB server 32 vCPU → chia thành 2×16 hoặc không dùng limit 16 vCPU VM trên 2×16 core host → 1 NUMA node → optimal
ỨNG DỤNG DOANH NGHIỆP — MODULE 14
VMware Aria Operations (formerly vRealize Operations, v8.17 với vSphere 8.0.3) deployment & automation thực chiến — sizing, custom dashboard, alert definitions, PowerCLI scripts, Terraform vSphere provider và Ansible VMware modules.
1. VMware Aria Operations Deployment Sizing — Small / Medium / Large
Chọn đúng deployment size ngay từ đầu — undersized Aria Operations gây chậm dashboard, mất metrics; oversized lãng phí tài nguyên. (Nodes = ESXi hosts + VMs được monitor)
| Size | vCPU | RAM | Disk | Objects | Metrics/min | Use case |
|---|---|---|---|---|---|---|
| Small | 4 | 16 GB | 250 GB | ≤100 nodes | <750K | Lab, nhỏ <100 hosts |
| Medium | 8 | 32 GB | 1 TB | ≤300 nodes | <2.5M | Mid-size DC |
| Large | 16 | 64 GB | 2 TB | ≤1,500 nodes | <7.5M | Enterprise |
| XLarge | 24+ | 128 GB | 4 TB | >1,500 nodes | >7.5M | Large enterprise, multi-vCenter |
vROps Deployment Best Practices
- Deploy trên All-Flash datastore — vROps ghi metrics liên tục, latency disk ảnh hưởng trực tiếp
- Dùng dedicated Service Account với Read-Only role trong vCenter — không dùng admin
- Enable "Global Settings → Remote Collector" nếu monitor multiple vCenters ở sites khác nhau
- Schedule "maintenance window" trong vROps khi có downtime planned — tránh false alarms
2. Custom Dashboard cho Capacity Management
Dashboard capacity management cho phép dự báo khi nào cần mua thêm tài nguyên — tránh tình trạng hết tài nguyên bất ngờ trong production.
### Tạo Capacity Management Dashboard trong vROps # 1. Home → Dashboards → Create Dashboard # Name: "Capacity Management — Executive View" # Layout: 3 columns # WIDGET 1: Cluster Remaining Capacity (Scoreboard) # Metrics: # - Cluster|CPU|Capacity Remaining (%) # - Cluster|Memory|Capacity Remaining (%) # - Datastore|Disk Space|Capacity Remaining (%) # Color: Green >40%, Yellow 20-40%, Red <20% # WIDGET 2: Time to Exhaustion (Metric Chart trend) # Metrics: # - Cluster|CPU|Time Remaining (days) # - Cluster|Memory|Time Remaining (days) # Threshold line: 60 days (procurement lead time) # WIDGET 3: Top 10 Oversized VMs (Table) # From: vROps Reclaim → Oversized VMs # Columns: VM Name, Current vCPU, Recommended vCPU, Wasted CPU (GHz) # Sort by: Wasted CPU descending # → Direct link to "Right-size" action # WIDGET 4: Capacity Forecast (12 months) # vROps built-in: Capacity → Potential Headroom # Shows: current + projected usage with confidence bands
| KPI | Target | Warning | Action khi Warning |
|---|---|---|---|
| CPU Time to Exhaustion | >90 ngày | <60 ngày | Khởi động procurement process |
| Memory Time to Exhaustion | >90 ngày | <60 ngày | Right-size oversized VMs trước |
| Storage Remaining Capacity | >30% | <20% | Cleanup snapshots, archive data |
| vROps Reclaim Savings | 0 oversized VMs | >5 oversized | Review & resize với VM owners |
3. Alert Definition & Notification Plugins
vROps alert system vượt trội hơn vCenter alarms nhờ predictive analysis và root cause analysis tự động — giảm MTTR đáng kể.
### Tạo Custom Alert Definition trong vROps
# 1. Home → Alerts → Alert Definitions → Add
# Name: "VM CPU Ready Critical — Production"
# Base Object Type: Virtual Machine
# Symptom Set:
# Symptom 1: CPU|CPU Ready (ms) > 500 for 5 minutes [CRITICAL]
# Symptom 2: CPU|Demand (MHz) > 90% of limit for 5m [WARNING]
# Condition: Symptom 1 OR Symptom 2
# Recommendation:
# "Check host CPU utilization. Consider:
# 1. Reduce vCPU count of this VM
# 2. Migrate VM to less-loaded host (DRS manual)
# 3. Add host to cluster if sustained"
# Notification Plugin — Gửi alert đến Slack/Teams:
# Administration → Notifications → Outbound Settings
# → Add: Webhook notification plugin
# URL: https://hooks.slack.com/services/xxx/yyy/zzz
# Payload template:
# {"text":"*[vROps Alert]* {{alertName}}\n*Object:* {{resourceName}}\n*Severity:* {{criticality}}\n*Time:* {{startTimeUTC}}"}
# Email notification cho Critical alerts:
# Outbound Settings → SMTP → smtp.hoatranlab.io.local:587
# → Notification rule: Alert level = CRITICAL → email [email protected]
4. PowerCLI Automation Scripts cho Common Tasks
PowerCLI (hiện tại v13.x) scripts hóa các tác vụ lặp đi lặp lại — từ weekly health report đến bulk VM provisioning và compliance auditing.
### Cài đặt PowerCLI (v13.x — yêu cầu PowerShell 5.1+ hoặc PowerShell 7+) Install-Module VMware.PowerCLI -AllowClobber -Scope CurrentUser Get-Module VMware.PowerCLI -ListAvailable | Select Version ### Kết nối vCenter Connect-VIServer -Server vcsa-01.lab.local -User [email protected] -Password 'VMware1!' ### Script 1: Weekly VM Health Report (gửi email tự động) Connect-VIServer vcsa-01.lab.local -User [email protected] $report = Get-VM | Select-Object Name, @{N='PowerState';E={$_.PowerState}}, @{N='CPU_Ready_%';E={[math]::Round((Get-Stat -Entity $_ -Stat "cpu.ready.summation" -MaxSamples 1 -Realtime).Value / 200, 2)}}, @{N='Mem_Balloon_MB';E={[math]::Round((Get-Stat -Entity $_ -Stat "mem.vmmemctl.average" -MaxSamples 1 -Realtime).Value / 1024, 0)}}, @{N='Snapshot_Count';E={(Get-Snapshot -VM $_).Count}}, @{N='Tools_Status';E={$_.ExtensionData.Guest.ToolsStatus}} # Alert: VMs có snapshot >3 ngày $oldSnapshots = Get-VM | Get-Snapshot | Where-Object {$_.Created -lt (Get-Date).AddDays(-3)} $oldSnapshots | Format-Table VM, Name, Created, SizeGB -AutoSize # Export & Email $report | Export-Csv "C:\Reports\VM-Health-$(Get-Date -f yyyyMMdd).csv" Send-MailMessage -To "[email protected]" -From "[email protected]" ` -Subject "Weekly VM Health Report $(Get-Date -f yyyy-MM-dd)" ` -Attachments "C:\Reports\VM-Health-$(Get-Date -f yyyyMMdd).csv" ` -SmtpServer "smtp.hoatranlab.io.local" ### Script 2: Bulk VM Provisioning từ Template $vmList = Import-Csv "C:\Provision\vm-list.csv" # Columns: VMName,Template,Cluster,Datastore,Folder $vmList | ForEach-Object { New-VM -Name $_.VMName -Template (Get-Template $_.Template) ` -VMHost (Get-Cluster $_.Cluster | Get-VMHost | Get-Random) ` -Datastore (Get-Datastore $_.Datastore) ` -Location (Get-Folder $_.Folder) ` -RunAsync Write-Host "Provisioning $($_.VMName)..." }
5. Terraform vSphere Provider cho Infrastructure as Code
Terraform vSphere provider cho phép define và version-control toàn bộ VM infrastructure — reproducible, auditable, và tích hợp CI/CD pipeline.
### Terraform vSphere — Deploy VM Production (main.tf)
terraform {
required_providers {
vsphere = { source = "hashicorp/vsphere", version = "~> 2.5" }
}
}
provider "vsphere" {
user = var.vsphere_user
password = var.vsphere_password
vsphere_server = "vcsa-01.lab.local"
allow_unverified_ssl = false
}
data "vsphere_datacenter" "dc" { name = "DC-HaNoi-01" }
data "vsphere_compute_cluster" "cluster" { name = "CL-HN-Prod-01"; datacenter_id = data.vsphere_datacenter.dc.id }
data "vsphere_datastore" "ds" { name = "ds-prod-ssd-01"; datacenter_id = data.vsphere_datacenter.dc.id }
data "vsphere_network" "net" { name = "PG-VLAN100-Production"; datacenter_id = data.vsphere_datacenter.dc.id }
data "vsphere_virtual_machine" "tmpl" { name = "TMPL-RHEL9-Base"; datacenter_id = data.vsphere_datacenter.dc.id }
resource "vsphere_virtual_machine" "app_server" {
count = var.instance_count # scale: terraform apply -var="instance_count=3"
name = "LNXAPP-${var.app_name}-PRD-0${count.index + 1}"
resource_pool_id = data.vsphere_compute_cluster.cluster.resource_pool_id
datastore_id = data.vsphere_datastore.ds.id
folder = "/DC-HaNoi-01/vm/Production/${var.app_name}"
num_cpus = 4; memory = 8192; guest_id = data.vsphere_virtual_machine.tmpl.guest_id
network_interface { network_id = data.vsphere_network.net.id }
disk { label = "disk0"; size = 100; thin_provisioned = true }
clone {
template_uuid = data.vsphere_virtual_machine.tmpl.id
customize {
linux_options { host_name = "lnxapp-${lower(var.app_name)}-prd-0${count.index + 1}"; domain = "hoatranlab.io.local" }
network_interface { ipv4_address = "10.100.${var.vlan_octet}.${10 + count.index}"; ipv4_netmask = 24 }
ipv4_gateway = "10.100.${var.vlan_octet}.1"
}
}
tags = [vsphere_tag.env_prod.id, vsphere_tag.app.id]
}
6. Ansible VMware Modules cho Configuration Management
Ansible community.vmware collection có 100+ modules cho việc manage vSphere — idempotent, declarative và tích hợp tốt với AWX/Tower.
### Cài đặt Ansible VMware Collection
# Yêu cầu: Python 3.8+, Ansible 2.12+
ansible-galaxy collection install community.vmware
# Verify
ansible-galaxy collection list | grep vmware
### Ansible Playbook — ESXi Host Hardening (playbook-esxi-hardening.yml)
---
- name: ESXi Security Hardening
hosts: localhost
gather_facts: false
vars:
vcenter_hostname: vcsa-01.lab.local
vcenter_username: "{{ vault_vcenter_user }}" # Lưu trong Ansible Vault
vcenter_password: "{{ vault_vcenter_password }}"
datacenter: DC-HaNoi-01
tasks:
- name: Disable SSH on all ESXi hosts
community.vmware.vmware_host_service_manager:
hostname: "{{ vcenter_hostname }}"
username: "{{ vcenter_username }}"
password: "{{ vcenter_password }}"
validate_certs: true
cluster_name: CL-HN-Prod-01
state: absent
service_name: TSM-SSH
loop: "{{ groups['esxi_hosts'] }}"
- name: Configure NTP servers
community.vmware.vmware_host_ntp:
hostname: "{{ vcenter_hostname }}"
username: "{{ vcenter_username }}"
password: "{{ vcenter_password }}"
validate_certs: true
cluster_name: CL-HN-Prod-01
state: present
ntp_servers: ["10.100.1.10", "10.100.1.11"]
- name: Enable Lockdown Mode (Normal)
community.vmware.vmware_host_lockdown:
hostname: "{{ vcenter_hostname }}"
username: "{{ vcenter_username }}"
password: "{{ vcenter_password }}"
validate_certs: true
cluster_name: CL-HN-Prod-01
state: normal # normal | strict | disabled
- name: Configure Syslog forwarding
community.vmware.vmware_host_config_manager:
hostname: "{{ vcenter_hostname }}"
username: "{{ vcenter_username }}"
password: "{{ vcenter_password }}"
validate_certs: true
esxi_hostname: "{{ item }}"
options:
'Syslog.global.logHost': "ssl://siem.hoatranlab.io.local:6514"
loop: "{{ esxi_hosts }}"
Checklist Automation Readiness
Checklist vROps Operations
Chúc mừng hoàn thành khóa học!
Bạn đã hoàn thành 14 buổi học VMware vSphere 8.0.3 — từ cài đặt ESXi cơ bản đến vận hành enterprise-grade.
Bước tiếp theo: Thi chứng chỉ VCP-DCV (2V0-21.23)