Phase 10 — Monitoring & Troubleshooting — Proxmox VE 9

← Phase 9 — Security & Users Phase 11 — Ecosystem (PBS+PDM+PMG) →

Phase 10: Monitoring & Troubleshooting

Bài 10.1: Logs — journalctl, /var/log, syslog

Lý thuyết cốt lõi

Log quan trọng:

File	Nội dung
`/var/log/syslog`	System log tổng hợp
`/var/log/daemon.log`	Daemon (pvedaemon, pveproxy...)
`/var/log/pveproxy/access.log`	Web GUI access
`journalctl -u pvedaemon`	pvedaemon service
`journalctl -u pve-cluster`	pmxcfs
`journalctl -u corosync`	Cluster
`/var/log/ceph/ceph-*.log`	Ceph (nếu có)
`qm log <VMID>`	VM-specific log

Bài tập thực hành

Debug tại sao VM 100 không start:

# 1. Log VM
qm log 100

# 2. Real-time journal
journalctl -u pveproxy -f

# 3. Last 200 line + follow
journalctl -xe --since '10 min ago'

# 4. Filter theo priority
journalctl -p err -b    # lỗi từ lần boot này

# 5. Log vào 1 session ssh đến stdout
ssh pve02 'journalctl -u corosync -f'

Kết quả đầu ra

$ qm log 100
QEMU [2026-04-22 11:30:05] starting vm 100
KVM: entry failed, hardware error 0x0
VM 100 qmp command 'query-proxmox-support' failed - client disconnected

Troubleshooting

KVM entry failed: CPU không bật VT-x/AMD-V, check BIOS
qemu-guest-agent not responding: restart agent trong VM
pveproxy 502 Bad Gateway: restart systemctl restart pveproxy pvedaemon pvestatd

Bài 10.2: Performance tuning với pveperf, iostat

Lý thuyết cốt lõi

Công cụ đo:

pveperf: I/O + CPU benchmark
iostat: disk stats
iftop: network per-connection
htop: process + CPU
ceph -s / ceph osd perf: Ceph stats

Bài tập thực hành

Benchmark node:

# CPU + I/O benchmark
pveperf /var/lib/vz

# Disk latency chi tiết
iostat -x 2 5

# Ceph performance
ceph osd perf

# Network bandwidth test giữa 2 node
# node A:
iperf3 -s
# node B:
iperf3 -c 10.0.0.11 -t 30 -P 4

Kết quả đầu ra

$ pveperf /var/lib/vz
CPU BOGOMIPS:      96000 (48 core * 2000)
REGEX/SECOND:      6540000
HD SIZE:           98.00 GB (/dev/mapper/pve-data)
FSYNCS/SECOND:     12500
DNS EXT:           15 ms
DNS INT:           5 ms

$ iperf3 -c 10.0.0.11 -t 30 -P 4
[SUM]   0.00-30.00  sec  3.47 GBytes  9.94 Gbits/sec   0             sender
[SUM]   0.00-30.00  sec  3.46 GBytes  9.92 Gbits/sec                 receiver

Tuning tips:

Disk: enable discard=on cho SSD (TRIM)
Network: MTU 9000 cho Ceph (gain ~10%)
CPU: pin vCPU theo NUMA với qm set <id> --numa 1
Memory: disable swap trên host production (swapoff -a)

Bài 10.3: Các sự cố điển hình và cách xử lý

Lý thuyết cốt lõi

Danh sách 10 sự cố gặp nhiều nhất production và cách xử lý:

1. Node mất quorum sau reboot

# Triệu chứng: không login Web GUI, /etc/pve read-only
# Nguyên nhân: corosync chưa lên trước pmxcfs
# Fix:
systemctl restart corosync pve-cluster
# Nếu vẫn không được:
pvecm expected 1   # TẠM THỜI — cho phép 1 node hoạt động
# Sau khi fix network, pvecm expected <số node thực tế>

2. Disk storage full (local-zfs)

# Triệu chứng: VM không start, báo "no space left"
# Fix:
zfs list | grep USED
# Xoá snapshot cũ:
zfs list -t snapshot | grep vm-100
zfs destroy tank/vm-100-disk-0@old-snap
# Tăng pool:
zpool add tank /dev/sdh  # add disk mới

3. Ceph OSD down

# Triệu chứng: ceph -s báo HEALTH_WARN, 1 osd down
# Check:
systemctl status [email protected]
journalctl -u ceph-osd@3 -n 50
# Restart:
systemctl restart ceph-osd@3
# Nếu disk hỏng:
pveceph osd destroy 3
pveceph osd create /dev/sdf  # disk mới

4. VM treo, không shutdown được

# Force kill
qm stop 100 --skiplock
kill -9 $(cat /var/run/qemu-server/100.pid)
# Unlock
qm unlock 100

5. Cluster split-brain (2 vs 1)

# Node thiểu số sẽ fence (read-only /etc/pve)
# Fix: sửa network, reboot node thiểu số
# Nếu 2 partition đều hoạt động: DANH GIỚI nghiêm trọng, cần shutdown
# partition sai trước khi rejoin

6. Backup thất bại "vma write failed"

# Thường do storage đích full hoặc permission
df -h /mnt/pve/backup-nfs
# Check NFS permission: chown nobody:nogroup từ server NFS

7. High I/O wait (iowait > 20%)

iotop -aoP  # xem process nào chiếm I/O
# Check Ceph slow ops:
ceph daemon osd.X ops
# Tuning: MTU jumbo, NVMe thay HDD

8. VM Windows BSOD sau restore

# Thường là driver không tương thích
# Boot VM vào Safe Mode, uninstall driver storage cũ
# Cài virtio-scsi driver từ ISO virtio-win

9. HA loop — VM bị restart liên tục

ha-manager status
# Nếu fence loop: kiểm tra `max_restart` trong resource config
ha-manager set vm:100 --max_restart 1
# Disable HA tạm:
ha-manager remove vm:100

10. Upgrade VE 8 → VE 9 fail

# Luôn chạy pve8to9 trước khi upgrade
pve8to9 --full
# Check tất cả WARN/FAIL, fix rồi mới upgrade:
apt update
apt dist-upgrade  # DÙNG dist-upgrade, KHÔNG phải upgrade

Bài tập thực hành

Simulate 3 sự cố trong lab:

Shutdown node pve02 bất ngờ → observe HA failover + cluster status
Fill disk /var/lib/vz tới 99% → xem VM bị ảnh hưởng gì
Block corosync network bằng iptables → check cluster behavior

Kết quả đầu ra

Sau mỗi sự cố, viết Post-mortem gồm:

Mô tả triệu chứng
Root cause
Cách khắc phục
Phòng ngừa lần sau

Ứng dụng thực tế

Duy trì runbook nội bộ team sysadmin gồm 10 sự cố trên + workflow xử lý. Review mỗi quý, cập nhật version mới.

← Phase 9 — Security & Users Phase 11 — Ecosystem (PBS+PDM+PMG) →