Phase 12 — IaC + Observability + Hybrid Cloud

Phase 12: IaC + Observability + Hybrid Cloud

Bài 12.1: Tự động hóa hạ tầng với Terraform + Ansible

Lý thuyết cốt lõi

IaC (Infrastructure as Code) cho Proxmox = định nghĩa VM/LXC/network bằng code, versioned trên Git, apply idempotent.

Phân vai trò:

Terraform (HashiCorp): tạo/xoá resource (VM, network, storage)
Ansible: config OS bên trong VM (package, user, service)

Provider:

bpg/proxmox (community, active nhất 2025-2026) — registry.terraform.io/providers/bpg/proxmox
Telmate/proxmox (legacy, ít update) — tránh dùng mới

Authentication: API token (không dùng root password).

Bài tập thực hành

Bước 1 — Tạo API token trên PVE:

pveum user add terraform@pve --password 'tokenP@ss'
pveum aclmod / -user terraform@pve -role PVEAdmin
pveum user token add terraform@pve tf-token --privsep 0
# → lưu secret UUID trả về

Bước 2 — Terraform config (main.tf):

terraform {
  required_providers {
    proxmox = {
      source  = "bpg/proxmox"
      version = "~> 0.68"
    }
  }
}

provider "proxmox" {
  endpoint  = "https://10.0.0.11:8006/"
  api_token = "terraform@pve!tf-token=xxxxxxxx-xxxx"
  insecure  = true
  ssh {
    agent    = false
    username = "root"
    password = var.root_password
  }
}

# Clone VM từ template 9000
resource "proxmox_virtual_environment_vm" "web" {
  count     = 3
  name      = "web-${count.index + 1}"
  node_name = "pve0${(count.index % 3) + 1}"
  clone {
    vm_id = 9000
    full  = true
  }
  memory   { dedicated = 2048 }
  cpu      { cores = 2 }
  initialization {
    ip_config {
      ipv4 {
        address = "10.0.10.${50 + count.index}/24"
        gateway = "10.0.10.1"
      }
    }
    user_account {
      username = "ubuntu"
      keys     = [file("~/.ssh/id_rsa.pub")]
    }
  }
  network_device {
    bridge  = "vmbr1"
    vlan_id = 10
  }
}

output "web_ips" {
  value = [for v in proxmox_virtual_environment_vm.web :
           v.initialization[0].ip_config[0].ipv4[0].address]
}

Apply:

terraform init
terraform plan
terraform apply -auto-approve
# → 3 VM web-1, web-2, web-3 tạo tự động

terraform destroy  # teardown

Bước 3 — Ansible inventory động (inventory.yml):

all:
  hosts:
    web-1: { ansible_host: 10.0.10.50 }
    web-2: { ansible_host: 10.0.10.51 }
    web-3: { ansible_host: 10.0.10.52 }
  vars:
    ansible_user: ubuntu
    ansible_ssh_private_key_file: ~/.ssh/id_rsa

Playbook cài Nginx (site.yml):

- hosts: all
  become: yes
  tasks:
    - name: Install nginx
      apt:
        name: nginx
        update_cache: yes
        state: present
    - name: Start nginx
      service: { name: nginx, state: started, enabled: yes }
    - name: Deploy index.html
      copy:
        dest: /var/www/html/index.html
        content: "<h1>{{ inventory_hostname }}</h1>"

Run:

ansible-playbook -i inventory.yml site.yml

Kết quả đầu ra

$ terraform apply
Plan: 3 to add, 0 to change, 0 to destroy.
proxmox_virtual_environment_vm.web[0]: Creation complete after 45s
proxmox_virtual_environment_vm.web[1]: Creation complete after 47s
proxmox_virtual_environment_vm.web[2]: Creation complete after 46s
Outputs:
web_ips = ["10.0.10.50/24", "10.0.10.51/24", "10.0.10.52/24"]

$ curl http://10.0.10.50
<h1>web-1</h1>

Troubleshooting

"Error: 501 no such node": node_name sai, check pvesh get /nodes
Cloud-init không apply: thiếu --ciuser/--sshkeys trong template gốc → set từ template trước qm template
Ansible fail SSH: check StrictHostKeyChecking, export ANSIBLE_HOST_KEY_CHECKING=False

Ứng dụng thực tế

Tình huống: Mỗi sprint dev cần 10 VM tạm thời test, admin mất 2h/sprint tạo thủ công.

Giải pháp: Terraform module dev-env với variable vm_count, vlan, size. Dev chạy terraform apply -var vm_count=10 → 5 phút có đủ 10 VM + provision Ansible → dùng 3 ngày → terraform destroy.

Lợi ích: Giảm 95% thời gian provisioning, versioned code, dễ audit ai tạo gì khi nào.

Bài 12.2: Giám sát tập trung — Prometheus + Grafana + Loki

Lý thuyết cốt lõi

Observability stack 3 pillars:

Pillar	Tool	Data
Metrics	Prometheus	Time-series số (CPU%, RAM, disk IOPS)
Logs	Loki	Log text (syslog, app log)
Traces	Tempo / Jaeger	Distributed trace (nâng cao)

Visualization: Grafana (dashboard chung cho cả 3).

Proxmox expose metrics:

InfluxDB line protocol: built-in, push tới Prometheus qua relay hoặc trực tiếp InfluxDB
PVE Exporter (github.com/prometheus-pve/prometheus-pve-exporter): pull-mode cho Prometheus — phổ biến nhất
Graphite: built-in, push-only

Architecture chuẩn:

  Proxmox nodes          Prometheus          Grafana
  pve01 ─┐                  │                   │
  pve02 ─┼──→ PVE Exporter ─┤                   │
  pve03 ─┘     :9221        ├──→ scrape every 30s
                            │                   │
  PBS node ──→ pbs-exporter ┤                   │
                            │                   │
  Linux VM ──→ node_exporter┘                   │
                                                ▼
                            VM Grafana ────→ Dashboard UI :3000
                                     │
                                     └──→ Loki ←─── Promtail (VM)

Bài tập thực hành

Bước 1 — Deploy Prometheus + Grafana stack (dùng Docker Compose trên VM monitor):

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:v3.1.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prom-data:/prometheus
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:11.4.0
    ports: ["3000:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: 'Gr@fana123'
    volumes:
      - grafana-data:/var/lib/grafana

  loki:
    image: grafana/loki:3.3.0
    ports: ["3100:3100"]

  pve-exporter:
    image: prompve/prometheus-pve-exporter:latest
    ports: ["9221:9221"]
    volumes:
      - ./pve.yml:/etc/pve.yml:ro

volumes:
  prom-data:
  grafana-data:

Bước 2 — Config PVE exporter (pve.yml):

default:
  user: monitor@pve
  token_name: monitor-token
  token_value: 'xxxxxxxx-xxxx-xxxx'
  verify_ssl: false

Bước 3 — Prometheus scrape config (prometheus.yml):

global:
  scrape_interval: 30s

scrape_configs:
  - job_name: pve
    metrics_path: /pve
    params:
      module: [default]
      cluster: ['1']
      node: ['1']
    static_configs:
      - targets: [10.0.0.11, 10.0.0.12, 10.0.0.13]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: pve-exporter:9221

  - job_name: node-exporter
    static_configs:
      - targets: [10.0.10.50:9100, 10.0.10.51:9100, 10.0.10.52:9100]

Bước 4 — Grafana dashboard:

Login http://monitor-vm:3000 (admin/Gr@fana123)
Add datasource: Prometheus (http://prometheus:9090), Loki (http://loki:3100)
Import dashboard ID 10347 (Proxmox via Prometheus) hoặc 11510 (PVE cluster)

Bước 5 — Alerting (Prometheus rules):

# alerts.yml
groups:
  - name: pve-critical
    rules:
      - alert: NodeDown
        expr: up{job="pve"} == 0
        for: 2m
        annotations:
          summary: "Proxmox node {{ $labels.instance }} down"
      - alert: HighCPU
        expr: pve_cpu_usage_ratio > 0.90
        for: 10m
      - alert: StorageFull
        expr: (pve_disk_usage_bytes / pve_disk_size_bytes) > 0.85
        for: 5m
      - alert: CephHealthWarn
        expr: ceph_health_status > 1
        for: 5m

Bước 6 — Log aggregation với Loki + Promtail:

Cài Promtail trên mỗi PVE node:

# /etc/promtail/config.yml
clients:
  - url: http://monitor-vm:3100/loki/api/v1/push

scrape_configs:
  - job_name: pve-syslog
    static_configs:
      - targets: [localhost]
        labels:
          job: syslog
          host: pve01
          __path__: /var/log/syslog
  - job_name: pve-daemon
    journal:
      matches: _SYSTEMD_UNIT=pvedaemon.service
      labels: { job: pvedaemon }

Kết quả đầu ra

$ curl http://pve-exporter:9221/pve?target=10.0.0.11 | head -20
# HELP pve_up Node/VM/Storage/... is online/running
# TYPE pve_up gauge
pve_up{id="node/pve01"} 1.0
pve_up{id="qemu/100"} 1.0
# HELP pve_cpu_usage_ratio CPU usage ratio
pve_cpu_usage_ratio{id="node/pve01"} 0.23
pve_cpu_usage_ratio{id="qemu/100"} 0.45
# HELP pve_memory_usage_bytes Memory usage
pve_memory_usage_bytes{id="node/pve01"} 137438953472

Grafana dashboard hiển thị: CPU/RAM/disk/network của từng VM/node, heatmap Ceph OSD, backup job duration, alert timeline.

Troubleshooting

Exporter 500 error: check token role phải có Sys.Audit, VM.Audit, Datastore.Audit minimum
Metrics thiếu: một số metric chỉ có ở node level — dùng cluster: ['1'] thay vì node
Grafana dashboard trống: kiểm tra query variable $instance, đảm bảo match label instance từ prometheus

Ứng dụng thực tế

Tình huống: 3 cluster 9 node tổng cộng, admin không biết node nào hot, VM nào ngốn RAM.

Giải pháp: Stack Prometheus + Grafana + Loki trên 1 VM 4 CPU/8 GB/200 GB. Scrape 30s, retention 15 ngày metrics + 30 ngày log.

Lợi ích:

Phát hiện CPU spike 3h sáng → sửa cron job gây peak
Alert Ceph OSD > 85% → kịp thêm disk trước khi full
Dashboard CEO-ready: số VM, uptime, capacity trend

Bài 12.3: Hybrid Cloud — Proxmox on-prem ↔ AWS (DR + burst)

Lý thuyết cốt lõi

Hybrid cloud use case:

DR site on cloud: on-prem primary, AWS secondary (RTO vài giờ, RPO ~5 phút)
Burst workload: peak traffic scale sang EC2 tạm thời
Cold tier: archive backup lên S3 Glacier

Kết nối mạng on-prem ↔ AWS:

Phương án	Bandwidth	Latency	Chi phí	Dùng khi
Site-to-Site VPN	≤1.25 Gbps	~50 ms	~36 USD/mo	SMB, PoC
Direct Connect	1–100 Gbps	<10 ms	300+ USD/mo	Enterprise prod
Transit Gateway	-	-	per-attachment	Multi-VPC

DR approach:

Backup-based DR (rẻ nhất): PBS sync → AWS S3 → restore EC2 khi cần
Pilot light: VM cốt lõi (DNS, AD) chạy sẵn trên AWS, scale-out khi failover
Warm standby: replica thường xuyên, scale khi failover
Multi-site active: 2 side cùng active (phức tạp, đắt)

Bài tập thực hành

Scenario: On-prem Proxmox 3-node (HCM), DR pilot-light trên AWS Singapore (ap-southeast-1).

Bước 1 — Site-to-Site VPN:

# AWS side (console hoặc CLI):
# 1. Tạo Customer Gateway (on-prem public IP)
aws ec2 create-customer-gateway \
    --type ipsec.1 --public-ip <ON_PREM_PUB_IP> --bgp-asn 65001

# 2. Tạo Virtual Private Gateway + attach VPC
aws ec2 create-vpn-gateway --type ipsec.1
aws ec2 attach-vpn-gateway --vpc-id vpc-xxx --vpn-gateway-id vgw-xxx

# 3. Tạo VPN connection → download config cho router on-prem
aws ec2 create-vpn-connection \
    --type ipsec.1 \
    --customer-gateway-id cgw-xxx \
    --vpn-gateway-id vgw-xxx \
    --options StaticRoutesOnly=false

# On-prem side: cấu hình VPN trên pfSense/Fortigate/VM Proxmox chạy StrongSwan
# Subnet on-prem: 10.0.0.0/16, Subnet AWS: 172.16.0.0/16

Bước 2 — Backup sang S3 (cold tier):

PBS 4.1 S3 backend (tích hợp sẵn):

# Trên PBS:
proxmox-backup-manager s3-endpoint create aws-sg \
    --endpoint s3.ap-southeast-1.amazonaws.com \
    --access-key AKIAxxx --secret-key xxx

proxmox-backup-manager datastore create dr-s3 /mnt/s3-cache \
    --backend-type s3 --s3-endpoint aws-sg \
    --bucket hoatranlab-dr

# Sync job: copy từ local-ds → dr-s3 hàng đêm
proxmox-backup-manager sync-job create sync-to-s3 \
    --store dr-s3 --remote local --remote-store big-ds \
    --schedule 'daily'

Bước 3 — Pilot light (AD/DNS trên AWS):

Deploy 2 EC2 t3.small chạy Samba4 AD DC (replica với on-prem AD)
Route 53 Private Hosted Zone cho domain nội bộ
Cost: ~30 USD/tháng

Bước 4 — Failover runbook (khi on-prem DC chết):

# 1. Restore VM critical từ S3 backup lên EC2
# Tạo EC2 instance cho mỗi VM critical (pre-sized template)
aws ec2 run-instances --image-id ami-xxx \
    --instance-type m5.large --count 1 \
    --subnet-id subnet-dr --security-group-ids sg-xxx \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=Role,Value=web-prod-dr}]'

# 2. Mount S3 bucket, restore file từ PBS backup
aws s3 cp s3://hoatranlab-dr/vm/100/latest.tar.zst /tmp/
tar xzf /tmp/latest.tar.zst -C /restore

# 3. Chuyển DNS: Route 53 failover record
aws route53 change-resource-record-sets --hosted-zone-id Zxxx \
    --change-batch file://failover-to-aws.json

# 4. Thông báo user, kiểm tra

Bước 5 — Burst scale (advanced):

Kubernetes cluster spread:

On-prem: 3 control plane + 6 worker (Proxmox VM)
AWS: thêm 6 worker (EC2 spot) khi CPU > 80%
Network: VPN + Calico BGP peer giữa on-prem và AWS
Auto-scale qua Cluster Autoscaler với mixed node groups

Kết quả đầu ra

$ aws ec2 describe-vpn-connections --query 'VpnConnections[*].[VpnConnectionId,State]'
[
    ["vpn-0abc123", "available"]
]

$ ping 172.16.10.5  # EC2 AD DC từ on-prem
PING 172.16.10.5: 64 bytes from 172.16.10.5: icmp_seq=1 ttl=253 time=48.2 ms
64 bytes from 172.16.10.5: icmp_seq=2 ttl=253 time=47.8 ms

$ proxmox-backup-manager sync-job status sync-to-s3
┌────────────┬──────────┬─────────┬──────────┐
│ Last Run   │ Duration │ Status  │ Size     │
├────────────┼──────────┼─────────┼──────────┤
│ 2026-04-23 │ 02:34:12 │ Success │ 78.4 GiB │
│ 02:00:00   │          │         │ (dedup)  │
└────────────┴──────────┴─────────┴──────────┘

Troubleshooting

VPN tunnel down: check firewall UDP 500/4500, MTU (AWS khuyến nghị 1436)
S3 upload chậm: tăng threads trong PBS, dùng AWS Transfer Acceleration
Cross-region latency cao: chọn region gần nhất (Singapore/Tokyo cho VN), dùng VPN nhiều tunnel
DR test fail: phải chạy DR drill ít nhất 2 lần/năm, không chỉ đọc tài liệu

Ứng dụng thực tế

Tình huống: Ngân hàng nhỏ compliance yêu cầu DR site ≥ 50 km. Không muốn đầu tư DC thứ 2.

Giải pháp:

Primary: 3-node PVE HCM (production)
DR on AWS Singapore (pilot-light):
PBS sync lên S3 hàng giờ (~50 GB/giờ incremental)
2 EC2 t3.medium chạy DNS/AD replica (30 USD/tháng)
Runbook failover: 2h RTO, 1h RPO
Cost: VPN 36 USD + S3 20 TB × 0.023 USD = ~500 USD/tháng
DR drill: mỗi 6 tháng restore 5 VM critical, đo RTO/RPO thực tế, update runbook

Lợi ích:

Tiết kiệm ~80% so với build DC thứ 2 (~500 USD/tháng vs 20000 USD CAPEX)
Compliance audit pass (có site ≥ 50 km)
Pay-as-you-go: chỉ scale cost khi thật failover

Tổng kết khóa học

Sau khi hoàn thành 10 phase, bạn đã có:

✅ Triển khai cluster 3-node Proxmox VE 9 từ zero
✅ Setup Ceph HCI, SDN Fabric, HA, PBS backup
✅ Quản lý VM + LXC + OCI container
✅ Xử lý các sự cố production
✅ Tự tin migrate workload từ VMware sang Proxmox

Bước tiếp theo

Thi Proxmox Certified Engineer (khi hãng ra chương trình chính thức)
Tham gia community forum: https://forum.proxmox.com
Đóng góp bug report / patch trên https://bugzilla.proxmox.com
Mở rộng: Proxmox Mail Gateway (PMG), Proxmox Datacenter Manager (PDM)

Tài nguyên bổ sung

Official Wiki: https://pve.proxmox.com/wiki/Main_Page
Admin Guide PDF: https://pve.proxmox.com/pve-docs/pve-admin-guide.pdf
Video training: https://www.proxmox.com/en/services/training-courses/videos
Blog HoaTranLab: https://hoatranlab.io.vn/proxmox

← Phase 11 — Ecosystem (PBS+PDM+PMG)