Baseline benchmark — đo trước khi tune
Quy tắc tuning vàng
Đo baseline → tune 1 thông số → đo lại → so sánh. KHÔNG tune nhiều thứ cùng lúc, không biết cái nào hiệu quả.
Đo IOPS & Latency với fio
# Cài fio
apt install fio -y
# Test 1: Random Read 4K (đo IOPS đỉnh)
fio \
--name=randread \
--filename=/dev/rbd0 \
--ioengine=libaio \
--direct=1 \
--rw=randread \
--bs=4k \
--iodepth=64 \
--numjobs=4 \
--time_based \
--runtime=60 \
--group_reporting
# Test 2: Random Write 4K
fio \
--name=randwrite \
--filename=/dev/rbd0 \
--ioengine=libaio \
--direct=1 \
--rw=randwrite \
--bs=4k \
--iodepth=64 \
--numjobs=4 \
--time_based \
--runtime=60
# Test 3: Sequential Read 1MB (đo throughput)
fio \
--name=seqread \
--filename=/dev/rbd0 \
--rw=read \
--bs=1M \
--iodepth=16 \
--numjobs=1 \
--runtime=60
# Built-in rados bench
rados bench -p vm-storage 60 write --no-cleanup
rados bench -p vm-storage 60 seq
rados bench -p vm-storage 60 rand
Mục tiêu hiệu năng (3 node all-NVMe ref)
L-1
BIOS & CPU Settings
Foundation — sai từ đây thì tune trên không có ý nghĩa
- Power Profile: Maximum Performance / OS Control (no power saving)
- C-States: Disabled (giảm latency wake-up)
- Turbo Boost: Enabled
- Hyper-Threading: Enabled (nhiều OSD threads)
- NUMA: Enabled (pin OSD theo NUMA node)
- SR-IOV: Enabled (cho NIC passthrough)
- VT-d / IOMMU: Enabled
- Memory Frequency: Max supported (3200/4800 MT/s)
L-2
Kernel & OS Tuning
sysctl, CPU governor, IO scheduler
CPU Governor — performance mode
# Đảm bảo CPU chạy ở max frequency
apt install cpufrequtils -y
cpupower frequency-set -g performance
# Persistent
echo 'GOVERNOR="performance"' > /etc/default/cpufrequtils
# Verify
cpupower frequency-info | grep "current policy"
IO Scheduler — none cho NVMe, mq-deadline cho SATA
# Check scheduler hiện tại
cat /sys/block/nvme0n1/queue/scheduler
# NVMe: dùng "none" (multi-queue native)
echo none > /sys/block/nvme0n1/queue/scheduler
# SATA SSD: dùng "mq-deadline"
echo mq-deadline > /sys/block/sda/queue/scheduler
# Persistent qua udev rule
cat > /etc/udev/rules.d/60-scheduler.rules << EOF
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
EOF
sysctl tuning cho Ceph
# File: /etc/sysctl.d/99-ceph-tuning.conf
# Network buffers (cho 25/100Gbps)
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 134217728
net.core.netdev_max_backlog = 250000
# TCP tuning
net.ipv4.tcp_rmem = 4096 87380 268435456
net.ipv4.tcp_wmem = 4096 65536 268435456
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.tcp_congestion_control = bbr
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 1
net.ipv4.tcp_low_latency = 1
# File descriptors (mỗi OSD cần ~1024)
fs.file-max = 26234859
fs.nr_open = 26234859
# Memory
vm.swappiness = 1
vm.vfs_cache_pressure = 50
vm.min_free_kbytes = 4194304
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
# Disable THP (Transparent Huge Pages) — gây latency spike
# Echo vào rc.local hoặc systemd unit:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Apply
sysctl -p /etc/sysctl.d/99-ceph-tuning.conf
L-3
Network Stack
Jumbo Frame · IRQ pin · RDMA · MTU
Network là bottleneck #1 của Ceph
Mỗi write phải replicate ra (size-1) OSD khác → bandwidth 2-3x. 10Gbps thường không đủ cho all-NVMe — nên ≥25Gbps.
Bật Jumbo Frame MTU 9000
# /etc/network/interfaces — trên cả Ceph Public + Cluster Net
auto bond1
iface bond1 inet manual
bond-slaves eno3 eno4
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4
mtu 9000
auto vmbr1
iface vmbr1 inet static
address 10.10.10.11/24
bridge-ports bond1
mtu 9000
# Switch cũng phải bật Jumbo Frame trên port + VLAN tương ứng
# Test sau cấu hình:
ping -c 4 -M do -s 8972 10.10.10.12
# Nếu OK → Jumbo Frame work. Nếu fail "frag needed" → switch chưa bật.
IRQ pinning — phân tải interrupt sang nhiều CPU
# Cài tool
apt install ethtool irqbalance -y
# Tăng số queue cho NIC (vd 16 queue)
ethtool -L eno3 combined 16
# Tăng ring buffer
ethtool -G eno3 rx 4096 tx 4096
# Disable LRO/GRO trên Ceph network (giảm latency)
ethtool -K eno3 lro off gro off
# Pin IRQ tới NUMA local CPU
systemctl stop irqbalance
# Manual pin từng queue tới CPU riêng (script):
for i in $(grep eno3 /proc/interrupts | awk '{print $1}' | tr -d ':'); do
cpu=$((i % 16))
echo $cpu > /proc/irq/$i/smp_affinity_list
done
RDMA / RoCEv2 (nâng cao)
RDMA bypass kernel TCP stack → giảm latency 50%+, giảm CPU usage. Cần NIC RDMA-capable (Mellanox CX-5/6, Intel E810).
# Bật RDMA messenger trong ceph.conf (experimental)
[global]
ms_type = async+rdma
ms_async_rdma_device_name = mlx5_0
ms_async_rdma_polling_us = 0
L-4
BlueStore — OSD Backend Tuning
WAL · DB · Cache · Compression
Kiến trúc BlueStore (đối chiếu khi tune)
graph LR
APP[Client Write] --> OSD[OSD Daemon]
OSD --> BS[BlueStore Engine]
BS --> WAL[(WAL
Write-Ahead Log
Fast NVMe)]
BS --> DB[(RocksDB
Metadata
Fast NVMe)]
BS --> DATA[(Block Device
Data Storage
HDD/SSD/NVMe)]
style WAL fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style DB fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
style DATA fill:#d1fae5,stroke:#10b981,stroke-width:2px
💡 Nguyên tắc WAL/DB
Nếu OSD chính là HDD → BẮT BUỘC để WAL+DB trên NVMe riêng → tăng IOPS 3-5x. Nếu OSD chính là NVMe → để cùng device (đỡ phức tạp).
Sizing WAL & DB
| Component |
Size khuyến nghị |
Device |
| WAL (Write-Ahead Log) | 1-2 GB | NVMe enterprise (high endurance) |
| DB (RocksDB metadata) | 4% data size (tối thiểu 30GB) | Cùng NVMe với WAL |
| Data | Phần còn lại | HDD/SSD/NVMe tùy use case |
Tạo OSD với WAL/DB tách riêng
# Giả sử có 1 NVMe 1TB làm WAL+DB cho 6 HDD OSD
# Chia NVMe thành 6 partition: mỗi cái 50GB cho DB (WAL share)
# PVE GUI: Ceph → OSD → Create OSD
# OSD Disk: /dev/sdb (HDD 4TB)
# DB Disk: /dev/nvme0n1 (chọn manual hoặc auto-split)
# DB size: 50GB
# Hoặc CLI:
pveceph osd create /dev/sdb \
--db_dev /dev/nvme0n1 \
--db_size 50
BlueStore Cache Memory
# Trong /etc/ceph/ceph.conf, section [osd]
[osd]
# Auto memory management (PVE 7+, Ceph Quincy+)
osd_memory_target = 8589934592 # 8GB mỗi OSD
# Phân bổ cache:
bluestore_cache_kv_ratio = 0.4 # RocksDB cache
bluestore_cache_meta_ratio = 0.4 # Onode metadata cache
bluestore_cache_data_ratio = 0.2 # Block data cache
# Cho HDD cluster, tăng prefetch
bluestore_prefetch_size = 1048576
# Apply: restart OSD lần lượt
systemctl restart ceph-osd@0
Tính RAM cần cho OSD
Quy tắc: osd_memory_target × số OSD + 8GB OS buffer. Server 12 OSD × 8GB = 96GB + 8GB = 104GB RAM (round up 128GB).
L-5
OSD Daemon Tuning
Threading · Op queue · Recovery throttle
# /etc/ceph/ceph.conf — [osd] section
[osd]
# Op threads — match số core/2
osd_op_num_shards = 8
osd_op_num_threads_per_shard = 2
# Disk threads
osd_disk_threads = 4
# Recovery throttle — quan trọng tránh impact client khi recovery
osd_max_backfills = 1 # default 1 — KHÔNG tăng khi production
osd_recovery_max_active = 3
osd_recovery_op_priority = 3 # 1-63, càng thấp càng ít impact
osd_client_op_priority = 63 # client luôn ưu tiên cao
# Scrub (kiểm tra integrity) — chạy giờ thấp điểm
osd_scrub_begin_hour = 2
osd_scrub_end_hour = 6
osd_scrub_during_recovery = false
osd_scrub_load_threshold = 0.5
osd_deep_scrub_interval = 604800 # 7 ngày
# Snapshot trimming throttle
osd_snap_trim_priority = 5
osd_snap_trim_sleep = 0
# Async messenger threads
ms_async_op_threads = 5
NUMA pinning OSD
# Check NUMA topology
numactl --hardware
lspci -vv | grep -i numa
# Pin OSD tới NUMA node gần disk + NIC
# Edit systemd unit override:
systemctl edit ceph-osd@0
# Thêm:
[Service]
CPUAffinity=0-15 # NUMA 0 cores
NUMAPolicy=bind
NUMAMask=0
L-6
Pool, PG & CRUSH Tuning
PG count · Erasure coding · Device classes · Failure domain
PG (Placement Group) sizing
PG count ảnh hưởng trực tiếp performance + balance. Quá ít → hot spot. Quá nhiều → CPU overhead.
# Công thức Ceph official:
PG count = (Total OSDs × 100) / (Pool replicas)
→ round up tới power of 2
# Ví dụ: 9 OSD, replica size=3
PG = (9 × 100) / 3 = 300 → round up = 512
# Set PG (đừng chỉnh thủ công, dùng autoscale)
ceph osd pool set vm-storage pg_autoscale_mode on
ceph osd pool autoscale-status
# Nếu auto chậm hoặc lệch, set manual:
ceph osd pool set vm-storage pg_num 512
ceph osd pool set vm-storage pgp_num 512
Device Classes — tách HDD và SSD
# Ceph tự detect device class (hdd/ssd/nvme)
ceph osd tree
# Tạo CRUSH rule chỉ dùng SSD
ceph osd crush rule create-replicated ssd-rule default host ssd
# Format: rule_name default_root failure_domain device_class
# Tạo pool dùng rule mới
ceph osd pool create ssd-pool 128 128 replicated ssd-rule
# Chuyển pool hiện có sang rule khác
ceph osd pool set vm-storage crush_rule ssd-rule
Erasure Coding — tiết kiệm 50% dung lượng
Replication 3x dùng 300% dung lượng. EC 4+2 chỉ dùng 150% nhưng penalty IOPS. Phù hợp cho backup, cold data.
# EC profile 4+2 (cần ≥6 OSD trên ≥6 host)
ceph osd erasure-code-profile set ec-4-2 \
k=4 m=2 \
crush-failure-domain=host
# Tạo EC pool
ceph osd pool create ec-backup 128 128 erasure ec-4-2
ceph osd pool set ec-backup allow_ec_overwrites true
CRUSH failure domain
# Default failure domain = host (1 replica/host)
# Multi-rack setup: thay đổi failure domain
ceph osd crush move pve-node1 rack=rack01
ceph osd crush move pve-node2 rack=rack02
ceph osd crush move pve-node3 rack=rack03
# Update rule để failure domain = rack
ceph osd crush rule create-replicated rack-rule default rack
L-7
RBD Client & VM Tuning
RBD cache · QEMU IO threads · VirtIO multi-queue
# /etc/ceph/ceph.conf — [client] section trên PVE
[client]
# RBD cache — quan trọng cho write performance
rbd_cache = true
rbd_cache_writethrough_until_flush = true
rbd_cache_size = 268435456 # 256MB
rbd_cache_max_dirty = 134217728 # 128MB
rbd_cache_target_dirty = 67108864 # 64MB
rbd_cache_max_dirty_age = 2.0 # seconds
# Concurrent ops
rbd_concurrent_management_ops = 20
VM tuning trên Proxmox
# Edit VM config: /etc/pve/qemu-server/<vmid>.conf
# Disk: dùng SCSI single + iothread
scsi0: vm-storage:vm-100-disk-0,iothread=1,discard=on,ssd=1,cache=writeback
# CPU: host type + numa
cpu: host
numa: 1
# Network: VirtIO multi-queue
net0: virtio=AA:BB:...,bridge=vmbr0,queues=8
# Balloon + memory
memory: 8192
balloon: 4096
# Apply: shutdown VM rồi start lại (live không apply hết)
cache=writeback an toàn không?
An toàn nếu host có UPS + qemu nhận fsync từ guest. Tăng IOPS 2-3x so với writethrough. Production thường dùng cache=writeback với iothread=1.
L-8
Monitoring & Diagnostics
Đo + theo dõi liên tục để biết tune có hiệu quả không
Built-in tools
# Real-time IOPS, latency
ceph osd perf
ceph -s
ceph osd df
ceph daemon osd.0 perf dump | jq
# Slow ops detection
ceph osd dump | grep blocked
ceph daemon osd.0 dump_ops_in_flight
# Histogram latency phân bố
ceph daemon osd.0 dump_historic_ops_by_duration
# Pool stats
ceph osd pool stats vm-storage
rados df
Prometheus + Grafana stack
# Enable Ceph Prometheus module (built-in)
ceph mgr module enable prometheus
ceph config set mgr mgr/prometheus/server_port 9283
# Verify endpoint
curl http://<mgr-ip>:9283/metrics
# Cài Prometheus + Grafana (LXC container)
apt install prometheus grafana -y
# Import Grafana dashboard official:
# https://grafana.com/grafana/dashboards/2842 — Ceph Cluster
# https://grafana.com/grafana/dashboards/5336 — Ceph OSD
# https://grafana.com/grafana/dashboards/5342 — Ceph Pools
Cảnh báo metrics quan trọng
| Metric |
Threshold |
Action |
| OSD usage | >80% | Add OSD / balance |
| PG inactive/incomplete | >0 | Critical — check OSD down |
| Apply latency | >50ms | Check disk health, network |
| Slow ops | >0 | Identify slow OSD |
| Recovery throughput | < 500MB/s | Tăng recovery_max_active |
| Client iops | Đột giảm | Check network + OSD health |
Quick wins — checklist tune nhanh
- L1: BIOS Power = Max Performance, disable C-State
- L2: CPU governor = performance, disable THP
- L2: sysctl tuning (net buffer, TCP BBR)
- L3: MTU 9000 trên Ceph Public + Cluster Net
- L3: Tách Ceph Public/Cluster ra 2 mạng khác nhau
- L4: osd_memory_target = 8GB (hoặc 4GB nếu RAM hạn chế)
- L4: Tách WAL/DB sang NVMe nếu OSD chính là HDD
- L5: osd_max_backfills = 1 (production)
- L5: Scrub hours = 2-6 (giờ thấp điểm)
- L6: Bật PG autoscale, dùng device class
- L7: VM disk: SCSI + iothread + cache=writeback
- L7: RBD cache enabled (256MB)
- L8: Setup Prometheus + Grafana dashboard