Ceph Performance Tuning — Tối ưu IOPS & Latency

Baseline benchmark — đo trước khi tune

Quy tắc tuning vàng

Đo baseline → tune 1 thông số → đo lại → so sánh. KHÔNG tune nhiều thứ cùng lúc, không biết cái nào hiệu quả.

Đo IOPS & Latency với fio

# Cài fio
apt install fio -y

# Test 1: Random Read 4K (đo IOPS đỉnh)
fio \
    --name=randread \
    --filename=/dev/rbd0 \
    --ioengine=libaio \
    --direct=1 \
    --rw=randread \
    --bs=4k \
    --iodepth=64 \
    --numjobs=4 \
    --time_based \
    --runtime=60 \
    --group_reporting

# Test 2: Random Write 4K
fio \
    --name=randwrite \
    --filename=/dev/rbd0 \
    --ioengine=libaio \
    --direct=1 \
    --rw=randwrite \
    --bs=4k \
    --iodepth=64 \
    --numjobs=4 \
    --time_based \
    --runtime=60

# Test 3: Sequential Read 1MB (đo throughput)
fio \
    --name=seqread \
    --filename=/dev/rbd0 \
    --rw=read \
    --bs=1M \
    --iodepth=16 \
    --numjobs=1 \
    --runtime=60

# Built-in rados bench
rados bench -p vm-storage 60 write --no-cleanup
rados bench -p vm-storage 60 seq
rados bench -p vm-storage 60 rand
                

Mục tiêu hiệu năng (3 node all-NVMe ref)

Random Read 4K

100K+

IOPS

Random Write 4K

40K+

IOPS

Latency p99

Throughput

GB/s

L-1

BIOS & CPU Settings

Foundation — sai từ đây thì tune trên không có ý nghĩa

Power Profile: Maximum Performance / OS Control (no power saving)
C-States: Disabled (giảm latency wake-up)
Turbo Boost: Enabled
Hyper-Threading: Enabled (nhiều OSD threads)
NUMA: Enabled (pin OSD theo NUMA node)
SR-IOV: Enabled (cho NIC passthrough)
VT-d / IOMMU: Enabled
Memory Frequency: Max supported (3200/4800 MT/s)

L-2

Kernel & OS Tuning

sysctl, CPU governor, IO scheduler

CPU Governor — performance mode

# Đảm bảo CPU chạy ở max frequency
apt install cpufrequtils -y
cpupower frequency-set -g performance

# Persistent
echo 'GOVERNOR="performance"' > /etc/default/cpufrequtils

# Verify
cpupower frequency-info | grep "current policy"
                

IO Scheduler — none cho NVMe, mq-deadline cho SATA

# Check scheduler hiện tại
cat /sys/block/nvme0n1/queue/scheduler

# NVMe: dùng "none" (multi-queue native)
echo none > /sys/block/nvme0n1/queue/scheduler

# SATA SSD: dùng "mq-deadline"
echo mq-deadline > /sys/block/sda/queue/scheduler

# Persistent qua udev rule
cat > /etc/udev/rules.d/60-scheduler.rules << EOF
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
EOF
                

sysctl tuning cho Ceph

# File: /etc/sysctl.d/99-ceph-tuning.conf

# Network buffers (cho 25/100Gbps)
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 134217728
net.core.netdev_max_backlog = 250000

# TCP tuning
net.ipv4.tcp_rmem = 4096 87380 268435456
net.ipv4.tcp_wmem = 4096 65536 268435456
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.tcp_congestion_control = bbr
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 1
net.ipv4.tcp_low_latency = 1

# File descriptors (mỗi OSD cần ~1024)
fs.file-max = 26234859
fs.nr_open = 26234859

# Memory
vm.swappiness = 1
vm.vfs_cache_pressure = 50
vm.min_free_kbytes = 4194304
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

# Disable THP (Transparent Huge Pages) — gây latency spike
# Echo vào rc.local hoặc systemd unit:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Apply
sysctl -p /etc/sysctl.d/99-ceph-tuning.conf
                

L-3

Network Stack

Jumbo Frame · IRQ pin · RDMA · MTU

Network là bottleneck #1 của Ceph

Mỗi write phải replicate ra (size-1) OSD khác → bandwidth 2-3x. 10Gbps thường không đủ cho all-NVMe — nên ≥25Gbps.

Bật Jumbo Frame MTU 9000

# /etc/network/interfaces — trên cả Ceph Public + Cluster Net
auto bond1
iface bond1 inet manual
    bond-slaves eno3 eno4
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    mtu 9000

auto vmbr1
iface vmbr1 inet static
    address 10.10.10.11/24
    bridge-ports bond1
    mtu 9000

# Switch cũng phải bật Jumbo Frame trên port + VLAN tương ứng
# Test sau cấu hình:
ping -c 4 -M do -s 8972 10.10.10.12
# Nếu OK → Jumbo Frame work. Nếu fail "frag needed" → switch chưa bật.
                

IRQ pinning — phân tải interrupt sang nhiều CPU

# Cài tool
apt install ethtool irqbalance -y

# Tăng số queue cho NIC (vd 16 queue)
ethtool -L eno3 combined 16

# Tăng ring buffer
ethtool -G eno3 rx 4096 tx 4096

# Disable LRO/GRO trên Ceph network (giảm latency)
ethtool -K eno3 lro off gro off

# Pin IRQ tới NUMA local CPU
systemctl stop irqbalance
# Manual pin từng queue tới CPU riêng (script):
for i in $(grep eno3 /proc/interrupts | awk '{print $1}' | tr -d ':'); do
    cpu=$((i % 16))
    echo $cpu > /proc/irq/$i/smp_affinity_list
done
                

RDMA / RoCEv2 (nâng cao)

RDMA bypass kernel TCP stack → giảm latency 50%+, giảm CPU usage. Cần NIC RDMA-capable (Mellanox CX-5/6, Intel E810).

# Bật RDMA messenger trong ceph.conf (experimental)
[global]
ms_type = async+rdma
ms_async_rdma_device_name = mlx5_0
ms_async_rdma_polling_us = 0
                

L-4

BlueStore — OSD Backend Tuning

WAL · DB · Cache · Compression

Kiến trúc BlueStore (đối chiếu khi tune)

graph LR APP[Client Write] --> OSD[OSD Daemon] OSD --> BS[BlueStore Engine] BS --> WAL[(WAL
Write-Ahead Log
Fast NVMe)] BS --> DB[(RocksDB
Metadata
Fast NVMe)] BS --> DATA[(Block Device
Data Storage
HDD/SSD/NVMe)] style WAL fill:#fef3c7,stroke:#f59e0b,stroke-width:2px style DB fill:#dbeafe,stroke:#3b82f6,stroke-width:2px style DATA fill:#d1fae5,stroke:#10b981,stroke-width:2px

💡 Nguyên tắc WAL/DB

Nếu OSD chính là HDD → BẮT BUỘC để WAL+DB trên NVMe riêng → tăng IOPS 3-5x. Nếu OSD chính là NVMe → để cùng device (đỡ phức tạp).

Sizing WAL & DB

Component	Size khuyến nghị	Device
WAL (Write-Ahead Log)	1-2 GB	NVMe enterprise (high endurance)
DB (RocksDB metadata)	4% data size (tối thiểu 30GB)	Cùng NVMe với WAL
Data	Phần còn lại	HDD/SSD/NVMe tùy use case

Tạo OSD với WAL/DB tách riêng

# Giả sử có 1 NVMe 1TB làm WAL+DB cho 6 HDD OSD
# Chia NVMe thành 6 partition: mỗi cái 50GB cho DB (WAL share)

# PVE GUI: Ceph → OSD → Create OSD
#   OSD Disk: /dev/sdb (HDD 4TB)
#   DB Disk: /dev/nvme0n1 (chọn manual hoặc auto-split)
#   DB size: 50GB

# Hoặc CLI:
pveceph osd create /dev/sdb \
    --db_dev  /dev/nvme0n1 \
    --db_size 50
                

BlueStore Cache Memory

# Trong /etc/ceph/ceph.conf, section [osd]

[osd]
# Auto memory management (PVE 7+, Ceph Quincy+)
osd_memory_target = 8589934592   # 8GB mỗi OSD

# Phân bổ cache:
bluestore_cache_kv_ratio = 0.4   # RocksDB cache
bluestore_cache_meta_ratio = 0.4 # Onode metadata cache
bluestore_cache_data_ratio = 0.2 # Block data cache

# Cho HDD cluster, tăng prefetch
bluestore_prefetch_size = 1048576

# Apply: restart OSD lần lượt
systemctl restart ceph-osd@0
                

Tính RAM cần cho OSD

Quy tắc: osd_memory_target × số OSD + 8GB OS buffer. Server 12 OSD × 8GB = 96GB + 8GB = 104GB RAM (round up 128GB).

L-5

OSD Daemon Tuning

Threading · Op queue · Recovery throttle

# /etc/ceph/ceph.conf — [osd] section

[osd]
# Op threads — match số core/2
osd_op_num_shards = 8
osd_op_num_threads_per_shard = 2

# Disk threads
osd_disk_threads = 4

# Recovery throttle — quan trọng tránh impact client khi recovery
osd_max_backfills = 1          # default 1 — KHÔNG tăng khi production
osd_recovery_max_active = 3
osd_recovery_op_priority = 3   # 1-63, càng thấp càng ít impact
osd_client_op_priority = 63    # client luôn ưu tiên cao

# Scrub (kiểm tra integrity) — chạy giờ thấp điểm
osd_scrub_begin_hour = 2
osd_scrub_end_hour = 6
osd_scrub_during_recovery = false
osd_scrub_load_threshold = 0.5
osd_deep_scrub_interval = 604800   # 7 ngày

# Snapshot trimming throttle
osd_snap_trim_priority = 5
osd_snap_trim_sleep = 0

# Async messenger threads
ms_async_op_threads = 5
                

NUMA pinning OSD

# Check NUMA topology
numactl --hardware
lspci -vv | grep -i numa

# Pin OSD tới NUMA node gần disk + NIC
# Edit systemd unit override:
systemctl edit ceph-osd@0

# Thêm:
[Service]
CPUAffinity=0-15      # NUMA 0 cores
NUMAPolicy=bind
NUMAMask=0
                

L-6

Pool, PG & CRUSH Tuning

PG count · Erasure coding · Device classes · Failure domain

PG (Placement Group) sizing

PG count ảnh hưởng trực tiếp performance + balance. Quá ít → hot spot. Quá nhiều → CPU overhead.

# Công thức Ceph official:
PG count = (Total OSDs × 100) / (Pool replicas)
       → round up tới power of 2

# Ví dụ: 9 OSD, replica size=3
PG = (9 × 100) / 3 = 300 → round up = 512

# Set PG (đừng chỉnh thủ công, dùng autoscale)
ceph osd pool set vm-storage pg_autoscale_mode on
ceph osd pool autoscale-status

# Nếu auto chậm hoặc lệch, set manual:
ceph osd pool set vm-storage pg_num 512
ceph osd pool set vm-storage pgp_num 512
                

Device Classes — tách HDD và SSD

# Ceph tự detect device class (hdd/ssd/nvme)
ceph osd tree

# Tạo CRUSH rule chỉ dùng SSD
ceph osd crush rule create-replicated ssd-rule default host ssd
# Format: rule_name default_root failure_domain device_class

# Tạo pool dùng rule mới
ceph osd pool create ssd-pool 128 128 replicated ssd-rule

# Chuyển pool hiện có sang rule khác
ceph osd pool set vm-storage crush_rule ssd-rule
                

Erasure Coding — tiết kiệm 50% dung lượng

Replication 3x dùng 300% dung lượng. EC 4+2 chỉ dùng 150% nhưng penalty IOPS. Phù hợp cho backup, cold data.

# EC profile 4+2 (cần ≥6 OSD trên ≥6 host)
ceph osd erasure-code-profile set ec-4-2 \
    k=4 m=2 \
    crush-failure-domain=host

# Tạo EC pool
ceph osd pool create ec-backup 128 128 erasure ec-4-2
ceph osd pool set ec-backup allow_ec_overwrites true
                

CRUSH failure domain

# Default failure domain = host (1 replica/host)
# Multi-rack setup: thay đổi failure domain

ceph osd crush move pve-node1 rack=rack01
ceph osd crush move pve-node2 rack=rack02
ceph osd crush move pve-node3 rack=rack03

# Update rule để failure domain = rack
ceph osd crush rule create-replicated rack-rule default rack
                

L-7

RBD Client & VM Tuning

RBD cache · QEMU IO threads · VirtIO multi-queue

# /etc/ceph/ceph.conf — [client] section trên PVE

[client]
# RBD cache — quan trọng cho write performance
rbd_cache = true
rbd_cache_writethrough_until_flush = true
rbd_cache_size = 268435456              # 256MB
rbd_cache_max_dirty = 134217728         # 128MB
rbd_cache_target_dirty = 67108864       # 64MB
rbd_cache_max_dirty_age = 2.0           # seconds

# Concurrent ops
rbd_concurrent_management_ops = 20
                

VM tuning trên Proxmox

# Edit VM config: /etc/pve/qemu-server/<vmid>.conf

# Disk: dùng SCSI single + iothread
scsi0: vm-storage:vm-100-disk-0,iothread=1,discard=on,ssd=1,cache=writeback

# CPU: host type + numa
cpu: host
numa: 1

# Network: VirtIO multi-queue
net0: virtio=AA:BB:...,bridge=vmbr0,queues=8

# Balloon + memory
memory: 8192
balloon: 4096

# Apply: shutdown VM rồi start lại (live không apply hết)
                

cache=writeback an toàn không?

An toàn nếu host có UPS + qemu nhận fsync từ guest. Tăng IOPS 2-3x so với writethrough. Production thường dùng cache=writeback với iothread=1.

L-8

Monitoring & Diagnostics

Đo + theo dõi liên tục để biết tune có hiệu quả không

Built-in tools

# Real-time IOPS, latency
ceph osd perf
ceph -s
ceph osd df

ceph daemon osd.0 perf dump | jq

# Slow ops detection
ceph osd dump | grep blocked
ceph daemon osd.0 dump_ops_in_flight

# Histogram latency phân bố
ceph daemon osd.0 dump_historic_ops_by_duration

# Pool stats
ceph osd pool stats vm-storage
rados df
                

Prometheus + Grafana stack

# Enable Ceph Prometheus module (built-in)
ceph mgr module enable prometheus
ceph config set mgr mgr/prometheus/server_port 9283

# Verify endpoint
curl http://<mgr-ip>:9283/metrics

# Cài Prometheus + Grafana (LXC container)
apt install prometheus grafana -y

# Import Grafana dashboard official:
# https://grafana.com/grafana/dashboards/2842 — Ceph Cluster
# https://grafana.com/grafana/dashboards/5336 — Ceph OSD
# https://grafana.com/grafana/dashboards/5342 — Ceph Pools
                

Cảnh báo metrics quan trọng

Metric	Threshold	Action
OSD usage	>80%	Add OSD / balance
PG inactive/incomplete	>0	Critical — check OSD down
Apply latency	>50ms	Check disk health, network
Slow ops	>0	Identify slow OSD
Recovery throughput	< 500MB/s	Tăng recovery_max_active
Client iops	Đột giảm	Check network + OSD health

Quick wins — checklist tune nhanh

L1: BIOS Power = Max Performance, disable C-State
L2: CPU governor = performance, disable THP
L2: sysctl tuning (net buffer, TCP BBR)
L3: MTU 9000 trên Ceph Public + Cluster Net
L3: Tách Ceph Public/Cluster ra 2 mạng khác nhau
L4: osd_memory_target = 8GB (hoặc 4GB nếu RAM hạn chế)
L4: Tách WAL/DB sang NVMe nếu OSD chính là HDD
L5: osd_max_backfills = 1 (production)
L5: Scrub hours = 2-6 (giờ thấp điểm)
L6: Bật PG autoscale, dùng device class
L7: VM disk: SCSI + iothread + cache=writeback
L7: RBD cache enabled (256MB)
L8: Setup Prometheus + Grafana dashboard

Quay lại System & Network

Tối ưu Ceph Performance IOPS · Latency · Throughput