Advanced · Performance Tuning

Tối ưu Ceph Performance
IOPS · Latency · Throughput

8 layer tuning từ BIOS → kernel → network → BlueStore → OSD → CRUSH → RBD client. Đạt 100K+ IOPS trên cluster 3-node all-NVMe.

Baseline benchmark — đo trước khi tune

Quy tắc tuning vàng

Đo baseline → tune 1 thông số → đo lại → so sánh. KHÔNG tune nhiều thứ cùng lúc, không biết cái nào hiệu quả.

Đo IOPS & Latency với fio

# Cài fio apt install fio -y # Test 1: Random Read 4K (đo IOPS đỉnh) fio \ --name=randread \ --filename=/dev/rbd0 \ --ioengine=libaio \ --direct=1 \ --rw=randread \ --bs=4k \ --iodepth=64 \ --numjobs=4 \ --time_based \ --runtime=60 \ --group_reporting # Test 2: Random Write 4K fio \ --name=randwrite \ --filename=/dev/rbd0 \ --ioengine=libaio \ --direct=1 \ --rw=randwrite \ --bs=4k \ --iodepth=64 \ --numjobs=4 \ --time_based \ --runtime=60 # Test 3: Sequential Read 1MB (đo throughput) fio \ --name=seqread \ --filename=/dev/rbd0 \ --rw=read \ --bs=1M \ --iodepth=16 \ --numjobs=1 \ --runtime=60 # Built-in rados bench rados bench -p vm-storage 60 write --no-cleanup rados bench -p vm-storage 60 seq rados bench -p vm-storage 60 rand

Mục tiêu hiệu năng (3 node all-NVMe ref)

Random Read 4K

100K+

IOPS

Random Write 4K

40K+

IOPS

Latency p99

<5

ms

Throughput

5+

GB/s

L-1

BIOS & CPU Settings

Foundation — sai từ đây thì tune trên không có ý nghĩa

  • Power Profile: Maximum Performance / OS Control (no power saving)
  • C-States: Disabled (giảm latency wake-up)
  • Turbo Boost: Enabled
  • Hyper-Threading: Enabled (nhiều OSD threads)
  • NUMA: Enabled (pin OSD theo NUMA node)
  • SR-IOV: Enabled (cho NIC passthrough)
  • VT-d / IOMMU: Enabled
  • Memory Frequency: Max supported (3200/4800 MT/s)
L-2

Kernel & OS Tuning

sysctl, CPU governor, IO scheduler

CPU Governor — performance mode

# Đảm bảo CPU chạy ở max frequency apt install cpufrequtils -y cpupower frequency-set -g performance # Persistent echo 'GOVERNOR="performance"' > /etc/default/cpufrequtils # Verify cpupower frequency-info | grep "current policy"

IO Scheduler — none cho NVMe, mq-deadline cho SATA

# Check scheduler hiện tại cat /sys/block/nvme0n1/queue/scheduler # NVMe: dùng "none" (multi-queue native) echo none > /sys/block/nvme0n1/queue/scheduler # SATA SSD: dùng "mq-deadline" echo mq-deadline > /sys/block/sda/queue/scheduler # Persistent qua udev rule cat > /etc/udev/rules.d/60-scheduler.rules << EOF ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline" EOF

sysctl tuning cho Ceph

# File: /etc/sysctl.d/99-ceph-tuning.conf # Network buffers (cho 25/100Gbps) net.core.rmem_max = 268435456 net.core.wmem_max = 268435456 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 134217728 net.core.netdev_max_backlog = 250000 # TCP tuning net.ipv4.tcp_rmem = 4096 87380 268435456 net.ipv4.tcp_wmem = 4096 65536 268435456 net.ipv4.tcp_mem = 786432 1048576 26777216 net.ipv4.tcp_congestion_control = bbr net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 1 net.ipv4.tcp_low_latency = 1 # File descriptors (mỗi OSD cần ~1024) fs.file-max = 26234859 fs.nr_open = 26234859 # Memory vm.swappiness = 1 vm.vfs_cache_pressure = 50 vm.min_free_kbytes = 4194304 vm.dirty_ratio = 10 vm.dirty_background_ratio = 5 # Disable THP (Transparent Huge Pages) — gây latency spike # Echo vào rc.local hoặc systemd unit: echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag # Apply sysctl -p /etc/sysctl.d/99-ceph-tuning.conf
L-3

Network Stack

Jumbo Frame · IRQ pin · RDMA · MTU

Network là bottleneck #1 của Ceph

Mỗi write phải replicate ra (size-1) OSD khác → bandwidth 2-3x. 10Gbps thường không đủ cho all-NVMe — nên ≥25Gbps.

Bật Jumbo Frame MTU 9000

# /etc/network/interfaces — trên cả Ceph Public + Cluster Net auto bond1 iface bond1 inet manual bond-slaves eno3 eno4 bond-mode 802.3ad bond-xmit-hash-policy layer3+4 mtu 9000 auto vmbr1 iface vmbr1 inet static address 10.10.10.11/24 bridge-ports bond1 mtu 9000 # Switch cũng phải bật Jumbo Frame trên port + VLAN tương ứng # Test sau cấu hình: ping -c 4 -M do -s 8972 10.10.10.12 # Nếu OK → Jumbo Frame work. Nếu fail "frag needed" → switch chưa bật.

IRQ pinning — phân tải interrupt sang nhiều CPU

# Cài tool apt install ethtool irqbalance -y # Tăng số queue cho NIC (vd 16 queue) ethtool -L eno3 combined 16 # Tăng ring buffer ethtool -G eno3 rx 4096 tx 4096 # Disable LRO/GRO trên Ceph network (giảm latency) ethtool -K eno3 lro off gro off # Pin IRQ tới NUMA local CPU systemctl stop irqbalance # Manual pin từng queue tới CPU riêng (script): for i in $(grep eno3 /proc/interrupts | awk '{print $1}' | tr -d ':'); do cpu=$((i % 16)) echo $cpu > /proc/irq/$i/smp_affinity_list done

RDMA / RoCEv2 (nâng cao)

RDMA bypass kernel TCP stack → giảm latency 50%+, giảm CPU usage. Cần NIC RDMA-capable (Mellanox CX-5/6, Intel E810).

# Bật RDMA messenger trong ceph.conf (experimental) [global] ms_type = async+rdma ms_async_rdma_device_name = mlx5_0 ms_async_rdma_polling_us = 0
L-4

BlueStore — OSD Backend Tuning

WAL · DB · Cache · Compression

Kiến trúc BlueStore (đối chiếu khi tune)

graph LR APP[Client Write] --> OSD[OSD Daemon] OSD --> BS[BlueStore Engine] BS --> WAL[(WAL
Write-Ahead Log
Fast NVMe)] BS --> DB[(RocksDB
Metadata
Fast NVMe)] BS --> DATA[(Block Device
Data Storage
HDD/SSD/NVMe)] style WAL fill:#fef3c7,stroke:#f59e0b,stroke-width:2px style DB fill:#dbeafe,stroke:#3b82f6,stroke-width:2px style DATA fill:#d1fae5,stroke:#10b981,stroke-width:2px

💡 Nguyên tắc WAL/DB

Nếu OSD chính là HDD → BẮT BUỘC để WAL+DB trên NVMe riêng → tăng IOPS 3-5x. Nếu OSD chính là NVMe → để cùng device (đỡ phức tạp).

Sizing WAL & DB

Component Size khuyến nghị Device
WAL (Write-Ahead Log)1-2 GBNVMe enterprise (high endurance)
DB (RocksDB metadata)4% data size (tối thiểu 30GB)Cùng NVMe với WAL
DataPhần còn lạiHDD/SSD/NVMe tùy use case

Tạo OSD với WAL/DB tách riêng

# Giả sử có 1 NVMe 1TB làm WAL+DB cho 6 HDD OSD # Chia NVMe thành 6 partition: mỗi cái 50GB cho DB (WAL share) # PVE GUI: Ceph → OSD → Create OSD # OSD Disk: /dev/sdb (HDD 4TB) # DB Disk: /dev/nvme0n1 (chọn manual hoặc auto-split) # DB size: 50GB # Hoặc CLI: pveceph osd create /dev/sdb \ --db_dev /dev/nvme0n1 \ --db_size 50

BlueStore Cache Memory

# Trong /etc/ceph/ceph.conf, section [osd] [osd] # Auto memory management (PVE 7+, Ceph Quincy+) osd_memory_target = 8589934592 # 8GB mỗi OSD # Phân bổ cache: bluestore_cache_kv_ratio = 0.4 # RocksDB cache bluestore_cache_meta_ratio = 0.4 # Onode metadata cache bluestore_cache_data_ratio = 0.2 # Block data cache # Cho HDD cluster, tăng prefetch bluestore_prefetch_size = 1048576 # Apply: restart OSD lần lượt systemctl restart ceph-osd@0

Tính RAM cần cho OSD

Quy tắc: osd_memory_target × số OSD + 8GB OS buffer. Server 12 OSD × 8GB = 96GB + 8GB = 104GB RAM (round up 128GB).

L-5

OSD Daemon Tuning

Threading · Op queue · Recovery throttle

# /etc/ceph/ceph.conf — [osd] section [osd] # Op threads — match số core/2 osd_op_num_shards = 8 osd_op_num_threads_per_shard = 2 # Disk threads osd_disk_threads = 4 # Recovery throttle — quan trọng tránh impact client khi recovery osd_max_backfills = 1 # default 1 — KHÔNG tăng khi production osd_recovery_max_active = 3 osd_recovery_op_priority = 3 # 1-63, càng thấp càng ít impact osd_client_op_priority = 63 # client luôn ưu tiên cao # Scrub (kiểm tra integrity) — chạy giờ thấp điểm osd_scrub_begin_hour = 2 osd_scrub_end_hour = 6 osd_scrub_during_recovery = false osd_scrub_load_threshold = 0.5 osd_deep_scrub_interval = 604800 # 7 ngày # Snapshot trimming throttle osd_snap_trim_priority = 5 osd_snap_trim_sleep = 0 # Async messenger threads ms_async_op_threads = 5

NUMA pinning OSD

# Check NUMA topology numactl --hardware lspci -vv | grep -i numa # Pin OSD tới NUMA node gần disk + NIC # Edit systemd unit override: systemctl edit ceph-osd@0 # Thêm: [Service] CPUAffinity=0-15 # NUMA 0 cores NUMAPolicy=bind NUMAMask=0
L-6

Pool, PG & CRUSH Tuning

PG count · Erasure coding · Device classes · Failure domain

PG (Placement Group) sizing

PG count ảnh hưởng trực tiếp performance + balance. Quá ít → hot spot. Quá nhiều → CPU overhead.

# Công thức Ceph official: PG count = (Total OSDs × 100) / (Pool replicas) → round up tới power of 2 # Ví dụ: 9 OSD, replica size=3 PG = (9 × 100) / 3 = 300 → round up = 512 # Set PG (đừng chỉnh thủ công, dùng autoscale) ceph osd pool set vm-storage pg_autoscale_mode on ceph osd pool autoscale-status # Nếu auto chậm hoặc lệch, set manual: ceph osd pool set vm-storage pg_num 512 ceph osd pool set vm-storage pgp_num 512

Device Classes — tách HDD và SSD

# Ceph tự detect device class (hdd/ssd/nvme) ceph osd tree # Tạo CRUSH rule chỉ dùng SSD ceph osd crush rule create-replicated ssd-rule default host ssd # Format: rule_name default_root failure_domain device_class # Tạo pool dùng rule mới ceph osd pool create ssd-pool 128 128 replicated ssd-rule # Chuyển pool hiện có sang rule khác ceph osd pool set vm-storage crush_rule ssd-rule

Erasure Coding — tiết kiệm 50% dung lượng

Replication 3x dùng 300% dung lượng. EC 4+2 chỉ dùng 150% nhưng penalty IOPS. Phù hợp cho backup, cold data.

# EC profile 4+2 (cần ≥6 OSD trên ≥6 host) ceph osd erasure-code-profile set ec-4-2 \ k=4 m=2 \ crush-failure-domain=host # Tạo EC pool ceph osd pool create ec-backup 128 128 erasure ec-4-2 ceph osd pool set ec-backup allow_ec_overwrites true

CRUSH failure domain

# Default failure domain = host (1 replica/host) # Multi-rack setup: thay đổi failure domain ceph osd crush move pve-node1 rack=rack01 ceph osd crush move pve-node2 rack=rack02 ceph osd crush move pve-node3 rack=rack03 # Update rule để failure domain = rack ceph osd crush rule create-replicated rack-rule default rack
L-7

RBD Client & VM Tuning

RBD cache · QEMU IO threads · VirtIO multi-queue

# /etc/ceph/ceph.conf — [client] section trên PVE [client] # RBD cache — quan trọng cho write performance rbd_cache = true rbd_cache_writethrough_until_flush = true rbd_cache_size = 268435456 # 256MB rbd_cache_max_dirty = 134217728 # 128MB rbd_cache_target_dirty = 67108864 # 64MB rbd_cache_max_dirty_age = 2.0 # seconds # Concurrent ops rbd_concurrent_management_ops = 20

VM tuning trên Proxmox

# Edit VM config: /etc/pve/qemu-server/<vmid>.conf # Disk: dùng SCSI single + iothread scsi0: vm-storage:vm-100-disk-0,iothread=1,discard=on,ssd=1,cache=writeback # CPU: host type + numa cpu: host numa: 1 # Network: VirtIO multi-queue net0: virtio=AA:BB:...,bridge=vmbr0,queues=8 # Balloon + memory memory: 8192 balloon: 4096 # Apply: shutdown VM rồi start lại (live không apply hết)

cache=writeback an toàn không?

An toàn nếu host có UPS + qemu nhận fsync từ guest. Tăng IOPS 2-3x so với writethrough. Production thường dùng cache=writeback với iothread=1.

L-8

Monitoring & Diagnostics

Đo + theo dõi liên tục để biết tune có hiệu quả không

Built-in tools

# Real-time IOPS, latency ceph osd perf ceph -s ceph osd df ceph daemon osd.0 perf dump | jq # Slow ops detection ceph osd dump | grep blocked ceph daemon osd.0 dump_ops_in_flight # Histogram latency phân bố ceph daemon osd.0 dump_historic_ops_by_duration # Pool stats ceph osd pool stats vm-storage rados df

Prometheus + Grafana stack

# Enable Ceph Prometheus module (built-in) ceph mgr module enable prometheus ceph config set mgr mgr/prometheus/server_port 9283 # Verify endpoint curl http://<mgr-ip>:9283/metrics # Cài Prometheus + Grafana (LXC container) apt install prometheus grafana -y # Import Grafana dashboard official: # https://grafana.com/grafana/dashboards/2842 — Ceph Cluster # https://grafana.com/grafana/dashboards/5336 — Ceph OSD # https://grafana.com/grafana/dashboards/5342 — Ceph Pools

Cảnh báo metrics quan trọng

Metric Threshold Action
OSD usage>80%Add OSD / balance
PG inactive/incomplete>0Critical — check OSD down
Apply latency>50msCheck disk health, network
Slow ops>0Identify slow OSD
Recovery throughput< 500MB/sTăng recovery_max_active
Client iopsĐột giảmCheck network + OSD health

Quick wins — checklist tune nhanh

  • L1: BIOS Power = Max Performance, disable C-State
  • L2: CPU governor = performance, disable THP
  • L2: sysctl tuning (net buffer, TCP BBR)
  • L3: MTU 9000 trên Ceph Public + Cluster Net
  • L3: Tách Ceph Public/Cluster ra 2 mạng khác nhau
  • L4: osd_memory_target = 8GB (hoặc 4GB nếu RAM hạn chế)
  • L4: Tách WAL/DB sang NVMe nếu OSD chính là HDD
  • L5: osd_max_backfills = 1 (production)
  • L5: Scrub hours = 2-6 (giờ thấp điểm)
  • L6: Bật PG autoscale, dùng device class
  • L7: VM disk: SCSI + iothread + cache=writeback
  • L7: RBD cache enabled (256MB)
  • L8: Setup Prometheus + Grafana dashboard