Fixing a Ceph performance WTF
Here are initial performance results for a simple write workload on a new Ceph cluster. There are 6 nodes in the cluster with 2 OSDs per node. Each OSD is has a dedicated data drive formatted with XFS, and both OSDs share an SSD for the journal.
nwatkins@issdm-9:~$ rados -p iod1 bench 60 write
Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects
Object prefix: benchmark_data_issdm-9_7039
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 32 16 63.975 64 0.414645 0.419866
2 16 60 44 87.9757 112 0.480593 0.337239
3 16 87 71 94.644 108 0.168348 0.324669
Well, I hope the cluster can do better than that, but we are only connected with gigabit Ethernet, so it looks alright so far.
2014-01-31 12:24:55.910827min lat: 0.07462 max lat: 19.2477 avg lat: 1.0116
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 16 116 100 19.9974 6 0.107051 1.0116
21 16 116 100 19.0451 0 - 1.0116
22 16 116 100 18.1795 0 - 1.0116
23 16 116 100 17.3891 0 - 1.0116
24 16 116 100 16.6645 0 - 1.0116
25 16 116 100 15.998 0 - 1.0116
26 16 116 100 15.3827 0 - 1.0116
27 16 116 100 14.8129 0 - 1.0116
28 16 126 110 15.7123 5 0.084678 1.66572
29 16 130 114 15.7222 16 0.077086 1.61002
Um, well, it didn’t get any better. Notice all the zero’s in the cur MB/s
column. Something is wonky. Check ceph status
.
cluster 51a40b44-b068-4da0-b25d-bd31ffd134c4
health HEALTH_WARN 12 requests are blocked > 32 sec
monmap e1: 1 mons at {issdm-2=192.168.141.102:6789/0}, election epoch 2, quorum 0 issdm-2
osdmap e115: 12 osds: 12 up, 12 in
pgmap v398: 3000 pgs, 4 pools, 468 MB data, 117 objects
914 MB used, 2792 GB / 2793 GB avail
3000 active+clean
client io 3866 kB/s wr, 0 op/s
That’s concerning. Get some details.
HEALTH_WARN 13 requests are blocked > 32 sec; 2 osds have slow requests
3 ops are blocked > 65.536 sec
10 ops are blocked > 32.768 sec
6 ops are blocked > 32.768 sec on osd.3
3 ops are blocked > 65.536 sec on osd.9
4 ops are blocked > 32.768 sec on osd.9
2 osds have slow requests
Alright, well it is nice that Ceph also thinks something is wrong. Where are those OSDs?
nwatkins@issdm-9:~$ ceph osd tree
# id weight type name up/down reweight
-1 2.76 root default
...
-5 0.46 host issdm-23
3 0.23 osd.3 up 1
9 0.23 osd.9 up 1
-6 0.46 host issdm-24
4 0.23 osd.4 up 1
10 0.23 osd.10 up 1
...
Oh, both osd.3
and osd.9
are on the same node. Must be a shitty node. There
doesn’t really seem to be anything out of the ordinary in dmesg
on
issdm-23
. This is relatively old hardware, and we’ve had problems before. I’m
just going to decommission the node and replace it later. First, mark the OSDs
on issdm-23
out and wait for recovery to complete.
nwatkins@issdm-9:~$ ceph osd out 3
marked out osd.3.
nwatkins@issdm-9:~$ ceph osd out 9
marked out osd.9.
...
nwatkins@issdm-9:~$ ceph health detail
HEALTH_OK
Ok, now we are ready to really remove them from the cluster. On issdm-23
we’ll shutdown the daemons.
nwatkins@issdm-23:~$ sudo stop ceph-osd-all
ceph-osd-all stop/waiting
Then remove the OSDs from the CRUSH maps, revoke authentication keys, and all that other good stuff.
nwatkins@issdm-9:~$ ceph osd crush remove osd.3
removed item id 3 name 'osd.3' from crush map
nwatkins@issdm-9:~$ ceph osd crush remove osd.9
removed item id 9 name 'osd.9' from crush map
nwatkins@issdm-9:~$ ceph auth del osd.3
updated
nwatkins@issdm-9:~$ ceph auth del osd.9
updated
nwatkins@issdm-9:~$ ceph osd rm 3
removed osd.3
nwatkins@issdm-9:~$ ceph osd rm 9
removed osd.9
Ok, we’re back up with a clean bill of health, and 2 less OSDs.
nwatkins@issdm-9:~$ ceph -s
cluster 51a40b44-b068-4da0-b25d-bd31ffd134c4
health HEALTH_OK
monmap e1: 1 mons at {issdm-2=192.168.141.102:6789/0}, election epoch 2, quorum 0 issdm-2
osdmap e127: 10 osds: 10 up, 10 in
pgmap v465: 3000 pgs, 4 pools, 616 MB data, 154 objects
1031 MB used, 2326 GB / 2327 GB avail
3000 active+clean
Running the same benchmark test again…
nwatkins@issdm-9:~$ rados -p iod1 bench 60 write
Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects
Object prefix: benchmark_data_issdm-9_8653
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 39 23 91.9677 92 0.626457 0.442627
2 16 66 50 99.975 108 0.527901 0.521575
3 16 96 80 106.644 120 0.268914 0.537312
4 16 125 109 108.978 116 1.29911 0.535492
5 16 151 135 107.98 104 0.335791 0.526973
6 16 180 164 109.313 116 0.771654 0.551064
7 16 208 192 109.695 112 0.584754 0.548393
8 16 236 220 109.981 112 0.468813 0.543709
9 16 265 249 110.647 116 0.347052 0.553681
10 15 291 276 110.381 108 0.723344 0.561019
11 16 320 304 110.527 112 0.557533 0.560427
12 16 348 332 110.648 112 1.2155 0.563304
Alright! This WTF has been resolved.