Fixing a Ceph performance WTF

Here are initial performance results for a simple write workload on a new Ceph cluster. There are 6 nodes in the cluster with 2 OSDs per node. Each OSD is has a dedicated data drive formatted with XFS, and both OSDs share an SSD for the journal.

Update 6 July 2020. I’ve noticed that this page receives quite a bit of traffic. A lot has changed in Ceph since 2014 when this was written, and it’s likely to be out of date.

nwatkins@issdm-9:~$ rados -p iod1 bench 60 write
 Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects
 Object prefix: benchmark_data_issdm-9_7039
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        32        16    63.975        64  0.414645  0.419866
     2      16        60        44   87.9757       112  0.480593  0.337239
     3      16        87        71    94.644       108  0.168348  0.324669

Well, I hope the cluster can do better than that, but we are only connected with gigabit Ethernet, so it looks alright so far.

2014-01-31 12:24:55.910827min lat: 0.07462 max lat: 19.2477 avg lat: 1.0116
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20      16       116       100   19.9974         6  0.107051    1.0116
    21      16       116       100   19.0451         0         -    1.0116
    22      16       116       100   18.1795         0         -    1.0116
    23      16       116       100   17.3891         0         -    1.0116
    24      16       116       100   16.6645         0         -    1.0116
    25      16       116       100    15.998         0         -    1.0116
    26      16       116       100   15.3827         0         -    1.0116
    27      16       116       100   14.8129         0         -    1.0116
    28      16       126       110   15.7123         5  0.084678   1.66572
    29      16       130       114   15.7222        16  0.077086   1.61002

Um, well, it didn’t get any better. Notice all the zero’s in the cur MB/s column. Something is wonky. Check ceph status.

cluster 51a40b44-b068-4da0-b25d-bd31ffd134c4
 health HEALTH_WARN 12 requests are blocked > 32 sec
 monmap e1: 1 mons at {issdm-2=192.168.141.102:6789/0}, election epoch 2, quorum 0 issdm-2
 osdmap e115: 12 osds: 12 up, 12 in
  pgmap v398: 3000 pgs, 4 pools, 468 MB data, 117 objects
        914 MB used, 2792 GB / 2793 GB avail
            3000 active+clean
client io 3866 kB/s wr, 0 op/s

That’s concerning. Get some details.

HEALTH_WARN 13 requests are blocked > 32 sec; 2 osds have slow requests
3 ops are blocked > 65.536 sec
10 ops are blocked > 32.768 sec
6 ops are blocked > 32.768 sec on osd.3
3 ops are blocked > 65.536 sec on osd.9
4 ops are blocked > 32.768 sec on osd.9
2 osds have slow requests

Alright, well it is nice that Ceph also thinks something is wrong. Where are those OSDs?

nwatkins@issdm-9:~$ ceph osd tree
# id    weight  type name   up/down reweight
-1  2.76    root default
...
-5  0.46        host issdm-23
3   0.23            osd.3   up  1   
9   0.23            osd.9   up  1   
-6  0.46        host issdm-24
4   0.23            osd.4   up  1   
10  0.23            osd.10  up  1   
...

Oh, both osd.3 and osd.9 are on the same node. Must be a shitty node. There doesn’t really seem to be anything out of the ordinary in dmesg on issdm-23. This is relatively old hardware, and we’ve had problems before. I’m just going to decommission the node and replace it later. First, mark the OSDs on issdm-23 out and wait for recovery to complete.

nwatkins@issdm-9:~$ ceph osd out 3
marked out osd.3. 
nwatkins@issdm-9:~$ ceph osd out 9
marked out osd.9. 
...
nwatkins@issdm-9:~$ ceph health detail
HEALTH_OK

Ok, now we are ready to really remove them from the cluster. On issdm-23 we’ll shutdown the daemons.

nwatkins@issdm-23:~$ sudo stop ceph-osd-all
ceph-osd-all stop/waiting

Then remove the OSDs from the CRUSH maps, revoke authentication keys, and all that other good stuff.

nwatkins@issdm-9:~$ ceph osd crush remove osd.3
removed item id 3 name 'osd.3' from crush map
nwatkins@issdm-9:~$ ceph osd crush remove osd.9
removed item id 9 name 'osd.9' from crush map
nwatkins@issdm-9:~$ ceph auth del osd.3
updated
nwatkins@issdm-9:~$ ceph auth del osd.9
updated
nwatkins@issdm-9:~$ ceph osd rm 3
removed osd.3
nwatkins@issdm-9:~$ ceph osd rm 9
removed osd.9

Ok, we’re back up with a clean bill of health, and 2 less OSDs.

nwatkins@issdm-9:~$ ceph -s
    cluster 51a40b44-b068-4da0-b25d-bd31ffd134c4
     health HEALTH_OK
     monmap e1: 1 mons at {issdm-2=192.168.141.102:6789/0}, election epoch 2, quorum 0 issdm-2
     osdmap e127: 10 osds: 10 up, 10 in
      pgmap v465: 3000 pgs, 4 pools, 616 MB data, 154 objects
            1031 MB used, 2326 GB / 2327 GB avail
                3000 active+clean

Running the same benchmark test again…

nwatkins@issdm-9:~$ rados -p iod1 bench 60 write
 Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects
 Object prefix: benchmark_data_issdm-9_8653
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        39        23   91.9677        92  0.626457  0.442627
     2      16        66        50    99.975       108  0.527901  0.521575
     3      16        96        80   106.644       120  0.268914  0.537312
     4      16       125       109   108.978       116   1.29911  0.535492
     5      16       151       135    107.98       104  0.335791  0.526973
     6      16       180       164   109.313       116  0.771654  0.551064
     7      16       208       192   109.695       112  0.584754  0.548393
     8      16       236       220   109.981       112  0.468813  0.543709
     9      16       265       249   110.647       116  0.347052  0.553681
    10      15       291       276   110.381       108  0.723344  0.561019
    11      16       320       304   110.527       112  0.557533  0.560427
    12      16       348       332   110.648       112    1.2155  0.563304

Alright! This WTF has been resolved.