Initial Test Notes for Rook Ceph

Issues

  • PVCs do not seem to get deleted when the cluster is deleted

Resilience testing

Executive Summary:

  • I was not able to put the cluster in a state which caused permanent data loss

  • All scenarios end, after a while, with Ceph being fully operational

Scenario 1: systemctl reboot on a node with 1 OSD and 1 Mon

Scenario/Execution:

  • Have containers using the Ceph cluster (via RBD and CephFS volumes) constantly write and check data

  • All pools without redundancy

  • Pick a node with just an OSD and a Mon and reboot it

Effect:

  • Writers stall while the cluster is reconfiguring

  • OSD is migrated away from the rebooted node as soon as it comes up

  • Mon is restarted sooner(?)

Scenario 2: kubectl drain –ignore-daemonsets –delete-local-data on a node with ALL the OSDs

Part 1

Scenario/Execution:

  • Have containers using the Ceph cluster (via RBD and CephFS volumes) constantly write and check data

  • All pools without redundancy

  • OSDs are (for some reason) all scheduled to the same node

  • kubectl drain that node

Effect:

  • Writers stall while the cluster is reconfiguring

  • OSDs are rescheduled to other nodes immediately-ish

  • Mon is not rescheduled (since we only have three nodes and they’re configured with anti-affinity)

  • Writers resume as soon as OSDs are up

  • Cluster ends in HEALTH_WARN due to missing Mon

Part 2

Execution:

  • kubectl uncordon the drained node

Effect:

  • No effect on the writers

  • Mon is rescheduled to the fresh node

  • Cluster is in HEALTH_WARN due to “2 daemons have recently crashed”. I’m not being gentle to this thing :>

Scenario 3: Hard-reset a node with 1 OSD and 1 Mon

Scenario/Execution:

  • Have containers using the Ceph cluster (via RBD and CephFS volumes) constantly write and check data

  • All pools without redundancy

  • Pick a node with just an OSD and a Mon and hard-reset it using OpenStack

Effect:

  • Writers stall while OSD is unavailable

    health: HEALTH_WARN
    insufficient standby MDS daemons available
    1 MDSs report slow metadata IOs
    1 osds down
    1 host (1 osds) down
    no active mgr
    2 daemons have recently crashed
    1/3 mons down, quorum a,b
    

    moohahaha

  • OSD is not rescheduled to another node. Probably too quick for rook to act?

  • No data loss AFAICT

  • mon failed to re-join quorum for minutes

  • deleted the pod, which did not seem to help

  • however, it joined quorum after four more minutes. loop of:

    debug 2020-02-06 07:46:56.889 7fae42cd5700  1 mon.c@2(electing) e3 handle_auth_request failed to assign global_id
    debug 2020-02-06 07:46:57.785 7fae414d2700  1 mon.c@2(electing).elector(563) init, last seen epoch 563, mid-election, bumping
    debug 2020-02-06 07:46:57.809 7fae414d2700 -1 mon.c@2(electing) e3 failed to get devid for : udev_device_new_from_subsystem_sysname failed on ''
    debug 2020-02-06 07:46:57.829 7fae3eccd700 -1 mon.c@2(electing) e3 failed to get devid for : udev_device_new_from_subsystem_sysname failed on ''
    debug 2020-02-06 07:46:59.425 7fae414d2700 -1 mon.c@2(electing) e3 get_health_metrics reporting 1 slow ops, oldest is log(1 entries from seq 1 at 2020-02-06 07:43:56.297914)
    

Scenario 4: Shut down a node for good, without draining (hard VM host crash)

Scenario/Execution:

  • Have containers using the Ceph cluster (via RBD and CephFS volumes) constantly write and check data

  • All pools without redundancy

  • Pick a node with just an OSD and a Mon and power it off using openstack

Effect:

  • Writers stall

  • Rook reschedules CephFS and Mon daemons

  • Mon cannot be rescheduled due to lack of nodes

  • State after ~4m:

    cluster:
       id:     da1f93e9-8ce0-47ee-82c7-f32a5d0caedf
       health: HEALTH_WARN
                2 MDSs report slow metadata IOs
                1 MDSs report slow requests
                1 osds down
                1 host (1 osds) down
                Reduced data availability: 7 pgs inactive
                2 daemons have recently crashed
                1/3 mons down, quorum a,b
    
    services:
       mon: 3 daemons, quorum a,b (age 3m), out of quorum: c
       mgr: a(active, since 3m)
       mds: ceph-fs:1 {0=ceph-fs-a=up:active} 1 up:standby-replay
       osd: 3 osds: 2 up (since 4m), 3 in (since 19h)
    
    data:
       pools:   3 pools, 24 pgs
       objects: 1.23k objects, 1.9 GiB
       usage:   3.9 GiB used, 174 GiB / 178 GiB avail
       pgs:     29.167% pgs unknown
                17 active+clean
                7  unknown
    

    Wat, it killed the operator?!

    po/rook-ceph-operator-7d65b545f7-8x4z8                              1/1       Running       0          10s
    po/rook-ceph-operator-7d65b545f7-wj9jb                              1/1       Terminating   1          32m
    

    Oh… it was running on the node I killed…

  • At 10m, the OSD is still not respawned. The issue is:

    FirstSeen     LastSeen        Count   From                            SubObjectPath   Type            Reason                  Message
    ---------     --------        -----   ----                            -------------   --------        ------                  -------
    9m            9m              1       default-scheduler                               Normal          Scheduled               Successfully assigned rook-ceph/rook-ceph-osd-2-86d456488b-628rc to managed-k8s-worker-2
    9m            9m              1       attachdetach-controller                         Warning         FailedAttachVolume      Multi-Attach error for volume "pvc-f3af19ea-0c59-44fa-a574-e1e9a86b6199" Volume is already used by pod(s) rook-ceph-osd-2-86d456488b-m45kb
    7m            59s             4       kubelet, managed-k8s-worker-2                   Warning         FailedMount             Unable to mount volumes for pod "rook-ceph-osd-2-86d456488b-628rc_rook-ceph(5a0ce905-028c-4dca-b610-7e116968e8ab)": timeout expired waiting for volumes to attach or mount for pod "rook-ceph"/"rook-ceph-osd-2-86d456488b-628rc". list of unmounted volumes=[cinder-2-ceph-data-qt5dp]. list of unattached volumes=[rook-data rook-config-override rook-ceph-log rook-ceph-crash devices cinder-2-ceph-data-qt5dp cinder-2-ceph-data-qt5dp-bridge run-udev rook-binaries rook-ceph-osd-token-4tlzx]
    
  • I’m now hard-detaching the volume from the powered off instance…

  • Did not help. After >1h, the cluster is still broken. Since we do not have any redundancy, we cannot recover from this unless we reboot the node, which is unfortunate; the volume exists and the data is there, but cinder CSI can’t re-attach it :(

  • Now I deleted the wrong node and probably broke the cluster :(

  • Re-starting worker-1 in the attempt to recover

  • Mons and OSDs are being rescheduled

  • It’s ceph we’re talking about. The cluster is healthy, despite me messing in more ways with it (unintentionally!):

    • Hard-reboot another node

    • systemctl restart docker on all nodes (masters and workers)

Scenario 4a: Hard-poweroff a node without draining, delete the node

Scenario/Execution:

  • Have containers using the Ceph cluster (via RBD and CephFS volumes) constantly write and check data

  • All pools without redundancy

  • Pick a node with just an OSD and a Mon and power it off using openstack

  • Once containers enter Terminating state, delete the node

Effect:

  • Writers are blocked as soon as node is off

    cluster:
        id:     da1f93e9-8ce0-47ee-82c7-f32a5d0caedf
        health: HEALTH_WARN
                2 MDSs report slow metadata IOs
                1 MDSs report slow requests
                1 osds down
                1 host (1 osds) down
                Reduced data availability: 7 pgs stale
                4 daemons have recently crashed
                1/3 mons down, quorum a,c
    
    services:
        mon: 3 daemons, quorum a,c (age 2m), out of quorum: b
        mgr: a(active, since 16m)
        mds: ceph-fs:1 {0=ceph-fs-b=up:active} 1 up:standby-replay
        osd: 3 osds: 2 up (since 2m), 3 in (since 59m)
    
    data:
        pools:   3 pools, 24 pgs
        objects: 2.10k objects, 2.5 GiB
        usage:   5.5 GiB used, 261 GiB / 267 GiB avail
        pgs:     17 active+clean
                 7  stale+active+clean
    
  • Terminating pods disappear and OSD gets rescheduled. blocked on Volume (but wait for it …)

    Events:
      FirstSeen     LastSeen        Count   From                            SubObjectPath   Type            Reason                  Message
      ---------     --------        -----   ----                            -------------   --------        ------                  -------
      2m            2m              1       default-scheduler                               Normal          Scheduled               Successfully assigned rook-ce
      ph/rook-ceph-osd-2-86d456488b-slmf6 to managed-k8s-worker-2
      2m            2m              1       attachdetach-controller                         Warning         FailedAttachVolume      Multi-Attach error for volume
      "pvc-f3af19ea-0c59-44fa-a574-e1e9a86b6199" Volume is already used by pod(s) rook-ceph-osd-2-86d456488b-dg775
      10s           10s             1       kubelet, managed-k8s-worker-2                   Warning         FailedMount             Unable to mount volumes for p
      od "rook-ceph-osd-2-86d456488b-slmf6_rook-ceph(8405cbb6-95ff-4512-befa-e279009e2e07)": timeout expired waiting for volumes to attach or mount for pod "rook-ceph"/"rook-ceph-osd-2-86d456488b-slmf6". list of unmounted volumes=[cinder-2-ceph-data-qt5dp]. list of unattached volumes=[rook-data rook-config-override rook-ceph-log rook-ceph-crash devices cinder-2-ceph-data-qt5dp cinder-2-ceph-data-qt5dp-bridge run-udev rook-binaries rook-ceph-osd-token-4tlzx]
    

    (it can take up to 5 minutes for cinder to recognize that a pod is gone and re-try attaching a volume…)

  • Mon does not get rescheduled for the usual reasons (anti-affinity)

  • Aaand there we go. After ~10 minutes of downtime, the OSD is up again and data is available.