Resizing an OSD
Prerequisites
You need a shell inside the Ceph toolbox. You can open such a shell with:
$ kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
The ceph cluster must be healthy.
Change the size of a single OSD
In theory, there are two ways to go about this. In general, the plan is to increase the OSD size in the cluster configuration and then substitute an existing OSD. This can be done by either first removing the existing OSD or by first adding a new OSD.
It is important to remember the restrictions around removal of OSDs: only the OSD on the highest numbered volume can be permanently removed from the cluster; any other OSD would be recreated by the operator, despite a reduced OSD count.
This makes it impossible to first add a larger OSD and then (permanently) remove a smaller OSD: the newly added larger OSD would have the highest numbered volume, and would thus be the only OSD you can permanently remove from the cluster.
Thus, the only feasible way is to first remove any OSD and have the operator replace it with a larger one. Note that this replacement is exactly the effect which prevents us from permanently removing any OSD except the one on the highest numbered volume.
If you do not have enough space in the cluster to buffer away the data of an OSD, you can temporarily increase the OSD count by one and later remove that newly created OSD.
Note that all of this applies in the same way to resizing an OSD to a smaller volume size and to resizing many OSDs (though you probably shouldn’t resize many at once, for your own cognitive good).
The steps are in many places very similar to removing an OSD, but since we don’t want to change the overall amount of OSDs, we’ll never decrease the OSD count.
Caution
This process will create and attach Cinder Volumes. Due to a known kernel bug, this may crash the Ceph workers. Ceph will recover from this, however, it may lead to temporarily unavailable data.
Ensure that the cluster is healthy and pick a victim OSD to remove:
toolbox# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.08690 1.00000 89 GiB 7.5 GiB 6.5 GiB 0 B 1 GiB 81 GiB 8.47 1.19 18 up 2 hdd 0.08690 1.00000 89 GiB 6.1 GiB 5.1 GiB 20 KiB 1024 MiB 83 GiB 6.80 0.95 18 up 1 hdd 0.08690 1.00000 89 GiB 5.5 GiB 4.5 GiB 24 KiB 1024 MiB 84 GiB 6.16 0.86 12 up TOTAL 267 GiB 19 GiB 16 GiB 44 KiB 3.0 GiB 248 GiB 7.14 MIN/MAX VAR: 0.86/1.19 STDDEV: 0.98 toolbox# ceph -s cluster: id: 9c61da6b-67e9-4acd-a25c-929db5cbb3f2 health: HEALTH_WARN 3 slow ops, oldest one blocked for 512 sec, mon.c has slow ops services: mon: 3 daemons, quorum a,b,c (age 8s) mgr: a(active, since 32m) mds: ceph-fs:1 {0=ceph-fs-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 29m), 3 in (since 44m) data: pools: 3 pools, 48 pgs objects: 4.21k objects, 16 GiB usage: 19 GiB used, 248 GiB / 267 GiB avail pgs: 48 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
Pick one of the OSDs. In this example, we’ll “resize” the OSD with ID 2.
We need to find the volume which the OSD uses, which we do with:
$ kubectl -n rook-ceph describe pod -l ceph-osd-id=2 […] cinder-1-ceph-data-8hd8s: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: cinder-1-ceph-data-8hd8s ReadOnly: false […]
$id
shall now refer to2
,$name
shall refer toosd.2
,$volume
shall refer tocinder-1-ceph-data-8hd8s
.(This is also an excellent example of OSD IDs not mapping 1:1 to volume indices.)
Set the new target volume size. Update the
config.toml
of kubernetes cluster by settingk8s-service-layer.rook.osd_volume_size
to the new desired size.Run the
toml_helper.py
and apply the changes by running ansible stage 3 (possibly with-t rook
to only apply the rook changes).Evict all data from the victim OSD.
toolbox# ceph osd crush reweight $name 0
Wait for the migration to finish.
You can run
watch ceph osd df
as well aswatch ceph -s
to observe the migration status; the former will show how the number of placement groups (PGS
column) for that OSD decreases, while the latter will show the status of the cluster overall.The migration is over when:
The number of placement groups for your victim OSD is 0
All placement groups show as active+clean in
ceph -s
Note
The
RAW USE
column of theceph osd df
output does not decrease for some reason. The column to look at isDATA
, which should reduce to something in the order of10 MiB
.toolbox# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.08690 1.00000 89 GiB 11 GiB 9.7 GiB 0 B 1 GiB 78 GiB 11.97 1.46 29 up 2 hdd 0 1.00000 89 GiB 3.0 GiB 24 MiB 20 KiB 1024 MiB 86 GiB 3.39 0.41 0 up 1 hdd 0.08690 1.00000 89 GiB 8.3 GiB 7.3 GiB 24 KiB 1024 MiB 81 GiB 9.27 1.13 19 up TOTAL 267 GiB 22 GiB 17 GiB 44 KiB 3.0 GiB 245 GiB 8.21 MIN/MAX VAR: 0.41/1.46 STDDEV: 3.58 toolbox# ceph -s cluster: id: 9c61da6b-67e9-4acd-a25c-929db5cbb3f2 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 9m) mgr: a(active, since 41m) mds: ceph-fs:1 {0=ceph-fs-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 38m), 3 in (since 2m) data: pools: 3 pools, 48 pgs objects: 4.21k objects, 16 GiB usage: 21 GiB used, 246 GiB / 267 GiB avail pgs: 48 active+clean io: client: 852 B/s rd, 1 op/s rd, 0 op/s wr
Mark the OSD as out.
toolbox# ceph osd out $name
ceph osd df
should now show it with all zeros, andceph -s
should still beHEALTH_OK
with all placement groups beingactive+clean
, since the data has been moved elsewhere. If this is not the case abort now and seek help immediately!toolbox# ceph -s […] services: mon: 3 daemons, quorum a,b,c (age 10m) mgr: a(active, since 42m) mds: ceph-fs:1 {0=ceph-fs-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 39m), 2 in (since 22s) data: pools: 3 pools, 48 pgs objects: 4.21k objects, 16 GiB usage: 21 GiB used, 246 GiB / 267 GiB avail pgs: 48 active+clean […]
Restart the operator to trigger the removal of the evicted OSD:
$ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 0 $ sleep 5 $ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 1
Wait until the operator has deleted the OSD pod (should be at most 10 minutes).
$ watch kubectl -n rook-ceph get pod -l ceph-osd-id=$id
This command should print “No resources found in rook-ceph namespace.”.
(Rook will auto-delete OSDs which are marked as out and have no placement groups.)
Purge the OSD. If the data has not been moved, data loss will occur here!
toolbox# ceph osd purge $name
Note
You do not need
--yes-i-really-mean-it
since all data was moved to another device. If ceph asks you for--yes-i-really-mean-it
something is wrong!ceph osd df
should not list the OSD anymore, andceph -s
should say that there are now only 2 OSDs (if you started out with 3), all of which should be up and in.Delete the preparation job and the PVC.
Caution
Deleting the wrong job + pvc will inevitably lead to loss of data! Double-check that you’re killing the correct volume by first running:
$ kubectl -n rook-ceph describe pvc $volume
The
MountedBy:
line should only list a single user, which isrook-ceph-osd-prepare-$volume-$suffix
(where$suffix
is a random thing).Once you’ve verified that, you can delete the job and the PVC:
$ kubectl -n rook-ceph delete job rook-ceph-osd-prepare-$volume $ kubectl -n rook-ceph delete pvc $volume
Verify that the volume is gone with
kubectl get pvc -n rook-ceph
and in openstack.Restart the operator to trigger re-creation of the OSD.
$ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 0 $ sleep 5 $ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 1
You can observe progress by watching the pod list and
ceph osd df
. The newly created volume (kubectl -n rook-ceph get pvc
) should have the new size.The process is done when you see:
all OSDs in
ceph -s
as up and in, andall placement groups as
active+clean
toolbox# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2 hdd 0.18459 1.00000 189 GiB 9.9 GiB 8.9 GiB 0 B 1 GiB 179 GiB 5.25 0.91 27 up 0 hdd 0.08690 1.00000 89 GiB 6.2 GiB 5.2 GiB 0 B 1 GiB 83 GiB 6.92 1.20 14 up 1 hdd 0.08690 1.00000 89 GiB 5.0 GiB 2.0 GiB 44 KiB 1024 MiB 84 GiB 5.66 0.98 7 up TOTAL 367 GiB 21 GiB 16 GiB 44 KiB 3.0 GiB 346 GiB 5.75 MIN/MAX VAR: 0.91/1.20 STDDEV: 0.73 toolbox# ceph -s cluster: id: 9c61da6b-67e9-4acd-a25c-929db5cbb3f2 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 27s) mgr: a(active, since 12m) mds: ceph-fs:1 {0=ceph-fs-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 12m), 3 in (since 12m) data: pools: 3 pools, 48 pgs objects: 4.21k objects, 16 GiB usage: 21 GiB used, 346 GiB / 367 GiB avail pgs: 48 active+clean io: client: 853 B/s rd, 1 op/s rd, 0 op/s wr