Reducing the number of OSDs
Warning
This document has been created quite some time ago and information may be misleading or outdated.
If the load on the meta machines is too high because of the number of OSDs, the number of OSDs can be reduced losslessly. This requires a few steps and great care, but it is possible.
Prerequisites
You need a shell inside the Ceph toolbox. You can open such a shell with:
$ kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
The ceph cluster must be healthy (refer to Cluster Health Verification).
Remove an OSD without data loss
The replacement and removal of OSDs is documented in the official Rook documentation.
Manual approach
The following procedure was tested and verified in a test cluster with 6 OSDs. To check that no data loss occurs, a volume was created with 8 GiB of data on it. 4 GiB were zeroes and 4 GiB were random numbers. The data was validated after each step against SHA256 checksums without any caches.
During creation of the data, it was ensured by watching the output of
ceph osd df
that all OSDs got a share of the data to prevent a false
negative. In addition, the procedure was repeated three times.
Check the current disk usage of the OSDs:
toolbox# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.08690 1.00000 89 GiB 5.0 GiB 4.0 GiB 0 B 1 GiB 84 GiB 5.59 1.35 6 up 2 hdd 0.08690 1.00000 89 GiB 5.2 GiB 4.2 GiB 0 B 1 GiB 84 GiB 5.83 1.41 7 up 1 hdd 0.08690 1.00000 89 GiB 3.0 GiB 2.0 GiB 0 B 1 GiB 86 GiB 3.34 0.81 2 up 3 hdd 0.08690 1.00000 89 GiB 4.9 GiB 3.9 GiB 0 B 1 GiB 84 GiB 5.46 1.32 4 up 4 hdd 0.08690 1.00000 89 GiB 1.0 GiB 4.6 MiB 0 B 1 GiB 88 GiB 1.13 0.27 3 up 5 hdd 0.08690 1.00000 89 GiB 3.0 GiB 2.0 GiB 0 B 1 GiB 86 GiB 3.42 0.83 2 up TOTAL 534 GiB 22 GiB 16 GiB 0 B 6 GiB 512 GiB 4.13 MIN/MAX VAR: 0.27/1.41 STDDEV: 1.68
In this example, we have six OSDs. Most of them have around 2.2 GiB of data on them. One has only 1.0 GiB of data (osd.0) and one has 5.6 GiB of data (osd.3).
Find the OSD to remove. You can only remove the OSD with the highest-numbered cinder volume PVC.
Note
The numbering of the PVCs and OSDs is not equal. That means that OSD 0 may use volume 3 and OSD 3 may use volume 2 and OSD 2 may use volume 0. You always have to discover the used volume using the following procedure.
First list all the OSD PVCs:
$ kubectl -n rook-ceph get pvc -l ceph.rook.io/DeviceSet=cinder NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE cinder-0-ceph-data-v6grl Bound pvc-27cccd47-5443-462b-b8d9-fe3192945223 90Gi RWO csi-sc-cinderplugin 134m cinder-1-ceph-data-8hd8s Bound pvc-83c714af-617b-4834-b9be-30f9b9f4e96f 90Gi RWO csi-sc-cinderplugin 134m cinder-2-ceph-data-n6gdx Bound pvc-9c304b41-bb0c-4a84-8481-3990f5f34618 90Gi RWO csi-sc-cinderplugin 134m cinder-3-ceph-data-446b4 Bound pvc-3929dde6-eb35-402f-a4ff-07c00db59447 90Gi RWO csi-sc-cinderplugin 134m cinder-4-ceph-data-hjwmt Bound pvc-3f4f4a66-692f-4cd6-8f12-e55244da62df 90Gi RWO csi-sc-cinderplugin 134m cinder-5-ceph-data-bwctn Bound pvc-07400ac2-bc0c-4f97-999f-d50ba81b33ec 90Gi RWO csi-sc-cinderplugin 134m
Then pick the one with the highest number in the name, in this example this is
cinder-5-ceph-data-bwctn
$ kubectl -n rook-ceph describe pvc cinder-5-ceph-data-bwctn […] Mounted By: rook-ceph-osd-5-c6d577548-vpscr rook-ceph-osd-prepare-cinder-5-ceph-data-bwctn-89sm8 […]
This gives you the OSD ID of the OSD using this volume. In this example, the OSD ID is the same as the volume index, but you can not rely on this!
In the following, we’ll refer to
osd.5
by$name
, to5
by$id
(ceph lingo) and tocinder-5-ceph-data-bwctn
by$volume
.Before proceeding, check that there is enough space. For that, at least the amount of
DATA
of the OSD to remove needs to beAVAIL
on the other OSDs combined. Due to the way ceph balances data, it is better if most OSDs can take most of the data of the OSD you remove.I’m not sure what’s going to happen if the data cannot be rebalanced, but I assume that (a) it will not cause data loss if you abort before removing the OSD and (b) ceph will tell you.
Set the weight of the OSD to 0. This makes ceph redistribute the data on that OSD to the other OSDs:
toolbox# ceph osd crush reweight $name 0
Note
There is a difference between
ceph osd crush reweight
andceph osd reweight
(see here).ceph osd crush reweight
is a permanent weighting measure, whileceph osd reweight
is a temporary measure which gets lost on a in/out cycle of an OSD.Wait for the migration to finish.
You can run
watch ceph osd df
as well aswatch ceph -s
to observe the migration status; the former will show how the number of placement groups (PGS
column) for that OSD decreases, while the latter will show the status of the cluster overall.The migration is over when:
The number of placement groups for your victim OSD is 0
All placement groups show as active+clean in
ceph -s
Note: the
RAW USE
column of theceph osd df
output does not decrease for some reason. The column to look at isDATA
, which should reduce to something in the order of10 MiB
.toolbox# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.08690 1.00000 89 GiB 5.0 GiB 4.0 GiB 0 B 1 GiB 84 GiB 5.59 1.24 7 up 2 hdd 0.08690 1.00000 89 GiB 7.2 GiB 6.2 GiB 0 B 1 GiB 82 GiB 8.12 1.80 8 up 1 hdd 0.08690 1.00000 89 GiB 3.0 GiB 2.0 GiB 0 B 1 GiB 86 GiB 3.34 0.74 2 up 3 hdd 0.08690 1.00000 89 GiB 4.9 GiB 3.9 GiB 0 B 1 GiB 84 GiB 5.46 1.21 4 up 4 hdd 0.08690 1.00000 89 GiB 1.0 GiB 5.2 MiB 0 B 1 GiB 88 GiB 1.13 0.25 3 up 5 hdd 0 1.00000 89 GiB 3.0 GiB 1.9 GiB 0 B 1 GiB 86 GiB 3.42 0.76 0 up TOTAL 534 GiB 24 GiB 18 GiB 0 B 6 GiB 510 GiB 4.51 MIN/MAX VAR: 0.25/1.80 STDDEV: 2.20 toolbox# ceph -s cluster: id: 9c61da6b-67e9-4acd-a25c-929db5cbb3f2 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 2h) mds: ceph-fs:1 {0=ceph-fs-a=up:active} 1 up:standby-replay osd: 6 osds: 6 up (since 2h), 6 in (since 2h) data: pools: 3 pools, 24 pgs objects: 4.21k objects, 16 GiB usage: 24 GiB used, 510 GiB / 534 GiB avail pgs: 24 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
Mark the OSD as out.
toolbox# ceph osd out $name
ceph osd df
should now show it with all zeros, andceph -s
should still beHEALTH_OK
with all placement groups beingactive+clean
, since the data has been moved elsewhere. If this is not the case abort now and seek help immediately!toolbox# ceph -s […] services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 2h) mds: ceph-fs:1 {0=ceph-fs-a=up:active} 1 up:standby-replay osd: 6 osds: 6 up (since 2h), 5 in (since 7s) data: pools: 3 pools, 24 pgs objects: 4.21k objects, 16 GiB usage: 22 GiB used, 512 GiB / 534 GiB avail pgs: 24 active+clean […]
Reduce the number of OSDs in the Cluster CRD. Update the
config.toml
of the cluster by decreasingk8s-service-layer.rook.nosds
by one.Run the
toml_helper.py
and execute stage three (possibly with-t rook
to only apply rook changes).Wait until the cluster has updated. Watch the output of:
$ kubectl -n rook-ceph get cephcluster rook-ceph -o yaml | tail
until the state has changed to Updating and back to Created.
Wait until the operator has deleted the OSD pod.
$ watch kubectl -n rook-ceph get pod -l ceph-osd-id=$id
This command should print “No resources found in rook-ceph namespace.”.
(Rook will auto-delete OSDs which are marked as out and have no placement groups.)
Purge the OSD. If the data has not been moved, data loss will occur here!
toolbox# ceph osd purge $name
Note
You do not need
--yes-i-really-mean-it
since all data was moved to another device. If ceph asks you for--yes-i-really-mean-it
something is wrong!ceph osd df
should not list the OSD anymore, andceph -s
should say that there are now only 5 OSDs (if you started out with 6), all of which should be up and in.toolbox# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.08690 1.00000 89 GiB 5.0 GiB 4.0 GiB 0 B 1 GiB 84 GiB 5.59 1.18 7 up 2 hdd 0.08690 1.00000 89 GiB 7.2 GiB 6.2 GiB 0 B 1 GiB 82 GiB 8.12 1.72 8 up 1 hdd 0.08690 1.00000 89 GiB 3.0 GiB 2.0 GiB 0 B 1 GiB 86 GiB 3.34 0.71 2 up 3 hdd 0.08690 1.00000 89 GiB 4.9 GiB 3.9 GiB 0 B 1 GiB 84 GiB 5.46 1.16 4 up 4 hdd 0.08690 1.00000 89 GiB 1.0 GiB 5.6 MiB 0 B 1 GiB 88 GiB 1.13 0.24 3 up TOTAL 445 GiB 21 GiB 16 GiB 0 B 5 GiB 424 GiB 4.73 MIN/MAX VAR: 0.24/1.72 STDDEV: 2.35 toolbox# ceph -s cluster: id: 9c61da6b-67e9-4acd-a25c-929db5cbb3f2 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 2h) mds: ceph-fs:1 {0=ceph-fs-a=up:active} 1 up:standby-replay osd: 5 osds: 5 up (since 2m), 5 in (since 5m) data: pools: 3 pools, 24 pgs objects: 4.21k objects, 16 GiB usage: 21 GiB used, 424 GiB / 445 GiB avail pgs: 24 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
Delete the preparation job and the PVC:
$ kubectl -n rook-ceph delete job rook-ceph-osd-prepare-$volume $ kubectl -n rook-ceph delete pvc $volumeVerify that the volume is gone with
kubectl get pvc -n rook-ceph
and in openstack.
Congratulations, you now have one OSD less.