Releasenotes
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project will adhere to Semantic Versioning.
We use towncrier for the generation of our release notes file.
Information about unreleased changes can be found here.
For changes before summer 2023 see the
end of this document and also
git log --no-merges
will help you to get a rough overview of
earlier changes.
v8.1.6 (2024-11-18)
Bugfixes
The
kubernetes.storage.cinder_enable_topology
config option is now applied when using Kubernetes v1.29 or a higher supported version. (!1563)
v8.1.5 (2024-11-05)
Bugfixes
v8.1.4 (2024-10-23)
Bugfixes
A bug has been fixed which caused Kubernetes updates to fail during PKI renewal if
kubernetes.controller_manager.enable_signing_requests
is enabled. (!1535)
v8.1.3 (2024-10-16)
Bugfixes
A bug has been fixed which prevented Kubernetes workers from joining the cluster. (!1524)
v8.1.2 (2024-10-16)
Bugfixes
A bug has been fixed which caused frontend nodes which were setup before release v8.1 to potentially do a system update and reboot. (!1521)
A bug has been fixed which prevented freshly initialized nodes from getting a system update. (!1521)
The bug which affected existing clusters setup before release v8.1 and caused Kubernetes workers to try to re-join the cluster has been fixed for real now. (!1521)
v8.1.1 (2024-10-07)
Bugfixes
A bug has been fixed which affected existing clusters setup before release v8.1 and caused Kubernetes workers to try to re-join the cluster. (!1510)
v8.1.0 (2024-10-01)
New Features
We now have a binary cache for Nix build artifacts, so users won’t have to build anything from source. To use it, configure
extra-substituters = https://yaook.cachix.org extra-trusted-public-keys = yaook.cachix.org-1:m85JtxgDjaNa7hcNUB6Vc/BTxpK5qRCqF4yHoAniwjQ=
in
/etc/nix/nix.conf
(!930)Poetry dependencies are now packaged within the Nix devShell (!930)
Add option to let calico announce the service cluster IP range to external peers. This is necessary in setups where external entities want to send traffic to cluster IPs instead of pods or node ports. (!1455)
The devShell can now be selected with the env var
YAOOK_K8S_DEVSHELL
(defaulting to'default'
) (!1457)Support for a separate inventory for the custom stage was added. It is used in addition to the main inventory.
Previously, the custom stage inventory was silently dropped with release v4.0.0. This is now noted in the v4.0.0 release notes. (!1472)
Changed functionality
Bugfixes
A missing call to install_prerequisites has been added to upgrade.sh (!1481)
Spawning clusters on Openstack with apply-all.sh has been fixed (!1488)
Cluster-health-verification in combination with poetry2nix has been fixed (!1495)
Devshell reloading in combination with poetry2nix has been fixed (!1495)
Changes in the Documentation
A hint to set
USE_VAULT_IN_DOCKER
for development setups has been added to Initialization documentation (!1461)The documentation has been updated to ensure compliance with the Kubernetes trademarks and streamline to consistently use “YAOOK/K8s for the LCM. (!1465)
Deprecations and Removals
tools/patch_config.py
was removed completing its deprecation cycle since release 6.1.0. (!1468)
Other Tasks
Misc
v8.0.2 (2024-09-20)
Bugfixes
Spawning clusters on Openstack with apply-all.sh has been fixed (!1488)
v8.0.1 (2024-09-09)
Bugfixes
The release migration script was fixed to support bare metal cluster repos as well. (!1470)
v8.0.0 (2024-08-28)
Breaking changes
The YAOOK/K8s Terraform module now allows worker nodes to be joined into individual anti affinity groups.
Attention
Action required
You must migrate your Terraform state by running the migration script.
./managed-k8s/actions/migrate-to-release.sh
_ (!1317)
The YAOOK/K8s Terraform module does not build a default set of nodes (3 masters + 4 workers) anymore when no nodes are given. (!1317)
The automatic just-in-time migration of Terraform resources from
count
tofor_each
introduced in July 2022 was removed in favor of a once-and-for-all migration../managed-k8s/actions/migrate-to-release.sh
_ (!1317)
YAOOK/K8s Terraform does not implicitly assign nodes to availability zones anymore if actually none was configured for a node.
For all master and worker nodes, availability zones must now be configured explicitly; and
[terraform].enable_az_management
has been removed therefore.Not configuring availability zones, now leaves the choice to the cloud controller which may or may not select one. To achieve the same effect for gateway nodes, turn off
[terraform].spread_gateways_across_azs
.Attention
Action required
To prevent Terraform from unneccessarily rebuilding master and worker nodes, you must run the migration script. This will determine each nodes’ availability zone in the Terraform state to set in the config for you.
./managed-k8s/actions/migrate-to-release.sh
_ (!1317)
The format of the
[terraform]
config section changed significantly.Terraform nodes are now to be configured as blocks of values rather than across separate lists for each type of value.
Furthermore you now have control over the whole name of Terraform nodes, see the documentation for further details.
[terraform] - masters = 2 - master_names = ["X", "Y"] - master_flavors = ["M", "M"] - master_images = ["Ubuntu 20.04 LTS x64", "Ubuntu 22.04 LTS x64"] - master_azs = ["AZ1", "AZ3"] + #.... + + [terraform.nodes.master-X] + role = "master" # mandatory + flavor = "M" + image = "Ubuntu 20.04 LTS x64" + az = "AZ1" + #.... + + [terraform.nodes.worker-A] + role = "worker" # mandatory + flavor = "S" + image = "Debian 12 (bookworm)" + az = "AZ3" #....
The gateway/master/worker defaults are consolidated into blocks as well.
[terraform] - gateway_image_name = "Debian 12 (bookworm)" - gateway_flavor = "XS" - default_master_image_name = "Ubuntu 22.04 LTS x64" - default_master_flavor = "M" - default_master_root_disk_size = 50 - default_worker_image_name = "Ubuntu 22.04 LTS x64" - default_worker_flavor = "L" - default_worker_root_disk_size = 100 + #.... + + [terraform.gateway_defaults] + image = "Debian 12 (bookworm)" + flavor = "XS" + + [terraform.master_defaults] + image = "Ubuntu 22.04 LTS x64" + flavor = "M" + root_disk_size = 50 + + [terraform.worker_defaults] + image = "Ubuntu 22.04 LTS x64" + flavor = "L" + root_disk_size = 100 #....
The worker anti affinity settings
[terraform].worker_anti_affinity_group_name
and[terraform].worker_join_anti_affinity_group
are merged into[terraform.workers.<name>].anti_affinity_group
or[terraform.worker_defaults].anti_affinity_group
. Unset means “no join”.[terraform] - worker_anti_affinity_group_name = "some-affinity-group" - worker_join_anti_affinity_group = [false, true] + #.... + + [terraform.worker_defaults] + + [terraform.workers.0] + + [terraform.workers.1] + anti_affinity_group = "some-affinity-group" #....
Attention
Action required
You must convert your config into the new format.
./managed-k8s/actions/migrate-to-release.sh
_ (!1317)
Gateway node names are now index rather than availability zone based, leading to names like
managed-k8s-gw-0
instead ofmanaged-k8s-gw-az1
.Attention
Action required
To prevent Terraform from unnecessarily rebuilding gateway nodes, you must run the migration script.
./managed-k8s/actions/migrate-to-release.sh
_ (!1317)
New Features
Terraform: Anti affinity group settings are now configurable per worker node. (!1317)
Terraform: The amount of gateway nodes created is not dependent on the amount of availability zones anymore and can be set with
[terraform].gateway_count
. The setting’s default yields the previous behavior when[terraform].spread_gateway_across_azs
is enabled which it is by default. (!1317)A rework has been done which now allows to trigger a specific playbook of k8s-core or k8s-supplements. The default behavior of triggering
install-all.yaml
has been preserved. See apply-k8s-core.sh and apply-k8s-supplements.sh for usage information. (!1433)It is now possible to set the root URL for Grafana (!1447)
Changed functionality
The minimum Terraform version is increased to 1.3 (!1317)
Bugfixes
Importing the Thanos object storage configuration has been reworked to not fail erroneously. (!1437)
Other Tasks
v7.0.2 (2024-08-26)
Bugfixes
A bug has been fixed which prevented the configuration of an exposed Vault service. (!1448)
v7.0.1 (2024-08-26)
Bugfixes
kube-state-metrics not being able to read namespace labels has been fixed (!1438)
v7.0.0 (2024-08-22)
Breaking changes
The dual stack support has been reworked and fixed. The variable
dualstack_support
has been split into two variables,ipv4_enabled
(defaults to true) andipv6_enabled
(defaults to false) to allow ipv6-only deployments and a more fine-granular configuration.The following configuration changes are recommended, but not mandatory:
[terraform] -dualstack_support = false +ipv6_enabled = false
Existing clusters running on OpenStack must execute the Terraform stage once:
$ ./managed-k8s/actions/apply-terraform.sh
to re-generate the inventory and hosts file for Ansible. (!1304)
New Features
Support for ch-k8s-lbaas v0.8.0 and v0.9.0 has been added. The ch-k8s-lbaas version is now an optional variable. To ensure the latest supported version is used, one can simply unset it.
$ tomlq --in-place --toml-output 'del(."ch-k8s-lbaas".version)' config/config.toml
. (!1304)
Introduce support for setting remote write targets (
[[remote_writes]]
) for Prometheus (!1396)Add new modules
http_api
andhttp_api_insecure
for Blackbox exporter allowing status codes 200, 300, 401 to be returned for http probes.http_api_insecure
additionally doesn’t care for the issuer of a certificate. (!1420)The default version for rook/Ceph has been bumped to v1.14.9. (!1430)
Changed functionality
Bugfixes
Changes in the Documentation
We now explain our release withdrawal procedure in the Release and Versioning Policy (!1376)
The documentation now links to the latest version of the Calico docs instead of a specific version (where possible). (!1408)
The generated Terraform docs was updated. (!1434)
Deprecations and Removals
The “global monitoring” functionality has been dropped. It was a provider-specific feature and has been dropped as the LCM should be kept as general as possible. (!1270)
Other Tasks
Misc
v6.1.2 (2024-08-19)
Bugfixes
In the v6.0.0 release notes, we now draw attention to committing
etc/ssh_known_hosts
in the cluster repository so that the re-enabled SSH host key verification does not require every user to use TOFU at first. (!1413)
v6.1.1 (2024-08-15)
Bugfixes
Fixed a bug in k8s-login.sh which would fail if the etc directory did not exist. (!1416)
v6.1.0 (2024-08-07)
New Features
Added support for Kubernetes v1.30 (!1385)
Configuration options have been added to cert-manager and ingress-controller to further streamline general helm chart handling. (!1387)
Add
MANAGED_K8S_GIT_BRANCH
environment variable to allow specifying a branch that should be checked out when runninginit-cluster-repo.sh
. (!1388)
Changed functionality
The mapped Calico versions have been bumped due to a bug which can result in high CPU utilization on nodes. If no custom Calico version is configured, Calico will get updated automatically on the next rollout. It is strongly recommended to do a rollout.
$ AFLAGS="--diff -t calico" bash managed-k8s/actions/apply-k8s-supplements.sh
. (!1393)
Bugfixes
Deprecations and Removals
Other Tasks
v6.0.3 (2024-07-22)
Updated the changelog after a few patch releases in the v5.1 series were withdrawn and superseded by another patch release.
Because the v6.0 release series already includes the breaking change that is removed again in the v5.1 release series, we kept it and just added it to the v6.0.0 release notes.
v6.0.2 (2024-07-20)
Changed functionality
Sourcing lib.sh is now side-effect free (!1340)
The entrypoint for the custom stage has been moved into the LCM. It now includes the connect-to-nodes role and then dispatches to the custom playbook. If you had included connect-to-nodes in the custom playbook, you may now remove it.
diff --git a/k8s-custom/main.yaml b/k8s-custom/main.yaml -# Node bootstrap is needed in most cases -- name: Initial node bootstrap - hosts: frontend:k8s_nodes - gather_facts: false - vars_files: - - k8s-core-vars/ssh-hardening.yaml - - k8s-core-vars/disruption.yaml - - k8s-core-vars/etc.yaml - roles: - - role: bootstrap/detect-user - tag: detect-user - - role: bootstrap/ssh-known-hosts - tags: ssh-known-hosts
. (!1352)
The version of bird-exporter for prometheus has been updated to 1.4.3, haproxy-exporter to 0.15, and keepalived-exporter to 0.7.0. (!1357)
Bugfixes
(!1366)
The required actions in the notes of release v6.0.0 were incomplete and are fixed now.
Other Tasks
v6.0.1 (2024-07-17)
Changed functionality
The default version of the kube-prometheus-stack helm chart has been updated to 59.1.0, and prometheus-adapter to version 4.10.0. (!1314)
Bugfixes
When initializing a new Wireguard endpoint, nftables may not get reloaded. This has been fixed. (!1339)
If the vault instance is not publicly routable, nodes were not able to login to it as the vault certificate handling was faulty. This has been fixed. (!1358)
A fix to properly generate short-lived kubeconfigs with intermediate CAs has been supplied. (!1359)
Other Tasks
Misc
v6.0.0 (2024-07-02)
Breaking changes
We now use short-lived (8d) kubeconfigs
The kubeconfig at
etc/admin.conf
is now only valid for 8 days after creation (was 1 year). Also, it is now discouraged to check it into version control but instead refresh it on each orchestrator as it is needed usingtools/vault/k8s-login.sh
.If your automation relies on the kubeconfig to be checked into VCS or for it to be valid for one year, you probably need to adapt it.
In order to switch to the short-lived kubeconfig, run
$ git rm etc/admin.conf $ sed --in-place '/^etc\/admin\.conf$/d' .gitignore $ git commit etc/admin.conf -m "Remove kubeconfig from git" $ ./managed-k8s/tools/vault/init.sh $ ./managed-k8s/tools/vault/update.sh $ ./managed-k8s/actions/k8s-login.sh
Which will remove the long-term kubeconfig and generate a short-lived one. (!1178)
We now provide an opt-in regression fix that restores Kubernetes’ ability to respond to certificate signing requests.
Using the fix is completely optional, see Restoring Kubernetes’ ability to sign certificates. for futher details.
Action required: As a prerequisite for making the regression fix functional you must update your Vault policies by executing the following:
# execute with Vault root token sourced ./managed-k8s/tools/vault/init.sh
. (!1219)
Some environment variables have been removed.
WG_USAGE
andTF_USAGE
have been moved from.envrc
toconfig.toml
. If they have been set to false, the respective optionswireguard.enabled
andterraform.enabled
inconfig.toml
need to be set accordingly. If they were not touched (i.e. they are set to true), no action is required.CUSTOM_STAGE_USAGE
has been removed. The custom stage is now always run if the playbook exists. No manual action required. (!1263)SSH host key verification has been re-enabled. Nodes are getting signed SSH certificates. For clusters not using a vault running inside docker as backend, automated certificate renewal is configured on the nodes. The SSH CA is stored inside
$CLUSTER_REPOSITORY/etc/ssh_known_hosts
and can be used to ssh to nodes.Attention: Make sure that file is not gitignored and is committed after rollout.
The vault policies have been adjusted to allow the orchestrator role to read the SSH CA from vault. You must update the vault policies therefore:
Note
A root token is required.
$ ./managed-k8s/tools/vault/init.sh
This is needed just once. (!1272)
With Kubernetes v1.29, the user specified in the
admin.conf
kubeconfig is now bound to thekubeadm:cluster-admins
RBAC group. This requires an update to the Vault cluster policies and configuration.You must update your vault policies and roles and a root token must be sourced.
$ ./managed-k8s/tools/vault/init.sh $ ./managed-k8s/tools/vault/update.sh
To upgrade your Kubernetes cluster from version v1.28 to v1.29, follow these steps:
Warning
You must upgrade to a version greater than
v1.29.5
due to kubeadm #3055$ MANAGED_K8S_RELEASE_THE_KRAKEN=true ./managed-k8s/actions/upgrade.sh 1.29.x $ ./managed-k8s/actions/k8s-login.sh
Note that the default upgrade procedure has changed such that addons get upgraded after all control plane nodes got upgraded and not along with the first control plane node. (!1284)
Use volumeV3 client at terraform. volumeV2 is not supported everywhere.
Note
This breaking change was originally introduced by release 5.1.2, but was reverted again with release 5.1.5 as release 5.1.2 got withdrawn.
If you have
[terraform].create_root_disk_on_volume = true
set in your config, you must migrate theopenstack_blockstorage_volume_v2
resources in your Terraform state to the v3 resource type in order to prevent rebuilds of all servers and their volumes.# Execute the lines produced by the following script # This will import all v2 volumes as v3 volumes # and remove the v2 volume resources from the Terraform state. terraform_module="managed-k8s/terraform" terraform_config="../../terraform/config.tfvars.json" for item in $( \ terraform -chdir=$terraform_module show -json \ | jq --raw-output '.values.root_module.resources[] | select(.type == "openstack_blockstorage_volume_v2") | .name+"[\""+.index+"\"]"+","+.values.id' \ ); do echo "terraform -chdir=$terraform_module import -var-file=$terraform_config 'openstack_blockstorage_volume_v3.${item%,*}' '${item#*,}' " \ "&& terraform -chdir=$terraform_module state rm 'openstack_blockstorage_volume_v2.${item%,*}'" done
(!1245)
New Features
Changed functionality
The CI image is now built as part of this repo’s pipeline using a Nix Flake (!1175)
Thanos CPU limits have been removed (!1186)
PKI renewal during Kubernetes upgrades has been refined and can be explicitly triggered or skipped via the newly introduced
renew-pki
tag. (!1251)All releasenotes will now have a link to their corresponding MR. (!1294)
(!1325)
Bugfixes
Adjust .gitignore template to keep the whole inventory (!1274) Action recommended: Adapt your .gitignore with
sed --in-place '/^!\?\/inventory\/.*$/d' .gitignore
.After each phase of a root CA rotation a new kubeconfig is automatically generated (!1293)
The common monitoring labels feature has been fixed. (!1303)
Keys in the wireguard endpoint dict have been fixed. (!1329)
Changes in the Documentation
add hints for terraform config (!1246)
A variable setting to avoid problems with the keyring backend has been added to the template of
~/.config/yaook-k8s/env
. (!1269)A hint to fix incorrect locale settings for Ansible has been added. (!1297)
A missing variable has been added to the reference (!1313)
Deprecations and Removals
Support for rook_v1 has been dropped. We do only support deploying rook via helm from now on. (!1042)
Deprecated vault policies have been removed after a sufficient transition time.
Hint
A root token is required.
./managed-k8s/tools/vault/init.sh
Execute the above to remove them from your vault instance. (!1318)
Other Tasks
Security
All Ansible tasks that handle secret keys are now prevented from logging them. (!1295)
Misc
v5.1.5 (2024-07-22)
Note
This release replaces all releases since and including 5.1.2.
Patch release 5.1.2 and its successors 5.1.3 and 5.1.4 were withdrawn due to #676 “Release v5.1.2 is breaking due to openstack_blockstorage_volume_v3”
This release reverts the breaking change introduced by !1245 “terraform use volume_v3 API”, while retaining all other changes introduced by the withdrawn releases that were withdrawn.
!1245 “terraform use volume_v3 API” will be re-added with a later major release.
Attention
DO NOT update to this or a higher non-major release if you are currently on one of the withdrawn releases. Make sure to only upgrade to the major release which re-adds !1245 “terraform use volume_v3 API” instead.
v5.1.4 (2024-06-07) [withdrawn]
Bugfixes
The root CA rotation has been fixed. (!1289)
v5.1.3 (2024-06-06) [withdrawn]
New Features
A Poetry group has been added so update-inventory.py can be called with minimal dependencies. (!1277)
v5.1.2 (2024-05-27) [withdrawn]
Note
This release was withdrawn due to #676 “Release v5.1.2 is breaking due to openstack_blockstorage_volume_v3”
Changed functionality
Bugfixes
(!1255)
Changes in the Documentation
Terraform references updated (!1189)
A guide on how to simulate a self-managed bare metal cluster on top of OpenStack has been added to the documentation. (!1231)
Instructions to install Vault have been added to the installation guide (!1247)
Deprecations and Removals
Misc
v5.1.1 (2024-05-21)
Bugfixes
The LCM is again able to retrieve the default subnet CIDR when
[terraform].subnet_cidr
is not set in the config.toml. (!1249)
v5.1.0 (2024-05-07)
New Features
An option to use a minimal virtual Python environment has been added. Take a look at Minimal Access Venv. (!1225)
Bugfixes
Dummy build the changelog for the current releasenotes in the ci
build-docs-check
-job (!1234)
v5.0.0 (2024-05-02)
Breaking changes
Added the
MANAGED_K8S_DISRUPT_THE_HARBOUR
environment variable.Disruption of harbour infrastructure is now excluded from
MANAGED_K8S_RELEASE_THE_KRAKEN
. To allow it nonethelessMANAGED_K8S_DISRUPT_THE_HARBOUR
needs to be set instead. (See documentation on environment variables)[terraform].prevent_disruption
has been added in the config to allow the environment variable to be overridden when Terraform is used (TF_USAGE=true
). It is set totrue
by default.Ultimately this prevents unintended destruction of the harbour infrastructure and hence the whole YAOOK/K8s deployment when
MANAGED_K8S_RELEASE_THE_KRAKEN
must be used, e.g. during Kubernetes upgrades. (!1176)Vault tools now read the cluster name from
config.toml
If your automation relies on any tool in
./tools/vault/
, you need to adapt its signature.<clustername>
has been removed as the first argument. (!1179)
New Features
Support for Kubernetes v1.28 has been added (!1205)
Changed functionality
Bugfixes
Cluster repository migration has been fixed for bare metal clusters. (!1183)
Core Split migration script doesn’t fail anymore when inventory folder is missing (!1196)
(!1203)
Some images got moved to the yaook registry, so we updated the image path.
For
registry.yaook.cloud/yaook/backup-shifter:1.0.166
a newer tag needs to be used, as the old one is not available at new registry. (!1206)Cluster repo initialization with
./actions/init-cluster-repo.sh
does not fail anymore when the config already exists. (!1211)
Changes in the Documentation
Deprecations and Removals
Support for the legacy installation procedure of Thanos with jsonnet has been dropped (!1214)
Other Tasks
Misc
v4.0.0 (2024-04-15)
Breaking changes
The first and main serve of the core-split has been merged and the code base has been tossed around. One MUST take actions to migrate a pre-core-split cluster.
$ bash managed-k8s/actions/migrate-cluster-repo.sh
This BREAKS the air-gapped and cluster-behind-proxy functionality.
Please refer to the respective documentation (!823).
The custom stage now uses the main inventory exclusively, like all other stages. A seperate inventory for the custom stage is not supported anymore and will be removed by the migrate-cluster-repo action.
Changed functionality
The custom stage is enabled by default now. (!823)
Change etcd-backup to use the new Service and ServiceMonitor manifests supplied by the Helm chart.
The old manifests that were included in the YAOOK/K8s repo in the past will be overwritten (
etcd-backup
ServiceMonitor) and removed (etcd-backup-monitoring
Service) in existing installations. (!1131)
Bugfixes
Changes in the Documentation
Streamline Thanos bucket management configuration (!1173)
Deprecations and Removals
Dropping the
on_openstack
variable from the[k8s-service-layer.rook]
sectionPreviously, this was a workaround to tell rook if we’re running on top of OpenStack or not. With the new repository layout that’s not needed anymore as the
on_openstack
variable is specified in the hosts file (inventory/yaook-k8s/hosts
) and available when invoking the rook roles. (!823)Remove configuration option for Thanos query persistence
As that’s not possible to set via the used helm chart and the variable is useless. (!1174)
Other Tasks
Disable “-rc”-tagging (!1170)
v3.0.2 (2024-04-09)
Changes in the Documentation
Add some details about Thanos configuration (!1146)
Misc
v3.0.1 (2024-04-03)
Bugfixes
Fix Prometheus stack deployment
If
scheduling_key
andallow_external_rules
where set, rendering the values file for the Prometheus-stack failed due to wrong indentation. Also thescheduling_key
did not take effect even withoutallow_external_rules
configured due to the wrong indentation. (!1142)
v3.0.0 (2024-03-27)
Breaking changes
Drop passwordstore functionality
We’re dropping the already deprecated and legacy passwordstore functionality. As the inventory updater checks for valid sections in the “config/config.toml” only, the “[passwordstore]” section must be dropped in its entirety for existing clusters. (!996)
Adjust configuration for persistence of Thanos components
Persistence for Thanos components can be enabled/disabled by setting/unsetting
k8s-service-layer.prometheus.thanos_storage_class
. It is disabled by default. You must adjust your configuration to re-enable it. Have a lookt at the configuration template. Furthermore, volume size for each component can be configured separately. (!1106)Fix disabling storage class creation for rook/ceph pools
Previously, the
create_storage_class
attribute of a ceph pool was a string which has been interpreted as boolean. This has been changed and that attribute must be a boolean now.[[k8s-service-layer.rook.pools]] name = "test-true" create_storage_class = true replicated = 3
This is restored behavior pre-rook_v2, where storage classes for ceph blockpools didn’t get created by default. (!1130)
The Thanos object storage configuration must be moved to vault if it is not automatically managed. Please check the documentation on how to create a configuration and move it to vault.
You must update your vault policies if you use Thanos with a custom object storage configuration
./managed-k8s/tools/vault/update.sh $CLUSTER_NAME
Execute the above to update your vault policies. A root token must be sourced.
New Features
Add Sonobuoy testing to CI (!957)
Add support to define memory limits for the kube-apiservers
The values set in the
config.toml
are only applied on K8s upgrades. If no values are explicitly configured, no memory resource requests nor limits will be set by default. (!1027)Thanos: Add option to configure in-memory index cache sizes (!1116)
Changed functionality
Poetry virtual envs are now deduplicated between cluster repos and can be switched much more quickly (!931)
Allow unsetting CPU limits for rook/ceph components (!1089)
Add check whether VAULT_TOKEN is set for stages 2 and 3 (!1108)
Enable auto-downsampling for Thanos query (!1116)
Add option for testing clusters to enforce the reboot of the nodes after each system update to simulate the cluster behaviour in a real world. (!1121)
Add a new env var
$MANAGED_K8S_LATEST_RELEASE
for theinit.sh
script which is true by default and causes that the latest release is checked out instead ofdevel
(!1122)
Bugfixes
Fix & generalize scheduling_key usage for managed K8s services (!1088)
Fix vault import for non-OpenStack clusters (!1090)
Don’t create Flux PodMonitos if monitoring is disabled (!1092)
Fix a bug which prevented nuking a cluster if Gitlab is used as Terraform backend (!1093)
Fix tool
tools/assemble_cephcluster_storage_nodes_yaml.py
to produce valid yaml.The tool helps to generate a Helm value file for rook-ceph-cluster Helm chart. The data type used for encryptedDevice in yaml path cephClusterSpec.storage has been fixed. It was boolean before but need to be string. (!1118)
(!1120)
Ensure minimal IPSec package installation (!1129)
Fix testing of rook ceph block storage classes - Now all configured rook ceph block storage pools for which a storage class is configured are checked rather than only rook-ceph-data. (!1130)
Changes in the Documentation
Include missing information in the “new Vault” case in the “Pivot vault” section of the Vault documentation (!1086)
Deprecations and Removals
Other Tasks
Add hotfixing strategy (!1063)
Add deprecation policy. (!1076)
Prevent CI jobs from failing if there are volume snapshots left (!1091)
Fix releasenote-file-check in ci (!1096)
Refine hotfixing procedure (!1101)
We define how long we’ll support older releases. (!1112)
Update flake dependencies (!1117)
Misc
v2.1.1 (2024-03-01)
Bugfixes
Fix kubernetes-validate installation for K8s updates (!1097)
v2.1.0 (2024-02-20)
New Features
Add support for Kubernetes v1.27 (!1065)
Allow to enable Ceph dashboard
Changed functionality
Disarm GPU tests until #610 is properly addressed
Bugfixes
Allow using clusters before and after the introduction of the Root CA rotation feature to use the same Vault instance. (!1069)
Fix loading order in envrc template
envrc.lib.sh: Run poetry install with –no-root
Changes in the Documentation
Add information on how to pack a release.
Update information about how to write releasenotes
Deprecations and Removals
Drop support for Kubernetes v1.24 (!1040)
Other Tasks
Update flake dependencies and allow unfree license for Terraform (!929)
Misc
v2.0.0 (2024-02-07)
Breaking changes
Add functionality to rotate certificate authorities of a cluster
This is i.e. needed if the old one is shortly to expire. As paths of vault policies have been updated for this feature, one must update them. Please refer to our documentation about the Vault setup. (!939)
New Features
Add support for generating Kubernetes configuration from Vault
This allows “logging into Kubernetes” using your Vault credentials. For more information, see the updated vault documentation (!1016).
Bugfixes
Disable automatic certification renewal by kubeadm as we manage certificates via vault
Fixed variable templates for Prometheus persistent storage configuration
Other Tasks
Further improvement to the automated release process. (!1033)
Automatically delete volume snapshots in the CI
Bump required Python version to >=3.10
CI: Don’t run the containerd job everytime on devel
Enable renovate bot for Ansible galaxy requirements
v1.0.0 (2024-01-29)
Breaking changes
Add option to configure multiple Wireguard endpoints
Note that you must update the vault policies once. See Wireguard documentation for further information.
# execute with root vault token sourced bash managed-k8s/tools/vault/init.sh
(!795)
Improve smoke tests for dedicated testing nodes
Smoke tests have been reworked a bit such that they are executing on defined testing nodes (if defined) only. You must update your config if you defined testing nodes. (!952)
New Features
Add option to migrate terraform backend from local to gitlab (!622)
Add support for Kubernetes v1.26 (!813)
Support the bitnami thanos helm chart
This will create new service names for thanos in K8s. The migration to the bitnami thanos helm chart is triggered by default. (!816)
Add tool to assemble snippets for CephCluster manifest
Writing the part for the CephCluster manifest describing which disks to be used for Ceph OSDs and metadata devices for every single storage node is error-prone. Once a erroneous manifest has been applied it can be very time-consuming to correct the errors as OSDs have to be un-deployed and wiped before re-applying the correct manifest. (!855)
Add project-specific managers for renovate-bot (!856)
Add option to configure custom DNS nameserver for OpenStack subnet (IPv4) (!904)
Add option to allow snippet annotations for NGINX Ingress controller (!906)
Add configuration option for persistent storage for Prometheus (!917)
Add optional configuration options for soft and hard disk pressure eviction to the
config.toml
. (!948)Additionally pull a local copy of the Terraform state for disaster recovery purposes if Gitlab is configured as backend. (!968)
Changed functionality
Bump default Kubernetes node image to Ubuntu 22.04 (!756)
Update Debian Version for Gateway VMs to 12 (!824)
Spawn Tigera operator on Control Plane only by adjusting its nodeSelector (!850)
A minimum version of v1.5.0 is now required for poetry (!861)
Rework installation procedure of flux
Flux will be deployed via the community helm chart from now on. A migration is automatically triggered (but can be prevented, see our flux documentation for further information). The old installation method will be dropped very soon. (!891)
Use the v1beta3 kubeadm Configuration format for initialization and join processes (!911)
Switch to new community-owned Kubernetes package repositories
As the Google-hosted repository got frozen, we’re switching over to the community-owned repositories. For more information, please refer to https://kubernetes.io/blog/2023/08/15/pkgs-k8s-io-introduction/#what-are-significant-differences-between-the-google-hosted-and-kubernetes-package-repositories (!937)
Moving IPSec credentials to vault. This requires manual migration steps. Please check the documentation. (!949)
Don’t set resource limits for the NGINX ingress controller by default
Bugfixes
Create a readable terraform var file (!817)
Fixed the missing gpu flag and monitoring scheduling key (!819)
Update the terraform linter and fix the related issues (!822)
Fixed the check for monitoring common labels in the rook-ceph cluster chart values template. (!826)
Fix the vault.sh script
The script will stop if a config.hcl file already exists. This can be avoided with a prior existence check. Coreutils v9.2 changed the behaviour of –no-clobber[1].
[1] https://github.com/coreutils/coreutils/blob/df4e4fbc7d4605b7e1c69bff33fd6af8727cf1bf/NEWS#L88 (!828)
Added missing dependencies to flake.nix (!829)
ipsec: Include passwordstore role only if enabled
The ipsec role hasn’t been fully migrated to vault yet and still depends on the passwordstore role. If ipsec is not used, initializing a password store is not necessary. However, as an ansible dependency, it was still run and thus failed if passwordstore hadn’t been configured. This change adds the role via include_role instead of as a dependency. (!833)
Docker support has been removed along with k8s versions <1.24, but some places remained dependent on the now unnecessary variable container_runtime. This change removes every use of the variable along with the documentation for migrating from docker to containerd. (!834)
Fix non-gpu clusters
For non-gpu clusters, the roles containerd and kubeadm-join would fail, because the variable has_gpu was not defined. This commit changes the order of the condition, so has_gpu is only checked if gpu support is enabled for the cluster.
This is actually kind of a workaround for a bug in Ansible. has_gpu would be set in a dependency of both roles, but Ansible skips dependencies if they have already been skipped earlier in the play. (!835)
Fix rook for clusters without prometheus
Previously, the rook cluster chart would always try to create PrometheusRules, which would fail without Prometheus’ CRD. This change makes the creation dependent on whether monitoring is enabled or not. (!836)
Fix vault for clusters without prometheus
Previously, the vault role would always try to create ServiceMonitors, which would fail without Prometheus’ CRD. This change makes the creation dependent on whether monitoring is enabled or not. (!838)
Change the default VRRP priorities from 150/100/80 to 150/100/50. This makes it less likely that two backup nodes attempt to become primary at the same time, avoiding race conditions and flappiness. (!841)
Fix Thanos v1 cleanup tasks during migration to prevent accidental double deletion of resources (!849)
Fixed incorrect templating of Thanos secrets for buckets managed by Terraform and clusters with custom names (!854)
Rename rook_on_openstack field in config.toml to on_openstack (!888)
Fixed configuration of host network mode for rook/ceph (!899)
Only delete volumes, ports and floating IPs from the current OpenStack project on destroy, even if the OpenStack credentials can access more than this project. (!921)
destroy: Ensure port deletion works even if only OS_PROJECT_NAME is set (!922)
destroy: Ensure port deletion works even if both OS_PROJECT_NAME and OS_PROJECT_ID are set (!924)
Add support for ch-k8s-lbaas version 0.7.0. Excerpt from the upstream release notes:
Improve scoping of actions within OpenStack. Previously, if the credentials allowed listing of ports or floating IPs outside the current project, those would also be affected. This is generally only the case with OpenStack admin credentials which you aren’t supposed to use anyway.
It is strongly recommended that you upgrade your cluster to use 0.7.0 as soon as possible. To do so, change the version value in the
ch-k8s-lbaas
section of yourconfig.toml
to"0.7.0"
. (!938)Fixed collection of Pod logs as job artifacts in the CI. (!953)
Fix forwarding nftable rules for multiple Wireguard endpoints. (!969)
The syntax of the rook cheph
operator_memory_limit
and _request was fixed inconfig.toml
. (!973)Fix migration tasks tasks for Flux (!976)
It is ensured that the values passed to the cloud-config secret are proper strings. (!980)
Fix configuration of Grafana resource limits & requests (!982)
Bump to latest K8s patch releases (!994)
Fix the behaviour of the Terraform backend when multiple users are maintaining the same cluster, especially when migrating the backend from local to http. (!998)
Constrain kubernetes-validate pip package on Kubernetes nodes (!1004)
Add automatic migration to community repository for Kubernetes packages
Create a workaround which should allow the renovate bot to create releasenotes
Changes in the Documentation
Added clarification for available release-note types. (!830)
Add clarification in vault setup. (!831)
Fix tip about .envrc in Environment Variable Reference (!832)
Clarify general upgrade procedure and remove obsolete version specific steps (!837)
The repo link to the prometheus blackbox exporter changed (!840)
Added clarification in initialization for the different
.envrc
used. (!852)Update and convert Terraform documentation to restructured Text (!904)
rook-ceph: Clarify role of mon_volume_storage_class (!955)
Deprecations and Removals
remove acng related files (!978)
Other Tasks
We start using our release pipeline. That includes automatic versioning and release note generation. (!825)
(!839, !842, !864, !865, !866, !867, !868, !869, !870, !871, !872, !874, !875, !876, !877, !878, !879, !880, !881, !885, !886, !890, !893, !894, !895, !896, !901, !907, !920, !927)
Adjusted CI and code base for ansible-lint v6.20 (!882)
Update dependency ansible to v8.5.0 (!909)
Enable renovate for Nix flake (!914)
Unpin poetry in flake.nix (!915)
Update kubeadm api version (!963)
The poetry.lock file will update automatically. (!965)
Changed the job rules for the ci-pipeline. (!992)
Security
Security hardening settings for the nginx ingress controller. (!972)
Misc
Preversion
Towncrier as tooling for releasenotes
From now on we use towncrier to generate our relasenotes. If you are a developer see the coding guide for further information.
Add .pre-commit-config.yaml
This repository now contains pre-commit hooks to validate the linting
stage of our CI (except ansible-lint) before committing. This allows for
a smoother development experience as mistakes can be catched quicker. To
use this, install pre-commit (if you use Nix
flakes, it is automatically installed for you) and then run
pre-commit install
to enable the hooks in the repo (if you use
direnv, they are automatically enabled for you).
Create volume snapshot CRDs (!763)
You can now create snapshots of your openstack PVCs. Missing CRDs and the snapshot-controller from [1] and [2] where added.
[1] https://github.com/kubernetes-csi/external-snapshotter/tree/master/client/config/crd
Add support for rook v1.8.10
Update by setting version=1.8.10
and running
MANAGED_K8S_RELEASE_THE_KRAKEN=true AFLAGS="--diff --tags mk8s-sl/rook" managed-k8s/actions/apply-stage4.sh
Use poetry to lock dependencies
Poetry allows to declaratively set Python dependencies and lock versions. This way we can ensure that everybody uses the same isolated environment with identical versions and thus reduce inconsistencies between individual development environments.
requirements.txt
has been removed. Python dependencies are now
declared in pyproject.toml
and locked in poetry.lock
. New deps
can be added using the command poetry add package-name
. After
manually editing pyproject.toml
, run poetry lock
to update the
lock file.
Drop support for Kubernetes v1.21, v1.22, v1.23
We’re dropping support for EOL Kubernetes versions.
Add support for Kubernetes v1.25
We added support for all patch versions of Kubernetes v1.25. One can either directly create a new cluster with a patch release of that version or upgrade an existing cluster to one as usual via:
# Replace the patch version
MANAGED_K8S_RELEASE_THE_KRAKEN=true ./managed-k8s/actions/upgrade.sh 1.25.10
Note
By default, the Tigera operator is deployed with Kubernetes v1.25. Therefore, during the upgrade from Kubernetes v1.24 to v1.25, the migration to the Tigera operator will be triggered automatically by default!
Add support for Helm-based installation of rook-ceph (!676)
Starting with rook v1.7, an official Helm chart is provided and has become the recommended installation method. The charts take care most installation and upgrade processes. The role rook_v2 includes adds support for the Helm-based installation as well as a migration path from rook_v1.
In order to migrate, make sure that rook v1.7.11 is installed and healthy, then set use_helm=true in the k8s-service-layer.rook section and run stage4.
GPU: Rework setup and check procedure (!750)
We reworked the setup and smoke test procedure for GPU nodes to be used inside of Kubernetes [1]. In the last two ShoreLeave-Meetings (our official development) meetings [2] and our IRC-Channel [3] we asked for feedback if the old procedure is in use in the wild. As that does not seem to be the case, we decided to save the overhead of implementing and testing a migration path. If you have GPU nodes in your cluster and support for these breaks by the reworked code, please create an issue or consider rebuilding the nodes with the new procedure.
Change kube-apiserver Service-Account-Issuer
Kube-apiserver now issues service-account tokens with
https://kubernetes.default.svc
as issuer instead of
kubernetes.default.svc
. Tokens with the old issuer are still
considered valid, but should be renewed as this additional support will
be dropped in the future.
This change had to be made to make yaook-k8s pass all k8s-conformance tests.
Drop support for Kubernetes v1.20
We’re dropping support for Kubernetes v1.20 as this version is EOL quite some time. This step has been announced several times in our public development meeting.
Drop support for Kubernetes v1.19
We’re dropping support for Kubernetes v1.19 as this version is EOL quite some time. This step has been announced several times in our public development meeting.
Implement support for Tigera operator-based Calico installation
Instead of using a customized manifest-based installation method, we’re now switching to an operator-based installation method based on the Tigera operator.
Existing clusters must be migrated. Please have a look at our Calico documentation for further information.
Support for Kubernetes v1.24
The LCM now supports Kubernetes v1.24. One can either directly create a new cluster with a patch release of that version or upgrade an existing cluster to one as usual via:
# Replace the patch version
MANAGED_K8S_RELEASE_THE_KRAKEN=true ./managed-k8s/actions/upgrade.sh 1.24.10
Note
If you’re using docker as CRI, you must migrate to containerd in advance.
Further information are given in the Upgrading Kubernetes documentation.
Implement automated docker to containerd migration
A migration path to change the container runtime on each node of a cluster from docker to containerd has been added. More information about this can be found in the documentation.
Drop support for kube-router
We’re dropping support for kube-router as CNI. This step has been announced via our usual communication channels months ago. A migration path from kube-router to calico has been available quite some time and is also removed now.
Support for Rook 1.7 added
The LCM now supports Rook v1.7.*. Upgrading is as easy as setting your rook version to 1.7.11, allowing to release the kraken and running stage 4.
Support for Calico v3.21.6
We now added support for Calico v3.21.6, which is tested against
Kubernetes v1.20, v1.21 and v1.22
by the Calico project team. We
also added the possibility to specify one of our supported Calico
versions (v3.17.1, v3.19.0, v3.21.6
) through a config.toml
variable: calico_custom_version
.
ch-k8s-lbaas now respects NetworkPolicy objects
If you are using NetworkPolicy objects, ch-k8s-lbaas will now interpret them and enforce restrictions on the frontend. That means that if you previously only allowlisted the CIDR in which the lbaas agents themselves reside, your inbound traffic will be dropped now.
You have to add external CIDRs to the network policies as needed to avoid that.
Clusters where NetworkPolicy objects are not in use or where filtering only happens on namespace/pod targets are not affected (as LBaaS wouldn’t have worked there anyway, as it needs to be allowlisted in a CIDR already).
Add Priority Class to esssential cluster components (!633)
The priority
classes
system-cluster-critical
and system-node-critical
have been added
to all managed and therefore essential services and components. There is
no switch to avoid that. For existing clusters, all managed components
will therefore be restarted/updated once during the next application of
the LCM. This is considered not disruptive.
Decoupling thanos and terraform
When enabling thanos, one can now prevent terraform from creating a
bucket in the same OpenStack project by setting
manage_thanos_bucket=false
in the
[k8s-service-layer.prometheus]
. Then it’s up to the user to manage
the bucket by configuring an alternative storage backend.
OpenStack: Ensure that credentials are used
https://gitlab.com/yaook/k8s/-/merge_requests/625 introduces the role
check-openstack-credentials
which fires a token request against the
given Keystone endpoint to ensure that credentials are available. For
details, check the commit messages. This sanity check can be skipped by
either passing -e check_openstack_credentials=False
to your call to
ansible-playbook
or by setting
check_openstack_credentials = True
in the [miscellaneous]
section of your config.toml
.
Thanos: Allow alternative object storage backends
By providing thanos_objectstorage_config_file
one can tell
thanos-{compact,store}
to use a specific (pre-configured) object
storage backend (instead of using the bucket the LCM built for you).
Please note that the usage of thanos still requires that the OpenStack
installation provides a SWIFT backend.
That’s a bug.
Observation of etcd
Our monitoring stack now includes the observation of etcd. To fetch the metrics securely (cert-auth based), a thin socat-based proxy is installed inside the kube-system namespace.
Support for Kubernetes v1.23
The LCM now supports Kubernetes v1.23. One can either directly create a new cluster with that version or upgrade an existing one as usual via:
# Replace the patch version
MANAGED_K8S_RELEASE_THE_KRAKEN=true ./managed-k8s/actions/upgrade.sh 1.23.11
Further information are given in the Upgrading Kubernetes documentation.
config.toml: Introduce the mandatory option [miscellaneous]/container_runtime
This must be set to "docker"
for pre-existing clusters. New clusters
should be set up with "containerd"
. Migration of pre-existing
clusters from docker to containerd is not yet supported.
Replace count
with for_each
in terraform (!524)
terraform now uses for_each
to manage instances which allows the
user to delete instances of any index without extraordinary terraform
black-magic. The LCM auto-magically orchestrates the migration.
Add action for system updates of initialized nodes (!429)
The node system updates have been pulled out into a
separate action script.
The reason for that is, that even though one has not set
MANAGED_K8S_RELEASE_THE_KRAKEN
, the cache of the package manager of
the host node is updated in stage2 and stage3. That takes quite some
time and is unnecessary as the update itself won’t happen. More
rationales are explained in the commit message of
e4c62211.
cluster-repo: Move submodules into dedicated directory (!433)
We’re now moving (git) submodules into a dedicated directory
submodules/
. For users enabling these, the cluster repository starts
to get messy, latest after introducing the option to use
customization playbooks.
As this is a breaking change, users which use at least one submodule
must re-execute the
init.sh
-script!
The init.sh
-script will move your enabled submodules into the
submodules/
directory. Otherwise at least the symlink to the
ch-role-users
- role will be
broken.
Note
By re-executing the
init.sh
, the latestdevel
branch of themanaged-k8s
-module will be checked out under normal circumstances!