Releasenotes

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project will adhere to Semantic Versioning.

We use towncrier for the generation of our release notes file.

Information about unreleased changes can be found here.

For changes before summer 2023 see the end of this document and also git log --no-merges will help you to get a rough overview of earlier changes.

v8.1.1 (2024-10-07)

Bugfixes

  • A bug has been fixed which affected existing clusters setup before release v8.1 and caused Kubernetes workers to try to re-join the cluster. (!1510)

v8.1.0 (2024-10-01)

New Features

  • We now have a binary cache for Nix build artifacts, so users won’t have to build anything from source. To use it, configure

    extra-substituters = https://yaook.cachix.org
    extra-trusted-public-keys = yaook.cachix.org-1:m85JtxgDjaNa7hcNUB6Vc/BTxpK5qRCqF4yHoAniwjQ=
    

    in /etc/nix/nix.conf (!930)

  • Poetry dependencies are now packaged within the Nix devShell (!930)

  • Add option to let calico announce the service cluster IP range to external peers. This is necessary in setups where external entities want to send traffic to cluster IPs instead of pods or node ports. (!1455)

  • The devShell can now be selected with the env var YAOOK_K8S_DEVSHELL (defaulting to 'default') (!1457)

  • Support for a separate inventory for the custom stage was added. It is used in addition to the main inventory.

    Previously, the custom stage inventory was silently dropped with release v4.0.0. This is now noted in the v4.0.0 release notes. (!1472)

Changed functionality

  • Some internal rework has been done to improve runtime performance. (!1450)

  • Wireguard client templates now set PersistentKeepalive to 25 seconds. Existing configurations do not have to be exchanged. (!1452)

  • Supported Kubernetes versions have been bumped. (!1476)

Bugfixes

  • A missing call to install_prerequisites has been added to upgrade.sh (!1481)

  • Spawning clusters on Openstack with apply-all.sh has been fixed (!1488)

  • Cluster-health-verification in combination with poetry2nix has been fixed (!1495)

  • Devshell reloading in combination with poetry2nix has been fixed (!1495)

Changes in the Documentation

  • !1454

  • A hint to set USE_VAULT_IN_DOCKER for development setups has been added to Initialization documentation (!1461)

  • The documentation has been updated to ensure compliance with the Kubernetes trademarks and streamline to consistently use “YAOOK/K8s for the LCM. (!1465)

Deprecations and Removals

  • tools/patch_config.py was removed completing its deprecation cycle since release 6.1.0. (!1468)

Other Tasks

Misc

v8.0.2 (2024-09-20)

Bugfixes

  • Spawning clusters on Openstack with apply-all.sh has been fixed (!1488)

v8.0.1 (2024-09-09)

Bugfixes

  • The release migration script was fixed to support bare metal cluster repos as well. (!1470)

v8.0.0 (2024-08-28)

Breaking changes

  • The YAOOK/K8s Terraform module now allows worker nodes to be joined into individual anti affinity groups.

    Attention

    Action required

    You must migrate your Terraform state by running the migration script.

    ./managed-k8s/actions/migrate-to-release.sh
    

    _ (!1317)

  • The YAOOK/K8s Terraform module does not build a default set of nodes (3 masters + 4 workers) anymore when no nodes are given. (!1317)

  • The automatic just-in-time migration of Terraform resources from count to for_each introduced in July 2022 was removed in favor of a once-and-for-all migration.

    ./managed-k8s/actions/migrate-to-release.sh
    

    _ (!1317)

  • YAOOK/K8s Terraform does not implicitly assign nodes to availability zones anymore if actually none was configured for a node.

    For all master and worker nodes, availability zones must now be configured explicitly; and [terraform].enable_az_management has been removed therefore.

    Not configuring availability zones, now leaves the choice to the cloud controller which may or may not select one. To achieve the same effect for gateway nodes, turn off [terraform].spread_gateways_across_azs.

    Attention

    Action required

    To prevent Terraform from unneccessarily rebuilding master and worker nodes, you must run the migration script. This will determine each nodes’ availability zone in the Terraform state to set in the config for you.

    ./managed-k8s/actions/migrate-to-release.sh
    

    _ (!1317)

  • The format of the [terraform] config section changed significantly.

    Terraform nodes are now to be configured as blocks of values rather than across separate lists for each type of value.

    Furthermore you now have control over the whole name of Terraform nodes, see the documentation for further details.

      [terraform]
    
    - masters = 2
    - master_names = ["X", "Y"]
    - master_flavors = ["M", "M"]
    - master_images = ["Ubuntu 20.04 LTS x64", "Ubuntu 22.04 LTS x64"]
    - master_azs = ["AZ1", "AZ3"]
    + #....
    +
    + [terraform.nodes.master-X]
    + role     = "master"  # mandatory
    + flavor   = "M"
    + image    = "Ubuntu 20.04 LTS x64"
    + az       = "AZ1"
    + #....
    +
    + [terraform.nodes.worker-A]
    + role     = "worker"  # mandatory
    + flavor   = "S"
    + image    = "Debian 12 (bookworm)"
    + az       = "AZ3"
      #....
    

    The gateway/master/worker defaults are consolidated into blocks as well.

      [terraform]
    
    - gateway_image_name = "Debian 12 (bookworm)"
    - gateway_flavor = "XS"
    - default_master_image_name = "Ubuntu 22.04 LTS x64"
    - default_master_flavor = "M"
    - default_master_root_disk_size = 50
    - default_worker_image_name = "Ubuntu 22.04 LTS x64"
    - default_worker_flavor = "L"
    - default_worker_root_disk_size = 100
    + #....
    +
    + [terraform.gateway_defaults]
    + image                      = "Debian 12 (bookworm)"
    + flavor                     = "XS"
    +
    + [terraform.master_defaults]
    + image                      = "Ubuntu 22.04 LTS x64"
    + flavor                     = "M"
    + root_disk_size             = 50
    +
    + [terraform.worker_defaults]
    + image                      = "Ubuntu 22.04 LTS x64"
    + flavor                     = "L"
    + root_disk_size             = 100
      #....
    

    The worker anti affinity settings [terraform].worker_anti_affinity_group_name and [terraform].worker_join_anti_affinity_group are merged into [terraform.workers.<name>].anti_affinity_group or [terraform.worker_defaults].anti_affinity_group. Unset means “no join”.

      [terraform]
    
    - worker_anti_affinity_group_name = "some-affinity-group"
    - worker_join_anti_affinity_group = [false, true]
    + #....
    +
    + [terraform.worker_defaults]
    +
    + [terraform.workers.0]
    +
    + [terraform.workers.1]
    + anti_affinity_group        = "some-affinity-group"
    
      #....
    

    Attention

    Action required

    You must convert your config into the new format.

    ./managed-k8s/actions/migrate-to-release.sh
    

    _ (!1317)

  • Gateway node names are now index rather than availability zone based, leading to names like managed-k8s-gw-0 instead of managed-k8s-gw-az1.

    Attention

    Action required

    To prevent Terraform from unnecessarily rebuilding gateway nodes, you must run the migration script.

    ./managed-k8s/actions/migrate-to-release.sh
    

    _ (!1317)

New Features

  • Terraform: Anti affinity group settings are now configurable per worker node. (!1317)

  • Terraform: The amount of gateway nodes created is not dependent on the amount of availability zones anymore and can be set with [terraform].gateway_count. The setting’s default yields the previous behavior when [terraform].spread_gateway_across_azs is enabled which it is by default. (!1317)

  • A rework has been done which now allows to trigger a specific playbook of k8s-core or k8s-supplements. The default behavior of triggering install-all.yaml has been preserved. See apply-k8s-core.sh and apply-k8s-supplements.sh for usage information. (!1433)

  • It is now possible to set the root URL for Grafana (!1447)

Changed functionality

  • The minimum Terraform version is increased to 1.3 (!1317)

Bugfixes

  • Importing the Thanos object storage configuration has been reworked to not fail erroneously. (!1437)

Other Tasks

v7.0.2 (2024-08-26)

Bugfixes

  • A bug has been fixed which prevented the configuration of an exposed Vault service. (!1448)

v7.0.1 (2024-08-26)

Bugfixes

  • kube-state-metrics not being able to read namespace labels has been fixed (!1438)

v7.0.0 (2024-08-22)

Breaking changes

  • The dual stack support has been reworked and fixed. The variable dualstack_support has been split into two variables, ipv4_enabled (defaults to true) and ipv6_enabled (defaults to false) to allow ipv6-only deployments and a more fine-granular configuration.

    The following configuration changes are recommended, but not mandatory:

    [terraform]
    -dualstack_support = false
    +ipv6_enabled = false
    

    Existing clusters running on OpenStack must execute the Terraform stage once:

    $ ./managed-k8s/actions/apply-terraform.sh
    

    to re-generate the inventory and hosts file for Ansible. (!1304)

New Features

  • Support for ch-k8s-lbaas v0.8.0 and v0.9.0 has been added. The ch-k8s-lbaas version is now an optional variable. To ensure the latest supported version is used, one can simply unset it.

    $ tomlq --in-place --toml-output 'del(."ch-k8s-lbaas".version)' config/config.toml
    

    . (!1304)

  • Introduce support for setting remote write targets ([[remote_writes]]) for Prometheus (!1396)

  • Add new modules http_api and http_api_insecure for Blackbox exporter allowing status codes 200, 300, 401 to be returned for http probes. http_api_insecure additionally doesn’t care for the issuer of a certificate. (!1420)

  • The default version for rook/Ceph has been bumped to v1.14.9. (!1430)

Changed functionality

  • The sysctl settings fs.inotify.max_user_instances, fs.inotify.max_user_watches and vm.max_map_count are now also adjusted on master nodes. (!1419)

  • The vault image used in the CI and for local development has been changed to “hashicorp/vault”. (!1429)

Bugfixes

Changes in the Documentation

Deprecations and Removals

  • The “global monitoring” functionality has been dropped. It was a provider-specific feature and has been dropped as the LCM should be kept as general as possible. (!1270)

Other Tasks

Misc

v6.1.2 (2024-08-19)

Bugfixes

  • In the v6.0.0 release notes, we now draw attention to committing etc/ssh_known_hosts in the cluster repository so that the re-enabled SSH host key verification does not require every user to use TOFU at first. (!1413)

v6.1.1 (2024-08-15)

Bugfixes

  • Fixed a bug in k8s-login.sh which would fail if the etc directory did not exist. (!1416)

v6.1.0 (2024-08-07)

New Features

  • Added support for Kubernetes v1.30 (!1385)

  • Configuration options have been added to cert-manager and ingress-controller to further streamline general helm chart handling. (!1387)

  • Add MANAGED_K8S_GIT_BRANCH environment variable to allow specifying a branch that should be checked out when running init-cluster-repo.sh. (!1388)

Changed functionality

  • The mapped Calico versions have been bumped due to a bug which can result in high CPU utilization on nodes. If no custom Calico version is configured, Calico will get updated automatically on the next rollout. It is strongly recommended to do a rollout.

    $ AFLAGS="--diff -t calico" bash managed-k8s/actions/apply-k8s-supplements.sh
    

    . (!1393)

Bugfixes

Deprecations and Removals

  • Support for Kubernetes v1.27 has been removed (!1362)

  • The tools/patch_config.py script was deprecated in favor of tomlq. (!1379)

Other Tasks

v6.0.3 (2024-07-22)

Updated the changelog after a few patch releases in the v5.1 series were withdrawn and superseded by another patch release.

Because the v6.0 release series already includes the breaking change that is removed again in the v5.1 release series, we kept it and just added it to the v6.0.0 release notes.

v6.0.2 (2024-07-20)

Changed functionality

  • Sourcing lib.sh is now side-effect free (!1340)

  • The entrypoint for the custom stage has been moved into the LCM. It now includes the connect-to-nodes role and then dispatches to the custom playbook. If you had included connect-to-nodes in the custom playbook, you may now remove it.

    diff --git a/k8s-custom/main.yaml b/k8s-custom/main.yaml
    -# Node bootstrap is needed in most cases
    -- name: Initial node bootstrap
    -  hosts: frontend:k8s_nodes
    -  gather_facts: false
    -  vars_files:
    -    - k8s-core-vars/ssh-hardening.yaml
    -    - k8s-core-vars/disruption.yaml
    -    - k8s-core-vars/etc.yaml
    -  roles:
    -    - role: bootstrap/detect-user
    -      tag: detect-user
    -    - role: bootstrap/ssh-known-hosts
    -      tags: ssh-known-hosts
    

    . (!1352)

  • The version of bird-exporter for prometheus has been updated to 1.4.3, haproxy-exporter to 0.15, and keepalived-exporter to 0.7.0. (!1357)

Bugfixes

  • (!1366)

  • The required actions in the notes of release v6.0.0 were incomplete and are fixed now.

Other Tasks

v6.0.1 (2024-07-17)

Changed functionality

  • The default version of the kube-prometheus-stack helm chart has been updated to 59.1.0, and prometheus-adapter to version 4.10.0. (!1314)

Bugfixes

  • When initializing a new Wireguard endpoint, nftables may not get reloaded. This has been fixed. (!1339)

  • If the vault instance is not publicly routable, nodes were not able to login to it as the vault certificate handling was faulty. This has been fixed. (!1358)

  • A fix to properly generate short-lived kubeconfigs with intermediate CAs has been supplied. (!1359)

Other Tasks

Misc

v6.0.0 (2024-07-02)

Breaking changes

  • We now use short-lived (8d) kubeconfigs

    The kubeconfig at etc/admin.conf is now only valid for 8 days after creation (was 1 year). Also, it is now discouraged to check it into version control but instead refresh it on each orchestrator as it is needed using tools/vault/k8s-login.sh.

    If your automation relies on the kubeconfig to be checked into VCS or for it to be valid for one year, you probably need to adapt it.

    In order to switch to the short-lived kubeconfig, run

    $ git rm etc/admin.conf
    $ sed --in-place '/^etc\/admin\.conf$/d' .gitignore
    $ git commit etc/admin.conf -m "Remove kubeconfig from git"
    $ ./managed-k8s/tools/vault/init.sh
    $ ./managed-k8s/tools/vault/update.sh
    $ ./managed-k8s/actions/k8s-login.sh
    

    Which will remove the long-term kubeconfig and generate a short-lived one. (!1178)

  • We now provide an opt-in regression fix that restores Kubernetes’ ability to respond to certificate signing requests.

    Using the fix is completely optional, see Restoring Kubernetes’ ability to sign certificates. for futher details.

    Action required: As a prerequisite for making the regression fix functional you must update your Vault policies by executing the following:

    # execute with Vault root token sourced
    ./managed-k8s/tools/vault/init.sh
    

    . (!1219)

  • Some environment variables have been removed.

    WG_USAGE and TF_USAGE have been moved from .envrc to config.toml. If they have been set to false, the respective options wireguard.enabled and terraform.enabled in config.toml need to be set accordingly. If they were not touched (i.e. they are set to true), no action is required.

    CUSTOM_STAGE_USAGE has been removed. The custom stage is now always run if the playbook exists. No manual action required. (!1263)

  • SSH host key verification has been re-enabled. Nodes are getting signed SSH certificates. For clusters not using a vault running inside docker as backend, automated certificate renewal is configured on the nodes. The SSH CA is stored inside $CLUSTER_REPOSITORY/etc/ssh_known_hosts and can be used to ssh to nodes.

    Attention: Make sure that file is not gitignored and is committed after rollout.

    The vault policies have been adjusted to allow the orchestrator role to read the SSH CA from vault. You must update the vault policies therefore:

    Note

    A root token is required.

    $ ./managed-k8s/tools/vault/init.sh
    

    This is needed just once. (!1272)

  • With Kubernetes v1.29, the user specified in the admin.conf kubeconfig is now bound to the kubeadm:cluster-admins RBAC group. This requires an update to the Vault cluster policies and configuration.

    You must update your vault policies and roles and a root token must be sourced.

    $ ./managed-k8s/tools/vault/init.sh
    $ ./managed-k8s/tools/vault/update.sh
    

    To upgrade your Kubernetes cluster from version v1.28 to v1.29, follow these steps:

    Warning

    You must upgrade to a version greater than v1.29.5 due to kubeadm #3055

    $ MANAGED_K8S_RELEASE_THE_KRAKEN=true ./managed-k8s/actions/upgrade.sh 1.29.x
    $ ./managed-k8s/actions/k8s-login.sh
    

    Note that the default upgrade procedure has changed such that addons get upgraded after all control plane nodes got upgraded and not along with the first control plane node. (!1284)

  • Use volumeV3 client at terraform. volumeV2 is not supported everywhere.

    Note

    This breaking change was originally introduced by release 5.1.2, but was reverted again with release 5.1.5 as release 5.1.2 got withdrawn.

    If you have [terraform].create_root_disk_on_volume = true set in your config, you must migrate the openstack_blockstorage_volume_v2 resources in your Terraform state to the v3 resource type in order to prevent rebuilds of all servers and their volumes.

    # Execute the lines produced by the following script
    # This will import all v2 volumes as v3 volumes
    #  and remove the v2 volume resources from the Terraform state.
    
    terraform_module="managed-k8s/terraform"
    terraform_config="../../terraform/config.tfvars.json"
    for item in $( \
        terraform -chdir=$terraform_module show -json \
        | jq --raw-output '.values.root_module.resources[] | select(.type == "openstack_blockstorage_volume_v2") | .name+"[\""+.index+"\"]"+","+.values.id' \
    ); do
        echo "terraform -chdir=$terraform_module import -var-file=$terraform_config 'openstack_blockstorage_volume_v3.${item%,*}' '${item#*,}' " \
             "&& terraform -chdir=$terraform_module state rm 'openstack_blockstorage_volume_v2.${item%,*}'"
    done
    

    (!1245)

New Features

  • Add option to install CCM and cinder csi plugin via helm charts. The migration to the helm chart will be enforced when upgrading to Kubernetes v1.29. (!1107)

  • A guide on how to rotate OpenStack credentials has been added. (!1266)

Changed functionality

  • The CI image is now built as part of this repo’s pipeline using a Nix Flake (!1175)

  • Thanos CPU limits have been removed (!1186)

  • PKI renewal during Kubernetes upgrades has been refined and can be explicitly triggered or skipped via the newly introduced renew-pki tag. (!1251)

  • All releasenotes will now have a link to their corresponding MR. (!1294)

  • (!1325)

Bugfixes

  • Adjust .gitignore template to keep the whole inventory (!1274) Action recommended: Adapt your .gitignore with sed --in-place '/^!\?\/inventory\/.*$/d' .gitignore.

  • After each phase of a root CA rotation a new kubeconfig is automatically generated (!1293)

  • (!1298, !1316, !1336)

  • The common monitoring labels feature has been fixed. (!1303)

  • Keys in the wireguard endpoint dict have been fixed. (!1329)

Changes in the Documentation

  • add hints for terraform config (!1246)

  • A variable setting to avoid problems with the keyring backend has been added to the template of ~/.config/yaook-k8s/env. (!1269)

  • A hint to fix incorrect locale settings for Ansible has been added. (!1297)

  • (!1308, !1315)

  • A missing variable has been added to the reference (!1313)

Deprecations and Removals

  • Support for rook_v1 has been dropped. We do only support deploying rook via helm from now on. (!1042)

  • Deprecated vault policies have been removed after a sufficient transition time.

    Hint

    A root token is required.

    ./managed-k8s/tools/vault/init.sh
    

    Execute the above to remove them from your vault instance. (!1318)

Other Tasks

Security

  • All Ansible tasks that handle secret keys are now prevented from logging them. (!1295)

Misc

v5.1.5 (2024-07-22)

Note

This release replaces all releases since and including 5.1.2.

Patch release 5.1.2 and its successors 5.1.3 and 5.1.4 were withdrawn due to #676 “Release v5.1.2 is breaking due to openstack_blockstorage_volume_v3”

This release reverts the breaking change introduced by !1245 “terraform use volume_v3 API”, while retaining all other changes introduced by the withdrawn releases that were withdrawn.

!1245 “terraform use volume_v3 API” will be re-added with a later major release.

Attention

DO NOT update to this or a higher non-major release if you are currently on one of the withdrawn releases. Make sure to only upgrade to the major release which re-adds !1245 “terraform use volume_v3 API” instead.

v5.1.4 (2024-06-07) [withdrawn]

Bugfixes

  • The root CA rotation has been fixed. (!1289)

v5.1.3 (2024-06-06) [withdrawn]

New Features

  • A Poetry group has been added so update-inventory.py can be called with minimal dependencies. (!1277)

v5.1.2 (2024-05-27) [withdrawn]

Changed functionality

  • The default version of the Thanos Helm Chart has been updated to 15.1.0 (!1188)

  • Make hosts file backing up more robust in bare metal clusters. (!1236)

  • Use volumeV3 client at terraform. volumeV2 is not supported everywhere. (!1245)

Bugfixes

Changes in the Documentation

  • Terraform references updated (!1189)

  • A guide on how to simulate a self-managed bare metal cluster on top of OpenStack has been added to the documentation. (!1231)

  • Instructions to install Vault have been added to the installation guide (!1247)

Deprecations and Removals

  • A service-account-issuer patch for kube-apiserver has been removed which was necessary for a flawless transition to an OIDC conformant HTTPS URL (!1252)

  • Support for Kubernetes v1.26 has been removed (!1253)

Misc

v5.1.1 (2024-05-21)

Bugfixes

  • The LCM is again able to retrieve the default subnet CIDR when [terraform].subnet_cidr is not set in the config.toml. (!1249)

v5.1.0 (2024-05-07)

New Features

Bugfixes

  • Dummy build the changelog for the current releasenotes in the ci build-docs-check-job (!1234)

v5.0.0 (2024-05-02)

Breaking changes

  • Added the MANAGED_K8S_DISRUPT_THE_HARBOUR environment variable.

    Disruption of harbour infrastructure is now excluded from MANAGED_K8S_RELEASE_THE_KRAKEN. To allow it nonetheless MANAGED_K8S_DISRUPT_THE_HARBOUR needs to be set instead. (See documentation on environment variables)

    [terraform].prevent_disruption has been added in the config to allow the environment variable to be overridden when Terraform is used (TF_USAGE=true). It is set to true by default.

    Ultimately this prevents unintended destruction of the harbour infrastructure and hence the whole YAOOK/K8s deployment when MANAGED_K8S_RELEASE_THE_KRAKEN must be used, e.g. during Kubernetes upgrades. (!1176)

  • Vault tools now read the cluster name from config.toml

    If your automation relies on any tool in ./tools/vault/, you need to adapt its signature. <clustername> has been removed as the first argument. (!1179)

New Features

  • Support for Kubernetes v1.28 has been added (!1205)

Changed functionality

  • Proof whether the WireGuard networks and the cluster network are disjoint (!1049)

  • The LCM has been adjusted to talk to the K8s API via the orchestrator node only (!1202)

Bugfixes

  • Cluster repository migration has been fixed for bare metal clusters. (!1183)

  • Core Split migration script doesn’t fail anymore when inventory folder is missing (!1196)

  • (!1203)

  • Some images got moved to the yaook registry, so we updated the image path.

    For registry.yaook.cloud/yaook/backup-shifter:1.0.166 a newer tag needs to be used, as the old one is not available at new registry. (!1206)

  • Cluster repo initialization with ./actions/init-cluster-repo.sh does not fail anymore when the config already exists. (!1211)

Changes in the Documentation

  • The documentation has been reworked according to Diátaxis. (!1181)

  • Add user tutorial on how to create a cluster (!1191)

  • Add copybutton for code (!1193)

Deprecations and Removals

  • Support for the legacy installation procedure of Thanos with jsonnet has been dropped (!1214)

Other Tasks

  • Added yq as a dependency. This allows shell scripts to read the config with tomlq. (!1176)

  • Helm module execution is not retried anymore as that obfuscated failed rollouts (!1215)

  • (!1218)

Misc

v4.0.0 (2024-04-15)

Breaking changes

  • The first and main serve of the core-split has been merged and the code base has been tossed around. One MUST take actions to migrate a pre-core-split cluster.

    $ bash managed-k8s/actions/migrate-cluster-repo.sh
    

    This BREAKS the air-gapped and cluster-behind-proxy functionality.

    Please refer to the respective documentation (!823).

  • The custom stage now uses the main inventory exclusively, like all other stages. A seperate inventory for the custom stage is not supported anymore and will be removed by the migrate-cluster-repo action.

Changed functionality

  • The custom stage is enabled by default now. (!823)

  • Change etcd-backup to use the new Service and ServiceMonitor manifests supplied by the Helm chart.

    The old manifests that were included in the YAOOK/K8s repo in the past will be overwritten (etcd-backup ServiceMonitor) and removed (etcd-backup-monitoring Service) in existing installations. (!1131)

Bugfixes

  • Fix patch-release tagging (!1169)

  • Change of the proposed hotfix procedure (!1171)

  • (!1172)

Changes in the Documentation

  • Streamline Thanos bucket management configuration (!1173)

Deprecations and Removals

  • Dropping the on_openstack variable from the [k8s-service-layer.rook] section

    Previously, this was a workaround to tell rook if we’re running on top of OpenStack or not. With the new repository layout that’s not needed anymore as the on_openstack variable is specified in the hosts file (inventory/yaook-k8s/hosts) and available when invoking the rook roles. (!823)

  • Remove configuration option for Thanos query persistence

    As that’s not possible to set via the used helm chart and the variable is useless. (!1174)

Other Tasks

  • Disable “-rc”-tagging (!1170)

v3.0.2 (2024-04-09)

Changes in the Documentation

  • Add some details about Thanos configuration (!1146)

Misc

v3.0.1 (2024-04-03)

Bugfixes

  • Fix Prometheus stack deployment

    If scheduling_key and allow_external_rules where set, rendering the values file for the Prometheus-stack failed due to wrong indentation. Also the scheduling_key did not take effect even without allow_external_rules configured due to the wrong indentation. (!1142)

v3.0.0 (2024-03-27)

Breaking changes

  • Drop passwordstore functionality

    We’re dropping the already deprecated and legacy passwordstore functionality. As the inventory updater checks for valid sections in the “config/config.toml” only, the “[passwordstore]” section must be dropped in its entirety for existing clusters. (!996)

  • Adjust configuration for persistence of Thanos components

    Persistence for Thanos components can be enabled/disabled by setting/unsetting k8s-service-layer.prometheus.thanos_storage_class. It is disabled by default. You must adjust your configuration to re-enable it. Have a lookt at the configuration template. Furthermore, volume size for each component can be configured separately. (!1106)

  • Fix disabling storage class creation for rook/ceph pools

    Previously, the create_storage_class attribute of a ceph pool was a string which has been interpreted as boolean. This has been changed and that attribute must be a boolean now.

    [[k8s-service-layer.rook.pools]]
    name = "test-true"
    create_storage_class = true
    replicated = 3
    

    This is restored behavior pre-rook_v2, where storage classes for ceph blockpools didn’t get created by default. (!1130)

  • The Thanos object storage configuration must be moved to vault if it is not automatically managed. Please check the documentation on how to create a configuration and move it to vault.

    You must update your vault policies if you use Thanos with a custom object storage configuration

    ./managed-k8s/tools/vault/update.sh $CLUSTER_NAME
    

    Execute the above to update your vault policies. A root token must be sourced.

New Features

  • Add Sonobuoy testing to CI (!957)

  • Add support to define memory limits for the kube-apiservers

    The values set in the config.toml are only applied on K8s upgrades. If no values are explicitly configured, no memory resource requests nor limits will be set by default. (!1027)

  • Thanos: Add option to configure in-memory index cache sizes (!1116)

Changed functionality

  • Poetry virtual envs are now deduplicated between cluster repos and can be switched much more quickly (!931)

  • Allow unsetting CPU limits for rook/ceph components (!1089)

  • Add check whether VAULT_TOKEN is set for stages 2 and 3 (!1108)

  • Enable auto-downsampling for Thanos query (!1116)

  • Add option for testing clusters to enforce the reboot of the nodes after each system update to simulate the cluster behaviour in a real world. (!1121)

  • Add a new env var $MANAGED_K8S_LATEST_RELEASE for the init.sh script which is true by default and causes that the latest release is checked out instead of devel (!1122)

Bugfixes

  • Fix & generalize scheduling_key usage for managed K8s services (!1088)

  • Fix vault import for non-OpenStack clusters (!1090)

  • Don’t create Flux PodMonitos if monitoring is disabled (!1092)

  • Fix a bug which prevented nuking a cluster if Gitlab is used as Terraform backend (!1093)

  • Fix tool tools/assemble_cephcluster_storage_nodes_yaml.py to produce valid yaml.

    The tool helps to generate a Helm value file for rook-ceph-cluster Helm chart. The data type used for encryptedDevice in yaml path cephClusterSpec.storage has been fixed. It was boolean before but need to be string. (!1118)

  • (!1120)

  • Ensure minimal IPSec package installation (!1129)

  • Fix testing of rook ceph block storage classes - Now all configured rook ceph block storage pools for which a storage class is configured are checked rather than only rook-ceph-data. (!1130)

Changes in the Documentation

  • Include missing information in the “new Vault” case in the “Pivot vault” section of the Vault documentation (!1086)

Deprecations and Removals

  • Drop support for Kubernetes v1.25 (!1056)

  • Support for the manifest-based Calico installation has been dropped (!1084)

Other Tasks

  • Add hotfixing strategy (!1063)

  • Add deprecation policy. (!1076)

  • Prevent CI jobs from failing if there are volume snapshots left (!1091)

  • Fix releasenote-file-check in ci (!1096)

  • Refine hotfixing procedure (!1101)

  • We define how long we’ll support older releases. (!1112)

  • Update flake dependencies (!1117)

Misc

v2.1.1 (2024-03-01)

Bugfixes

  • Fix kubernetes-validate installation for K8s updates (!1097)

v2.1.0 (2024-02-20)

New Features

  • Add support for Kubernetes v1.27 (!1065)

  • Allow to enable Ceph dashboard

Changed functionality

  • Disarm GPU tests until #610 is properly addressed

Bugfixes

  • Allow using clusters before and after the introduction of the Root CA rotation feature to use the same Vault instance. (!1069)

  • Fix loading order in envrc template

  • envrc.lib.sh: Run poetry install with –no-root

Changes in the Documentation

  • Add information on how to pack a release.

  • Update information about how to write releasenotes

Deprecations and Removals

  • Drop support for Kubernetes v1.24 (!1040)

Other Tasks

  • Update flake dependencies and allow unfree license for Terraform (!929)

Misc

v2.0.0 (2024-02-07)

Breaking changes

  • Add functionality to rotate certificate authorities of a cluster

    This is i.e. needed if the old one is shortly to expire. As paths of vault policies have been updated for this feature, one must update them. Please refer to our documentation about the Vault setup. (!939)

New Features

  • Add support for generating Kubernetes configuration from Vault

    This allows “logging into Kubernetes” using your Vault credentials. For more information, see the updated vault documentation (!1016).

Bugfixes

  • Disable automatic certification renewal by kubeadm as we manage certificates via vault

  • Fixed variable templates for Prometheus persistent storage configuration

Other Tasks

  • Further improvement to the automated release process. (!1033)

  • Automatically delete volume snapshots in the CI

  • Bump required Python version to >=3.10

  • CI: Don’t run the containerd job everytime on devel

  • Enable renovate bot for Ansible galaxy requirements

v1.0.0 (2024-01-29)

Breaking changes

  • Add option to configure multiple Wireguard endpoints

    Note that you must update the vault policies once. See Wireguard documentation for further information.

    # execute with root vault token sourced
    bash managed-k8s/tools/vault/init.sh
    
  • Improve smoke tests for dedicated testing nodes

    Smoke tests have been reworked a bit such that they are executing on defined testing nodes (if defined) only. You must update your config if you defined testing nodes. (!952)

New Features

  • Add option to migrate terraform backend from local to gitlab (!622)

  • Add support for Kubernetes v1.26 (!813)

  • Support the bitnami thanos helm chart

    This will create new service names for thanos in K8s. The migration to the bitnami thanos helm chart is triggered by default. (!816)

  • Add tool to assemble snippets for CephCluster manifest

    Writing the part for the CephCluster manifest describing which disks to be used for Ceph OSDs and metadata devices for every single storage node is error-prone. Once a erroneous manifest has been applied it can be very time-consuming to correct the errors as OSDs have to be un-deployed and wiped before re-applying the correct manifest. (!855)

  • Add project-specific managers for renovate-bot (!856)

  • Add option to configure custom DNS nameserver for OpenStack subnet (IPv4) (!904)

  • Add option to allow snippet annotations for NGINX Ingress controller (!906)

  • Add configuration option for persistent storage for Prometheus (!917)

  • Add optional configuration options for soft and hard disk pressure eviction to the config.toml. (!948)

  • Additionally pull a local copy of the Terraform state for disaster recovery purposes if Gitlab is configured as backend. (!968)

Changed functionality

  • Bump default Kubernetes node image to Ubuntu 22.04 (!756)

  • Update Debian Version for Gateway VMs to 12 (!824)

  • Spawn Tigera operator on Control Plane only by adjusting its nodeSelector (!850)

  • A minimum version of v1.5.0 is now required for poetry (!861)

  • Rework installation procedure of flux

    Flux will be deployed via the community helm chart from now on. A migration is automatically triggered (but can be prevented, see our flux documentation for further information). The old installation method will be dropped very soon. (!891)

  • Use the v1beta3 kubeadm Configuration format for initialization and join processes (!911)

  • Switch to new community-owned Kubernetes package repositories

    As the Google-hosted repository got frozen, we’re switching over to the community-owned repositories. For more information, please refer to https://kubernetes.io/blog/2023/08/15/pkgs-k8s-io-introduction/#what-are-significant-differences-between-the-google-hosted-and-kubernetes-package-repositories (!937)

  • Moving IPSec credentials to vault. This requires manual migration steps. Please check the documentation. (!949)

  • Don’t set resource limits for the NGINX ingress controller by default

Bugfixes

  • Create a readable terraform var file (!817)

  • Fixed the missing gpu flag and monitoring scheduling key (!819)

  • Update the terraform linter and fix the related issues (!822)

  • Fixed the check for monitoring common labels in the rook-ceph cluster chart values template. (!826)

  • Fix the vault.sh script

    The script will stop if a config.hcl file already exists. This can be avoided with a prior existence check. Coreutils v9.2 changed the behaviour of –no-clobber[1].

    [1] https://github.com/coreutils/coreutils/blob/df4e4fbc7d4605b7e1c69bff33fd6af8727cf1bf/NEWS#L88 (!828)

  • Added missing dependencies to flake.nix (!829)

  • ipsec: Include passwordstore role only if enabled

    The ipsec role hasn’t been fully migrated to vault yet and still depends on the passwordstore role. If ipsec is not used, initializing a password store is not necessary. However, as an ansible dependency, it was still run and thus failed if passwordstore hadn’t been configured. This change adds the role via include_role instead of as a dependency. (!833)

  • Docker support has been removed along with k8s versions <1.24, but some places remained dependent on the now unnecessary variable container_runtime. This change removes every use of the variable along with the documentation for migrating from docker to containerd. (!834)

  • Fix non-gpu clusters

    For non-gpu clusters, the roles containerd and kubeadm-join would fail, because the variable has_gpu was not defined. This commit changes the order of the condition, so has_gpu is only checked if gpu support is enabled for the cluster.

    This is actually kind of a workaround for a bug in Ansible. has_gpu would be set in a dependency of both roles, but Ansible skips dependencies if they have already been skipped earlier in the play. (!835)

  • Fix rook for clusters without prometheus

    Previously, the rook cluster chart would always try to create PrometheusRules, which would fail without Prometheus’ CRD. This change makes the creation dependent on whether monitoring is enabled or not. (!836)

  • Fix vault for clusters without prometheus

    Previously, the vault role would always try to create ServiceMonitors, which would fail without Prometheus’ CRD. This change makes the creation dependent on whether monitoring is enabled or not. (!838)

  • Change the default VRRP priorities from 150/100/80 to 150/100/50. This makes it less likely that two backup nodes attempt to become primary at the same time, avoiding race conditions and flappiness. (!841)

  • Fix Thanos v1 cleanup tasks during migration to prevent accidental double deletion of resources (!849)

  • Fixed incorrect templating of Thanos secrets for buckets managed by Terraform and clusters with custom names (!854)

  • Rename rook_on_openstack field in config.toml to on_openstack (!888)

  • (!889, !910)

  • Fixed configuration of host network mode for rook/ceph (!899)

    • Only delete volumes, ports and floating IPs from the current OpenStack project on destroy, even if the OpenStack credentials can access more than this project. (!921)

  • destroy: Ensure port deletion works even if only OS_PROJECT_NAME is set (!922)

  • destroy: Ensure port deletion works even if both OS_PROJECT_NAME and OS_PROJECT_ID are set (!924)

  • Add support for ch-k8s-lbaas version 0.7.0. Excerpt from the upstream release notes:

    • Improve scoping of actions within OpenStack. Previously, if the credentials allowed listing of ports or floating IPs outside the current project, those would also be affected. This is generally only the case with OpenStack admin credentials which you aren’t supposed to use anyway.

    It is strongly recommended that you upgrade your cluster to use 0.7.0 as soon as possible. To do so, change the version value in the ch-k8s-lbaas section of your config.toml to "0.7.0". (!938)

  • Fixed collection of Pod logs as job artifacts in the CI. (!953)

  • Fix forwarding nftable rules for multiple Wireguard endpoints. (!969)

  • The syntax of the rook cheph operator_memory_limit and _request was fixed in config.toml. (!973)

  • Fix migration tasks tasks for Flux (!976)

  • It is ensured that the values passed to the cloud-config secret are proper strings. (!980)

  • Fix configuration of Grafana resource limits & requests (!982)

  • Bump to latest K8s patch releases (!994)

  • Fix the behaviour of the Terraform backend when multiple users are maintaining the same cluster, especially when migrating the backend from local to http. (!998)

  • Constrain kubernetes-validate pip package on Kubernetes nodes (!1004)

  • Add automatic migration to community repository for Kubernetes packages

  • Create a workaround which should allow the renovate bot to create releasenotes

Changes in the Documentation

  • Added clarification for available release-note types. (!830)

  • Add clarification in vault setup. (!831)

  • Fix tip about .envrc in Environment Variable Reference (!832)

  • Clarify general upgrade procedure and remove obsolete version specific steps (!837)

  • The repo link to the prometheus blackbox exporter changed (!840)

  • (!851, !853, !908, !979)

  • Added clarification in initialization for the different .envrc used. (!852)

  • Update and convert Terraform documentation to restructured Text (!904)

  • rook-ceph: Clarify role of mon_volume_storage_class (!955)

Deprecations and Removals

  • remove acng related files (!978)

Other Tasks

Security

  • Security hardening settings for the nginx ingress controller. (!972)

Misc

Preversion

Towncrier as tooling for releasenotes

From now on we use towncrier to generate our relasenotes. If you are a developer see the coding guide for further information.

Add .pre-commit-config.yaml

This repository now contains pre-commit hooks to validate the linting stage of our CI (except ansible-lint) before committing. This allows for a smoother development experience as mistakes can be catched quicker. To use this, install pre-commit (if you use Nix flakes, it is automatically installed for you) and then run pre-commit install to enable the hooks in the repo (if you use direnv, they are automatically enabled for you).

Create volume snapshot CRDs (!763)

You can now create snapshots of your openstack PVCs. Missing CRDs and the snapshot-controller from [1] and [2] where added.

[1] https://github.com/kubernetes-csi/external-snapshotter/tree/master/client/config/crd

[2] https://github.com/kubernetes-csi/external-snapshotter/tree/master/deploy/kubernetes/snapshot-controller

Add support for rook v1.8.10

Update by setting version=1.8.10 and running MANAGED_K8S_RELEASE_THE_KRAKEN=true AFLAGS="--diff --tags mk8s-sl/rook" managed-k8s/actions/apply-stage4.sh

Use poetry to lock dependencies

Poetry allows to declaratively set Python dependencies and lock versions. This way we can ensure that everybody uses the same isolated environment with identical versions and thus reduce inconsistencies between individual development environments.

requirements.txt has been removed. Python dependencies are now declared in pyproject.toml and locked in poetry.lock. New deps can be added using the command poetry add package-name. After manually editing pyproject.toml, run poetry lock to update the lock file.

Drop support for Kubernetes v1.21, v1.22, v1.23

We’re dropping support for EOL Kubernetes versions.

Add support for Kubernetes v1.25

We added support for all patch versions of Kubernetes v1.25. One can either directly create a new cluster with a patch release of that version or upgrade an existing cluster to one as usual via:

# Replace the patch version
MANAGED_K8S_RELEASE_THE_KRAKEN=true ./managed-k8s/actions/upgrade.sh 1.25.10

Note

By default, the Tigera operator is deployed with Kubernetes v1.25. Therefore, during the upgrade from Kubernetes v1.24 to v1.25, the migration to the Tigera operator will be triggered automatically by default!

Add support for Helm-based installation of rook-ceph (!676)

Starting with rook v1.7, an official Helm chart is provided and has become the recommended installation method. The charts take care most installation and upgrade processes. The role rook_v2 includes adds support for the Helm-based installation as well as a migration path from rook_v1.

In order to migrate, make sure that rook v1.7.11 is installed and healthy, then set use_helm=true in the k8s-service-layer.rook section and run stage4.

GPU: Rework setup and check procedure (!750)

We reworked the setup and smoke test procedure for GPU nodes to be used inside of Kubernetes [1]. In the last two ShoreLeave-Meetings (our official development) meetings [2] and our IRC-Channel [3] we asked for feedback if the old procedure is in use in the wild. As that does not seem to be the case, we decided to save the overhead of implementing and testing a migration path. If you have GPU nodes in your cluster and support for these breaks by the reworked code, please create an issue or consider rebuilding the nodes with the new procedure.

[1] GPU Support Documentation

[2] https://gitlab.com/yaook/meta#subscribe-to-meetings

[3] https://gitlab.com/yaook/meta/-/wikis/home#chat

Change kube-apiserver Service-Account-Issuer

Kube-apiserver now issues service-account tokens with https://kubernetes.default.svc as issuer instead of kubernetes.default.svc. Tokens with the old issuer are still considered valid, but should be renewed as this additional support will be dropped in the future.

This change had to be made to make yaook-k8s pass all k8s-conformance tests.

Drop support for Kubernetes v1.20

We’re dropping support for Kubernetes v1.20 as this version is EOL quite some time. This step has been announced several times in our public development meeting.

Drop support for Kubernetes v1.19

We’re dropping support for Kubernetes v1.19 as this version is EOL quite some time. This step has been announced several times in our public development meeting.

Implement support for Tigera operator-based Calico installation

Instead of using a customized manifest-based installation method, we’re now switching to an operator-based installation method based on the Tigera operator.

Existing clusters must be migrated. Please have a look at our Calico documentation for further information.

Support for Kubernetes v1.24

The LCM now supports Kubernetes v1.24. One can either directly create a new cluster with a patch release of that version or upgrade an existing cluster to one as usual via:

# Replace the patch version
MANAGED_K8S_RELEASE_THE_KRAKEN=true ./managed-k8s/actions/upgrade.sh 1.24.10

Note

If you’re using docker as CRI, you must migrate to containerd in advance.

Further information are given in the Upgrading Kubernetes documentation.

Implement automated docker to containerd migration

A migration path to change the container runtime on each node of a cluster from docker to containerd has been added. More information about this can be found in the documentation.

Drop support for kube-router

We’re dropping support for kube-router as CNI. This step has been announced via our usual communication channels months ago. A migration path from kube-router to calico has been available quite some time and is also removed now.

Support for Rook 1.7 added

The LCM now supports Rook v1.7.*. Upgrading is as easy as setting your rook version to 1.7.11, allowing to release the kraken and running stage 4.

Support for Calico v3.21.6

We now added support for Calico v3.21.6, which is tested against Kubernetes v1.20, v1.21 and v1.22 by the Calico project team. We also added the possibility to specify one of our supported Calico versions (v3.17.1, v3.19.0, v3.21.6) through a config.toml variable: calico_custom_version.

ch-k8s-lbaas now respects NetworkPolicy objects

If you are using NetworkPolicy objects, ch-k8s-lbaas will now interpret them and enforce restrictions on the frontend. That means that if you previously only allowlisted the CIDR in which the lbaas agents themselves reside, your inbound traffic will be dropped now.

You have to add external CIDRs to the network policies as needed to avoid that.

Clusters where NetworkPolicy objects are not in use or where filtering only happens on namespace/pod targets are not affected (as LBaaS wouldn’t have worked there anyway, as it needs to be allowlisted in a CIDR already).

Add Priority Class to esssential cluster components (!633)

The priority classes system-cluster-critical and system-node-critical have been added to all managed and therefore essential services and components. There is no switch to avoid that. For existing clusters, all managed components will therefore be restarted/updated once during the next application of the LCM. This is considered not disruptive.

Decoupling thanos and terraform

When enabling thanos, one can now prevent terraform from creating a bucket in the same OpenStack project by setting manage_thanos_bucket=false in the [k8s-service-layer.prometheus]. Then it’s up to the user to manage the bucket by configuring an alternative storage backend.

OpenStack: Ensure that credentials are used

https://gitlab.com/yaook/k8s/-/merge_requests/625 introduces the role check-openstack-credentials which fires a token request against the given Keystone endpoint to ensure that credentials are available. For details, check the commit messages. This sanity check can be skipped by either passing -e check_openstack_credentials=False to your call to ansible-playbook or by setting check_openstack_credentials = True in the [miscellaneous] section of your config.toml.

Thanos: Allow alternative object storage backends

By providing thanos_objectstorage_config_file one can tell thanos-{compact,store} to use a specific (pre-configured) object storage backend (instead of using the bucket the LCM built for you). Please note that the usage of thanos still requires that the OpenStack installation provides a SWIFT backend. That’s a bug.

Observation of etcd

Our monitoring stack now includes the observation of etcd. To fetch the metrics securely (cert-auth based), a thin socat-based proxy is installed inside the kube-system namespace.

Support for Kubernetes v1.23

The LCM now supports Kubernetes v1.23. One can either directly create a new cluster with that version or upgrade an existing one as usual via:

# Replace the patch version
MANAGED_K8S_RELEASE_THE_KRAKEN=true ./managed-k8s/actions/upgrade.sh 1.23.11

Further information are given in the Upgrading Kubernetes documentation.

config.toml: Introduce the mandatory option [miscellaneous]/container_runtime

This must be set to "docker" for pre-existing clusters. New clusters should be set up with "containerd". Migration of pre-existing clusters from docker to containerd is not yet supported.

Replace count with for_each in terraform (!524)

terraform now uses for_each to manage instances which allows the user to delete instances of any index without extraordinary terraform black-magic. The LCM auto-magically orchestrates the migration.

Add action for system updates of initialized nodes (!429)

The node system updates have been pulled out into a separate action script. The reason for that is, that even though one has not set MANAGED_K8S_RELEASE_THE_KRAKEN, the cache of the package manager of the host node is updated in stage2 and stage3. That takes quite some time and is unnecessary as the update itself won’t happen. More rationales are explained in the commit message of e4c62211.

cluster-repo: Move submodules into dedicated directory (!433)

We’re now moving (git) submodules into a dedicated directory submodules/. For users enabling these, the cluster repository starts to get messy, latest after introducing the option to use customization playbooks.

As this is a breaking change, users which use at least one submodule must re-execute the init.sh-script! The init.sh-script will move your enabled submodules into the submodules/ directory. Otherwise at least the symlink to the ch-role-users- role will be broken.

Note

By re-executing the init.sh, the latest devel branch of the managed-k8s-module will be checked out under normal circumstances!