Cluster Configuration

The environment variables affect how the user interact with the cluster via the action scripts. The config/config.toml however is the main configuration file and can be adjusted to customize the YAOOK/K8s cluster to fit your needs. It also contains operational flags which can trigger operational tasks. After initializing a cluster repository, the config/config.toml contains necessary (default) values to create a cluster. However, you’ll still need to adjust some of them before triggering a cluster creation.

The config/config.toml configuration file

The config.toml configuration file is created during the cluster repository initialization from the templates/config.template.toml file. You can (and must) adjust some of it’s values.

Before triggering an action script, the inventory updater automatically reads the configuration file, processes it, and puts variables into the inventory/. The inventory/ is automatically included. Following the concept of separation of concerns, variables are only available to stages/layers which need them.

Configuring Terraform

You can overwrite all Terraform related variables (see below for where to find a complete list) in the Terraform section of your config.toml.

By default 3 control plane nodes and 4 workers will get created. You’ll need to adjust these values if you e.g. want to enable rook.

Note

There is a variable nodes to configure the K8s master and worker servers. The role attribute must be used to distinguish both [1].

The amount of gateway nodes can be controlled with the gateway_count variable. It defaults to the number of elements in the azs array when spread_gateways_across_azs=true and 3 otherwise.

You can add and delete Terraform nodes simply by adding and removing their entries to/from the config or tuning gateway_count for gateway nodes. Consider the following example:

 [terraform]

- gateway_count = 3
+ gateway_count = 2                # <-- one gateway gets deleted

 [terraform.nodes.worker-0]
 role = "worker"
 flavor = "M"
 image = "Debian 12 (bookworm)"
-
-[terraform.nodes.worker-1]       # <-- gets deleted
-role = "worker"
-flavor = "M"

 [terraform.nodes.worker-2]
 role = "worker"
 flavor = "L"
+
+[terraform.nodes.mon1]           # <-- gets created
+role = "worker"
+flavor = "S"
+image = "Ubuntu 22.04 LTS x64"

Attention

You must configure at least one master node.

For an auto-generated complete list of variables, please refer to Terraform docs.

The name of a Terraform node is composed from the following parts:

  • for master/worker nodes: [terraform].cluster_name <the nodes' table name>

  • for gateway nodes: [terraform].cluster_name [terraform.gateway_defaults].common_name <numeric-index>

[terraform]

cluster_name = "yk8s"
gateway_count = 1
#....

[terraform.gateway_defaults]
common_name = "gateway-"

[terraform.nodes.master-X]
role = "master"

[terraform.nodes.worker-A]
role = "worker"

# yields the following node names:
# - yk8s-gateway-0
# - yk8s-master-X
# - yk8s-worker-A

To activate automatic backend of Terraform statefiles to Gitlab, adapt the Terraform section of your config.toml: set gitlab_backend to True, set the URL of the Gitlab project and the name of the Gitlab state object.

[terraform]
gitlab_backend    = true
gitlab_base_url   = "https://gitlab.com"
gitlab_project_id = "012345678"
gitlab_state_name = "tf-state"

Put your Gitlab username and access token into the ~/.config/yaook-k8s/env. Your Gitlab access token must have at least Maintainer role and read/write access to the API. Please see GitLab documentation for creating a personal access token.

To successful migrate from the “local” to “http” Terraform backend method, ensure that gitlab_backend is set to true and all other required variables are set correctly. Incorrect data entry may result in an HTTP error respond, such as a HTTP/401 error for incorrect credentials. Assuming correct credentials in the case of an HTTP/404 error, Terraform is executed and the state is migrated to Gitlab.

To migrate from the “http” to “local” Terraform backend method, set gitlab_backend=false, MANAGED_K8S_NUKE_FROM_ORBIT=true, and assume that all variables above are properly set and the Terraform state exists on GitLab. Once the migration is successful, unset the variables above to continue using the “local” backend method.

export TF_HTTP_USERNAME="<gitlab-username>"
export TF_HTTP_PASSWORD="<gitlab-access-token>"

Excerpt from templates/config.template-toml:

config.toml: Terraform configuration
# --- TERRAFORM ---
# ansible prefix: /
[terraform]

# Allows to disable execution of the terraform stage by setting it to false.
# Intended use case are bare-metal or otherwise pre-provisioned setups.
enabled = true

# If true, prevent Terraform from performing disruptive action
# defaults to true if unset
prevent_disruption = true

#subnet_cidr = "172.30.154.0/24"

# NOTE: Disabling this to build an IPv6-only cluster with Terraform
#       is not supported yet and will fail.
# https://gitlab.com/yaook/k8s/-/issues/685
#ipv4_enabled = true

# Optionally enable IPv6 for DualStack support
# WARNING: IPv6 isn't fully supported by all components yet. (e.g. LBaaS)
#ipv6_enabled = false

# If you enabled IPv6-support you may want to adjust the IPv6 subnet
#subnet_v6_cidr = "fd00::/120"

# If true, create block volume for each instance and boot from there.
# Equivalent to `openstack server create --boot-from-volume […].
#create_root_disk_on_volume = false

# Volume type that is used if `create_root_disk_on_volume` is true.
#root_disk_volume_type = "three_times_replicated"

# Enable GitLab-managed Terraform backend
# If true, the Terraform state will be stored inside the provided gitlab project.
# If set, the environment variables `TF_HTTP_USERNAME` and `TF_HTTP_PASSWORD`
# must be configured in a separate file `~/.config/yaook-k8s/env`.
#gitlab_backend = false

# The base URL of your GitLab project.
#gitlab_base_url = "https://gitlab.com"

# The unique ID of your GitLab project.
#gitlab_project_id = "01234567"

# The name of the Gitlab state object in which to store the Terraform state, e.g. 'tf-state'
#gitlab_state_name = "tf-state"


# if set to true it is ensured that gateway nodes are evenly spread across the specified availability zones
#spread_gateways_across_azs = true

# Amount of gateway nodes
#gateway_count = 3  # if 'spread_gateways_across_azs=true' defaults to one in each availability zone

# Change the default values for all gateway nodes
# Values can be selectively specified
#[terraform.gateway_defaults]
#common_name                = "gw-"  # NOTE: will be suffixed by an index
#image                      = "Debian 12 (bookworm)"
#flavor                     = "XS"
#root_disk_size             = 10
#root_disk_volume_type      = ""  # == default volume type in IaaS environment
#
# Change the default values for all master nodes
# Values can be selectively specified
#[terraform.master_defaults]
#image                      = "Ubuntu 22.04 LTS x64"
#flavor                     = "M"
#root_disk_size             = 50
#root_disk_volume_type      = ""  # == default volume type in IaaS environment
#
# Change the default values for all worker nodes
# Values can be selectively specified
#[terraform.worker_defaults]
#image                      = "Ubuntu 22.04 LTS x64"
#flavor                     = "M"
#root_disk_size             = 50
#root_disk_volume_type      = ""  # == default volume type in IaaS environment
##anti_affinity_group             # == <unset>: don't join any group
#
# Master nodes to be created (at least one)
#[terraform.nodes.<master-name>]
#role = "master"
#image                    =
#flavor                   =
#az                       =
#root_disk_size           =
#root_disk_volume_type    =
#
# Worker nodes to be created
#[terraform.nodes.<worker-name>]
#role = "worker"
#image                    =
#flavor                   =
#az                       =
#root_disk_size           =
#root_disk_volume_type    =
#anti_affinity_group      =
#
# One gateway node is created per availability zone (AZ1, AZ2, AZ3),
# see [terraform].gateway_count

Configuring Load-Balancing

By default, if you’re deploying on top of OpenStack, the self-developed load-balancing solution ch-k8s-lbaas will be used to avoid the aches of using OpenStack Octavia. Nonetheless, you are not forced to use it and can easily disable it.

The following section contains legacy load-balancing options which will probably be removed in the foreseeable future.

config.toml: Historic load-balancing configuration
# --- LOAD-BALANCING ---
# ansible prefix: /
[load-balancing]
# lb_ports is a list of ports that are exposed by HAProxy on the gateway nodes and forwarded
# to NodePorts in the k8s cluster. This poor man's load-balancing / exposing of services
# has been superseded by ch-k8s-lbaas. For legacy reasons and because it's useful under
# certain circumstances it is kept inside the repository.
# The NodePorts are either literally exposed by HAProxy or can be mapped to other ports.
# The `layer` attribute can either be `tcp` (L4) or `http` (L7). For `http`, `option forwardfor`
# is added implicitly to the backend servers in the haproxy configuration.
# If `use_proxy_protocol` is set to `true`, HAProxy will use the proxy protocol to convey information
# about the connection initiator to the backend. NOTE: the backend has to accept the proxy
# protocol, otherwise your traffic will be discarded.
# Short form:
#lb_ports = [30060]
# Explicit form:
#lb_ports = [{external=80,nodeport=30080, layer=tcp, use_proxy_protocol=true}]

# A list of priorities to assign to the gateway/frontend nodes. The priorities
# will be assigned based on the sorted list of matching nodes.
#
# If more nodes exist than there are entries in this list, the rollout will
# fail.
#
# Please note the keepalived.conf manpage for choosing priority values.
#vrrp_priorities = [150, 100, 50]

# Enable/Disable OpenStack-based load-balancing.
# openstack_lbaas = false

# Port for HAProxy statistics
#haproxy_stats_port = 48981

Kubernetes Cluster Configuration

This section contains generic information about the Kubernetes cluster configuration.

Basic Cluster Configuration

config.toml: Kubernetes basic cluster configuration
# --- KUBERNETES: BASIC CLUSTER CONFIGURATION ---
# ansible prefix: "k8s_"
[kubernetes]
# Kubernetes version. Currently, we support from 1.28.* to 1.30.*.
version = "1.30.5" # •ᴗ•

# Set this variable if this cluster contains worker with GPU access
# and you want to make use of these inside of the cluster,
# so that the driver and surrounding framework is deployed.
is_gpu_cluster = false # •ᴗ•

# Set this variable to virtualize Nvidia GPUs on worker nodes
# for usage outside of the Kubernetes cluster / above the Kubernetes layer.
# It will install a VGPU manager on the worker node and
# split the GPU according to chosen vgpu type.
# Note: This will not install Nvidia drivers to utilize vGPU guest VMs!!
# If set to true, please set further variables in the [miscellaneous] section.
# Note: This is mutually exclusive with "is_gpu_cluster"
virtualize_gpu = false

[kubernetes.apiserver]
frontend_port = 8888 # •ᴗ•

# Configure memory resources for the api-kubeserver
# memory_limit = ""

[kubernetes.controller_manager]
#large_cluster_size_threshold = 50

# Note: This currently means that the cluster CA key is copied to the control
# plane nodes which decreases security compared to storing the CA only in the Vault.
# IMPORTANT: Manual steps required when enabled after cluster creation
#   The CA key is made available through Vault's kv store and fetched by Ansible.
#   Due to Vault's security architecture this means
#   you must run the CA rotation script
#   (or manually upload the CA key from your backup to Vault's kv store).
#enable_signing_requests = false

Calico Configuration

The following configuration options are specific to calico, our CNI plugin in use.

config.toml: Kubernetes basic cluster configuration
[kubernetes.network.calico]
#mtu = 1450 # for OpenStack at most 1450

# Only takes effect for operator-based installations
# https://docs.tigera.io/calico/latest/reference/installation/api/#operator.tigera.io/v1.EncapsulationType
#encapsulation = "None"

# Only takes effect for manifest-based installations
# Define if the IP-in-IP encapsulation of calico should be activated
# https://docs.tigera.io/calico/latest/reference/resources/ippool#spec
#ipipmode = "Never"

# Make the auto detection method variable as one downside of
# using can-reach mechanism is that it produces additional logs about
# other interfaces i.e. tap interfaces. Also a simpler way will be to
# use an interface to detect ip settings i.e. interface=bond0
# calico_ip_autodetection_method = "can-reach=www.cloudandheat.com"
# calico_ipv6_autodetection_method = "can-reach=www.cloudandheat.com"

# For the operator-based installation,
# it is possible to link to self-maintained values file for the helm chart
#values_file_path = "path-to-a-custom/values.yaml"

# We're mapping a fitting calico version to the configured Kubernetes version.
# You can however pick a custom Calico version.
# Be aware that not all combinations of Kubernetes and Calico versions are recommended:
# https://docs.tigera.io/calico/latest/getting-started/kubernetes/requirements
# Any version should work as long as
# you stick to the calico-Kubernetes compatibility matrix.
#
# If not specified here, a predefined Calico version will be matched against
# the above specified Kubernetes version.
#custom_version = "3.25.1"

Storage Configuration

config.toml: Kubernetes - Basic Storage Configuration
# --- KUBERNETES: STORAGE CONFIGURATION ---
# ansible prefix: "k8s_storage"
[kubernetes.storage]
# Many clusters will want to use rook, so you should enable
# or disable it here if you want. It requires extra options
# which need to be chosen with care.
rook_enabled = false # •ᴗ•

# Setting this to true will cause the storage plugins
#to run on all nodes (ignoring all taints). This is often desirable.
nodeplugin_toleration = false # •ᴗ•

# This flag enables the topology feature gate of the cinder controller plugin.
# Its purpose is to allocate volumes from cinder which are in the same AZ as
# the worker node to which the volume should be attached.
# Important: Cinder must support AZs and the AZs must match the AZs used by nova!
#cinder_enable_topology=true
config.toml: Kubernetes - Static Local Storage Configuration
# --- KUBERNETES: STATIC LOCAL STORAGE CONFIGURATION ---
# ansible prefix: "k8s_local_storage"
[kubernetes.local_storage.static]
# Enable static provisioning of local storage. This provisions a single local
# storage volume per worker node.
#
# It is recommended to use the dynamic local storage instead.
enabled = false # •ᴗ•

# Name of the storage class to create.
#
# NOTE: the static and dynamic provisioner must have distinct storage class
# names if both are enabled!
#storageclass_name = "local-storage"

# Namespace to deploy the components in
#namespace = "kube-system"

# Directory where the volume will be placed on the worker node
#data_directory = "/mnt/data"

# Synchronization directory where the provisioner will pick up the volume from
#discovery_directory = "/mnt/mk8s-disks"

# Version of the provisioner to use
#version = "v2.3.4"

# Toleration for the plugin. Defaults to `kubernetes.storage.nodeplugin_toleration`
#nodeplugin_toleration = ...
config.toml: Kubernetes - Dynamic Local Storage Configuration
# --- KUBERNETES: DYNAMIC LOCAL STORAGE CONFIGURATION ---
# ansible prefix: "k8s_local_storage"
[kubernetes.local_storage.dynamic]
# Enable dynamic local storage provisioning. This provides a storage class which
# can be used with PVCs to allocate local storage on a node.
enabled = false # •ᴗ•

# Name of the storage class to create.
#
# NOTE: the static and dynamic provisioner must have distinct storage class
# names if both are enabled!
#storageclass_name = "local-storage"

# Namespace to deploy the components in
#namespace = "kube-system"

# Directory where the volumes will be placed on the worker node
#data_directory = "/mnt/dynamic-data"

# Version of the local path controller to deploy
#version = "v0.0.20"

# Toleration for the plugin. Defaults to `kubernetes.storage.nodeplugin_toleration`
#nodeplugin_toleration = ...

Monitoring Configuration

config.toml: Kubernetes - Monitoring Configuration
# --- KUBERNETES: MONITORING CONFIGURATION ---
# ansible prefix: "k8s_monitoring"
[kubernetes.monitoring]
# Enable Prometheus-based monitoring.
# For prometheus-specific configurations take a look at the
# k8s-service-layer.prometheus section.
enabled = false # •ᴗ•

Network Configuration

Note

To enable the calico network plugin, kubernetes.network.plugin needs to be set to calico.

config.toml: Kubernetes - Network Configuration
# --- KUBERNETES: NETWORK CONFIGURATION ---
# ansible prefix: "k8s_network"
[kubernetes.network]
# This is the subnet used by Kubernetes for Pods. Subnets will be delegated
# automatically to each node.
#pod_subnet = "10.244.0.0/16"
# pod_subnet_v6 = "fdff:2::/56"

# This is the subnet used by Kubernetes for Services.
#service_subnet = "10.96.0.0/12"
# service_subnet_v6 = "fdff:3::/108"

# If enabled, the service cluster IP range will be announced to external
# BGP peers. By default, only per-node pod networks are announced.
#bgp_announce_service_ips = false

# calico:
# High-performance, pure IP networking, policy engine. Calico provides
# layer 3 networking capabilities and associates a virtual router with each node.
# Allows the establishment of zone boundaries through BGP
plugin = "calico" # •ᴗ•

kubelet Configuration

The LCM supports the customization of certain variables of kubelet for (meta-)worker nodes.

Note

Applying changes requires to enable disruptive actions.

config.toml: Kubernetes - kubelet Configuration
# --- KUBERNETES: KUBELET CONFIGURATION (WORKERS) ---
# ansible prefix: "k8s_kubelet"
[kubernetes.kubelet]
# This section enables you to customize kubelet on the k8s workers (sic!)
# Changes will be rolled out only during k8s upgrades or if you explicitly
# allow disruptions.

# Maximum number of Pods per worker
# Increasing this value may also decrease performance,
# as more Pods can be packed into a single node.
#pod_limit = 110

# Config for soft and hard eviction values.
# Note: To change these values you have to release the Kraken
#evictionsoft_memory_period = "1m30s"
#evictionhard_nodefs_available = "10%"
#evictionhard_nodefs_inodesfree = "5%"

KSL - Kubernetes Service Layer

Rook Configuration

The used rook setup is explained in more detail here.

Note

To enable rook in a cluster on top of OpenStack, you need to set both k8s-service-layer.rook.nosds and k8s-service-layer.rook.osd_volume_size, as well as enable kubernetes.storage.rook_enabled and either kubernetes.local_storage.dynamic.enabled or kubernetes.local_storage.static.enabled local storage (or both) (see storage configuration).

config.toml: KSL - Rook Configuration
# --- KUBERNETES SERVICE LAYER : ROOK (STORAGE) ---
# ansible prefix: "rook"
[k8s-service-layer.rook]
# If kubernetes.storage.rook_enabled is enabled, rook will be installed.
# In this section you can customize and configure rook.

namespace    = "rook-ceph" # •ᴗ•
cluster_name = "rook-ceph" # •ᴗ•

# Configure a custom Ceph version.
# If not defined, the one mapped to the rook version
# will be used. Be aware that you can't choose an
# arbitrary Ceph version, but should stick to the
# rook-ceph-compatibility-matrix.
#custom_ceph_version = ""

# The helm chart version to be used.
#version = "v1.14.9"

# Enable the ceph dashboard for viewing cluster status
#dashboard = false

#nodeplugin_toleration = true

# Storage class name to be used by the ceph mons. SHOULD be compliant with one
# storage class you have configured in the kubernetes.local_storage section (or
# you should know what your are doing). Note that this is not the storage class
# name that rook will provide.
#mon_volume_storage_class = "local-storage"

# Enables rook to use the host network.
#use_host_networking = false

# If set to true Rook won’t perform any upgrade checks on Ceph daemons
# during an upgrade. Use this at YOUR OWN RISK, only if you know what
# you’re doing.
# https://rook.github.io/docs/rook/v1.3/ceph-cluster-crd.html#cluster-settings
#skip_upgrade_checks = false

# If true, the rook operator will create and manage PodDisruptionBudgets
# for OSD, Mon, RGW, and MDS daemons.
#rook_manage_pod_budgets = true

# Scheduling keys control where services may run. A scheduling key corresponds
# to both a node label and to a taint. In order for a service to run on a node,
# it needs to have that label key.
# If no scheduling key is defined for a service, it will run on any untainted
# node.
#scheduling_key = null
# If you're using a general scheduling key prefix,
# you can reference it here directly:
#scheduling_key = "{{ scheduling_key_prefix }}/storage"

# Set to false to disable CSI plugins, if they are not needed in the rook cluster.
# (For example if the ceph cluster is used for an OpenStack cluster)
#csi_plugins=true

# Additionally it is possible to schedule mons and mgrs pods specifically.
# NOTE: Rook does not merge scheduling rules set in 'all' and the ones in 'mon' and 'mgr',
# but will use the most specific one for scheduling.
#mon_scheduling_key = "{{ scheduling_key_prefix }}/rook-mon"
#mgr_scheduling_key = "{{ scheduling_key_prefix }}/rook-mgr"

# Number of mons to run.
# Default is 3 and is the minimum to ensure high-availability!
# The number of mons has to be uneven.
#nmons = 3

# Number of mgrs to run. Default is 1 and can be extended to 2
# and achieve high-availability.
# The count of mgrs is adjustable since rook v1.6 and does not work with older versions.
#nmgrs = 1

# Number of OSDs to run. This should be equal to the number of storage meta
# workers you use.
#nosds = 3

# The size of the storage backing each OSD.
#osd_volume_size = "90Gi"

# Enable the rook toolbox, which is a pod with ceph tools installed to
# introspect the cluster state.
#toolbox = true

# Enable the CephFS shared filesystem.
#ceph_fs = false
#ceph_fs_name = "ceph-fs"
#ceph_fs_replicated = 1
#ceph_fs_preserve_pools_on_delete = false

# Enable the encryption of OSDs
#encrypt_osds = false

# ROOK POD RESOURCE LIMITS
# The default values are the *absolute minimum* values required by rook. Going
# below these numbers will make rook refuse to even create the pods. See also:
# https://rook.io/docs/rook/v1.2/ceph-cluster-crd.html#cluster-wide-resources-configuration-settings

# Memory limit for mon Pods
#mon_memory_limit = "1Gi"
#mon_memory_request = "{{ rook_mon_memory_limit }}"
#mon_cpu_limit = null
#mon_cpu_request = "100m"

# Resource limits for OSD pods
# Note that these are chosen so that the OSD pods end up in the
# Guaranteed QoS class.
#osd_memory_limit = "2Gi"
#osd_memory_request = "{{ rook_osd_memory_limit }}"
#osd_cpu_limit = null
#osd_cpu_request = "{{ rook_osd_cpu_limit }}"

# Memory limit for mgr Pods
#mgr_memory_limit = "512Mi"
#mgr_memory_request = "{{ rook_mgr_memory_limit }}"
#mgr_cpu_limit = null
#mgr_cpu_request = "100m"

# Memory limit for MDS / CephFS Pods
#mds_memory_limit = "4Gi"
#mds_memory_request = "{{ rook_mds_memory_limit }}"
#mds_cpu_limit = null
#mds_cpu_request = "{{ rook_mds_cpu_limit }}"

# Rook-ceph operator limits
#operator_memory_limit = "512Mi"
#operator_memory_request = "{{ rook_operator_memory_limit }}"
#operator_cpu_limit = null
#operator_cpu_request = "{{ rook_operator_cpu_limit }}"

#[[k8s-service-layer.rook.pools]]
#name = "data"
#create_storage_class = true
#replicated = 1

# Custom storage configuration is documented at
# docs/user/guide/custom-storage.rst
#on_openstack = true
#use_all_available_devices = true
#use_all_available_nodes = true

Prometheus-based Monitoring Configuration

The used prometheus-based monitoring setup will be explained in more detail soon :)

Note

To enable prometheus, k8s-serice-layer.prometheus.install and kubernetes.monitoring.enabled need to be set to true.

config.toml: KSL - Prometheus Configuration
# --- KUBERNETES SERVICE LAYER : MONITORING(PROMETHEUS) ---
# ansible prefix: "monitoring_"
[k8s-service-layer.prometheus]
# If kubernetes.monitoring.enabled is true, choose whether to install or uninstall
# Prometheus. IF SET TO FALSE, PROMETHEUS WILL BE DELETED WITHOUT CHECKING FOR
# DISRUPTION (sic!).
#install = true

#namespace = "monitoring"

# helm chart version of the prometheus stack
# https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
# If you set this empty (not unset), the latest version is used
# Note that upgrades require additional steps and maybe even LCM changes are needed:
# https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#upgrading-chart
#prometheus_stack_version = "59.1.0"

# Configure persistent storage for Prometheus
# By default an empty-dir is used.
# https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
#prometheus_persistent_storage_class = ""
#prometheus_persistent_storage_resource_request = "50Gi"

# Enable remote write for Prometheus
# https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.RemoteWriteSpec
#[[k8s-service-layer.prometheus.remote_writes]]
#url = "http://remote-write-receiver:9090/api/v1/write"
#
#[[k8s-service-layer.prometheus.remote_writes.write_relabel_configs]]
#targetLabel = "prometheus"
#replacement = "my-cluster"
#
#[[k8s-service-layer.prometheus.remote_writes.write_relabel_configs]]
#targetLabel = "cluster"
#replacement = "my-cluster"

# Enable grafana
#use_grafana = true

# If this variable is defined, Grafana will store its data in a PersistentVolume
# in the defined StorageClass. Otherwise, persistence is disabled for Grafana.
# The value has to be a valid StorageClass available in your cluster.
#grafana_persistent_storage_class=""

# Reference to an (existing) secret containing credentials
# for the Grafana admin user. If the secret does not already exist,
# it is automatically created with the default credentials.
#grafana_admin_secret_name = "cah-grafana-admin"

# The full public facing url you use in browser, used for redirects and emails
#grafana_root_url = ""

# Enable use of Thanos
#use_thanos = false

# Let terraform create an object storage container / bucket for you if `true`.
# If set to `false` one must provide a valid configuration via Vault
# See: https://yaook.gitlab.io/k8s/release/v3.0/managed-services/prometheus/prometheus-stack.html#custom-bucket-management
#manage_thanos_bucket = true

# Set custom Bitnami/Thanos chart version
#thanos_chart_version: "13.3.0"

# Thanos uses emptyDirs by default for its components
# for faster access.
# If that's not feasible, a storage class can be set to
# enable persistence and the size for each component volume
# can be configured.
# Note that switching between persistence requires
# manual intervention and it may be necessary to reinstall
# the helm chart completely.
#thanos_storage_class = null
# You can explicitly set the PV size for each component.
# If left undefined, the helm chart defaults will be used
#thanos_storegateway_size = null
#thanos_compactor_size = null

# By default, the monitoring will capture all namespaces. If this is not
# desired, the following switch can be turned off. In that case, only the
# kube-system, monitoring and rook namespaces are scraped by Prometheus.
#prometheus_monitor_all_namespaces=true

# How many replicas of the alertmanager should be deployed inside the cluster
#alertmanager_replicas=1

# Scheduling keys control where services may run. A scheduling key corresponds
# to both a node label and to a taint. In order for a service to run on a node,
# it needs to have that label key.
# If no scheduling key is defined for service, it will run on any untainted
# node.
#scheduling_key = "node-restriction.kubernetes.io/cah-managed-k8s-monitoring"
# If you're using a general scheduling key prefix
# you can reference it here directly
#scheduling_key = "{{ scheduling_key_prefix }}/monitoring"

# Monitoring pod resource limits
# PROMETHEUS POD RESOURCE LIMITS
# The following limits are applied to the respective pods.
# Note that the Prometheus limits are chosen fairly conservatively and may need
# tuning for larger and smaller clusters.
# By default, we prefer to set limits in such a way that the Pods end up in the
# Guaranteed QoS class (i.e. both CPU and Memory limits and requests set to the
# same value).

#alertmanager_memory_limit = "256Mi"
#alertmanager_memory_request = "{{ monitoring_alertmanager_memory_limit }}"
#alertmanager_cpu_limit = "100m"
#alertmanager_cpu_request = "{{ monitoring_alertmanager_cpu_limit }}"

#prometheus_memory_limit = "3Gi"
#prometheus_memory_request = "{{ monitoring_prometheus_memory_limit }}"
#prometheus_cpu_limit = "1"
#prometheus_cpu_request = "{{ monitoring_prometheus_cpu_limit }}"

#grafana_memory_limit = "512Mi"
#grafana_memory_request = "256Mi"
#grafana_cpu_limit = "500m"
#grafana_cpu_request = "100m"

#kube_state_metrics_memory_limit = "128Mi"
#kube_state_metrics_memory_request = "50Mi"
#kube_state_metrics_cpu_limit = "50m"
#kube_state_metrics_cpu_request = "20m"

#thanos_sidecar_memory_limit = "256Mi"
#thanos_sidecar_memory_request = "{{ monitoring_thanos_sidecar_memory_limit }}"
#thanos_sidecar_cpu_limit = "500m"
#thanos_sidecar_cpu_request = "{{ monitoring_thanos_sidecar_cpu_limit }}"

#thanos_query_memory_limit = "786Mi"
#thanos_query_memory_request = "128Mi"
#thanos_query_cpu_limit = "1"
#thanos_query_cpu_request = "100m"

#thanos_store_memory_limit = "2Gi"
#thanos_store_memory_request = "256Mi"
#thanos_store_cpu_limit = "500m"
#thanos_store_cpu_request = "100m"

# https://thanos.io/tip/components/store.md/#in-memory-index-cache
# Note: Unit must be specified as decimal! (MB,GB)
# This value should be chosen in a sane matter based on
# thanos_store_memory_request and thanos_store_memory_limit
#thanos_store_in_memory_max_size = 0

# WARNING: If you have set terraform.cluster_name, you must set this
# variable to "${terraform.cluster_name}-monitoring-thanos-data".
# The default terraform.cluster_name is "managed-k8s" which is why the
# default object store container name is set to the following.
#thanos_objectstorage_container_name = "managed-k8s-monitoring-thanos-data"

# Scrape external targets via blackbox exporter
# https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-blackbox-exporter
#internet_probe = false

# Provide a list of DNS endpoints for additional thanos store endpoints.
# The endpoint will be extended to `dnssrv+_grpc._tcp.{{ endpoint }}.monitoring.svc.cluster.local`.
#thanos_query_additional_store_endpoints = []

# Deploy a specific blackbox exporter version
# https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-blackbox-exporter
#blackbox_version = "7.0.0"

# By default, prometheus and alertmanager only consider global rules from the monitoring
# namespace while other rules can only alert on their own namespace. If this variable is
# set, cluster wide rules are considered from all namespaces.
#allow_external_rules = false

#[[k8s-service-layer.prometheus.internet_probe_targets]]
# name="example"                    # Human readable URL that will appear in Prometheus / AlertManager
# url="http://example.com/healthz"  # The URL that blackbox will scrape
# interval="60s"                    # Scraping interval. Overrides value set in `defaults`
# scrapeTimeout="60s"               # Scrape timeout. Overrides value set in `defaults`
# module = "http_2xx"               # module to be used. Can be "http_2xx" (default), "http_api" (allow status codes 200, 300, 401), "http_api_insecure", "icmp" or "tcp_connect".

# If at least one common_label is defined, Prometheus will be created with selectors
# matching these labels and only ServiceMonitors that meet the criteria of the selector,
# i.e. are labeled accordingly, are included by Prometheus.
# The LCM takes care that all ServiceMonitor created by itself are labeled accordingly.
# The key can not be "release" as that one is already used by the Prometheus helm chart.
#
#[k8s-service-layer.prometheus.common_labels]
#managed-by = "yaook-k8s"

Tweak Thanos Configuration
index-cache-size / in-memory-max-size

Thanos is unaware of its Kubernetes limits which can lead to OOM kills of the storegateway if a lot of metrics are requested.

We therefore added an option to configure the index-cache-size (see Tweak Thanos configuration (!1116) · Merge requests · YAOOK / K8s · GitLab and (see Thanos - Highly available Prometheus setup with long term storage capabilities) which should prevent that and is available as of release/v3.0 · YAOOK / K8s · GitLab.

It can be configured by setting the following in the clusters config/config.toml:

# [...]
[k8s-service-layer.prometheus]
# [...]
thanos_store_in_memory_max_size = "XGB"
thanos_store_memory_request = "XGi"
thanos_store_memory_limit = "XGi"
# [...]

Note that the value must be a decimal unit! Please also note that you should set a meaningful value based on the configured thanos_store_memory_limit. If this variable is not explicitly configured, the helm chart default is used which is not optimal. You should configure both variables and in the best case you additionally set thanos_store_memory_request to the same value as thanos_store_memory_limit.

Persistence

With release/v3.0 · YAOOK / K8s · GitLab, persistence for Thanos components has been reworked. By default, Thanos components use emptyDirs. Depending on the size of the cluster and the metrics flying around, Thanos components may need more disk than the host node can provide them and in that cases it makes sense to configure persistence.

If you want to enable persistence for Thanos components, you can do so by configuring a storage class to use and you can specify the persistent volume size for each component like in the following.

# [...]
[k8s-service-layer.prometheus]
# [...]
thanos_storage_class = "SOME_STORAGE_CLASS"
thanos_storegateway_size = "XGi"
thanos_compactor_size = "YGi"
thanos_query_size = "ZGi"
# [...]

NGINX Ingress Controller Configuration

The used NGINX ingress controller setup will be explained in more detail soon :)

Note

To enable an ingress controller, k8s-service-layer.ingress.enabled needs to be set to true.

config.toml: KSL - NGINX Ingress Configuration
# --- KUBERNETES SERVICE LAYER : INGRESS ---
# ansible prefix: "k8s_ingress_"
[k8s-service-layer.ingress]
# Enable nginx-ingress management.
enabled = false # •ᴗ•

# Helm chart version
#chart_version = "4.11.1"
# Helm repository URL
#helm_repo_url = "https://kubernetes.github.io/ingress-nginx"
# Helm chart reference
#chart_ref = "ingress-nginx/ingress-nginx"
# Helm release name
#release_name = "ingress"

# Namespace to deploy the ingress in (will be created if it does not exist, but
# never deleted).
#namespace = "k8s-svc-ingress"

# If enabled, choose whether to install or uninstall the ingress. IF SET TO
# FALSE, THE INGRESS CONTROLLER WILL BE DELETED WITHOUT CHECKING FOR
# DISRUPTION.
#install = true

# Scheduling key for the cert manager instance and its resources. Has no
# default.
#scheduling_key =

# Service type for the frontend Kubernetes service.
#service_type = "LoadBalancer"

# Node port for the HTTP endpoint
#nodeport_http = 32080

# Node port for the HTTPS endpoint
#nodeport_https = 32443

# Enable SSL passthrough in the controller
#enable_ssl_passthrough = true

# Replica Count
#replica_count = 1

# Allow snippet annotations
#ingress_allow_snippet_annotations = false

# Recommended memory and CPU requests and limits for the Pod.
# For security reasons, a limit is strongly recommended and
# has a direct impact on the security of the cluster,
# for example to prevent a DoS attack.
#memory_limit = ""
#memory_request = "128Mi"
#cpu_limit = ""
#cpu_request = "100m"

Cert-Manager Configuration

The used Cert-Manager controller setup will be explained in more detail soon :)

Note

To enable cert-manager, k8s-service-layer.cert-manager.enabled needs to be set to true.

config.toml: KSL - Cert-Manager Configuration
# --- KUBERNETES SERVICE LAYER : CERT MANAGER ---
# ansible prefix: "k8s_cert_manager_"
[k8s-service-layer.cert-manager]
# Enable management of a cert-manager.io instance
enabled = false # •ᴗ•

# Helm chart version
#chart_version = "1.15.2"
# Helm repository URL
#helm_repo_url = "https://charts.jetstack.io"
# Helm chart reference
#chart_ref = "jetstack/cert-manager"
# Helm release name
#release_name = "cert-manager"

# Configure in which namespace the cert-manager is run. The namespace is
# created automatically, but never deleted automatically.
#namespace = "k8s-svc-cert-manager"

# Install or uninstall cert manager. If set to false, the cert-manager will be
# uninstalled WITHOUT CHECK FOR DISRUPTION!
#install = true

# Scheduling key for the cert manager instance and its resources. Has no
# default.
#scheduling_key =

# If given, a *cluster wide* Let's Encrypt issuer with that email address will
# be generated. Requires an ingress to work correctly.
# DO NOT ENABLE THIS IN CUSTOMER CLUSTERS, BECAUSE THEY SHOULD NOT CREATE
# CERTIFICATES UNDER OUR NAME. Customers are supposed to deploy their own
# ACME/Let's Encrypt issuer.
#letsencrypt_email = "..."

# By default, the ACME issuer will let the server choose the certificate chain
# to use for the certificate. This can be used to override it.
#letsencrypt_preferred_chain = "..."

# The ingress class to use for responding to the ACME challenge.
# The default value works for the default k8s-service-layer.ingress
# configuration and may need to be adapted in case a different ingress is to be
# used.
#letsencrypt_ingress = "nginx"

# This variable let's you specify the endpoint of the ACME issuer. A common usecase
# is to switch between staging and production. [0]
# [0]: https://letsencrypt.org/docs/staging-environment/
# letsencrypt_server: https://acme-staging-v02.api.letsencrypt.org/directory

etcd-backup Configuration

Automated etcd backups can be configured in this section. When enabled it periodically creates snapshots of etcd database and store it in a object storage using s3. It uses the helm chart etcdbackup present in yaook operator helm chart repository. The object storage retains data for 30 days then deletes it.

The usage of it is disabled by default but can be enabled (and configured) in the following section. The credentials are stored in Vault. By default, they are searched for in the cluster’s kv storage (at yaook/$clustername/kv) under etcdbackup. They must be in the form of a JSON object/dict with the keys access_key and secret_key.

Note

To enable etcd-backup, k8s-service-layer.etcd-backup.enabled needs to be set to true.

config.toml: KSL - Etcd-backup Configuration
# --- KUBERNETES SERVICE LAYER : ETCD-BACKUP ---
# ansible prefix: "etcd_backup_"
[k8s-service-layer.etcd-backup]
enabled = false

# Configure value for the cron job schedule for etcd backups. If not set it will be
# set to default value of  21 */12 * * *
#schedule = "21 * * * *"

# Name of the s3 bucket to store the backups. It defaults to `etcd-backup`
#bucket_name = "etcd-backup"

# Name of the folder to keep the backup files. It defaults to `etcd-backup`
#file_prefix = "backup"

# Configure the location of the Vault kv2 storage where the credentials can
# be found. This location is the default location used by import.sh and is
# recommended.
#vault_mount_point = "yaook/{{ vault_cluster_name }}/kv"

# Configure the kv2 key under which the credentials are found. This location is
# the default location used by import.sh and is recommended.
#
# The role expects a JSON object with `access_key` and `secret_key` keys,
# containing the corresponding S3 credentials.
#vault_path = "etcdbackup"

# Number of days after which individual items in the bucket are dropped. Enforced by S3 lifecyle rules which
# are also implemented by Ceph's RGW.
#days_of_retention = 30

# etcdbackup chart version to install.
# If this is not specified, the latest version is installed.
#chart_version=""

# Metrics port on which the backup-shifter Pod will provide metrics.
# Please note that the etcd-backup deployment runs in host network mode
# for easier access to the etcd cluster.
#metrics_port: 19100

The following values need to be set:

Variable

Description

access_key

Identifier for your S3 endpoint

secret_key

Credential for your S3 endpoint

endpoint_url

URL of your S3 endpoint

endpoint_cacrt

Certificate bundle of the endpoint.

etcd-backup configuration template
---
access_key: REPLACEME
secret_key: REPLACEME
endpoint_url: REPLACEME
endpoint_cacrt: |
  -----BEGIN CERTIFICATE-----
  REPLACEME
  -----END CERTIFICATE-----
...
Generate/Figure out etcd-backup configuration values
# Generate access and secret key on OpenStack
openstack ec2 credentials create

# Get certificate bundle of url
openssl s_client -connect ENDPOINT_URL:PORT showcerts 2>&1 < /dev/null | sed -n '/-----BEGIN/,/-----END/p'

Flux

More details about our FluxCD2 implementation can be found here.

The following configuration options are available:

config.toml: KSL - Flux
# --- KUBERNETES SERVICE LAYER : FLUXCD ---
# ansible prefix: "fluxcd_"
[k8s-service-layer.fluxcd]
# Enable Flux management.
enabled = false

# If enabled, choose whether to install or uninstall fluxcd2. IF SET TO
# FALSE, FLUXCD2 WILL BE DELETED WITHOUT CHECKING FOR
# DISRUPTION.
#install = true

# Helm chart version of fluxcd to be deployed.
#version = "2.10.1"

# Namespace to deploy the flux-system in (will be created if it does not exist, but
# never deleted).
#namespace = "k8s-svc-flux-system"

Node-Scheduling: Labels and Taints Configuration

Note

Nodes get their labels and taints during the Kubernetes cluster initialization and node-join process. Once a node has joined the cluster, its labels and taints will not get updated anymore.

More details about the labels and taints configuration can be found here.

config.toml: Node-Scheduling: Labels and Taints Configuration
# --- NODE SCHEDULING ---
# ansible prefix: /
[node-scheduling]
# Scheduling keys control where services may run. A scheduling key corresponds
# to both a node label and to a taint. In order for a service to run on a node,
# it needs to have that label key. The following defines a prefix for these keys
scheduling_key_prefix = "scheduling.mk8s.cloudandheat.com"

# --- NODE SCHEDULING: LABELS (sent to ansible as k8s_node_labels!) ---
[node-scheduling.labels]
# Labels are assigned to a node during its initialization/join process only!
#
# The following fields are commented out because they make assumptions on the existence
# and naming scheme of nodes. Use them for inspiration :)
#managed-k8s-worker-0 = ["{{ scheduling_key_prefix }}/storage=true"]
#managed-k8s-worker-1 = ["{{ scheduling_key_prefix }}/monitoring=true"]
#managed-k8s-worker-2 = ["{{ scheduling_key_prefix }}/storage=true"]
#managed-k8s-worker-3 = ["{{ scheduling_key_prefix }}/monitoring=true"]
#managed-k8s-worker-4 = ["{{ scheduling_key_prefix }}/storage=true"]
#managed-k8s-worker-5 = ["{{ scheduling_key_prefix }}/monitoring=true"]
#
# --- NODE SCHEDULING: TAINTS (sent to ansible as k8s_node_taints!) ---
[node-scheduling.taints]
# Taints are assigned to a node during its initialization/join process only!
#
# The following fields are commented out because they make assumptions on the existence
# and naming scheme of nodes. Use them for inspiration :)
#managed-k8s-worker-0 = ["{{ scheduling_key_prefix }}/storage=true:NoSchedule"]
#managed-k8s-worker-2 = ["{{ scheduling_key_prefix }}/storage=true:NoSchedule"]
#managed-k8s-worker-4 = ["{{ scheduling_key_prefix }}/storage=true:NoSchedule"]

Wireguard Configuration

You MUST add yourself to the wireguard peers.

You can do so either in the following section of the config file or by using and configuring a git submodule. This submodule would then refer to another repository, holding the wireguard public keys of everybody that should have access to the cluster by default. This is the recommended approach for companies and organizations.

config.toml: Wireguard Configuration
# --- WIREGUARD ---
# ansible prefix: "wg_"
[wireguard]
enabled = true

# Set the environment variable "WG_COMPANY_USERS" or this field to 'false' if C&H company members
# should not be rolled out as wireguard peers.
#rollout_company_users = false

# This block defines a WireGuard endpoint/server
# To allow rolling key rotations, multiple endpoints can be added.
# Please make sure that each endpoint has a different id, port and subnet
[[wireguard.endpoints]]
id = 0
enabled = true
port = 7777 # •ᴗ•

# IP address range to use for WireGuard clients. Must be set to a CIDR and must
# not conflict with the terraform.subnet_cidr.
# Should be chosen uniquely for all clusters of a customer at the very least
# so that they can use all of their clusters at the same time without having
# to tear down tunnels.
ip_cidr = "172.30.153.64/26"
ip_gw   = "172.30.153.65/26"

# Same for IPv6
#ipv6_cidr = "fd01::/120"
#ipv6_gw = "fd01::1/120"

# To add WireGuard keys, create blocks like the following
# You can add as many of them as you want. Inventory updater will auto-allocate IP
# addresses from the configured ip_cidr.
#[[wireguard.peers]]
#pub_key = "test1"
#ident = "testkunde1"

IPsec Configuration

More details about the IPsec setup can be found here.

config.toml: IPsec Configuration
# --- IPSEC ---
# ansible prefix: "ipsec_"
[ipsec]
enabled = false

# Flag to enable the test suite.
# Must make sure a remote endpoint, with ipsec enabled, is running and open for connections.
# test_enabled = false

# Must be a list of parent SA proposals to offer to the client.
# Must be explicitly set if ipsec_enabled is set to true.
#proposals =

# Must be a list of ESP proposals to offer to the client.
#esp_proposals = "{{ ipsec_proposals }}"

# List of CIDRs to route to the peer. If not set, only dynamic IP
# assignments will be routed.
#peer_networks = []

# List of CIDRs to offer to the peer.
#local_networks = ["{{ subnet_cidr }}"]

# Pool to source virtual IP addresses from. Those are the IP addresses assigned
# to clients which do not have remote networks. (e.g.: "10.3.0.0/24")
#virtual_subnet_pool = null

# List of addresses to accept as remote. When initiating, the first single IP
# address is used.
#remote_addrs = false

# Private address of remote endpoint.
# only used when test_enabled is True
#remote_private_addrs = ""

Testing

Testing Nodes

The following configuration section can be used to ensure that smoke tests and checks are executed from different nodes. This is disabled by default as it requires some prethinking.

config.toml: Testing Nodes Configuration
# --- TESTING: TEST NODES ---
# You can define specifc nodes for some
# smoke tests. If you define these, you
# must specify at least two nodes.
[testing]
# Example:
#test-nodes = ["managed-k8s-worker-1", "managed-k8s-worker-3"]

# Enforce rebooting of nodes after every system update
#force_reboot_nodes = false

Custom Configuration

Since YAOOK/K8s allows to execute custom playbook(s), the following section allows you to specify your own custom variables to be used in these.

config.toml: Custom Configuration
# --- CUSTOM ---
# ansible prefix: /
# Specify variables to be used in the custom stage here. See below for examples.

##[custom]
#my_var_foo = "" # makes the variable `my_custom_section_prefix_my_var = ""`

#[custom.my_custom_section_prefix]
#my_var = "" # produces the var `my_custom_section_prefix_my_var = ""`

Miscellaneous Configuration

This section contains various configuration options for special use cases. You won’t need to enable and adjust any of these under normal circumstances.

Miscellaneous configuration
# --- MISCELLANEOUS ---
# ansible prefix: /
[miscellaneous]
# Install wireguard on all workers (without setting up any server-side stuff)
# so that it can be used from within Pods.
wireguard_on_workers = false

# Configuration details if the cluster will be placed behind a HTTP proxy.
# If unconfigured images will be used to setup the cluster, the updates of
# package sources, the download of docker images and the initial cluster setup will fail.
# NOTE: These chances are currently only tested for Debian-based operating systems and not for RHEL-based!
#cluster_behind_proxy = false
# Set the approriate HTTP proxy settings for your cluster here. E.g. the address of the proxy or
# internal docker repositories can be added to the no_proxy config entry
# Important note: Settings for the yaook-k8s cluster itself (like the service subnet or the pod subnet)
# will be set automagically and do not have to set manually here.
#http_proxy = "http://proxy.example.com:8889"
#https_proxy = "http://proxy.example.com:8889"
#no_proxy = "localhost,127.0.0.0/8"

# Name of the internal OpenStack network. This field becomes important if a VM is
# attached to two networks but the controller-manager should only pick up one. If
# you don't understand the purpose of this field, there's a very high chance you
# won't need to touch it/uncomment it.
# Note: This network name isn't fetched automagically (by terraform) on purpose
# because there might be situations where the CCM should not pick the managed network.
#openstack_network_name = "managed-k8s-network"

# Use the helm chart to deploy the CCM and the cinder csi plugin.
# If openstack_connect_use_helm is false the deployment will be done with the help
# of the deprecated manifest code.
# This will be enforced for clusters with Kubernetes >= v1.29 and
# the depricated manifest code will be dropped along with Kubernetes v1.28
#openstack_connect_use_helm = true

# Value for the kernel parameter `vm.max_map_count` on k8s nodes. Modifications
# might be required depending on the software running on the nodes (e.g., ElasticSearch).
# If you leave the value commented out you're fine and the system's default will be kept.
#vm_max_map_count = 262144

# Custom Docker Configuration
# A list of registry mirrors can be configured as a pull through cache to reduce
# external network traffic and the amount of docker pulls from dockerhub.
#docker_registry_mirrors: [ "https://0.docker-mirror.example.org", "https://1.docker-mirror.example.org" ]

# A list of insecure registries that can be accessed without TLS verification.
#docker_insecure_registries: [ "0.docker-registry.example.org", "1.docker-registry.example.org" ]

# Mirror Configuration for Containerd
# container_mirror_default_host = "install-node"
# container_mirrors = [
#   { name = "docker.io",
#     upstream = "https://registry-1.docker.io/",
#     port = 5000 },
#   { name = "gitlab.cloudandheat.com",
#     upstream = "https://registry.gitlab.cloudandheat.com/",
#     mirrors = ["https://install-node:8000"] },
# ]


# Custom Chrony Configration
# The ntp servers used by chrony can be customized if it should be necessary or wanted.
# A list of pools and/or servers can be specified.
# Chrony treats both similarily but it expects that a pool will resolve to several ntp servers.
#custom_chrony_configuration = false
#custom_ntp_pools = [ "0.pool.ntp.example.org", "1.pool.ntp.example.org"]
#custom_ntp_servers = [ "0.server.ntp.example.org", "1.server.ntp.example.org"]

# OpenStack credential checks
# Terrible things will happen when certain tasks are run and OpenStack credentials are not sourced.
# Okay, maybe not so terrible after all, but the templates do not check if certain values exist.
# Hence config files with empty credentials are written. The LCM will execute a simple check to see
# if you provided valid credentials as a sanity check iff you're on openstack and the flag below is set
# to True.
#check_openstack_credentials = True

# APT Proxy Configuration
# As a secondary effect, https repositories are not used, since
# those don't work with caching proxies like apt-cacher-ng.
# apt_proxy_url = "..."

# Custom PyPI mirror
# Use this in offline setups or to use a pull-through cache for
# accessing the PyPI.
# If the TLS certificate used by the mirror is not signed by a CA in
# certifi, you can put its cert in `config/pip_mirror_ca.pem` to set
# it explicitly.
# pip_mirror_url = "..."

[nvidia.vgpu]
# vGPU Support
# If virtualize_gpu in the [kubernetes] section is set to true, please also set these variables:
# driver_blob_url should point to a object store or otherwise web server, where the vGPU Manager installation file is available.
# driver_blob_url= "..."
# manager_filename should hold the name of the vGPU Manager installation file.
# manager_filename = "..."

Ansible Configuration

The Ansible configuration file can be found in the ansible/ directory. It is used across all stages and layers.

Default Ansible configuration
# Ansible configuration

[defaults]
action_plugins = plugins/action
filter_plugins = plugins/filter
stdout_callback = yaml
bin_ansible_callbacks = True
host_key_checking = True
force_valid_group_names = never

# Give certain events, e.g., escalation prompt (become) more time to avoid premature cancellations
timeout = 60

retry_files_enabled = False # Do not create .retry files

#callback_whitelist = profile_tasks
forks = 42

[inventory]
enable_plugins = host_list,script,yaml,ini,openstack

# Fail, not warn if any inventory source could not be parsed
unparsed_is_failed = true

[ssh_connection]
# https://stackoverflow.com/questions/40340761/is-it-possible-to-have-ansible-retry-on-connection-failure
retries=10
ssh_args=-o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=../../etc/ssh_known_hosts -o ControlMaster=auto -o ControlPersist=60s
pipelining=true
transfer_method=piped

[connection]
# https://docs.ansible.com/ansible/latest/reference_appendices/config.html#ansible-pipelining
pipelining=true