Prometheus Stack

yaook/k8s uses the kube-prometheus-stack helm chart with an additional abstraction layer. To figure out the used version, you could use:

$ helm ls -n monitoring

Take a look at the values.yaml files of the individual helm charts to see what you can (or can`t) potentially modify. Note that not all values might be exposed in the config.toml. The data path is config.toml -> inventory/prometheus.yaml -> monitoring_v2 -> templates/prometheus_stack.yaml.j2. If a field that you need isn’t listed in prometheus_stack.yaml or statically configured, please open an issue or, even preferable, submit a merge request :). yaook/k8s’ developer guide can be found here.

Yaook/k8s also allows the upgrade of the kube-prometheus-stack. You can adjust the prometheus_stack_version in the config.toml

...
[monitoring]
...
prometheus_stack_version = 59.1.0
...

If the variable isn`t set, the default will be used, which can be found via the following call as monitoring_prometheus_stack_version.

$ cat managed-k8s/k8s-service-layer/roles/monitoring_v2/defaults/main.yaml

This file also lists currently supported versions. As each upgrade requires further steps, i.e. updating the CRDs, you cannot simply jump ahead.

The upgrade routine can be triggered by running the following:

$ MANAGED_K8S_RELEASE_THE_KRAKEN=true AFLAGS="--diff -t monitoring" bash managed-k8s/actions/apply-k8s-supplements.sh

Prometheus

By default we deploy exactly one Prometheus server if the monitoring is enabled. This instance doesn’t have persistent storage unless prometheus_persistent_storage_class is set in config.toml. Prometheus scrapes all PodMonitors and ServiceMonitors it can find in the different namespaces. This behavior can be altered by setting [k8s-service-layer.prometheus.common_labels] in config.toml to only scrape resources which match a certain label set.

Prometheus can be integrated into existing Prometheus-based monitoring setups via federation which is a pull-based approach for gathering a subset of its metrics. The alternative way is using a push-based approach called remote write. Remote write allows your Prometheus to actively send metrics to an endpoint (remote write receiver or target) and is configured via [[k8s-service-layer.prometheus.remote_writes]] in config.toml.

Grafana

The LCM uses the Grafana helm chart with the version that comes with the current kube-prometheus-stack helm chart version. Grafana is not enabled by default, you can enable it in the Prometheus configuration.

Custom dashboards and datasources

By default, Grafana will be rolled with two sidecars (grafana-sc-dashboard, grafana-sc-datasources) that are configured to pick up additional dashboards/datasources in any namespace. These extra resources can reside either in k8s Secrets or in ConfigMaps. They have to have a label grafana_dashboard set. To configure a custom, logical folder for one or more dashboard, add the annotation customer-dashboards=<Folder name>.

Note

After 30s seconds of research the author came to the conclusion that one cannot nest logical dashboard folders in Grafana. If <Folder name> consists of a path of multiple folders, only the last one is picked.

As an example, let’s add a dashboard for the NGINX Ingress Controller and have it displayed under the logical nginx folder in Grafana. The backing configmap for the dashboard should be stored in the namespace fancy.

  1. Download the dashboard as JSON to your workstation. We will call that manifest nginx_db.json.

  2. Create the configmap: kubectl create configmap nginx-db -n fancy --from-file=nginx_db.json

  3. Add a label so that the sidecar will pick up the dashboard: kubectl label cm -n fancy nginx-db grafana_dashboard=1. (The value of the key/value label pair does not matter)

  4. Annotate the configmap with the proper path: kubectl annotate cm -n fancy nginx-db customer-dashboards=nginx.

The sidecar should pick up the change eventually. If it doesn’t or you’re impatient, you could restart Grafana by destroying its pod.

Alertmanager

The monitoring stack comes with an alertmanager instance that is available to the end user for their convenience. One can create also their own Alertmanager resource which is then translated into a StatefulSet by the prometheus operator. AM configuration should be kept separate and can be injected by creating a AlertmanagerConfig resource within the monitoring namespace. Other namespace are not considered without further configuration. For further information please refer to the corresponding documentation. A silly example:

kind: AlertmanagerConfig
apiVersion: monitoring.coreos.com/v1alpha1
metadata:
  name: custom-amc
  namespace: monitoring
spec:
  receivers:
    - name: your mom
      emailConfigs:
        - hello: localhost
          requireTLS: true
          to: a@b.de
          smarthost: a.com:25
          from: c@d.de
      pagerdutyConfigs:
        - url: https://events.pagerduty.com/v2/enqueue
          routingKey:
            key: a
            name: blub
            optional: true
  route:
    receiver: your mom
    groupBy:
    - job
    continue: false
    routes:
    - receiver: your mom
      match:
        alertname: Watchdog
      continue: false
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h

Note

The author hasn’t worked much with Alertmanager(Config) in the past and only ensured that manifests are read correctly. Their test was looking at

$ kubectl exec -ti -n monitoring alertmanager-prometheus-stack-kube-prom-alertmanager-0 -- amtool --alertmanager.url=http://127.0.0.1:9093 config

Note

You will probably mess up the AlertmanagerConfig manifest in one way or another. The AdmissionController caught some typos. On other occasions I had to look into the logs of the prometheus-operator pod. And eventually the AM failed to come up because I missed some further fields which I figured via the logs of the AM pod.

Thanos

Thanos is deployed outside of the kube-prometheus-stack helm chart. By default, it writes its metrics into a SWIFT object storage container that resides in the same OpenStack project.

We’re deploying the Bitnami Thanos helm chart with adjusted values by default. Please refer to its documentation for further details.

Thanos can be enabled and configured in the Prometheus configuration.

Object Storage Configuration

You can either choose the automated Thanos object storage management (default) in which case the LCM takes care to create a bucket inside your OpenStack project or you can configure a custom bucket.

Automated bucket management

Warning

The automated bucket management can only be used when your cluster is created on top of OpenStack and a valid OpenStack RC file is sourced.

This method is enabled by default. This will let Terraform create an object storage container inside your OpenStack project and automatically configures Thanos to use that container as primary storage.

Custom bucket management

The custom bucket management can be enabled by setting k8s-service-layer.prometheus.manage_thanos_bucket to false in your config/config.toml.

You must supply a valid configuration for a supported Thanos client.

This configuration must be stored in your cluster key-value secrets engine under kv/data/thanos-config. Inserting a Thanos client config into vault can be automated by storing the configuration at config/thanos.yaml (or specifying another location in your config/config.toml under k8s-service-layer.prometheus.thanos_objectstorage_config_file) and then triggering the vault update script:

$ ./managed-k8s/tools/vault/update.sh

Alternatively, you can also manually insert your configuration into vault.

Prometheus Adapter (metrics server)

Background and motivation

The prometheus-adapter provides the metrics API by making use of existing prometheus metrics. In case of default resources (memory and cpu per pod/node), prometheus fetches these metrics from kubelet which, on the hand, reads these values from cAdvisor which gets its values from cgroups on the individual node. metrics-server gets those metrics directly from kubelet/cAdvisor.

A common use case for the metrics API is horizontal (HPA) and vertical pod autoscaling (VPA). An advantage of prometheus-adapter compared to metrics-server is that one can define custom metrics for HPA and VPA. kubectl top nodes and kubectl top pods also needs a working metrics API :)

As stated above, the values of the metrics API are derived stats of the cgroups on the node. kubelet creates a resource tree with the layers

  • QoS (Guaranteed, Burstable, BestEffort)

  • Pod

  • Container

A sample tree:

root@managed-k8s-worker-1:/sys/fs/cgroup/unified/kubepods.slice# tree -d
.
├── kubepods-besteffort.slice
│   ├── kubepods-besteffort-pod1793a176_009e_4b22_9d89_6d71f914f6f7.slice
│   │   ├── docker-2dbb7f0327a157479fda466398aa87664069610232b293f5817b2712b9ff5719.scope
│   │   └── docker-51fdb8e253c7873a04db7219fb602694ad3977957a8ee354d362ce25cd29d3c8.scope
│   ├── kubepods-besteffort-pod2c9a23a5_effa_4130_aa19_5efac4829224.slice
│   │   ├── docker-817cb87c8d31136e3ef7d6274393127184b4781367bf3b9b62e572b796ebecd4.scope
│   │   └── docker-bb2c7f5087e52182667e63fc548fbb15d7981fa7322b58b59c529bbca71a8361.scope
│   ├── kubepods-besteffort-pod6aaaaf32_9f4e_46fe_841f_13bee2413625.slice
│   │   ├── docker-3a58ec66ee269a25dc14d580fd9ea4766ff6fcb269b7be39bdc08abd9c0a87f4.scope
│   │   └── docker-3ad62f52496d25dd5ef3f8b9b462776bbd7023ed1c37c56b19429b8c7b926ad6.scope
│   ├── kubepods-besteffort-podb2481109_b708_49f3_b2bb_52b0fb470fe9.slice
│   │   ├── docker-601173595b1d0d6b08b7965e28e04c83a64900e2642d3c48ff0f972019f9f556.scope
│   │   ├── docker-9edfeb7ab8ae757ffb90e847ffa70b2281e89367eca3f34d89065225e61e47ba.scope
│   │   └── docker-de4b153c2c49bb04c0b45f534694fd143d70f25b18503626d67a4fd73c016ea5.scope
│   ├── kubepods-besteffort-podb393dd5c_0c80_488b_bed1_c548aea803a3.slice
│   │   ├── docker-7f22a8b72620cd7b6d740de9957f10eed127063b64745df8b45b432d299d04f0.scope
│   │   └── docker-e3a42aca173771b1089d97ba8664d6fd04e9f5ed736a1167c75b3f71025315e9.scope
│   ├── kubepods-besteffort-podcd213409_756c_4d17_9b7b_9a9b023d8533.slice
│   │   ├── docker-ab7a790f1afbd39ffaef0ce1bdb0dbbe7b9525ad785190e498b9a68754f96c86.scope
│   │   └── docker-eac640f0373dc37d45e6d36375656db04d2b815e605d9c8b1c8a2652e1a66e65.scope
│   └── kubepods-besteffort-podeba9d649_010c_4122_bcee_27255d8ad69c.slice
│       ├── docker-087baf1b34e7a703d81cbe8a988d2eb9e0837f86b798066789436443cfea090e.scope
│       └── docker-68c2d4b2f374611a1e550b7f3b31dba3039d5c98b5d931fb87638cf0114bd9a8.scope
└── kubepods-burstable.slice
    ├── kubepods-burstable-pod4bbb178b_3396_49b7_90e7_6264b7392aa2.slice
    │   ├── docker-5f4521bde3825fa1b35262ed377c95ce47cdd322e2f017a9a8f1083e05a8d39b.scope
    │   ├── docker-6b6d47a682fc95ca0d7c37cf83f391c3d0f8bacda88eae22634b4c5dff043dbf.scope
    │   └── docker-cd817ca433d294ae3701c61dab312ab5715525cf3cd8c74fc5f1471bbcde59c3.scope
    ├── kubepods-burstable-pod793e426b_16c6_4b86_a0b8_e4b4ed877c15.slice
    │   ├── docker-7158fab7cdc1af3bc68599e8fa0cfcc637840a8a9fea65a94cc467e7836310ea.scope
    │   └── docker-92a0b9788b01f2ca82792d93bbdfb90da419097c61493dcd6587fafacace1d91.scope
    └── kubepods-burstable-pod81795b29_e574_4d5e_866c_ad146e86bdbb.slice
        ├── docker-51c9c0b1dcf6153572661b8bcb9d99ea4a4934db35e074fb88297b4b36002ace.scope
        └── docker-dee4dea98d5e8e6282fe64607d3c91e3ee071d2fab2570d44eedac649702daf2.scope

34 directories

Note: /sys/fs/cgroup/unified/ is the mount point of cgroups v2 on a Ubuntu 20.04 node. As it seems, cgroups v1 is still the default so, i.e., information on memory usage have to be fetched from the corresponding memory controllers.

Those values are translated into such metrics:

container_memory_working_set_bytes{container="POD", endpoint="https-metrics", id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod00376bc9_6679_4f56_a9dd_a10aad6ff2d4.slice/docker-5b9efdb04ff83031b437fde548968ef9b92c3febccb03946ec421b11d12893dd.scope", image="k8s.gcr.io/pause:3.2", instance="172.30.181.39:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_prometheus-stack-prometheus-node-exporter-z8qj7_monitoring_00376bc9-6679-4f56-a9dd-a10aad6ff2d4_0", namespace="monitoring", node="managed-k8s-worker-0", pod="prometheus-stack-prometheus-node-exporter-z8qj7", service="prometheus-stack-kube-prom-kubelet"}
    536576
container_memory_working_set_bytes{container="POD", endpoint="https-metrics", id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod01ed3a39_a5f0_4465_a33f_63645893aa1e.slice/docker-469c599d81d233dd2a1d6e1ea252ca1535df26e4c57f04451c066bf1589cc129.scope", image="k8s.gcr.io/pause:3.2", instance="172.30.181.39:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_nvidia-device-plugin-daemonset-4gbd2_kube-system_01ed3a39-a5f0-4465-a33f-63645893aa1e_0", namespace="kube-system", node="managed-k8s-worker-0", pod="nvidia-device-plugin-daemonset-4gbd2", service="prometheus-stack-kube-prom-kubelet"}
    737280
container_memory_working_set_bytes{container="POD", endpoint="https-metrics", id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod12321fde_373b_4347_ad3e_f31b4f587d35.slice/docker-2bd3158e1dc1d1911dcb294e62463c6da24517287c77eb132cf22bafe1710bc4.scope", image="k8s.gcr.io/pause:3.2", instance="172.30.181.180:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_thanos-sample-storegateway-0_monitoring_12321fde-373b-4347-ad3e-f31b4f587d35_0", namespace="monitoring", node="managed-k8s-worker-2", pod="thanos-sample-storegateway-0", service="prometheus-stack-kube-prom-kubelet"}