vSphere with Tanzu und NSX Advanced Load Balancer - avi-secret not found

vSphere with Tanzu und NSX Advanced Load Balancer – avi-secret not found

Ich bin kürzlich über ein interessantes Verhalten bei einem Funktionstest einer VMware vSphere with Tanzu Container Runtime Plattform in Kombination mit dem VMware NSX Advanced Load Balancer gestolpert. Ich persönlich finde Funktionstests immer spannend, da man aus erster Hand erfahren kann, wie die eigenen Systeme unter erschwerten Bedingungen funktioniere bzw. ab wann sie den Dienst quittieren.

Die Plattform basierte dabei auf vSphere with Tanzu (vSphere 8.0 Update 2), NSX Data Center (4.1.2.1) und NSX (Avi) Advanced Load Balancer (22.1.5). Sämtliche Komponenten wurden in einem integrierten Setup bereitgestellt. Das heisst, dass sämtliche Schnittstellen automatisiert angesteuert werden, was eine echte Cloud-like Experience für Developer und Plattform-Administratoren bietet.

Dabei dient NSX Data Center in erster Linie der Netzwerk- und Security-Automatisierung (bspw. Antrea CNI Integration in NSX) und NSX Advanced Load Balancer der automatisierten Bereitstellung von L4 Load Balancer Services (kann um L7 Ingress Integration erweitert werden). Diesen Grad der Automatisierung erreicht man sprichwörtlich mit der Initialen “Out-of-the-Box” Bereitstellung der vSphere with Tanzu Container Plattform.

Das Problem

Als ich alle drei NSX ALB Controller ausgeschaltet (power-off) habe, um ein Desaster-Szenario zu simulieren, hat sich alles so verhalten, wie erwartet. Die Überraschung kam erst nach der erfolgreichen Wiederherstellung des NSX ALB Control Planes zum Vorschein.

Als ich ein Deployment mit einem Service vom Typen LoadBalancer bereitstellen wollte, stellte ich fest, dass ich keine externe IP-Adresse vom NSX ALB Controller erhalte:

kubectl get svc -n ft-nsx-alb

NAMESPACE       NAME               TYPE         CLUSTER-IP  EXTERNAL-IP  PORT(S)  AGE
ft-nsx-alb      service/kubernetes LoadBalancer 10.96.0.10  <pending>    443/TCP  1d
ft-nsx-alb      service/supervisor LoadBalancer 10.96.0.134 <pending>    80/TCP   1d

Im NSX ALB Control Plane konnte ich keine Indizien zu diesem Verhalten finden, da einfach nichts protokolliert wurde und dieser Umstand liefert mir wiederum einen Verdacht, woraufhin ich den zuständigen Service auf dem Supervisor Cluster überprüfte.

Der Avi Kubernetes Operator (AKO) Service ist dabei für die Kommunikation des Supervisor Cluster mit dem NSX ALB Controller Cluster zuständig und genau dieser Service befand sich nicht im ordnungsgemässen Running Zustand:

kubectl get pods -A | grep -v Running

NAMESPACE          NAME                                                      READY  STATUS           RESTARTS         AGE
vmware-system-ako  vmware-system-ako-ako-controller-manager-65d78d698d-c944k 1/2    CrashLoopBackOff 1465 (2m24s ago) 49d

Daraufhin habe ich sämtliche Ressourcen im vmware-system-ako Namespace überprüft:

kubectl -n vmware-system-ako get all

NAME                                                          READY STATUS           RESTARTS         AGE
pod/vmware-system-ako-ako-controller-manager-65d78d698d-c944k 1/2   CrashLoopBackOff 1465 (4m17s ago) 49d

NAME                                                     READY UP-TO-DATE AVAILABLE AGE
deployment.apps/vmware-system-ako-ako-controller-manager 0/1   1          0         49d

NAME                                                                DESIRED CURRENT READY AGE
replicaset.apps/vmware-system-ako-ako-controller-manager-65d78d698d 1       1       0     49d

Ein kubectl describe auf den fehlerhaften Pod lieferte folgendes Indiz:

kubectl -n vmware-system-ako describe pod vmware-system-ako-ako-controller-manager-65d78d698d-c944k
[...]
Events:
Type    Reason  Age                       From    Message
----    ------  ----                      ----    -------
Warning BackOff 3m14s (x33529 over 5d10h) kubelet Back-off restarting failed container manager in pod vmware-system-ako-ako-controller-manager-65d78d698d-c944k_vmware-system-ako(e629cb54-f4fe-409f-8ac4-f2b0ac58b506)

In den entsprechenden Logs konnte man dann das effektive Problem ausmachen:

kubectl -n vmware-system-ako logs vmware-system-ako-ako-controller-manager-65d78d698d-c944k infra | more

2024-02-02T09:16:01.591Z INFO infra-main/main.go:49 AKO-Infra is running with version: ob-21883866-460f000-7e6ff10
2024-02-02T09:16:01.591Z INFO infra-main/main.go:55 We are running inside kubernetes cluster. Won't use kubeconfig files.
2024-02-02T09:16:01.594Z INFO infra-main/main.go:76 Successfully created kube client for ako-infra
2024-02-02T09:16:01.594Z INFO utils/utils.go:173 Initializing configmap informer in vmware-system-ako
2024-02-02T09:16:01.675Z INFO lib/dynamic_client.go:134 Skipped initializing dynamic informers for cniPlugin
2024-02-02T09:16:01.682Z INFO ingestion/vcf_k8s_controller.go:346 Got data from ConfigMap {"advancedL4":"true","cloudName":"/infra/sites/default/enforcement-points/default/transport-zones/overlay-tz","clusterID":"domain-c[...]","controllerIP":"[...]","credentialsSecretName":"avi-secret","credentialsSecretNamespace":"vmware-system-ako","logLevel":"WARN","serverURL":"https://[...]"}
2024-02-02T09:16:01.682Z INFO ingestion/vcf_k8s_controller.go:427 TransportZone to use for AKO is set to /infra/sites/default/enforcement-points/default/transport-zones/overlay-tz
E0202 09:16:20.192221 1 avisession.go:668] Client error for URI: login. Error: Post "https://[...]/login": dial tcp [...]:443: connect: no route to host
E0202 09:16:20.193030 1 avisession.go:714] CheckControllerStatus is disabled for this session, not going to retry.
E0202 09:16:20.193046 1 avisession.go:716] Failed to invoke API. Error: Post "https://[...]/login": dial tcp [...]:443: connect: no route to host
E0202 09:16:20.193123 1 avisession.go:383] response error: Rest request error, returning to caller: Post " https://[...]/login": dial tcp [...]:443: connect: no route to host
2024-02-02T09:16:20.193Z ERROR ingestion/vcf_k8s_controller.go:381 <span style="color: #ff0000;">Failed to connect to AVI controller using secret provided by NCP, the secret would be deleted</span>, err: Rest request error, returning to caller: Post "https://[...]/login": dial tcp [...]:443: connect: no route to host
2024-02-02T09:16:20.201Z INFO ingestion/vcf_k8s_controller.go:210 ConfigMap Add
2024-02-02T09:16:20.204Z INFO ingestion/vcf_k8s_controller.go:346 Got data from ConfigMap {"advancedL4":"true","cloudName":"/infra/sites/default/enforcement-points/default/transport-zones/overlay-tz","clusterID":"domain-c[...]","controllerIP":"[...]","credentialsSecretName":"avi-secret","credentialsSecretNamespace":"vmware-system-ako","logLevel":"WARN","serverURL":"https://[...]"}
2024-02-02T09:16:20.206Z WARN ingestion/vcf_k8s_controller.go:361 <span style="color: #ff0000;">Failed to get Secret, got err: secrets "avi-secret" not found</span>
2024-02-02T09:16:20.206Z INFO ingestion/vcf_k8s_controller.go:210 ConfigMap Add
2024-02-02T09:16:20.208Z INFO ingestion/vcf_k8s_controller.go:346 Got data from ConfigMap
[...]

Die Zeile […]Failed to connect to AVI controller using secret provided by NCP, the secret would be deleted[…] lieferte dabei den ausschlaggebenden Punkt.

Die Lösung

Somit schien für die Generierung des fehlenden Secrets der NSX Container Plugin (NCP)-Service zuständig zu sein, woraufhin ich diesen neustartete, um den Secret-Regenerierungsprozess anzustossen:

kubectl -n vmware-system-nsx get pods

NAME                       READY   STATUS    RESTARTS   AGE
nsx-ncp-5f4f7d6597-7rstp   2/2     Running   0          49d
nsx-ncp-5f4f7d6597-ppl7w   2/2     Running   0          49d

kubectl -n vmware-system-nsx delete pod nsx-ncp-5f4f7d6597-7rstp
kubectl -n vmware-system-nsx delete pod nsx-ncp-5f4f7d6597-ppl7w

Nachdem die NCP Pods neu generiert wurden, erschien ein avi-init-secret, welches wiederum einen Reboot des AKO Services auslöste. Kurz danach erschien dann auch das erhoffte avi-secret Objekt:

kubectl -n vmware-system-ako get secrets

NAME            TYPE   DATA AGE
avi-init-secret Opaque 3    26h
avi-secret      Opaque 3    26h

Eine kurze Verifizierung meines Deployments zeigte, dass nun wieder LoadBalancer VIP IP-Adressen vom NSX ALB Controller zugewiesen wurden:

kubectl get svc -n ft-nsx-alb

NAMESPACE  NAME               TYPE         CLUSTER-IP  EXTERNAL-IP  PORT(S) AGE
ft-nsx-alb service/kubernetes LoadBalancer 10.96.0.10  172.16.22.31 443/TCP 1d
ft-nsx-alb service/supervisor LoadBalancer 10.96.0.134 172.16.22.32 80/TCP  1d

TL;DR

Nach einem Totalausfall des NSX ALB Controller Cluster kann ggf. ein Neustart der NCP Pods wahre Wunder bewirken.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

vSphere with Tanzu und NSX Advanced Load Balancer – avi-secret not found