1607349 - ingestion-sink kubernetes cluster metrics server broken

Autoscaling via HPA and CPU metrics is broken on the production cluster due to an issue that most closely resembles this upstream issue: https://github.com/kubernetes-sigs/metrics-server/issues/269.

Investigating the kube-system namespace I see the metrics server pods are in crash loop backoff due to panic: Get https://10.8.8.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.8.8.1:443: i/o timeout. I haven't investigated further.

Given the impending increase in traffic from bug #1583649 we should sort this out, but in the mean time, manual scaling and maintaining peak / backlog capacity via minreplicas has worked and is still significantly cheaper than the dataflow-based version of yesteryear. This also isn't a client-facing application and could be easily moved to an entirely new cluster if necessary.

I would normally be inclined to try a cluster upgrade to resolve this, but we're on 1.13.12-gke.16, which from what I can tell is the latest stable channel release. If we run into issues with load this week it will be easier to simply switch out the current cluster with a new one keeping on the stable release track. I may open a support case with google to track the fix for this, and if they recommend upgrading to regular or rapid channels (i.e. 1.1[45] releases) I will do so.

Bugzilla

Quick Search

ingestion-sink kubernetes cluster metrics server broken

Categories

(Data Platform and Tools Graveyard :: Operations, defect)

Tracking

(Not tracked)

People

(Reporter: whd, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated