Closed Bug 1607349 Opened 5 years ago Closed 4 years ago

ingestion-sink kubernetes cluster metrics server broken

Categories

(Data Platform and Tools Graveyard :: Operations, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Unassigned)

Details

Autoscaling via HPA and CPU metrics is broken on the production cluster due to an issue that most closely resembles this upstream issue: https://github.com/kubernetes-sigs/metrics-server/issues/269.

Investigating the kube-system namespace I see the metrics server pods are in crash loop backoff due to panic: Get https://10.8.8.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.8.8.1:443: i/o timeout. I haven't investigated further.

Given the impending increase in traffic from bug #1583649 we should sort this out, but in the mean time, manual scaling and maintaining peak / backlog capacity via minreplicas has worked and is still significantly cheaper than the dataflow-based version of yesteryear. This also isn't a client-facing application and could be easily moved to an entirely new cluster if necessary.

I would normally be inclined to try a cluster upgrade to resolve this, but we're on 1.13.12-gke.16, which from what I can tell is the latest stable channel release. If we run into issues with load this week it will be easier to simply switch out the current cluster with a new one keeping on the stable release track. I may open a support case with google to track the fix for this, and if they recommend upgrading to regular or rapid channels (i.e. 1.1[45] releases) I will do so.

This has been fixed, probably by an automatic minor version update (1.13.12-gke.25).

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.