ingestion-sink kubernetes cluster metrics server broken
Categories
(Data Platform and Tools Graveyard :: Operations, defect)
Tracking
(Not tracked)
People
(Reporter: whd, Unassigned)
Details
Autoscaling via HPA and CPU metrics is broken on the production cluster due to an issue that most closely resembles this upstream issue: https://github.com/kubernetes-sigs/metrics-server/issues/269.
Investigating the kube-system namespace I see the metrics server pods are in crash loop backoff due to panic: Get https://10.8.8.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.8.8.1:443: i/o timeout
. I haven't investigated further.
Given the impending increase in traffic from bug #1583649 we should sort this out, but in the mean time, manual scaling and maintaining peak / backlog capacity via minreplicas has worked and is still significantly cheaper than the dataflow-based version of yesteryear. This also isn't a client-facing application and could be easily moved to an entirely new cluster if necessary.
I would normally be inclined to try a cluster upgrade to resolve this, but we're on 1.13.12-gke.16, which from what I can tell is the latest stable channel release. If we run into issues with load this week it will be easier to simply switch out the current cluster with a new one keeping on the stable release track. I may open a support case with google to track the fix for this, and if they recommend upgrading to regular or rapid channels (i.e. 1.1[45] releases) I will do so.
Reporter | ||
Comment 1•4 years ago
|
||
This has been fixed, probably by an automatic minor version update (1.13.12-gke.25).
Updated•1 year ago
|
Description
•