Closed Bug 1594220 Opened 6 years ago Closed 6 years ago

taskcluster: scale out firefoxci deployment

Categories

(Cloud Services :: Operations: Taskcluster, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: miles, Assigned: brian)

References

Details

Current numbers of Heroku dynos per service:

queue.web: 25
queue.claimResolver: 4
queue.deadlineResolver: 3
queue.dependencyResolver: 4
auth.web: 10
index.web: 4
secrets.web: 2

Each dyno has roughly 1 cpu and 512MB of memory. These numbers can roughly correspond to k8s pods, though we'd rather be overprovisioned than underprovisioned.

There is a push tomorrow that we'd like to be scaled out for. It's OK if this is done manually for now.

Component: Operations: Deployment Requests → Operations: Taskcluster
Assignee: nobody → bpitts
Status: NEW → ASSIGNED

I assume everything not in that list is using 1 dyno.

Are all taskcluster services universally configured to reserve 1 CPU and 512MB of RAM in heroku?

Is there a way edunham or I could see historical resource utilization per-service to better fine tune our initial requests? If not, that's fine, we can just adjust downward after launch. We don't have any resource limits, only requests, configured in k8s, so nothing is going to get throttled artificially. We just have to look out for overloaded nodes in the short-term, and address overprovisioning causing us to run too many nodes in the long term.

Miles, do the resource changes at https://paste.mozilla.org/YowdwOJ0 look good to you? If so I can apply both https://github.com/mozilla-services/cloudops-infra/pull/1558 and them in the morning.

I've reserved 0.9 cpu instead of 1 because the node pool is currently comprised of n1-standard-2 instances, which only have 1.94 usable cpu (https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture). I think after the push tomorrow is done we should consider creating a new nodepool with a larger instance type, possibly with a higher cpu to memory ratio.

Flags: needinfo?(miles)

This is done. We now have 0.9cpu and 500MB of RAM reserved for each service, and the additional replicas described in the original request.

I'll create a bug for us to followup in a couple weeks and tweak the requests and replicas.

From the Taskcluster dev side, my expectation is that once things settle down you can start to generate HPAs for these 7 services so we will not need to scale up and down manually.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
See Also: → 1594476
Flags: needinfo?(miles)
You need to log in before you can comment on or make changes to this bug.