Increase the number of Windows instances for the NSS CI from 10 to 20+
Categories
(Release Engineering :: Firefox-CI Administration, enhancement)
Tracking
(Not tracked)
People
(Reporter: beurdouche, Assigned: jlorenzo)
References
(Blocks 4 open bugs)
Details
Attachments
(2 files)
The NSS CI is extremely slow for Windows tasks even when only one user is pushing to Try and there are no other tasks pending (it takes often 2h+ for a single run with all tests). After discussing with Johan and Sylvestre, it seems to be the case that the number of workers (currently 10) is simply to low to handle NSS tasks in a reasonable time.
Would it be possible to increase the maximum capacity for these workers, please?
Thanks a lot, B.
Assignee | ||
Comment 1•5 years ago
|
||
When a single developer runs try: -b do -p all -u all -t all -e all
on nss-try[1]. I checked the associated task group: 150 tasks run on win2012r2
[2]. Even though, they likely don't run all at the same time, 10 workers is not enough to absorb the load of a single developer asking a single full run. Benjamin told me they're 3 people actively using nss-try.
I looked up when the max number was defined: it was in bug 1589706[3]. That said, chances are this value was the one manually defined on the old Taskcluster instance.
As a data point, the task group above has 360+ linux tasks but uses 50 instances[4], instead.
Long story short: I think we should first bump the number of windows worker to 30.
[1] https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=88589004920cb2b9a8e5c847a03c8385de5311b6
[2] Number given by: taskcluster group list --format-string '{{ .Status.TaskID }} {{ .Task.WorkerType }}' --all XQRiZ4v5RWOhip-XScO73g | grep win2012r2 | sort | uniq | wc -l
[3] https://hg.mozilla.org/ci/ci-configuration/file/7d44b77acfbf17662937d0e747aa54d9f94d362b/worker-pools.yml#l1315
[4] https://hg.mozilla.org/ci/ci-configuration/file/7d44b77acfbf17662937d0e747aa54d9f94d362b/worker-pools.yml#l1283
Assignee | ||
Comment 2•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 4•5 years ago
|
||
Landed and deployed! Please let us know how this goes from now on, Benjamin!
Assignee | ||
Comment 5•5 years ago
|
||
Reopening bug. Taskcluster doesn't spawn as many windows worker as we want. :beurdouche reported it to me. We currently have 100 tasks pending and 24 workers. I'm guessing the root cause is this error[1] we see in one of the data center:
Instance Creation Error
Error calling AWS API: The image id '[ami-00711ead51d83d62b]' does not exist
It keeps repeating over and over. The fix should be simple, it's a matter of updating this line[2]. That said, I have no idea how to get a valid hash. I looked into the taskcluster CLI[3] but I couldn't find a way to list the available AMIs. :tomprince, would you know where to get it?
[1] https://firefox-ci-tc.services.mozilla.com/worker-manager/nss-1%2Fwin2012r2/errors
[2] https://hg.mozilla.org/ci/ci-configuration/file/a5b6ec0f08448bde885d47ccd86c485a94d04def/worker-images.yml#l226
[3] https://github.com/taskcluster/taskcluster/tree/27691079fc5bbd3b81c752f641c2a2e643ba582e/clients/client-shell#taskcluster-client-for-shell
Assignee | ||
Comment 6•5 years ago
|
||
Chatted with Tom on chat.mozilla.org. Let's remove us-east-1 for now and get the real fix in bug 1649498.
Assignee | ||
Comment 7•5 years ago
|
||
Comment 8•5 years ago
•
|
||
Hey coop,
Is this something that might have been deleted in a recent cleanup?
@jlorenzo I think the NSS windows images are managed by relops now and not by taskcluster team, but I'm not 100% sure how they are created or by who any more. In the past, when everything ran under one taskcluster deployment and one AWS account, I used to create them, but these days I'm only involved in taskcluster community deployment workers, I don't think the NSS bootstrapping powershell scripts that I used to use are used any more. It might be worth checking in with Kendall. It's been so long since I touched NSS stuff, I genuinely can't remember! :-)
Assignee | ||
Comment 10•5 years ago
|
||
Thanks Pete for the context! :fubar, would you know who owns these AMIs?
In the meantime, I landed and deployed patch #2. Please let me know if you notice any difference, :beurdouche 🙂
Comment 11•5 years ago
|
||
Looking at history, it looks like Taskcluster has done most of the support for the NSS workers. I don't know if they have custom config in that AMI or not. If not we should try and get them onto something recent. Rob, do you know more?
Comment 12•5 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #8)
@jlorenzo I think the NSS windows images are managed by relops now and not by taskcluster team, but I'm not 100% sure how they are created or by who any more. In the past, when everything ran under one taskcluster deployment and one AWS account, I used to create them, but these days I'm only involved in taskcluster community deployment workers, I don't think the NSS bootstrapping powershell scripts that I used to use are used any more. It might be worth checking in with Kendall. It's been so long since I touched NSS stuff, I genuinely can't remember! :-)
In case it's still valuable, I warrant that the AMI hasn't changed since Pete last touched it. The NSS configs are long-lived and rarely updated.
Comment 13•5 years ago
|
||
we don't have any nss configs under relops repos. i can add configs to occ for nss that would make it possible for us to create new amis, if someone can point me at the powershell scripts that were last used to create nss amis, so that i can determine what's in the ami. i'd only be guessing without that information.
perhaps a link to the source code for the powershell bootstrap script, or just attach the script to this bug and ni me again.
Reporter | ||
Comment 14•5 years ago
|
||
(In reply to Johan Lorenzo [:jlorenzo] - On PTO - Back on July 13th from comment #10)
Thanks Pete for the context! :fubar, would you know who owns these AMIs?
In the meantime, I landed and deployed patch #2. Please let me know if you notice any difference, :beurdouche 🙂
Hi :jlorenzo , Hi all ! Thanks a lot for looking into this!
We've observed that the CI seem to be consistently faster (about 1h instead of 2h-2h30) which is great but we've also observed a significant raise in number of intermittent failures where some containers are killed after 3600 seconds (we had none before).
Is that a known issue when updating these configurations?
Assignee | ||
Comment 15•5 years ago
|
||
Sorry for the delay, I tried to catch you on chat.m.o last week but it was a miss 🙂
Do you have some examples I can look into? At first, I don't see any causality between what was changed and what happens now. Having some treeherder links will help me making sure of that.
Reporter | ||
Comment 16•5 years ago
|
||
No problem, thanks for looking into this ! : )
The intermittent started on July 30th, the day where the CI patch here was merged.
It looks like the first build showing these timeouts is this try run, everything before has none:
https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=78360e69f9dad5dfe7462d7ac9505b1d66076b77
Here are a few recent examples: from the latest push on nss-try all the failures seem to be timeouts:
https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=dc3146dcb08c3b4c23c4a0c2ee0ec1e75a3f4ca2
Assignee | ||
Comment 17•4 years ago
|
||
Chatted over Elemeng with :beurdouche. No big issue has happened over the last 4 months. He agreed to close this issue.
Reporter | ||
Updated•1 year ago
|
Description
•