Closed Bug 1648475 Opened 4 years ago Closed 4 years ago

Increase the number of Windows instances for the NSS CI from 10 to 20+

Categories

(Release Engineering :: Firefox-CI Administration, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: beurdouche, Assigned: jlorenzo)

References

(Blocks 4 open bugs)

Details

Attachments

(2 files)

The NSS CI is extremely slow for Windows tasks even when only one user is pushing to Try and there are no other tasks pending (it takes often 2h+ for a single run with all tests). After discussing with Johan and Sylvestre, it seems to be the case that the number of workers (currently 10) is simply to low to handle NSS tasks in a reasonable time.

https://hg.mozilla.org/ci/ci-configuration/annotate/11448df50bf8e0975b276f257ac8118645bbaa2f/worker-pools.yml#l1983

Would it be possible to increase the maximum capacity for these workers, please?
Thanks a lot, B.

When a single developer runs try: -b do -p all -u all -t all -e all on nss-try[1]. I checked the associated task group: 150 tasks run on win2012r2[2]. Even though, they likely don't run all at the same time, 10 workers is not enough to absorb the load of a single developer asking a single full run. Benjamin told me they're 3 people actively using nss-try.

I looked up when the max number was defined: it was in bug 1589706[3]. That said, chances are this value was the one manually defined on the old Taskcluster instance.

As a data point, the task group above has 360+ linux tasks but uses 50 instances[4], instead.

Long story short: I think we should first bump the number of windows worker to 30.

[1] https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=88589004920cb2b9a8e5c847a03c8385de5311b6
[2] Number given by: taskcluster group list --format-string '{{ .Status.TaskID }} {{ .Task.WorkerType }}' --all XQRiZ4v5RWOhip-XScO73g | grep win2012r2 | sort | uniq | wc -l
[3] https://hg.mozilla.org/ci/ci-configuration/file/7d44b77acfbf17662937d0e747aa54d9f94d362b/worker-pools.yml#l1315
[4] https://hg.mozilla.org/ci/ci-configuration/file/7d44b77acfbf17662937d0e747aa54d9f94d362b/worker-pools.yml#l1283

Depends on: 1589706
Assignee: nobody → jlorenzo
Status: NEW → ASSIGNED
Attachment #9159318 - Attachment description: Bug 1648475 - Increase the number of Windows instances for the NSS CI from 10 to 30 r=tomprince → Bug 1648475 - Increase the number of Windows instances for the NSS CI from 10 to 150 r=tomprince
Pushed by jlorenzo@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/a5b6ec0f0844 Increase the number of Windows instances for the NSS CI from 10 to 150 r=tomprince

Landed and deployed! Please let us know how this goes from now on, Benjamin!

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(bbeurdouche)
Resolution: --- → FIXED

Reopening bug. Taskcluster doesn't spawn as many windows worker as we want. :beurdouche reported it to me. We currently have 100 tasks pending and 24 workers. I'm guessing the root cause is this error[1] we see in one of the data center:

Instance Creation Error
	
Error calling AWS API: The image id '[ami-00711ead51d83d62b]' does not exist

It keeps repeating over and over. The fix should be simple, it's a matter of updating this line[2]. That said, I have no idea how to get a valid hash. I looked into the taskcluster CLI[3] but I couldn't find a way to list the available AMIs. :tomprince, would you know where to get it?

[1] https://firefox-ci-tc.services.mozilla.com/worker-manager/nss-1%2Fwin2012r2/errors
[2] https://hg.mozilla.org/ci/ci-configuration/file/a5b6ec0f08448bde885d47ccd86c485a94d04def/worker-images.yml#l226
[3] https://github.com/taskcluster/taskcluster/tree/27691079fc5bbd3b81c752f641c2a2e643ba582e/clients/client-shell#taskcluster-client-for-shell

Status: RESOLVED → REOPENED
Flags: needinfo?(bbeurdouche) → needinfo?(mozilla)
Resolution: FIXED → ---
Blocks: 1649498

Chatted with Tom on chat.mozilla.org. Let's remove us-east-1 for now and get the real fix in bug 1649498.

Flags: needinfo?(mozilla)

Hey coop,

Is this something that might have been deleted in a recent cleanup?

@jlorenzo I think the NSS windows images are managed by relops now and not by taskcluster team, but I'm not 100% sure how they are created or by who any more. In the past, when everything ran under one taskcluster deployment and one AWS account, I used to create them, but these days I'm only involved in taskcluster community deployment workers, I don't think the NSS bootstrapping powershell scripts that I used to use are used any more. It might be worth checking in with Kendall. It's been so long since I touched NSS stuff, I genuinely can't remember! :-)

Flags: needinfo?(coop)
Pushed by jlorenzo@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/fc09b04c98a6 part 2: Remove nss windows workers off us-east-1 r=tomprince

Thanks Pete for the context! :fubar, would you know who owns these AMIs?

In the meantime, I landed and deployed patch #2. Please let me know if you notice any difference, :beurdouche 🙂

Flags: needinfo?(klibby)
Flags: needinfo?(bbeurdouche)

Looking at history, it looks like Taskcluster has done most of the support for the NSS workers. I don't know if they have custom config in that AMI or not. If not we should try and get them onto something recent. Rob, do you know more?

Flags: needinfo?(klibby) → needinfo?(rthijssen)

(In reply to Pete Moore [:pmoore][:pete] from comment #8)

@jlorenzo I think the NSS windows images are managed by relops now and not by taskcluster team, but I'm not 100% sure how they are created or by who any more. In the past, when everything ran under one taskcluster deployment and one AWS account, I used to create them, but these days I'm only involved in taskcluster community deployment workers, I don't think the NSS bootstrapping powershell scripts that I used to use are used any more. It might be worth checking in with Kendall. It's been so long since I touched NSS stuff, I genuinely can't remember! :-)

In case it's still valuable, I warrant that the AMI hasn't changed since Pete last touched it. The NSS configs are long-lived and rarely updated.

Flags: needinfo?(coop)

we don't have any nss configs under relops repos. i can add configs to occ for nss that would make it possible for us to create new amis, if someone can point me at the powershell scripts that were last used to create nss amis, so that i can determine what's in the ami. i'd only be guessing without that information.
perhaps a link to the source code for the powershell bootstrap script, or just attach the script to this bug and ni me again.

Flags: needinfo?(rthijssen)

(In reply to Johan Lorenzo [:jlorenzo] - On PTO - Back on July 13th from comment #10)

Thanks Pete for the context! :fubar, would you know who owns these AMIs?

In the meantime, I landed and deployed patch #2. Please let me know if you notice any difference, :beurdouche 🙂

Hi :jlorenzo , Hi all ! Thanks a lot for looking into this!
We've observed that the CI seem to be consistently faster (about 1h instead of 2h-2h30) which is great but we've also observed a significant raise in number of intermittent failures where some containers are killed after 3600 seconds (we had none before).
Is that a known issue when updating these configurations?

Flags: needinfo?(bbeurdouche) → needinfo?(jlorenzo)
Blocks: 1640328

Sorry for the delay, I tried to catch you on chat.m.o last week but it was a miss 🙂

Do you have some examples I can look into? At first, I don't see any causality between what was changed and what happens now. Having some treeherder links will help me making sure of that.

Flags: needinfo?(jlorenzo) → needinfo?(bbeurdouche)

No problem, thanks for looking into this ! : )

The intermittent started on July 30th, the day where the CI patch here was merged.
It looks like the first build showing these timeouts is this try run, everything before has none:
https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=78360e69f9dad5dfe7462d7ac9505b1d66076b77

Here are a few recent examples: from the latest push on nss-try all the failures seem to be timeouts:
https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=dc3146dcb08c3b4c23c4a0c2ee0ec1e75a3f4ca2

Flags: needinfo?(bbeurdouche)

Chatted over Elemeng with :beurdouche. No big issue has happened over the last 4 months. He agreed to close this issue.

Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
QA Contact: mtabara
Resolution: --- → FIXED
See Also: → 1805216
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: