Closed Bug 1648475 Opened 5 years ago Closed 4 years ago

Increase the number of Windows instances for the NSS CI from 10 to 20+

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: beurdouche, Assigned: jlorenzo)

References

(Blocks 4 open bugs)

Details

Attachments

(2 files)

Bug 1648475 - Increase the number of Windows instances for the NSS CI from 10 to 150 r=tomprince 5 years ago Johan Lorenzo [:jlorenzo] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1648475 - part 2: Remove nss windows workers off us-east-1 r=tomprince 5 years ago Johan Lorenzo [:jlorenzo] 47 bytes, text/x-phabricator-request		Details \| Review

Benjamin Beurdouche [:beurdouche]

Reporter

Description

•

5 years ago

The NSS CI is extremely slow for Windows tasks even when only one user is pushing to Try and there are no other tasks pending (it takes often 2h+ for a single run with all tests). After discussing with Johan and Sylvestre, it seems to be the case that the number of workers (currently 10) is simply to low to handle NSS tasks in a reasonable time.

https://hg.mozilla.org/ci/ci-configuration/annotate/11448df50bf8e0975b276f257ac8118645bbaa2f/worker-pools.yml#l1983

Would it be possible to increase the maximum capacity for these workers, please?
Thanks a lot, B.

Johan Lorenzo [:jlorenzo]

Assignee

Comment 1

•

5 years ago

When a single developer runs try: -b do -p all -u all -t all -e all on nss-try[1]. I checked the associated task group: 150 tasks run on win2012r2[2]. Even though, they likely don't run all at the same time, 10 workers is not enough to absorb the load of a single developer asking a single full run. Benjamin told me they're 3 people actively using nss-try.

I looked up when the max number was defined: it was in bug 1589706[3]. That said, chances are this value was the one manually defined on the old Taskcluster instance.

As a data point, the task group above has 360+ linux tasks but uses 50 instances[4], instead.

Long story short: I think we should first bump the number of windows worker to 30.

[1] https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=88589004920cb2b9a8e5c847a03c8385de5311b6
[2] Number given by: taskcluster group list --format-string '{{ .Status.TaskID }} {{ .Task.WorkerType }}' --all XQRiZ4v5RWOhip-XScO73g | grep win2012r2 | sort | uniq | wc -l
[3] https://hg.mozilla.org/ci/ci-configuration/file/7d44b77acfbf17662937d0e747aa54d9f94d362b/worker-pools.yml#l1315
[4] https://hg.mozilla.org/ci/ci-configuration/file/7d44b77acfbf17662937d0e747aa54d9f94d362b/worker-pools.yml#l1283

Depends on: 1589706

Johan Lorenzo [:jlorenzo]

Assignee

Comment 2

•

5 years ago

Attached file Bug 1648475 - Increase the number of Windows instances for the NSS CI from 10 to 150 r=tomprince — Details

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → jlorenzo

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

5 years ago

Attachment #9159318 - Attachment description: Bug 1648475 - Increase the number of Windows instances for the NSS CI from 10 to 30 r=tomprince → Bug 1648475 - Increase the number of Windows instances for the NSS CI from 10 to 150 r=tomprince

Pulsebot

Comment 3

•

5 years ago

Pushed by jlorenzo@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/a5b6ec0f0844 Increase the number of Windows instances for the NSS CI from 10 to 150 r=tomprince

Johan Lorenzo [:jlorenzo]

Assignee

Comment 4

•

5 years ago

Landed and deployed! Please let us know how this goes from now on, Benjamin!

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Flags: needinfo?(bbeurdouche)

Resolution: --- → FIXED

Johan Lorenzo [:jlorenzo]

Assignee

Comment 5

•

5 years ago

Reopening bug. Taskcluster doesn't spawn as many windows worker as we want. :beurdouche reported it to me. We currently have 100 tasks pending and 24 workers. I'm guessing the root cause is this error[1] we see in one of the data center:

Instance Creation Error
	
Error calling AWS API: The image id '[ami-00711ead51d83d62b]' does not exist

It keeps repeating over and over. The fix should be simple, it's a matter of updating this line[2]. That said, I have no idea how to get a valid hash. I looked into the taskcluster CLI[3] but I couldn't find a way to list the available AMIs. :tomprince, would you know where to get it?

[1] https://firefox-ci-tc.services.mozilla.com/worker-manager/nss-1%2Fwin2012r2/errors
[2] https://hg.mozilla.org/ci/ci-configuration/file/a5b6ec0f08448bde885d47ccd86c485a94d04def/worker-images.yml#l226
[3] https://github.com/taskcluster/taskcluster/tree/27691079fc5bbd3b81c752f641c2a2e643ba582e/clients/client-shell#taskcluster-client-for-shell

Status: RESOLVED → REOPENED

Flags: needinfo?(bbeurdouche) → needinfo?(mozilla)

Resolution: FIXED → ---

Johan Lorenzo [:jlorenzo]

Assignee

Updated

•

5 years ago

Blocks: 1649498

Johan Lorenzo [:jlorenzo]

Assignee

Comment 6

•

5 years ago

Chatted with Tom on chat.mozilla.org. Let's remove us-east-1 for now and get the real fix in bug 1649498.

Flags: needinfo?(mozilla)

Johan Lorenzo [:jlorenzo]

Assignee

Comment 7

•

5 years ago

Attached file Bug 1648475 - part 2: Remove nss windows workers off us-east-1 r=tomprince — Details

Pete Moore [:pmoore][:pete]

Comment 8

•

5 years ago

•

Edited

Hey coop,

Is this something that might have been deleted in a recent cleanup?

@jlorenzo I think the NSS windows images are managed by relops now and not by taskcluster team, but I'm not 100% sure how they are created or by who any more. In the past, when everything ran under one taskcluster deployment and one AWS account, I used to create them, but these days I'm only involved in taskcluster community deployment workers, I don't think the NSS bootstrapping powershell scripts that I used to use are used any more. It might be worth checking in with Kendall. It's been so long since I touched NSS stuff, I genuinely can't remember! :-)

Flags: needinfo?(coop)

Pulsebot

Comment 9

•

5 years ago

Pushed by jlorenzo@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/fc09b04c98a6 part 2: Remove nss windows workers off us-east-1 r=tomprince

Johan Lorenzo [:jlorenzo]

Assignee

Comment 10

•

5 years ago

Thanks Pete for the context! :fubar, would you know who owns these AMIs?

In the meantime, I landed and deployed patch #2. Please let me know if you notice any difference, :beurdouche 🙂

Flags: needinfo?(klibby)

Flags: needinfo?(bbeurdouche)

Kendall Libby [:fubar] (he/him)

Comment 11

•

5 years ago

Looking at history, it looks like Taskcluster has done most of the support for the NSS workers. I don't know if they have custom config in that AMI or not. If not we should try and get them onto something recent. Rob, do you know more?

Flags: needinfo?(klibby) → needinfo?(rthijssen)

Chris Cooper [:coop] (he/him)

Comment 12

•

5 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #8)

@jlorenzo I think the NSS windows images are managed by relops now and not by taskcluster team, but I'm not 100% sure how they are created or by who any more. In the past, when everything ran under one taskcluster deployment and one AWS account, I used to create them, but these days I'm only involved in taskcluster community deployment workers, I don't think the NSS bootstrapping powershell scripts that I used to use are used any more. It might be worth checking in with Kendall. It's been so long since I touched NSS stuff, I genuinely can't remember! :-)

In case it's still valuable, I warrant that the AMI hasn't changed since Pete last touched it. The NSS configs are long-lived and rarely updated.

Flags: needinfo?(coop)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 13

•

5 years ago

we don't have any nss configs under relops repos. i can add configs to occ for nss that would make it possible for us to create new amis, if someone can point me at the powershell scripts that were last used to create nss amis, so that i can determine what's in the ami. i'd only be guessing without that information.
perhaps a link to the source code for the powershell bootstrap script, or just attach the script to this bug and ni me again.

Flags: needinfo?(rthijssen)

Benjamin Beurdouche [:beurdouche]

Reporter

Comment 14

•

5 years ago

(In reply to Johan Lorenzo [:jlorenzo] - On PTO - Back on July 13th from comment #10)

Thanks Pete for the context! :fubar, would you know who owns these AMIs?

In the meantime, I landed and deployed patch #2. Please let me know if you notice any difference, :beurdouche 🙂

Hi :jlorenzo , Hi all ! Thanks a lot for looking into this!
We've observed that the CI seem to be consistently faster (about 1h instead of 2h-2h30) which is great but we've also observed a significant raise in number of intermittent failures where some containers are killed after 3600 seconds (we had none before).
Is that a known issue when updating these configurations?

Flags: needinfo?(bbeurdouche) → needinfo?(jlorenzo)

Benjamin Beurdouche [:beurdouche]

Reporter

Updated

•

5 years ago

Blocks: 1640328

Johan Lorenzo [:jlorenzo]

Assignee

Comment 15

•

5 years ago

Sorry for the delay, I tried to catch you on chat.m.o last week but it was a miss 🙂

Do you have some examples I can look into? At first, I don't see any causality between what was changed and what happens now. Having some treeherder links will help me making sure of that.

Flags: needinfo?(jlorenzo) → needinfo?(bbeurdouche)

Benjamin Beurdouche [:beurdouche]

Reporter

Comment 16

•

5 years ago

No problem, thanks for looking into this ! : )

The intermittent started on July 30th, the day where the CI patch here was merged.
It looks like the first build showing these timeouts is this try run, everything before has none:
https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=78360e69f9dad5dfe7462d7ac9505b1d66076b77

Here are a few recent examples: from the latest push on nss-try all the failures seem to be timeouts:
https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=dc3146dcb08c3b4c23c4a0c2ee0ec1e75a3f4ca2

Flags: needinfo?(bbeurdouche)

Johan Lorenzo [:jlorenzo]

Assignee

Comment 17

•

4 years ago

Chatted over Elemeng with :beurdouche. No big issue has happened over the last 4 months. He agreed to close this issue.

Status: REOPENED → RESOLVED

Closed: 5 years ago → 4 years ago

QA Contact: mtabara

Resolution: --- → FIXED

Julien Cristau [:jcristau]

Updated

•

2 years ago

Updated

•

1 year ago

Blocks: nss-ci, nss-external

You need to log in before you can comment on or make changes to this bug.