If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

windows 10 talos pending counts are at ~3800

RESOLVED FIXED

Status

Release Engineering
Buildduty
RESOLVED FIXED
4 months ago
4 months ago

People

(Reporter: kmoir, Assigned: kmoir)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Assignee)

Description

4 months ago
nagios-releng> Mon 08:15:06 PDT [4007] [moc] nagios1.private.releng.scl3.mozilla.com:Pending jobs is CRITICAL: CRITICAL Pending Jobs: 3881 on [t-w1064-ix] (http://m.mozilla.org/Pending+jobs)
11:47 AM 
<•kmoir> Kim Moir ^^looks like the pending count is from jobs that are from Thursday onwards. verifying that jobs are being coalesced by seta
11:54 AM so looking at inbound it appears that the win10 jobs are being scheduled twice and not being coalesced by seta
(Assignee)

Updated

4 months ago
Assignee: nobody → kmoir
(Assignee)

Comment 1

4 months ago
Created attachment 8874478 [details] [diff] [review]
bug1370270.patch

One item I noticed is that win10 talos is on in try by default, the comments on this bug were not incorporated into the patch when it was landed

https://bugzilla.mozilla.org/show_bug.cgi?id=1366029#c17

https://hg.mozilla.org/build/buildbot-configs/rev/b86e54ce5992#l2.15
(Assignee)

Comment 2

4 months ago
Comment on attachment 8874478 [details] [diff] [review]
bug1370270.patch

Actually, it wasn't enabled by default in the intial patch, it landed later here from changes in bug 1369165

https://hg.mozilla.org/build/buildbot-configs/rev/0d17eb5ae115
Attachment #8874478 - Flags: checked-in+
(Assignee)

Comment 3

4 months ago
Now win10 pending counts are down to ~1000

jmaher, looking here it appears we are triggering win10 talos against both the opt and pgo build on the same push

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2a45f5c74d5a525eef9ebd8a57dc519ec857dcdd

is this intended or should it be restricted to the one of the two platforms
Flags: needinfo?(jmaher)
this looks correct- we only do pgo periodically, and need to benchmark numbers for opt and pgo.  I don't know why there is such a large spike in numbers, I did a bunch of try runs about 10 hours ago and win10 picked up the jobs and finished quickly.
Flags: needinfo?(jmaher)
why did we turn try by default off?  We have it on for linux64 and the load should be the same (just talos), and I believe we have more machines for windows10?

The goal of moving to win10 and turning non-e10s tests off was to have enough machines so we could run on try by default when -p win64 -t X is defined.

win7 is keeping up fine.  I don't see many jobs on try requesting win10 talos- could it be possible the culprit is elsewhere?
Flags: needinfo?(kmoir)
(Assignee)

Comment 6

4 months ago
Okay, I am backing out that patch.
Flags: needinfo?(kmoir)
(Assignee)

Updated

4 months ago
Attachment #8874478 - Flags: checked-in+ → checked-in-
(Assignee)

Comment 7

4 months ago
Nick did you clean up the dbs last night to reduce the pending count for win10 talos?  I'm trying to figure out why the pending count suddenly went down last night after being so high yesterday.
Flags: needinfo?(nthomas)
No, I didn't touch the DB. The only thing kinda-related work was fixing some stuck reconfigs on windows test masters, due to some long-running t-w732-spot jobs. I think the t-w10 backlog was already clearing when I started that though.
Flags: needinfo?(nthomas)
(Assignee)

Comment 9

4 months ago
Okay, I don't really understand how the backlog could have completed so quickly given the number of machines, perhaps they were coalesced
Status: NEW → RESOLVED
Last Resolved: 4 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.