nagios-releng> Mon 08:15:06 PDT  [moc] nagios1.private.releng.scl3.mozilla.com:Pending jobs is CRITICAL: CRITICAL Pending Jobs: 3881 on [t-w1064-ix] (http://m.mozilla.org/Pending+jobs) 11:47 AM <•kmoir> Kim Moir ^^looks like the pending count is from jobs that are from Thursday onwards. verifying that jobs are being coalesced by seta 11:54 AM so looking at inbound it appears that the win10 jobs are being scheduled twice and not being coalesced by seta
Created attachment 8874478 [details] [diff] [review] bug1370270.patch One item I noticed is that win10 talos is on in try by default, the comments on this bug were not incorporated into the patch when it was landed https://bugzilla.mozilla.org/show_bug.cgi?id=1366029#c17 https://hg.mozilla.org/build/buildbot-configs/rev/b86e54ce5992#l2.15
Comment on attachment 8874478 [details] [diff] [review] bug1370270.patch Actually, it wasn't enabled by default in the intial patch, it landed later here from changes in bug 1369165 https://hg.mozilla.org/build/buildbot-configs/rev/0d17eb5ae115
Now win10 pending counts are down to ~1000 jmaher, looking here it appears we are triggering win10 talos against both the opt and pgo build on the same push https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2a45f5c74d5a525eef9ebd8a57dc519ec857dcdd is this intended or should it be restricted to the one of the two platforms
this looks correct- we only do pgo periodically, and need to benchmark numbers for opt and pgo. I don't know why there is such a large spike in numbers, I did a bunch of try runs about 10 hours ago and win10 picked up the jobs and finished quickly.
why did we turn try by default off? We have it on for linux64 and the load should be the same (just talos), and I believe we have more machines for windows10? The goal of moving to win10 and turning non-e10s tests off was to have enough machines so we could run on try by default when -p win64 -t X is defined. win7 is keeping up fine. I don't see many jobs on try requesting win10 talos- could it be possible the culprit is elsewhere?
Okay, I am backing out that patch.
Nick did you clean up the dbs last night to reduce the pending count for win10 talos? I'm trying to figure out why the pending count suddenly went down last night after being so high yesterday.
No, I didn't touch the DB. The only thing kinda-related work was fixing some stuck reconfigs on windows test masters, due to some long-running t-w732-spot jobs. I think the t-w10 backlog was already clearing when I started that though.
Okay, I don't really understand how the backlog could have completed so quickly given the number of machines, perhaps they were coalesced