Last Comment Bug 818833 - (toodamnhigh!) Pending count for Linux32 test slaves is too high
(toodamnhigh!)
: Pending count for Linux32 test slaves is too high
Status: RESOLVED FIXED
[capacity]
: sheriffing-P1
Product: Release Engineering
Classification: Other
Component: General Automation (show other bugs)
: other
: x86 Linux
: -- critical (vote)
: ---
Assigned To: John O'Duinn [:joduinn] (please use "needinfo?" flag)
: Chris AtLee [:catlee]
Mentors:
Depends on: 784913 818968 820299 820958 821012 822924 823642 830923 835955 843054 843229
Blocks: 772458 819044
  Show dependency treegraph
 
Reported: 2012-12-06 02:10 PST by Ed Morley [:emorley]
Modified: 2013-08-12 21:54 PDT (History)
19 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
[configs] per c#20 (2.62 KB, patch)
2013-01-09 06:40 PST, Justin Wood (:Callek) (Away until Aug 29)
bhearsum: review+
bhearsum: checked‑in+
Details | Diff | Splinter Review
[configs] re-enable (2.58 KB, patch)
2013-01-14 09:18 PST, Justin Wood (:Callek) (Away until Aug 29)
bhearsum: review+
bugspam.Callek: checked‑in+
Details | Diff | Splinter Review
[configs] v1 - re-enable for TB as well. (1.11 KB, patch)
2013-01-28 08:32 PST, Justin Wood (:Callek) (Away until Aug 29)
bhearsum: review+
bugspam.Callek: checked‑in+
Details | Diff | Splinter Review

Description Ed Morley [:emorley] 2012-12-06 02:10:01 PST
I would imagine this is a combination of:
* "Linux used to be the fastest platform to request on Try, so if I'm testing just one platform, I'll request linux (32) in my trychooser syntax"
* B2G emulator uses linux slaves

~ Non-try:
Pending test(s) @ Dec 06 02:05:02
linux (70)
   25 mozilla-aurora
   24 mozilla-inbound
   11 ionmonkey
   10 mozilla-central
linux64 (22)
    9 mozilla-beta
    8 ionmonkey
    5 mozilla-inbound
mac10.6-rev4 (13)
   13 mozilla-inbound
mac10.7 (13)
   13 mozilla-inbound
mac10.8 (2)
    2 mozilla-inbound
winxp (5)
    5 ionmonkey

~ Try:
Pending test(s) @ Dec 06 02:05:02
linux (1057)
 1057 try
linux64 (14)
   14 try
win7 (89)
   89 try
winxp (27)
   27 try
Comment 1 Ed Morley [:emorley] 2012-12-06 02:13:28 PST
jgriffin, is there a bug filed for trying to use the emulator on linux64 slaves too? Happy to file one if not :-)
Comment 2 Armen Zambrano [:armenzg] (EDT/UTC-4) 2012-12-06 06:21:24 PST
We should really attempt to move them to aws. The rev3 minis won't scale even if you split them among 2 of the platforms.

Can we encourage people to use fedora64 on the try server?
Comment 3 Rail Aliiev [:rail] 2012-12-06 06:31:47 PST
So far the pool looks OK, we have not so much hung or waiting for reboot test slaves. The running jobs look sane as well.
Comment 4 Jonathan Griffin (:jgriffin) 2012-12-06 09:34:08 PST
(In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #1)
> jgriffin, is there a bug filed for trying to use the emulator on linux64
> slaves too? Happy to file one if not :-)

No, there's no such bug yet.
Comment 5 Gregor Wagner [:gwagner] 2012-12-06 14:24:26 PST
Currently I have to debug marionette code on inbound. Thats BAD!

I pushed to try but even after 24hours the marionette tests were still pending.
16 out of 35 tests fail locally on my linux machine on mc tip and nobody knows why.

We need try-server coverage for marionette tests!
Comment 6 Ed Morley [:emorley] 2012-12-06 14:29:09 PST
(In reply to Jonathan Griffin (:jgriffin) from comment #4)
> (In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #1)
> > jgriffin, is there a bug filed for trying to use the emulator on linux64
> > slaves too? Happy to file one if not :-)
> 
> No, there's no such bug yet.

Filed bug 818968.

(In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #0)
> I would imagine this is a combination of:
> * "Linux used to be the fastest platform to request on Try, so if I'm
> testing just one platform, I'll request linux (32) in my trychooser syntax"

Posted to dev.platform to try and reverse this habit:
https://groups.google.com/d/msg/mozilla.dev.platform/XcVw9IeUXVU/wvHM0OjCN5MJ
Comment 7 Ed Morley [:emorley] 2012-12-07 02:44:53 PST
At a similar time to yesterday, we're now up to 1500 pending linux32 Try jobs (from ~1000) :-(
Comment 8 Ed Morley [:emorley] 2012-12-07 06:43:16 PST
We now have Try jobs that are still pending after 1 day 20 hours :-(
Comment 9 Ed Morley [:emorley] 2012-12-11 07:12:42 PST
Mandatory meme :-)

http://www.quickmeme.com/meme/3s585s/
Comment 10 Aki Sasaki [:aki] 2012-12-12 12:25:47 PST
The current fix in line for this is turning off linux32 desktop tests on m-c and project branches.  Newsgroup posts incoming.
Comment 11 Rail Aliiev [:rail] 2012-12-12 12:52:34 PST
(In reply to Aki Sasaki [:aki] from comment #10)
> The current fix in line for this is turning off linux32 desktop tests on m-c
> and project branches.  Newsgroup posts incoming.

Maybe m-i and branches? m-c is not so overloaded.
Comment 12 Aki Sasaki [:aki] 2012-12-12 12:55:39 PST
That would just lead to them being hidden.
Comment 13 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2012-12-18 12:55:45 PST
Totally true, and only going to get worse as more B2G tests suites are figured out, and start running in production. Meanwhile, new hardware is still months out.

We've proposed some short term plans for reducing load on linux32 test slaves in dev.planning newsgroup, and the dev platform meetings (last week and today). Details still being worked out, so grabbing this for now.
Comment 14 Ed Morley [:emorley] 2012-12-19 01:14:29 PST
B2G marionette-webapi is getting broken quite frequently - and is broken again this morning - and the high levels of coalescing often mean I've had had to close the tree to bisect due to the large range. I suspect the bustage would have been more obvious (and thus in today's case not left in for 45 pushes), if more jobs had been run and so oranges showing on TBPL.

Thank you for looking at this :-)
Comment 15 Ed Morley [:emorley] 2012-12-19 02:32:38 PST
Release branches keep on stealing linux32 testslaves (meaning the marionette retriggers on inbound are still pending and holding inbound closed), so I've temporarily closed aurora, beta, b2g18, esr18, esr10 temporarily to ensure we can get inbound open sooner rather than later.
Comment 16 Ed Morley [:emorley] 2012-12-19 06:29:44 PST
There are ~4 linux32 machines that haven't taken a job in anywhere from 10hrs to a day+.

Could you give them a kick? :-)

http://build.mozilla.org/builds/last-job-per-slave.html
Comment 17 Ben Hearsum (:bhearsum) 2013-01-07 06:53:27 PST
The chronic issue of not having enough capacity is not an issue that buildduty can deal with. We'll continue to deal with it acutely by unsticking slaves and the like, but I'm removing [buildduty] because this unactionable for buildduty.
Comment 18 Ben Hearsum (:bhearsum) 2013-01-08 12:56:07 PST
I'm currently going through the buildduty queue and lost job per slave list and kicking machines. I've already fixed up a couple of the fed ones.
Comment 19 Ed Morley [:emorley] 2013-01-08 12:59:33 PST
That's great - thank you :-)
Comment 20 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2013-01-09 03:16:14 PST
At this point, we've done the following:
1) adjusted priority of b2g jobs
2) disabling any known broken test jobs (which are just wasting cpu cycles)
3) scavenged additional test machines from others

...which helped. A bit. But not enough to offset:
1) the traditional busiest week of the year is first week after Christmas/NewYears. ie now.
2) the b2g workweek in progress this week, which is also last one before 15jan, so a spike in b2g traffic.
3) the lack of replacement machines to keep up with load, which are still some way off from being delivered/online. 

As a short term emergency move, we're disabling linux32 desktop test jobs effective immediately until we get through this week. Note: given today's FF18.0/FF10esr/FF17esr releases, we will leave linux32 desktop test jobs enabled on mozilla-aurora/beta/release/esr10/esr17. These branches combined are ~7% of load, so not significant load, but they are important to have in case we need to chemspill. Note: this change is for linux32 *test* jobs only. linux32 builds continue as usual. Also, linux64 builds and tests continue as usual.
Comment 21 Justin Wood (:Callek) (Away until Aug 29) 2013-01-09 06:40:20 PST
Created attachment 699786 [details] [diff] [review]
[configs] per c#20
Comment 22 Ben Hearsum (:bhearsum) 2013-01-09 08:50:30 PST
Comment on attachment 699786 [details] [diff] [review]
[configs] per c#20

This was landed and put into production a couple of hours ago.
Comment 23 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2013-01-09 15:13:07 PST
cjones: from newsgroups, there was a question about whether B2G still needed the crashtest-ipc test suite run on linux32, or if this was now covered by other suites.

Per comment#20, we disabled linux32 test suites, including crashtest-ipc, on most branches last night, in order to improve b2g test waittimes. Let us know if you need this reenabled.
Comment 24 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2013-01-14 09:12:58 PST
per bmoss, the volume+urgency on b2g checkins has decreased to the point that we are now ok to start re-enabling the linux32 tests that we disabled last week.

The linux32 desktop tests should be back live in production soon, and certainly sometime today.
Comment 25 Justin Wood (:Callek) (Away until Aug 29) 2013-01-14 09:18:56 PST
Created attachment 701849 [details] [diff] [review]
[configs] re-enable
Comment 26 Justin Wood (:Callek) (Away until Aug 29) 2013-01-14 09:23:19 PST
Comment on attachment 701849 [details] [diff] [review]
[configs] re-enable

http://hg.mozilla.org/build/buildbot-configs/rev/98b27f79a36d
Comment 27 Aki Sasaki [:aki] 2013-01-14 10:35:48 PST
This is in production.
Comment 28 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-14 12:00:24 PST
Sorry for the reply lag.

(In reply to John O'Duinn [:joduinn] from comment #23)
> cjones: from newsgroups, there was a question about whether B2G still needed
> the crashtest-ipc test suite run on linux32, or if this was now covered by
> other suites.

B2G doesn't (directly) need crashtest-ipc on linux32.  However, those tests along with reftest-ipc are the only thing that keeps cross-process graphics somewhere close to working on desktop builds.  That's not a shipping configuration, but it's important for developers.  There's also a project on the back burner that wants this.
Comment 29 Ed Morley [:emorley] 2013-01-16 05:01:07 PST
Linux32 try pending counts are pretty bad again:
http://builddata.pub.build.mozilla.org/reports/pending/pending_test_try_day.png
Comment 30 Mark Banner (:standard8) 2013-01-28 05:38:37 PST
(In reply to Justin Wood (:Callek) from comment #26)
> Comment on attachment 701849 [details] [diff] [review]
> [configs] re-enable
> 
> http://hg.mozilla.org/build/buildbot-configs/rev/98b27f79a36d

This never re-enabled tests for the Thunderbird tree(s).
Comment 31 Justin Wood (:Callek) (Away until Aug 29) 2013-01-28 08:32:50 PST
Created attachment 707091 [details] [diff] [review]
[configs] v1 - re-enable for TB as well.

with apologies to the TB team
Comment 32 Justin Wood (:Callek) (Away until Aug 29) 2013-01-28 09:03:28 PST
Comment on attachment 707091 [details] [diff] [review]
[configs] v1 - re-enable for TB as well.

http://hg.mozilla.org/build/buildbot-configs/rev/b30523f75d91
Comment 33 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2013-01-30 12:28:55 PST
Yesterday we did 48,302 test jobs. Help is on the way with bug#835955 live in production before the end of this week.
Comment 34 Chris AtLee [:catlee] 2013-04-10 15:08:20 PDT
still critical?
Comment 35 Phil Ringnalda (:philor) 2013-04-10 21:10:56 PDT
Neither critical nor an issue.

Note You need to log in before you can comment on or make changes to this bug.