Bug 818833 (toodamnhigh!)

Pending count for Linux32 test slaves is too high

RESOLVED FIXED

Status

Release Engineering
General Automation
--
critical
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: emorley, Assigned: joduinn)

Tracking

({sheriffing-P1})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [capacity])

Attachments

(3 attachments)

I would imagine this is a combination of:
* "Linux used to be the fastest platform to request on Try, so if I'm testing just one platform, I'll request linux (32) in my trychooser syntax"
* B2G emulator uses linux slaves

~ Non-try:
Pending test(s) @ Dec 06 02:05:02
linux (70)
   25 mozilla-aurora
   24 mozilla-inbound
   11 ionmonkey
   10 mozilla-central
linux64 (22)
    9 mozilla-beta
    8 ionmonkey
    5 mozilla-inbound
mac10.6-rev4 (13)
   13 mozilla-inbound
mac10.7 (13)
   13 mozilla-inbound
mac10.8 (2)
    2 mozilla-inbound
winxp (5)
    5 ionmonkey

~ Try:
Pending test(s) @ Dec 06 02:05:02
linux (1057)
 1057 try
linux64 (14)
   14 try
win7 (89)
   89 try
winxp (27)
   27 try
jgriffin, is there a bug filed for trying to use the emulator on linux64 slaves too? Happy to file one if not :-)
Flags: needinfo?(jgriffin)
We should really attempt to move them to aws. The rev3 minis won't scale even if you split them among 2 of the platforms.

Can we encourage people to use fedora64 on the try server?
So far the pool looks OK, we have not so much hung or waiting for reboot test slaves. The running jobs look sane as well.
(In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #1)
> jgriffin, is there a bug filed for trying to use the emulator on linux64
> slaves too? Happy to file one if not :-)

No, there's no such bug yet.
Flags: needinfo?(jgriffin)
Depends on: 818968
Currently I have to debug marionette code on inbound. Thats BAD!

I pushed to try but even after 24hours the marionette tests were still pending.
16 out of 35 tests fail locally on my linux machine on mc tip and nobody knows why.

We need try-server coverage for marionette tests!
(In reply to Jonathan Griffin (:jgriffin) from comment #4)
> (In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #1)
> > jgriffin, is there a bug filed for trying to use the emulator on linux64
> > slaves too? Happy to file one if not :-)
> 
> No, there's no such bug yet.

Filed bug 818968.

(In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #0)
> I would imagine this is a combination of:
> * "Linux used to be the fastest platform to request on Try, so if I'm
> testing just one platform, I'll request linux (32) in my trychooser syntax"

Posted to dev.platform to try and reverse this habit:
https://groups.google.com/d/msg/mozilla.dev.platform/XcVw9IeUXVU/wvHM0OjCN5MJ
Whiteboard: [buildduty] → [buildduty][capacity]
At a similar time to yesterday, we're now up to 1500 pending linux32 Try jobs (from ~1000) :-(
We now have Try jobs that are still pending after 1 day 20 hours :-(
Mandatory meme :-)

http://www.quickmeme.com/meme/3s585s/
Depends on: 820958

Comment 10

5 years ago
The current fix in line for this is turning off linux32 desktop tests on m-c and project branches.  Newsgroup posts incoming.
(In reply to Aki Sasaki [:aki] from comment #10)
> The current fix in line for this is turning off linux32 desktop tests on m-c
> and project branches.  Newsgroup posts incoming.

Maybe m-i and branches? m-c is not so overloaded.

Comment 12

5 years ago
That would just lead to them being hidden.
Depends on: 821012

Updated

5 years ago
Depends on: 820299
Totally true, and only going to get worse as more B2G tests suites are figured out, and start running in production. Meanwhile, new hardware is still months out.

We've proposed some short term plans for reducing load on linux32 test slaves in dev.planning newsgroup, and the dev platform meetings (last week and today). Details still being worked out, so grabbing this for now.
Assignee: nobody → joduinn
Keywords: sheriffing-P1
B2G marionette-webapi is getting broken quite frequently - and is broken again this morning - and the high levels of coalescing often mean I've had had to close the tree to bisect due to the large range. I suspect the bustage would have been more obvious (and thus in today's case not left in for 45 pushes), if more jobs had been run and so oranges showing on TBPL.

Thank you for looking at this :-)
Release branches keep on stealing linux32 testslaves (meaning the marionette retriggers on inbound are still pending and holding inbound closed), so I've temporarily closed aurora, beta, b2g18, esr18, esr10 temporarily to ensure we can get inbound open sooner rather than later.
There are ~4 linux32 machines that haven't taken a job in anywhere from 10hrs to a day+.

Could you give them a kick? :-)

http://build.mozilla.org/builds/last-job-per-slave.html
Blocks: 819044

Updated

5 years ago
Depends on: 823642
Depends on: 822924
Blocks: 772458
The chronic issue of not having enough capacity is not an issue that buildduty can deal with. We'll continue to deal with it acutely by unsticking slaves and the like, but I'm removing [buildduty] because this unactionable for buildduty.
Whiteboard: [buildduty][capacity] → [capacity]
I'm currently going through the buildduty queue and lost job per slave list and kicking machines. I've already fixed up a couple of the fed ones.
That's great - thank you :-)

Updated

4 years ago
Depends on: 828198
At this point, we've done the following:
1) adjusted priority of b2g jobs
2) disabling any known broken test jobs (which are just wasting cpu cycles)
3) scavenged additional test machines from others

...which helped. A bit. But not enough to offset:
1) the traditional busiest week of the year is first week after Christmas/NewYears. ie now.
2) the b2g workweek in progress this week, which is also last one before 15jan, so a spike in b2g traffic.
3) the lack of replacement machines to keep up with load, which are still some way off from being delivered/online. 

As a short term emergency move, we're disabling linux32 desktop test jobs effective immediately until we get through this week. Note: given today's FF18.0/FF10esr/FF17esr releases, we will leave linux32 desktop test jobs enabled on mozilla-aurora/beta/release/esr10/esr17. These branches combined are ~7% of load, so not significant load, but they are important to have in case we need to chemspill. Note: this change is for linux32 *test* jobs only. linux32 builds continue as usual. Also, linux64 builds and tests continue as usual.
Created attachment 699786 [details] [diff] [review]
[configs] per c#20
Attachment #699786 - Flags: review?(bhearsum)
Attachment #699786 - Flags: review?(bhearsum) → review+
Comment on attachment 699786 [details] [diff] [review]
[configs] per c#20

This was landed and put into production a couple of hours ago.
Attachment #699786 - Flags: checked-in+
cjones: from newsgroups, there was a question about whether B2G still needed the crashtest-ipc test suite run on linux32, or if this was now covered by other suites.

Per comment#20, we disabled linux32 test suites, including crashtest-ipc, on most branches last night, in order to improve b2g test waittimes. Let us know if you need this reenabled.
per bmoss, the volume+urgency on b2g checkins has decreased to the point that we are now ok to start re-enabling the linux32 tests that we disabled last week.

The linux32 desktop tests should be back live in production soon, and certainly sometime today.
Created attachment 701849 [details] [diff] [review]
[configs] re-enable
Attachment #701849 - Flags: review?(bhearsum)
Attachment #701849 - Flags: review?(bhearsum) → review+
Comment on attachment 701849 [details] [diff] [review]
[configs] re-enable

http://hg.mozilla.org/build/buildbot-configs/rev/98b27f79a36d
Attachment #701849 - Flags: checked-in+

Comment 27

4 years ago
This is in production.
Sorry for the reply lag.

(In reply to John O'Duinn [:joduinn] from comment #23)
> cjones: from newsgroups, there was a question about whether B2G still needed
> the crashtest-ipc test suite run on linux32, or if this was now covered by
> other suites.

B2G doesn't (directly) need crashtest-ipc on linux32.  However, those tests along with reftest-ipc are the only thing that keeps cross-process graphics somewhere close to working on desktop builds.  That's not a shipping configuration, but it's important for developers.  There's also a project on the back burner that wants this.
Depends on: 830923
No longer depends on: 828198
Linux32 try pending counts are pretty bad again:
http://builddata.pub.build.mozilla.org/reports/pending/pending_test_try_day.png
Alias: toodamnhigh!
(In reply to Justin Wood (:Callek) from comment #26)
> Comment on attachment 701849 [details] [diff] [review]
> [configs] re-enable
> 
> http://hg.mozilla.org/build/buildbot-configs/rev/98b27f79a36d

This never re-enabled tests for the Thunderbird tree(s).
Created attachment 707091 [details] [diff] [review]
[configs] v1 - re-enable for TB as well.

with apologies to the TB team
Attachment #707091 - Flags: review?(bhearsum)
Attachment #707091 - Flags: review?(bhearsum) → review+
Comment on attachment 707091 [details] [diff] [review]
[configs] v1 - re-enable for TB as well.

http://hg.mozilla.org/build/buildbot-configs/rev/b30523f75d91
Attachment #707091 - Flags: checked-in+
Yesterday we did 48,302 test jobs. Help is on the way with bug#835955 live in production before the end of this week.
Depends on: 835955
Depends on: 842629
No longer depends on: 842629
Depends on: 843054
Depends on: 843229
Depends on: 784913
still critical?
Neither critical nor an issue.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.