Increase b2g desktop mochitest chunk size to 2 due to oversized log

RESOLVED WONTFIX

Status

Release Engineering
General Automation
RESOLVED WONTFIX
4 years ago
4 years ago

People

(Reporter: ahal, Assigned: ahal)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Assignee)

Description

4 years ago
The b2g desktop mochitest jobs are running so many tests that we are more or less at the 50mb log limit. We should spend some time looking into ways to make the log smaller, but in the meantime we are starting to see several intermittents. We should chunk the job to get around this in the short term.
(Assignee)

Comment 1

4 years ago
Created attachment 8356223 [details] [diff] [review]
Patch 1.0 - increase b2g desktop mochitest chunks to 2
Attachment #8356223 - Flags: review?(aki)

Comment 2

4 years ago
Comment on attachment 8356223 [details] [diff] [review]
Patch 1.0 - increase b2g desktop mochitest chunks to 2

lgtm
Attachment #8356223 - Flags: review?(aki) → review+
(Assignee)

Comment 3

4 years ago
https://hg.mozilla.org/build/buildbot-configs/rev/070f36bc24b6

Comment 4

4 years ago
In production (not CCed).
Sadly the following failures are now being seen after we chunk:
https://tbpl.mozilla.org/php/getParsedLog.php?id=32641891&tree=Mozilla-Inbound

{
9479 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/general/test_domWindowUtils.html | Test timed out.
13027 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_after_paint_pref.html | Test timed out.
13253 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug423523.html | Caret should be at offset 1
13724 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug558663.html | before and middle should not be the same snapshots compare incorrectly. snapshot 1: <snip>
13725 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug558663.html | before and middle2 should not be the same snapshots compare incorrectly. snapshot 1: <snip>
13726 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug558663.html | middle and end should not be the same snapshots compare incorrectly. snapshot 1: <snip>
13727 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug558663.html | middle2 and end should not be the same snapshots compare incorrectly. snapshot 1: <snip>
14122 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug582771.html | Should have got mousemove - got 0, expected 1
14123 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug582771.html | Should have got mousemove - got 0, expected 2
14124 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug582771.html | Shouldn't have got mousemove - got 0, expected 2
14125 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug582771.html | Should have got mousemove - got 0, expected 1
14126 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug582771.html | Should have got mousemove - got 0, expected 3
14127 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug582771.html | Shouldn't have got mousemove - got 0, expected 1
14128 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug582771.html | Should have got mousemove - got 0, expected 4
14191 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug677878.html | Test timed out.
14237 ERROR TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_bug761572.html | Test timed out.
14238 ERROR TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | 4 test timeouts, giving up.
}
(Assignee)

Comment 6

4 years ago
I backed out the patch for now, but we still need to merge it to production.. I'd rather let releng handle that:
http://hg.mozilla.org/build/buildbot-configs/rev/d08f0b276cdc

Comment 7

4 years ago
(In reply to Andrew Halberstadt [:ahal] from comment #6)
> I backed out the patch for now, but we still need to merge it to
> production.. I'd rather let releng handle that:
> http://hg.mozilla.org/build/buildbot-configs/rev/d08f0b276cdc

The reconfig will be happening soon.
Andrew, don't suppose you've had any luck figuring out the test failures? Bug 951771 is occurring extremely frequently now (more so than the in-bug TBPLbot comments would imply), so would be good to get this chunked soon if we can :-)
(Assignee)

Comment 9

4 years ago
(In reply to Ed Morley [:edmorley UTC+0] from comment #8)
> Andrew, don't suppose you've had any luck figuring out the test failures?
> Bug 951771 is occurring extremely frequently now (more so than the in-bug
> TBPLbot comments would imply), so would be good to get this chunked soon if
> we can :-)

I decided to go the route of hiding subtests instead of this in bug 957768. That bug should go live when the next reconfig happens.. though I still haven't seen it working in the wild (waiting for some jobs to finish on Cypress) so I may need some follow-up fixes.
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
Great, thank you :-)
(Assignee)

Comment 11

4 years ago
Going to reopen this except to see if it improves the situation in bug 967588 at all.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
(Assignee)

Updated

4 years ago
Blocks: 967588
No longer blocks: 951771
(Assignee)

Comment 12

4 years ago
Oh, apparently they were already unhidden, the other bug just hadn't gotten resolved. We may still want to revisit this in the future, but re-resolving for now.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → WONTFIX
Nope and nope - we haven't permanently switch all slaves back to m1, and the switch to m1 didn't actually entirely cure this - e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=34483029&tree=Mozilla-Inbound on a m1 instance. Somebody would have to carefully count, counting the percentage run on -spot- versus -ec2- and the percentage failing on each to see whether a permanent switch back to m1 would make them meet the letter of the law with a 9.9% failure rate, but they still fail by hanging in random tests.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
(In reply to Phil Ringnalda (:philor) from comment #13)
> Nope and nope - we haven't permanently switch all slaves back to m1, and the
> switch to m1 didn't actually entirely cure this - e.g.
> https://tbpl.mozilla.org/php/getParsedLog.php?id=34483029&tree=Mozilla-
> Inbound on a m1 instance. Somebody would have to carefully count, counting
> the percentage run on -spot- versus -ec2- and the percentage failing on each
> to see whether a permanent switch back to m1 would make them meet the letter
> of the law with a 9.9% failure rate, but they still fail by hanging in
> random tests.

On b2g-inbound, there have been 0 failures of this type on Linux64 on spot instances; all the failures (other than the one that resulted in a backout) have been on on-demand instances.  I think switching the on-demand instance type to m1.medium will allow us to unhide these.
(In reply to Jonathan Griffin (:jgriffin) from comment #14)
> 
> On b2g-inbound, there have been 0 failures of this type on Linux64 on spot
> instances; all the failures (other than the one that resulted in a backout)
> have been on on-demand instances.  I think switching the on-demand instance
> type to m1.medium will allow us to unhide these.

This is since yesterday morning, I should add.
Right. It's more like a couple a day on mozilla-inbound (plus the hangs in a non-random test), but we've got pretty good momentum going for starring things as just "b" and not retriggering.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.