Closed Bug 820299 Opened 7 years ago Closed 7 years ago

Please enable DEBUG b2g emulator test runs on b2g18 and cedar


(Release Engineering :: General, defect)

Gonk (Firefox OS)
Not set


(Not tracked)



(Reporter: cjones, Assigned: aki)



(Whiteboard: [b2g])


(4 files)

--enable-debug builds help catch a lot of dumb bugs, and they're additionally extremely valuable for developers testing patches locally.  We should keep these working reliably by enabling test runs for these builds on tbpl.
Component: Release Engineering → Release Engineering: Automation (General)
QA Contact: catlee
Whiteboard: [b2g]
Which b2g builds do you want these enabled on? Or do you want a debug variant of one of the existing build types?
Currently we have

B2G Arm opt    Bg 1 2 3 4 5 6 R1 R2 R3 R4 R5 R6 Mn
B2G Arm debug  Bg

The request of this bug is to make this

B2G Arm opt    Bg 1 2 3 4 5 6 R1 R2 R3 R4 R5 R6 Mn
B2G Arm debug  Bg 1 2 3 4 5 6 R1 R2 R3 R4 R5 R6 Mn

that is, parity between opt and debug testing.
Summary: Please enable DEBUG test runs on b2g builds → Please enable DEBUG emulator test runs on b2g builds
What kind of priority does this have compared to our other tasks?
I'm currently assuming we should finish C2 tasks first, and this is in the C3 set.
Yes.  We'll need some platform work in parallel to get these green.
Assignee: nobody → aki
Testing this in staging.
We'll be running with the exact same scripts+commandline options as opt, just using the debug installer+test zip.  I think this is sufficient; the staging run should tell me if that's accurate.
Ok, this works (some green, some orange).

Something to consider: we're already seeing bad wait times on Fedora 32, due to the combined load of desktop Firefox linux 32 testing + B2G emulator testing.

Compare Fedora 32 vs Fedora 64, and you'll see just over 25% of Fedora 32 jobs are picked up by a free test slave within 15 minutes, as opposed to 40% for Fedora 64:

fedora: 6310
  0:     1631    25.85%
 15:      882    13.98%
 30:      494     7.83%
 45:      396     6.28%
 60:      357     5.66%
 75:      255     4.04%
 90+:     2295    36.37%

fedora64: 4513
  0:     1827    40.48%
 15:      673    14.91%
 30:      439     9.73%
 45:      351     7.78%
 60:      217     4.81%
 75:      142     3.15%
 90+:      864    19.14%

This is due to the larger load (6,310 jobs yesterday versus 4,513 on 64bit).  If that ~1800 job differential is due to emulator tests, turning on debug tests will increase this number to ~8110 and awful wait times will become even worse.

Do we want debug tests per-checkin?  Nightly-only?  Only on certain branches?  (this will mean they'll be hidden on tbpl, but can be viewed with ?noignore=1 in the url).

If we want per-checkin on all branches, I recommend waiting til bug 818833 is closed.  (The linux32 wait times issue is being tracked in bug 818833.)
Blocks: toodamnhigh!
Comment on attachment 691568 [details] [diff] [review]
turn debug emulator tests on on cedar

Turning debug tests on on Cedar only shouldn't be a huge level of load, and we'll be able to show which test chunks turn orange etc.
Attachment #691568 - Flags: review?(bhearsum)
Attachment #691568 - Flags: review?(bhearsum) → review+
Chris, needinfo on frequency of debug tests.  (see comment 7)
Flags: needinfo?(jones.chris.g)
The problem is that if they're not run on every checkin on at least try, inbound, and m-c, then developers will break the tests without knowing it, and that's not fair to anyone.

Let's start with every checkin on mozilla-b2g18 and turn on in other branches I guess after bug 818833.
Flags: needinfo?(jones.chris.g)
jgriffin, ahal:

Looks like we have green mochitests and orange everything else on Cedar.
Do we have a way to limit tests for debug test runs?
Wondering if we can+should do something to make the tests less perma-orange before enabling elsewhere.
Current plan is to enable mochitests on b2g18, and leave the others til we can resolve their perma-orangeness.
I'm going to turn on a large set of layout mochitests today (bug 815416).  I'll land this change on cedar too; let's wait until we have a green run there before enabling debug mochitests on b2g18.
This avoids turning on the perma-orange reftests/xpcshell/marionette webapi tests.
We should wait to land this til after jgriffin lands and we see Cedar go green.
Attachment #693602 - Flags: review?(bhearsum)
Attachment #693602 - Flags: review?(bhearsum) → review+
We have a timed out mochitest-6 on debug on cedar; rekicked to see if that's real.
It looks real, and just due to the fact that we're hidding buildbot's hard 60-minute timeout.  We can probably resolve this by adding another chunk or two to mochitests.
(In reply to Jonathan Griffin (:jgriffin) from comment #18)
> It looks real, and just due to the fact that we're hidding buildbot's hard
> 60-minute timeout.  We can probably resolve this by adding another chunk or
> two to mochitests.

If optimal chunk time is ~20-30 minutes then we should be adding another 4-6 chunks :(
You still have 10 minutes of setup/teardown overhead, so you still need to completely forget about 20 minute chunks.

If I'm reading things right, desktop debug non-mozharness has a maxTime (which becomes script_maxtime) of 2 hours, and desktop mozharness has a script_maxtime of 2 hours, but these are getting the default 1 hour, which isn't enough (it's enough time to run the tests, but not enough time to grovel through the logs counting up passes and fails).
I actually just noticed that chunks 1-5 have <1052 tests while chunk six has >32000. The best solution would be to even out the chunking.
Isn't the chunking in the test harness itself?
Yeah. I think this is happening because it is taking skipped tests into account when chunking. My bet is there is a bug in the mochitest harness that needs to be fixed (it probably doesn't take into account the manifest specified by --run-only-tests).
Is it doing --chunk-by-dirs the way desktop does? If so, you're not going to be able to split up that M6 much, since by far the bulk of it is the brutally detailed layout/style/, if not then it's totally unpredictable what chunk a test will be in and people will either run every chunk on try, or more likely won't and will land busted patches. (Or maybe not passing --chunk-by-dirs 4 gives some less optimal default and it still chunks by dir, just more poorly?)

Also, "tests" is a pretty loose measure of load, since you can compare 1000 test values to reference values by either is(t[1], r[1], "t[1] isn't equal to r[1]") etc. producing 1000 "tests" or accum = t[1] - r[1] ... accum = t[1000] - r[1000]; is(accum, 0, "something was different") producing 1 "test". Actual runtime between " | Running tests: start" and " | Running tests: end" might be a better measure, and that's around 45 minutes for the 597 "tests" in M1, and 55 minutes for the 34852 "tests" in M6.

layout/style/ is brutal, and there's no real reason to believe that this emulator is twice as fast at running debug builds as the Win2K3 slaves for which we set a 2 hour maxTime on desktop were. If that desktop maxTime had already been ported over, and M6 was green and taking 70 minutes, while the others were green and taking just under 60, how concerned would you be? On b2g18, where opt M6 already takes 40 minutes, that's another half hour; in the real world, it's still vastly faster than end-to-end on Windows.
Attachment #694518 - Flags: review?(armenzg) → review+
Comment on attachment 693602 [details] [diff] [review]
turn on b2g emulator debug mochitests on b2g18
Attachment #693602 - Flags: checked-in+
Comment on attachment 694518 [details] [diff] [review]
bump mozharness test default script_maxtime to 2 hours
Attachment #694518 - Flags: checked-in+
Mochitests are now being scheduled on b2g18.
Attachment #694926 - Flags: review?(armenzg)
Attachment #694926 - Flags: review?(armenzg) → review+
in production
Can this now be closed?
Flags: needinfo?(aki)
(In reply to Andrew Overholt [:overholt] from comment #33)
> Can this now be closed?

We currently have debug emulator tests (minus reftests, marionette-webapi, and crashtests) on b2g18.

We can either keep this open til we have some or all of those ready, which may take a while (they're not live on b2g18 because they're not currently giving useful information -- perma-orange or red).
Or we can resolve and file new bugs if we want those other test suites in the future.

I'm fine with either approach.
Flags: needinfo?(aki)
I managed to unstick some of the reftest emulator debug tests, but they go over the 2 hour timeout per job. Not sure if we'll ever get to a point where we can run these on emulators. At least not without some substantial work put into the reftest harness to speed things up (if that's even possible).
Since there's no immediate action here, and no one seems to be clamoring for the missing/broken test suites, I'm going to resolve this bug.
We can file new bugs for the other test suites, but those may be long-lived bugs depending on how long it takes to get those test suites to report anything useful.
Closed: 7 years ago
Resolution: --- → FIXED
Thought this was talking about trunk from bugmail; clarifying summary.
Summary: Please enable DEBUG emulator test runs on b2g builds → Please enable DEBUG b2g emulator test runs on b2g18 and cedar
Product: → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.