Closed Bug 818576 Opened 12 years ago Closed 12 years ago

Tests for "Android 2.2 debug" (on cedar)

Categories

(Release Engineering :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: Callek)

References

Details

Attachments

(1 file)

We have "Android 2.2 debug" builds on tbpl, but no tests are run against them. Tests run against debug builds might provide better diagnostics for some of our on-going intermittent failures. It looks like there is some history to this issue; see bug 693015.
Component: Release Engineering → Release Engineering: Automation (General)
QA Contact: catlee
Someone smart recently suggested that we might move this forward by allowing debug tests to be requested on try, warts and all, leaving the bigger task of getting all-green debug test runs on m-i, etc. for another day.
(In reply to Geoff Brown [:gbrown] from comment #1) > Someone smart recently suggested that we might move this forward by allowing > debug tests to be requested on try, warts and all, leaving the bigger task > of getting all-green debug test runs on m-i, etc. for another day. This would be fantastic. A 'matching' Try run (from trychooser) could skip these, but it'd be great to be able to run them if needed.
that is fine, but 100% of the tests will timeout which means we will not get full coverage. How will getting debug tests help us? The mochitest.log file looks the same, and logcat only gets a small portion of the data since it is a lifo buffer.
Debug tests are really important because throughout the code, there are assertions and #ifdef DEBUG snippets which work to guarantee that the code is functioning and being used properly. Especially since fennec and b2g sometimes use unique codepaths (which aren't used by any desktop platform), this mobile-only code is getting no DEBUG testing, and therefore no assertions and no extra debug-only checks. This means that when things go wrong, they go off the rails instead of hitting an assertion at a useful stage. Tests coming back green means things appear to be working, but we could actually be skipping important assertions. Tests can also come back green intermittently, where they might have gone orange from hitting asserts. Running a DEBUG build locally, I've had this happen a couple of times. We need this coverage.
that is good to know. To make tests report green, we will have to modify the harnesses and the buildbot runners. They run way too long and we have timers that kill a test after 1 hour. In fact as we run things the tests will run for 2+ hours. I believe on android the buildbot stuff kills it at 2 hours, for mochitest the harness kills it after an hour. changing these variables just because could really eat up a lot of machine time if somebody checked in a bad change (imagine 400 jobs sitting around for an extra hour because we made the timeout longer). There are a couple solutions I can think of for this: 1) adjust the timeouts based on a --debugBuild flag to the harness 2) adjust buildbot to run smaller chunks (15 chunks for mochitest instead of 8, etc...) 3) selectively run tests we really really care about in debug mode and ignore the rest. 4) all of the above we will need releng to work on this as I don't have access or the ability to turn on debug tests and make necessary buildbot adjustments to tweak how we run the tests.
Do we actually do something vaguely approximating useful with assertion now, something that we did not do the last time we ran debug tests on Android? You know, like, log them within a thousand lines of where they actually happened?
Fatal assertions are logged fine: https://tbpl.mozilla.org/php/getParsedLog.php?id=18580536&tree=Try I don't know about for non-fatal assertions.
Yeah, that's a nice looking MOZ_ASSERT in a desktop build. How would you feel about that only having the crash and the crash stack, and having the "Assertion failure: dest->GetSize() == src->Size(), at ../../../gfx/gl/GLScreenBuffer.cpp:422" line only appear about 400 lines below the bottom of the stack, so rather than an assertion failure it looks like an actual crash, and only two or three people involved in the project will know where to look to see that it was really an assertion failure and what assertion it was? And you can only have even that unhappy state for mochitests, not for crashtests or reftests, since they count assertions per-test, and the only reason your MOZ_ASSERT would be sure to appear 400 lines below is because you're crashing and thus cutting off the partial logcat and ensuring that you'll be there in the relatively few lines that get printed. For crashtests and reftests, you'd have a test failure, "assertion count 3 is more than expected 1 to 2" with absolutely no way of knowing what assertions failed unless you happened to be looking at one of the last tests in the run.
Depends on: 682129
OS: Mac OS X → Android
Hardware: x86 → ARM
While what philor proposes isn't ideal, Jeff's right that this stuff is important. It's high-leverage for automatically finding bugs closer to the time they're landed, when they're much cheaper to fix, because the introducer won't have swapped out the relevant mental context. So any step that moves us in the right direction seems worth doing to me.
releng folks (armenzg, callek, kmoir): I have outlined some issues with running debug tests in automation (in comment #5 https://bugzilla.mozilla.org/show_bug.cgi?id=818576#c5), we need to have resolutions to these before moving forward. I would personally prefer the greater chunking method as compared to longer timeouts. The longer a test runs, the greater the chance that we have a hardware or connectivity issue. Can the developers here give us a top priority list for which tests (not test suites) to run? Do we care about dom-level* (which never fail)? Do we care about crashtests or jsreftests? We are doing OK with our tegra pool now, turning on debug tests will only increase wait times and cause more delay. Should these only run on inbound and try? Can we limit the number of jobs per push to 10? With the volume on inbound that seems like enough to give us results vs take out the entire pool.
(In reply to Joel Maher (:jmaher) from comment #5) > There are a couple solutions I can think of for this: > 1) adjust the timeouts based on a --debugBuild flag to the harness I can't think of a good way to do this [and propogate to buildbot] top of my head > 2) adjust buildbot to run smaller chunks (15 chunks for mochitest instead of > 8, etc...) Fwiw :philor already noticed yesterday that mochi-8 for devices are running end to end in < 1 minute, and thought at first that the device was rebooting in the middle given how short the "Application ran for" was, so increasing this further, which highly increases our overall pool load (setup/teardown time) is a no-go for me. > 3) selectively run tests we really really care about in debug mode and > ignore the rest. This sounds like the best choice to me. > we will need releng to work on this as I don't have access or the ability to > turn on debug tests and make necessary buildbot adjustments to tweak how we > run the tests. Once we know exactly what suites, where, it can then be actionable by releng.
mochitest-8 runs for 12 minutes on inbound for android 2.2 opt mochitest-7 runs for 9 minutes on inbound for android 2.2 opt mochitest-8 runs for 8 minutes on inbound for android 4.0 mochitest-7 runs for 5 minutes on inbound for android 4.0 whilst we have a need to do less chunks on our tests right now, I cannot find a 1 minute runtime on android.
My example was: https://tbpl.mozilla.org/php/getParsedLog.php?id=19027505&tree=Mozilla-Beta Application ran for: 0:00:53.128938 with whole step elapsed: 1 mins, 16 secs with whole job: Started at 2013-01-22 14:17:17.721565 Finished at 2013-01-22 14:35:08.004424 Elapsed: 17:50.282859 % of time spent doing testing vs time spent in rest of job: 4.96%
One nice property of prioritizing mochitests is that since they tend to be system/functional tests, they seem more likely to exercise more code throughout the stack.
WebGL appears to touch M1, R, Ru, android R2, and b2g R6. As such, those are my recommendations for DEBUG test coverage. JS tests should also very strongly be considered for DEBUG coverage.
we don't run Ru on android, but we do R1-3. So Mochitests, R2 and Jsreftests. This bug is just for android, not b2g or desktop. Callek, it looks like we are getting a reasonable list of tests built up.
Assignee: nobody → bugspam.Callek
Attached patch [configs] v1Splinter Review
So this ended up being simpler on the code-side than I first imagined (and first-started) This looked well enough for cedar per staging. Will need dev eyes to green them up properly before moving on to m-c.
Attachment #728101 - Flags: review?(kmoir)
Comment on attachment 728101 [details] [diff] [review] [configs] v1 lgtm
Attachment #728101 - Flags: review?(kmoir) → review+
(In reply to Dan Mosedale (:dmose) from comment #9) > While what philor proposes isn't ideal Sorry that I wasn't sufficiently clear. That wasn't what I was proposing, it was what I was telling you that you would get, and the problem is not that it isn't ideal, it's that it isn't acceptable, and will not run visible on any tree other than Cedar. Take a look at https://tbpl.mozilla.org/php/getParsedLog.php?id=20995435&tree=Cedar&full=1 and picture yourself filing the bugs about the unexpected assertion test failures, particularly the ones in the first five tests, where you do not even know *what* assertions were hit. Now picture yourself having to fix the assertions in the last two tests, where you know what assertion was hit, but because you don't have a stack, you have no idea what code path was followed to get to that assertion. Now picture yourself trying to persuade bz to help you fix them. Now picture yourself telling someone that you backed out their patch because it caused Android tests to fail by hitting unexpected assertions. How? Sorry, they cannot know that. Which? Sorry, they cannot know that.
That does sound like a nightmare, but it really sounds like the design of the tests, where they're counting assertions instead of running cleanly. Not being familiar with the reasons (historical or otherwise) why it works this way, it sounds really broken. Other tests we have don't have this problem, but do have a ton of asserts and #ifdef DEBUG-only code we need coverage on. Skipping DEBUG-build tests on android shouldn't have ever been an option, but here we are. We need as many of these back online as we can get. If some of them are currently inviable, and need to stay disabled until fixed, so be it. (Also someone will need to fix them) We *already* deal with mystery crashes in android opt-build tests during development, and part of what makes this painful is not having our DEBUG tests running to keep us honest.
philor, yes that does sound, uh, exciting. :-) That said, +1 to jgilbert's suggestion to blacklist busted tests in these runs until they are fixed and turning on the rest.
(In reply to Jeff Gilbert [:jgilbert] from comment #20) > That does sound like a nightmare, but it really sounds like the design of > the tests, where they're counting assertions instead of running cleanly. Not > being familiar with the reasons (historical or otherwise) why it works this > way, it sounds really broken. > > Other tests we have don't have this problem, but do have a ton of asserts > and #ifdef DEBUG-only code we need coverage on. Skipping DEBUG-build tests > on android shouldn't have ever been an option, but here we are. We need as > many of these back online as we can get. If some of them are currently > inviable, and need to stay disabled until fixed, so be it. (Also someone > will need to fix them) > > We *already* deal with mystery crashes in android opt-build tests during > development, and part of what makes this painful is not having our DEBUG > tests running to keep us honest. That still doesn't solve the "logs only show part of the logcat, so future landings will be backed out for causing assertions, but people will have no idea what they broke and how to fix it" problem that philor was talking about.
To be more explicit: Debug tests on android will need to remain hidden until assertions + stacks show up in the log like every other platform.
I'm assuming that's could be fixed by logcat scraping, a la bug 682129, upon which this bug depends. (Correct?) Is it possible that the shortest path to success here is to tweak the Android assertion code to complain to logcat in addition to stdout/stderr?
As you saw when you looked at the log in comment 19, they are already there and visible to logcat, but as you also saw, logcat is just displaying a circular buffer which doesn't have the capacity to hold all of our output for an entire test run, and so bug 682129 is exactly about "capture all of our output from logcat."
Or we could just ensure we print assertions to stdout, and not have to scrape the logcat at all...
for the record, all mochitest data is captured in a special logfile on the device (typically /mnt/sdcard/tests/mochitest.log). The harness itself writes there instead of to stdout. Capturing stdout or even logcat is a major undertaking on android and mobile in general.
(In reply to Joel Maher (:jmaher) from comment #27) > for the record, all mochitest data is captured in a special logfile on the > device (typically /mnt/sdcard/tests/mochitest.log). The harness itself > writes there instead of to stdout. Capturing stdout or even logcat is a > major undertaking on android and mobile in general. Couldn't we just output printf_stderr (and such) to a file, instead of actually stderr?
Great idea! In the harness we could pull that stderr file to the host and append it to the logfile. If it had timestamps, we could sort it all properly with the test log.
So, I can do a mini-patch to dump output to a file, but I can't do the rest of the build/harness stuff. Is there anyone else who could help out with that part?
:jgilbert, I would be happy to help. This just requires editing testing/mochitest/runtestsremote.py. That would only be for the mochitest harness, but it would be a place to start. Basically we would do something like this: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtestsremote.py#217 self._dm.getFile('/mnt/sdcard/tests/<jgilbert.log>', "%s.jgilbert" % self.localLog) # append .jgilbert file to localLog ll = open(self.localLog, 'a') jg = open("%s.jgilbert" % self.localLog, 'r') data = jg.read() ll.write(jg) jg.close() ll.close() os.remove("%s.jgilbert" % self.localLog) It is a hack, if this seems to solve our problems, I would like to figure out a better way to support >1 type of file coming back. Likewise interleaving data to make it useful.
The major work [in buildbot] has been completed, I'm going to spin off "turn on for other branches" to a new bug to get this off my radar until we have actionable steps for said other branches. This is already live on cedar.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Summary: Tests for "Android 2.2 debug" → Tests for "Android 2.2 debug" (on cedar)
Blocks: 866795
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: