Closed Bug 808410 Opened 7 years ago Closed 7 years ago

Android shutdown crashes on mochitest/crashtest/reftest/jsreftest + all crashes on robocop are green on TBPL

Categories

(Testing :: General, defect, critical)

ARM
Android
defect
Not set
critical

Tracking

(firefox17 wontfix, firefox18 affected, firefox19 affected, firefox-esr10 wontfix, firefox-esr17 affected)

RESOLVED FIXED
mozilla20
Tracking Status
firefox17 --- wontfix
firefox18 --- affected
firefox19 --- affected
firefox-esr10 --- wontfix
firefox-esr17 --- affected

People

(Reporter: emorley, Assigned: emorley)

References

(Blocks 1 open bug)

Details

Attachments

(6 files)

Yes, you read that correctly \o/

The fix for this will need backporting to all branches, since this has been present since the beginning of remote reftest (2010).

Examples:

Android Tegra 250 mozilla-release opt test reftest-1 on 2012-10-23 10:46:30 PDT for push 774491e9f207
slave: tegra-149
results: success (0)
...but shutdown crash in [@ nssCertificate_Destroy]
...remotereftest.py finished with exit code 0
https://tbpl.mozilla.org/php/getParsedLog.php?id=16386302&full=1&branch=mozilla-release

Android Armv6 Tegra 250 mozilla-inbound opt test crashtest-3 on 2012-11-04 03:50:10 PST for push 58b4bd7b5065
slave: tegra-274
results: success (0)
...but shutdown crash in [@ nssCertificate_Destroy]
...remotereftest.py finished with exit code 0
https://tbpl.mozilla.org/php/getParsedLog.php?id=16736472&full=1&branch=mozilla-inbound

Android Armv6 Tegra 250 mozilla-inbound opt test crashtest-3 on 2012-11-03 20:46:11 PDT for push 38b97d2a1bc5
slave: tegra-277
results: success (0)
...but shutdown crash in [@ nssCertificate_Destroy]
...remotereftest.py finished with exit code 0
https://tbpl.mozilla.org/php/getParsedLog.php?id=16733514&full=1&branch=mozilla-inbound

Background...

Buildbotcustom's unittest.py does:

>    # Assume that having the "Unexpected: 0" line
>    # means the tests run completed (successfully).
>    # Also check for "^TEST-UNEXPECTED-" for harness errors.
>    if not re.search(r"^REFTEST INFO \| Unexpected: 0 \(", log, re.MULTILINE) or \
>       re.search("^TEST-UNEXPECTED-", log, re.MULTILINE):
>        return worst_status(superResult, WARNINGS)

However, reftest.js' summary appended to the log doesn't specifically look for crashes, which means that in the case of a shutdown crash, we get a summary like:

> REFTEST INFO | Result summary:
> REFTEST INFO | Successful: 704 (1 pass, 703 load only)
> REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception)
> REFTEST INFO | Known problems: 10 (0 known fail, 0 known asserts, 0 random, 10 skipped, 0 slow)
> REFTEST INFO | Total canvas count = 2
> REFTEST TEST-START | Shutdown
> 
> INFO | automation.py | Application ran for: 0:05:16.468885
> INFO | automation.py | Reading PID log: /tmp/tmpBwFVijpidlog
> getting files in '/mnt/sdcard/tests/reftest/profile/minidumps/'
> Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-armv6/1352026305/fennec-19.0a1.en-US.android-arm-armv6.crashreporter-symbols.zip
> PROCESS-CRASH | Shutdown | application crashed (minidump found)

...which means the unittest.py snippet above relies on superResult (~the exit code) only. However remotereftest.py unfortunately does:

>    retVal = 0
>    try:
>        cmdlineArgs = ["-reftest", manifest]
>        if options.bootstrap:
>            cmdlineArgs = []
>        dm.recordLogcat()
>        reftest.runTests(manifest, options, cmdlineArgs)
>    except:
>        print "Automation Error: Exception caught while running tests"
>        traceback.print_exc()
>        retVal = 1
>
>    reftest.stopWebServer(options)
>    try:
>        logcat = dm.getLogcat()
>        print ''.join(logcat[-500:-1])
>        print dm.getInfo()
>    except devicemanager.DMError:
>        print "WARNING: Error getting device information at end of test"
>
>    return retVal

...so unless we get an exception (which we don't in the case of crashes) we'll exit with code 0; causing buildbot (and thus TBPL) to treat the run as a pass.

Compare with...

runreftest.py:
line 288 -- sys.exit(reftest.runTests(args[0], options))

runreftestb2g.py
line 548 -- retVal = reftest.runTests(manifest, options, cmdlineArgs)

I've checked the other harnesses and none of the others seem to need fixing (http://mxr.mozilla.org/mozilla-central/search?string=.runTests&case=1&find=\.py&findi=\.js&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central).
> However, reftest.js' summary appended to the log doesn't specifically look for 
> crashes, which means that in the case of a shutdown crash, we get a summary like:

Bah brain fail, that's an orthogonal issue; meant to say something like:

However, in the case of a shutdown crash, we get a summary like:
> REFTEST INFO | Result summary:
> REFTEST INFO | Successful: 704 (1 pass, 703 load only)
> REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception)
...
Assignee: nobody → bmo
Status: NEW → ASSIGNED
Attachment #678132 - Flags: review?(jmaher)
Blocks: 761987
Duplicate of this bug: 794244
Comment on attachment 678132 [details] [diff] [review]
Part 1: remotereftest.py should use runTests return value

Review of attachment 678132 [details] [diff] [review]:
-----------------------------------------------------------------

good catch.
Attachment #678132 - Flags: review?(jmaher) → review+
(In reply to Ed Morley [:edmorley UTC+0] from comment #5)
> Hmm I'm still getting exit code 0:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=16743181&full=1&branch=try

Due to bug 810405 comment 1.
I'll attach further patches here.
Duplicate of this bug: 810405
Attachment #678132 - Attachment description: Patch v1 → Part 1: remotereftest.py should use runTests return value
Ok, so as the duped bugs have shown, this doesn't just affect reftest variants.

We miss Android shutdown crashes on everything but talos, and on robocop we even miss non-shutdown crashes too \o/

This is due to a combination of:
* remotereftest.py not using the return value from runTests (dealt with by patch part 1).
* The return value for robocop tests being overwritten after each test run, so unless the last test failed, we miss earlier failures. (For non-crashes this doesn't matter, since we do an end-of-overall-run log parse + summary (printLog), where failures can still change the return value. However printLog does not look for PROCESS-CRASH lines, but I'll deal with this in another bug, since with the other fixes here, it doesn't affect buildbo/TBPL job status.)
* checkForCrashes returning false when we hit an exception partway through, since we don't set foundCrash to True until the very end.
* Callers of checkForCrashes() don't use its return value as part of determining the overall exit code.

In addition, I spotted:
* mochitest/runtestsb2g.py possibly doesn't always set the retVal when the final mochitest.cleanup succeeds without exception.
* We fail to print the overall robocop run summary unless the return value was 0.
* checkForZombies() doesn't return the result (unlike checkForCrashes) & callers don't use it to determine the final exit value.
Summary: remotereftest.py ignores runTests() exit code, so Android crashtest/reftest/jsreftest shutdown crashes are green on TBPL → Android shutdown crashes on mochitest/crashtest/reftest/jsreftest + all crashes on robocop are green on TBPL
Component: Reftest → General
Only sets retVal if it is at its initial 'None' value or else 0. If we have multiple failures during the robocop run, it will remain at the first non-zero return value.
If there were do dumps found we return early, so no need to use foundCrash (and fail to set it if we hit an exception).
(In reply to Ed Morley [:edmorley UTC+0] from comment #11)
> If there were do dumps found we return early,

That should be "no dumps".
Whilst peptest's checkForCrashes sets foundCrash earlier (and so doesn't have the same issue as in comment 11), we might as well do similar cleanup here.
...to ensure shutdown crashes actually make the run fail.

Note: The remoteautomation.py changes effectively make the "WARNING: unable to remove directory: <minidumps>" fatal, since the comment alongside implied that it should be. Whilst testing this patch, I have seen robocop runs that hit this, but I suspect they are a real issue that needs to be looked at separately (ie startup crashes that we've been missing until now?).

gps, please may I have feedback on the runxpcshelltests.py part (notably the xunitResult format).
Attachment #681926 - Flags: feedback?(gps)
If for some reason we hit an exception during runTests, but the later mochitest.cleanup call doesn't cause an exception, we may not set retVal. I can't see us hitting this very often, but just in case.
Try run looks good:
https://tbpl.mozilla.org/?tree=Try&rev=679262f1068d

Example crashes on the Try run, that we now correctly catch...

Reftest shutdown crash:
https://tbpl.mozilla.org/php/getParsedLog.php?id=17068582&tree=Try

Robocop crash during testBookmarksTab:
https://tbpl.mozilla.org/php/getParsedLog.php?id=17068026&tree=Try

Example "Automation Error: No crash directory (/mnt/sdcard/tests/profile/minidumps/) on remote device" (see comment 14; I think these are startup crashes - either way I believe the right thing to do is make these fatal and deal in another bug):
https://tbpl.mozilla.org/php/getParsedLog.php?id=17067851&tree=Try
https://tbpl.mozilla.org/php/getParsedLog.php?id=17067872&tree=Try
https://tbpl.mozilla.org/php/getParsedLog.php?id=17067832&tree=Try
(In reply to Ed Morley [:edmorley UTC+0] from comment #16)
> I think these are startup crashes - either way I believe the right thing to
> do is make these fatal

(which is what the part 5 patch does)
(In reply to Ed Morley [:edmorley UTC+0] from comment #16)
> Example "Automation Error: No crash directory
> (/mnt/sdcard/tests/profile/minidumps/) on remote device" (see comment 14; I
> think these are startup crashes - either way I believe the right thing to do
> is make these fatal and deal in another bug):
> https://tbpl.mozilla.org/php/getParsedLog.php?id=17067851&tree=Try
> https://tbpl.mozilla.org/php/getParsedLog.php?id=17067872&tree=Try
> https://tbpl.mozilla.org/php/getParsedLog.php?id=17067832&tree=Try

In fact, these look like bug 722166 perhaps.
Comment on attachment 681916 [details] [diff] [review]
Part 2: Robocop - non-zero return values shouldn't be overwritten by later success

Review of attachment 681916 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM

::: testing/mochitest/runtestsremote.py
@@ +497,5 @@
>              try:
>                  dm.recordLogcat()
> +                result = mochitest.runTests(options)
> +                # Ensure failures aren't obscured by subsequent passes
> +                if not retVal:

nit: "if retVal is None", to be consistent with the check just below
Attachment #681916 - Flags: review+
(In reply to Geoff Brown [:gbrown] from comment #19)
> Comment on attachment 681916 [details] [diff] [review]
> Part 2: Robocop - non-zero return values shouldn't be overwritten by later
> success
> 
> nit: "if retVal is None", to be consistent with the check just below

That unfortunately won't fix the problem being addressed by that patch - along the lines of comment 10, we should only overwrite retVal if the former value was either None (ie we haven't run any tests yet) or 0.
Comment on attachment 681926 [details] [diff] [review]
Part 5: Callers of checkForCrashes should use its return value

Review of attachment 681926 [details] [diff] [review]:
-----------------------------------------------------------------

Patch looks good as is. Just some minor style nits.

::: build/automation.py.in
@@ +1069,5 @@
> +
> +    crashed = self.checkForCrashes(profileDir, symbolsPath)
> +
> +    if crashed:
> +      status = 1

No need for "crashed" variable.

@@ +1075,5 @@
>      if os.path.exists(processLog):
>        os.unlink(processLog)
>  
>      if self.IS_TEST_BUILD and runSSLTunnel:
>        ssltunnelProcess.kill()

This script looks like it could use some context manager love!

::: build/mobile/b2gautomation.py
@@ +103,5 @@
>          try:
>            shutil.rmtree(dumpDir)
>          except:
>            print "WARNING: unable to remove directory: %s" % (dumpDir)
> +        return crashed

See comment in remoteautomation.py regarding lingering temp directories and try..finally.

::: build/mobile/remoteautomation.py
@@ +100,5 @@
> +        try:
> +            shutil.rmtree(dumpDir)
> +        except:
> +            print "WARNING: unable to remove directory: %s" % dumpDir
> +        return crashed

If getDirectory() or checkForCrashes() raises, the temporary directory will linger. I'd use a try..finally to ensure dumpDir is deleted.

::: testing/xpcshell/runxpcshelltests.py
@@ +870,5 @@
>              self.todoCount += 1
>              xunitResult["todo"] = True
>  
> +        crashed = checkForCrashes(testdir, self.symbolsPath, testName=name)
> +        if crashed:

This is the only use of "created" so I wouldn't bother with a variable.
Attachment #681926 - Flags: feedback?(gps) → feedback+
(In reply to Gregory Szorc [:gps] from comment #21)
> ::: build/automation.py.in
> @@ +1069,5 @@
> > +
> > +    crashed = self.checkForCrashes(profileDir, symbolsPath)
> > +
> > +    if crashed:
> > +      status = 1
> 
> No need for "crashed" variable.

This is added to in bug 812135 (whose patch applies after this), and whilst the variable is still only used once after the patches there, I think it reads more clearly like this.

> @@ +1075,5 @@
> >      if os.path.exists(processLog):
> >        os.unlink(processLog)
> >  
> >      if self.IS_TEST_BUILD and runSSLTunnel:
> >        ssltunnelProcess.kill()
> 
> This script looks like it could use some context manager love!

These lines weren't changed by the patch. I'll file another bug for this.

> ::: build/mobile/b2gautomation.py
> @@ +103,5 @@
> >          try:
> >            shutil.rmtree(dumpDir)
> >          except:
> >            print "WARNING: unable to remove directory: %s" % (dumpDir)
> > +        return crashed
> 
> See comment in remoteautomation.py regarding lingering temp directories and
> try..finally.
> 
> ::: build/mobile/remoteautomation.py
> @@ +100,5 @@
> > +        try:
> > +            shutil.rmtree(dumpDir)
> > +        except:
> > +            print "WARNING: unable to remove directory: %s" % dumpDir
> > +        return crashed
> 
> If getDirectory() or checkForCrashes() raises, the temporary directory will
> linger. I'd use a try..finally to ensure dumpDir is deleted.

Other than indentation, these weren't changed by this patch, so I'll file another bug.

> ::: testing/xpcshell/runxpcshelltests.py
> @@ +870,5 @@
> >              self.todoCount += 1
> >              xunitResult["todo"] = True
> >  
> > +        crashed = checkForCrashes(testdir, self.symbolsPath, testName=name)
> > +        if crashed:
> 
> This is the only use of "created" so I wouldn't bother with a variable.

Since bug 812135 doesn't apply to runxpcshelltests.py (or at least, runxpcshelltests.py doesn't have an equivalent at the moment), I'll make this change.
(In reply to Ed Morley [:edmorley UTC+0] from comment #20)
> (In reply to Geoff Brown [:gbrown] from comment #19)
> > Comment on attachment 681916 [details] [diff] [review]
> > Part 2: Robocop - non-zero return values shouldn't be overwritten by later
> > success
> > 
> > nit: "if retVal is None", to be consistent with the check just below
> 
> That unfortunately won't fix the problem being addressed by that patch -
> along the lines of comment 10, we should only overwrite retVal if the former
> value was either None (ie we haven't run any tests yet) or 0.

gbrown, instead of:
"if not retVal:"
...would you rather:
"if retVal is None or retVal == 0:"
...to be more explicit, and also consistent with the style lower down?
Comment on attachment 681918 [details] [diff] [review]
Part 3: checkForCrashes should return true for cases where we hit an exception during minidump stackwalk

(For all of these, see the comment where the attachment was added, for reasoning. [datetime link by attachment name should jump you to it])
Attachment #681918 - Flags: review?(jmaher)
Attachment #681919 - Flags: review?(jmaher)
Attachment #681926 - Flags: review?(jmaher)
Attachment #681927 - Flags: review?(jmaher)
Comment on attachment 681918 [details] [diff] [review]
Part 3: checkForCrashes should return true for cases where we hit an exception during minidump stackwalk

Review of attachment 681918 [details] [diff] [review]:
-----------------------------------------------------------------

this simplifies the code a bit, thanks!
Attachment #681918 - Flags: review?(jmaher) → review+
Comment on attachment 681919 [details] [diff] [review]
Part 4: Make peptest's checkForCrashes more consistent with that in automationutils.py

Review of attachment 681919 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm!
Attachment #681919 - Flags: review?(jmaher) → review+
Comment on attachment 681926 [details] [diff] [review]
Part 5: Callers of checkForCrashes should use its return value

Review of attachment 681926 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/mobile/remoteautomation.py
@@ +100,5 @@
> +        try:
> +            shutil.rmtree(dumpDir)
> +        except:
> +            print "WARNING: unable to remove directory: %s" % dumpDir
> +        return crashed

to that note, we should have a try/except/finall around the entire getDirectory/checkForCrashes code.  Right now we only have it around shutil.rmtree.

::: testing/xpcshell/runxpcshelltests.py
@@ +870,5 @@
>              self.todoCount += 1
>              xunitResult["todo"] = True
>  
> +        crashed = checkForCrashes(testdir, self.symbolsPath, testName=name)
> +        if crashed:

likewise 'if checkForCrashes(testdir, self.symbolsPath, testName=name):'
Attachment #681926 - Flags: review?(jmaher) → review+
(In reply to Ed Morley [:edmorley UTC+0] from comment #23)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #20)
> > (In reply to Geoff Brown [:gbrown] from comment #19)
> > > nit: "if retVal is None", to be consistent with the check just below
> > 
> > That unfortunately won't fix the problem being addressed by that patch -
> > along the lines of comment 10, we should only overwrite retVal if the former
> > value was either None (ie we haven't run any tests yet) or 0.
> 
> gbrown, instead of:
> "if not retVal:"
> ...would you rather:
> "if retVal is None or retVal == 0:"
> ...to be more explicit, and also consistent with the style lower down?

Yes, I would prefer that. But it's a very minor issue -- change only if you feel like it.
Blocks: 813132
Comment on attachment 681927 [details] [diff] [review]
Part 6: B2G mochitests - Ensure retVal is set even when cleanup did not generate an exception

Review of attachment 681927 [details] [diff] [review]:
-----------------------------------------------------------------

Taking over trivial review as per request on irc
Attachment #681927 - Flags: review?(jmaher) → review+
Target Milestone: mozilla19 → mozilla20
(In reply to Geoff Brown [:gbrown] from comment #28)
> Yes, I would prefer that. But it's a very minor issue -- change only if you
> feel like it.

Done :-)

(In reply to Joel Maher (:jmaher) from comment #27)
> to that note, we should have a try/except/finall around the entire
> getDirectory/checkForCrashes code.  Right now we only have it around
> shutil.rmtree.

Filed bug 813577.

> ::: testing/xpcshell/runxpcshelltests.py
> @@ +870,5 @@
> >              self.todoCount += 1
> >              xunitResult["todo"] = True
> >  
> > +        crashed = checkForCrashes(testdir, self.symbolsPath, testName=name)
> > +        if crashed:
> 
> likewise 'if checkForCrashes(testdir, self.symbolsPath, testName=name):'

Done.

(In reply to Gregory Szorc [:gps] from comment #21)
> @@ +1075,5 @@
> >      if os.path.exists(processLog):
> >        os.unlink(processLog)
> >  
> >      if self.IS_TEST_BUILD and runSSLTunnel:
> >        ssltunnelProcess.kill()
> 
> This script looks like it could use some context manager love!

Filed bug 813576.
Added bonus, this made bug 722166 starrable :-)
Blocks: 778688
Blocks: 816952
No longer blocks: 830046
You need to log in before you can comment on or make changes to this bug.