Closed Bug 823452 Opened 7 years ago Closed 7 years ago

Android's checkForCrashes "No crash directory found on remote device" error is misleading (obscures startup java stacks)

Categories

(Testing :: General, defect)

ARM
Android
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: gbrown)

References

(Blocks 1 open bug)

Details

(Keywords: sheriffing-P1)

Attachments

(1 file, 2 obsolete files)

During the night mozilla-inbound was closed, since devs in #developers saw this push (and subsequent similarly busted):
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=06d72606ec56

...and inferred it was an infra problem, because the failure line shown in TBPL's annotated summary was:

"Automation Error: No crash directory (/mnt/sdcard/tests/profile/minidumps/) found on remote device"

And thus filed bug 823420.

The real problem in this case was visible lower down in the logcat:
{
12-19 21:07:16.938 E/GeckoAppShell( 1816): >>> REPORTING UNCAUGHT EXCEPTION FROM THREAD 1 ("main")
12-19 21:07:16.938 E/GeckoAppShell( 1816): java.lang.RuntimeException: Unable to start activity ComponentInfo{org.mozilla.fennec/org.mozilla.fennec.App}: android.content.res.Resources$NotFoundException: Resource ID #0x7f0a0034
12-19 21:07:16.938 E/GeckoAppShell( 1816): 	at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2663)
...
}

...however, you have to view the full log on TBPL to see it, which is unlikely to happen if one didn't know what to look for.

In addition, I've just looked at a handful of the logs in the intermittent failure bug 722166, and many of those actually have usable Java exceptions in the logcat (and aren't all the "NSS load done", with no further useful output, which is what that bug is primarily about). We should do a better job of separating out those different intermittent failures.

I was toying with the idea of:
* Rewording the "Automation Error: No crash directory..." string
* Adding "REPORTING UNCAUGHT EXCEPTION FROM THREAD" (or similar) to TBPL's parser regexes, so the annotated summary lists (and also TBPL's brief log view highlights/links to) the relevant parts of the logcat.

However, there is also bug 809065, which is presumably going to output the java stack directly to the log (it relies on the .extra file in the minidumps directly, which we don't have here aiui). So if TBPL's parser matches the logcat exception lines, we'll end up double-reporting them in the annotated summary. 

It seems like we should just be handling this in remoteautomation.py's checkForCrashes. (ie call out to a new automationutils.py checkForJavaCrash(), which would parse the logcat and extract the failure or something; unless there is a more intelligent way to retrieve it from Android? I really don't have much experience with Android java crash debugging/devicemanager etc). 

What do you think?
Keywords: sheriffing-P1
(In reply to Ed Morley (Away 20th Dec-2nd Jan) [UTC+0; email:edmorley@moco] from comment #0)
> However, there is also bug 809065, which is presumably going to output the
> java stack directly to the log (it relies on the .extra file in the
> minidumps directly, which we don't have here aiui). So if TBPL's parser
> matches the logcat exception lines, we'll end up double-reporting them in
> the annotated summary. 

Actually, once the crashreporter is in a position to give us the .extra file, it then handles the java exceptions, meaning that we could get TBPL to match on ">>> REPORTING UNCAUGHT EXCEPTION FROM THREAD .*", which would only be present in the startup-crash logs where nothing caught them (or at least that's what I've seen so far from inspecting the logs).
Same confusion has now happened on https://tbpl.mozilla.org/?rev=ea373e534245 too. dbaron has starred all the crashes as the intermittent failure (even though it occurred on every testsuite for that push, which is highly unlikely, but hey), when in fact it was:

{
12-22 06:32:41.937 E/GeckoAppShell( 2197): >>> REPORTING UNCAUGHT EXCEPTION FROM THREAD 1 ("main")
12-22 06:32:41.937 E/GeckoAppShell( 2197): java.lang.RuntimeException: Unable to start activity ComponentInfo{org.mozilla.fennec/org.mozilla.fennec.App}: java.lang.RuntimeException: Your TabHost must have a TabWidget whose id attribute is 'android.R.id.tabs'
12-22 06:32:41.937 E/GeckoAppShell( 2197): 	at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:1956)
12-22 06:32:41.937 E/GeckoAppShell( 2197): 	at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:1981)
12-22 06:32:41.937 E/GeckoAppShell( 2197): 	at android.app.ActivityThread.access$600(ActivityThread.java:123)
12-22 06:32:41.937 E/GeckoAppShell( 2197): 	at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1147)
...
}
Joel, Geoff, Ted: Any thoughts on comments 0-2?
This is what a startup Java crash looks like in Android builds? I don't think I have any real position on this, we should just do whatever makes these easy to deal with.
Right now the mobile developers are not seeing these crashes on their radar screen.  Half of our orange factor is related to crashes on startup or shutdown.  I am willing to add whatever we need to to talos and mochitest/reftest harnesses.  We are switching talos to use mozcrash here soon (this week), so whatever we change in mozcrash should handle our scenarios for all automation types.
We are in a bad place now because we have the on-going problem of intermittent crash-on-test-startup (bug 810471) and crash-on-test-shutdown (bug 761987). If we were not so accustomed to encountering those long-unresolved crashes, we would be looking closely at the logs for the remaining crashes and all would be fine.

But as it is, this mis-starring of crashes is a problem and we are losing logs for problems we can fix in the mess of logs for known, difficult problems.


It would be *great* if the Java exception was highlighted and star-able. At the moment, I think the only location of that info is in logcat.
Depends on: 803158
Blocks: 722166
Ok, thinking about this more, I think we should:
1) Make checkForCrashes() scrape the logcat if no minidumps directory found, and extract the line after the ">>> REPORTING UNCAUGHT EXCEPTION", and output it in a TBPL-compatible manner (ie like we do for the top frame of minidump stackwalk output).
2) Still add ">>> REPORTING UNCAUGHT EXCEPTION" to TBPL's regexes, so that way the resultant log output will be something like:

{
PROCESS-CRASH | java-startup-crash | java.lang.RuntimeException: foo
...
...
12-22 06:32:41.937 E/GeckoAppShell( 2197): >>> REPORTING UNCAUGHT EXCEPTION FROM THREAD 1 ("main")
}

...and so when the TBPL log is opened, the regex match on "REPORTING UNCAUGHT EXCEPTION" will mean we get shown the logcat with the rest of the stack in the brief log (and not have to open the full log and hunt for it).

The alternative to #2 would be to parse and output the entire stack at #1, but that seems unnecessary IMO.
(In reply to Ed Morley [:edmorley UTC+0] from comment #7)
> 1) Make checkForCrashes() scrape the logcat if no minidumps directory found,
> and extract the line after the ">>> REPORTING UNCAUGHT EXCEPTION"

In virtually all of the logs I have seen with ">>> REPORTING UNCAUGHT EXCEPTION", the Java exception was the most important debug information and the best identifier for the bug, so I suggest:

  1) Make checkForCrashes() scrape the logcat (always) 
  and extract the line after the ">>> REPORTING UNCAUGHT EXCEPTION" ...
...actually, checkForCrashes doesn't currently access logcat data, so maybe that's not the right place for logcat processing.

A lot of the ndk-stack work is unrelated to this bug, but both involve processing logcat, so we should at least think about both at the same time.
Depends on: 836333
No longer depends on: 803158
Assignee: nobody → gbrown
I would be fine with putting logcat checking in mozcrash, since that would be useful for crash handling on Android. We talked about adding a check_for_remote_crashes method that would pull crashes from the remote directory and call check_for_crashes on the local directory, you could put the logic in there.
The mozcrash approach did not work out. try runs failed on "import mozcrash" and I note that remote mochitests and reftests are not currently using mozcrash, and perhaps not any mozbase module directly. 

This approach simply adds on to remoteautomation's checkForCrashes. If a Java exception is found in the logcat, that is reported as a PROCESS-CRASH, and additional crash detection is skipped.

try run showing normal (no java crash) operation: https://tbpl.mozilla.org/?tree=Try&rev=d96f80619f1a&noignore=1
try run with java exceptions: https://tbpl.mozilla.org/?tree=Try&rev=1f231280ffdb&noignore=1

I will handle talos in a follow-up patch.
Attachment #711887 - Attachment is obsolete: true
Attachment #711888 - Attachment is obsolete: true
Attachment #711893 - Flags: review?(jmaher)
Depends on: 839569
Comment on attachment 711893 [details] [diff] [review]
check for java exception in logcat; if found, report preferentially to other crash reporting

Review of attachment 711893 [details] [diff] [review]:
-----------------------------------------------------------------

it seems to be a longer road to get mozcrash fully implemented, one or two more hurdles left, but it shouldn't hold this patch up any more.

::: build/mobile/remoteautomation.py
@@ +120,4 @@
>      def checkForCrashes(self, directory, symbolsPath):
> +        logcat = self._devicemanager.getLogcat(filterOutRegexps=fennecLogcatFilters)
> +        crashed = self.checkForJavaException(logcat)
> +        if not crashed:

the variable name 'crashed' is confusing here as we are checking if not crashed, but then checking for a crash again.  Maybe javaException would be more descriptive?
Attachment #711893 - Flags: review?(jmaher) → review+
As landed, we now don't delete the old minidumps directory if we found a java exception.
(In reply to Ed Morley [:edmorley UTC+0] from comment #16)

I have a feeling I missed something and you are right, but I still don't see the problem...can you clarify? 

It looks to me like the remote minidumps dir (remoteCrashDir) is never deleted, regardless of my change. 

The local dump dir (dumpDir) is neither created nor deleted in the case of a Java exception.
Whiteboard: [leave open]
Sorry, I meant the local dir (dumpDir) isn't deleted since we return early if there was a java exception.
I checked with Ed on irc and we agree that the code is okay as it is: on java exception, dumpDir is never created.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Blocks: 809065
Blocks: 778688
Depends on: 872507
Blocks: 910320
You need to log in before you can comment on or make changes to this bug.