Closed Bug 1027802 Opened 5 years ago Closed 5 years ago

Try to get a ANR report and stack trace following Android test hangs

Categories

(Testing :: General, defect)

x86_64
Linux
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla33

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

Attachments

(1 file)

In bug 960265, we considered but backed away from using "kill SIGABRT" to get a stack trace following detection of a test hang on Android.

My experience tracking down a shutdown hang in bug 1020345 convinced me that we should revisit that idea, or otherwise invest in trying to get a stack trace following an Android hang.
Depends on: 890026
See Also: → 1029550
Duplicate of this bug: 1029550
Summary: Try to get a stack trace following Android test hangs → Try to get a ANR report and stack trace following Android test hangs
We want to SIGQUIT (kill -3) to generate an ANR report, then SIGABRT (kill -6) to trigger breadpad -- requested in bug 1029550.
When my previous patch was reviewed (https://bugzilla.mozilla.org/show_bug.cgi?id=960265#c12) we thought it might be good to leverage devicemanager.killProcess(). killProcess() takes a signal parameter, which is supported for ADB, but not for SUT (added for B2G in bug 931078). For SUT, the issue is that sutagent's kill command does not support a signal argument.

Rather than block on a sutagent upgrade, I'm thinking I'll hack around sutagent with a shellCheckOutput command, and open another bug to support the signal parameter in sutagent.
I pushed an initial patch to try with: [ kill -3, kill -6, kill ]. It consistently produces an ANR report and consistently produces an .dmp file. But frequently (50%?), the dmp file has a bad header signature. This happens even if I [ kill -3, kill -6, sleep 3 seconds, kill ].

https://tbpl.mozilla.org/php/getParsedLog.php?id=42381315&tree=Try&full=1#error0

13:20:36     INFO -  mozcrash INFO | Downloading symbols from: https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/try-builds/gbrown@mozilla.com-a7818fa4ad10/try-android/fennec-33.0a1.en-US.android-arm.crashreporter-symbols.zip
13:20:37  WARNING -  PROCESS-CRASH | http://10.12.135.20:30737/tests/layout/reftests/reftest-sanity/test-async.html | application crashed [Unknown top frame]
13:20:37     INFO -  Crash dump filename: /tmp/tmpLwEjSY/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp
13:20:37     INFO -  stderr from minidump_stackwalk:
13:20:37     INFO -  2014-06-24 13:20:37: minidump_processor.cc:264: INFO: Processing minidump in file /tmp/tmpLwEjSY/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp
13:20:37     INFO -  2014-06-24 13:20:37: minidump.cc:3815: INFO: Minidump opened minidump /tmp/tmpLwEjSY/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp
13:20:37     INFO -  2014-06-24 13:20:37: minidump.cc:3847: ERROR: Minidump header signature mismatch: (0x0, 0x0) != 0x504d444d
13:20:37     INFO -  2014-06-24 13:20:37: minidump_processor.cc:268: ERROR: Minidump /tmp/tmpLwEjSY/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp could not be read
13:20:37     INFO -  2014-06-24 13:20:37: minidump.cc:3787: INFO: Minidump closing minidump
13:20:37     INFO -  2014-06-24 13:20:37: minidump_stackwalk.cc:529: ERROR: MinidumpProcessor::Process failed
13:20:37     INFO -  minidump_stackwalk exited with return code 1
13:20:37     INFO -  mozcrash INFO | Saved minidump as /builds/panda-0737/test/build/blobber_upload_dir/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp


I think we have seen this before, but I don't know the cause. 

:ted -- Any ideas? Is it worth sleeping longer before the final kill?
Flags: needinfo?(ted)
(In reply to Geoff Brown [:gbrown] from comment #4)
> I pushed an initial patch to try with: [ kill -3, kill -6, kill ]. It
> consistently produces an ANR report and consistently produces an .dmp file.
> But frequently (50%?), the dmp file has a bad header signature. This happens
> even if I [ kill -3, kill -6, sleep 3 seconds, kill ].

Not sure if it matters here, but I would put a sleep between the 'kill -3' and 'kill -6' too.
Ugh, it's really depressing that the ANR output does not have the native stack. I guess they added that in later versions. When are we getting JB/KK emulators up? :)
Depends on: 1030342
See Also: → 1030330
(In reply to Geoff Brown [:gbrown] from comment #4)
> :ted -- Any ideas? Is it worth sleeping longer before the final kill?

Ideally the right behavior would be "kill -6; wait for the process to exit, if that times out kill -9". Maybe that's a pain to do, but we don't know how long it will take to write the minidump and exit cleanly.
Flags: needinfo?(ted)
Following Ted's advice, the approach here is to kill -3 to generate an ANR report, wait 3 seconds, kill -6 to generate a .dmp file, wait up to 15 seconds for the process to end, kill -9 if still running, and finally fall back to the old killProcess behavior if *still* running.

Testing on Android 2.3 always produces an ANR and a stack trace, and the process consistently dies quickly following kill -6.

Testing on Android 4.0 always produces an ANR and sometimes produces a stack trace and dies quickly following kill -6. Frequently, kill -6 produces a .dmp file with a header signature mismatch (Comment 4) and Fennec does not die until kill -9; in these cases, even waiting for several minutes does not help. Although we don't get a stack trace in these cases, no harm is done and at least we get the ANR report.
Attachment #8450559 - Flags: review?(jmaher)
Comment on attachment 8450559 [details] [diff] [review]
on hang, try to get Fennec ANR and stack trace

Review of attachment 8450559 [details] [diff] [review]:
-----------------------------------------------------------------

a few nits

::: build/mobile/remoteautomation.py
@@ +301,5 @@
> +                    if pid and pid > 0:
> +                        time.sleep(5)
> +                    else:
> +                        return
> +                    retries += 1

if we retry and don't succeed, should we print a warning?

@@ +302,5 @@
> +                        time.sleep(5)
> +                    else:
> +                        return
> +                    retries += 1
> +                self.dm.killProcess(self.procName, 9)

do we need a 9 here or can we do the old school way?

::: testing/mozbase/mozdevice/mozdevice/devicemanagerSUT.py
@@ +539,5 @@
>          if sig:
> +            pid = self.processExist(appname)
> +            if pid and pid > 0:
> +                try:
> +                    self.shellCheckOutput(['kill', '-%d' % sig, str(pid)], 

nit: trailing whitespace

@@ +557,5 @@
> +                    if self.processExist(appname):
> +                        self._runCmds([{ 'cmd': 'kill ' + appname }])
> +                    return
> +                except DMError, err:
> +                    retries +=1

nit: retries += 1 <- add a space before the 1
Attachment #8450559 - Flags: review?(jmaher) → review+
Thanks! I fixed the nits and added a retry warning, but kept the kill -9 -- I am not sure it is needed but I think it is good practice to kill aggressively when faced with a process that is not exiting.
https://hg.mozilla.org/mozilla-central/rev/e7ba9bfbea36
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
\o/

I'm excited to see what this turns up for existing intermittent failure modes :-)
You need to log in before you can comment on or make changes to this bug.