Closed Bug 1027802 Opened 11 years ago Closed 11 years ago

Try to get a ANR report and stack trace following Android test hangs

Categories

(Testing :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla33

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

Attachments

(1 file)

In bug 960265, we considered but backed away from using "kill SIGABRT" to get a stack trace following detection of a test hang on Android. My experience tracking down a shutdown hang in bug 1020345 convinced me that we should revisit that idea, or otherwise invest in trying to get a stack trace following an Android hang.
Depends on: 890026
See Also: → 1029550
Summary: Try to get a stack trace following Android test hangs → Try to get a ANR report and stack trace following Android test hangs
We want to SIGQUIT (kill -3) to generate an ANR report, then SIGABRT (kill -6) to trigger breadpad -- requested in bug 1029550.
When my previous patch was reviewed (https://bugzilla.mozilla.org/show_bug.cgi?id=960265#c12) we thought it might be good to leverage devicemanager.killProcess(). killProcess() takes a signal parameter, which is supported for ADB, but not for SUT (added for B2G in bug 931078). For SUT, the issue is that sutagent's kill command does not support a signal argument. Rather than block on a sutagent upgrade, I'm thinking I'll hack around sutagent with a shellCheckOutput command, and open another bug to support the signal parameter in sutagent.
I pushed an initial patch to try with: [ kill -3, kill -6, kill ]. It consistently produces an ANR report and consistently produces an .dmp file. But frequently (50%?), the dmp file has a bad header signature. This happens even if I [ kill -3, kill -6, sleep 3 seconds, kill ]. https://tbpl.mozilla.org/php/getParsedLog.php?id=42381315&tree=Try&full=1#error0 13:20:36 INFO - mozcrash INFO | Downloading symbols from: https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/try-builds/gbrown@mozilla.com-a7818fa4ad10/try-android/fennec-33.0a1.en-US.android-arm.crashreporter-symbols.zip 13:20:37 WARNING - PROCESS-CRASH | http://10.12.135.20:30737/tests/layout/reftests/reftest-sanity/test-async.html | application crashed [Unknown top frame] 13:20:37 INFO - Crash dump filename: /tmp/tmpLwEjSY/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp 13:20:37 INFO - stderr from minidump_stackwalk: 13:20:37 INFO - 2014-06-24 13:20:37: minidump_processor.cc:264: INFO: Processing minidump in file /tmp/tmpLwEjSY/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp 13:20:37 INFO - 2014-06-24 13:20:37: minidump.cc:3815: INFO: Minidump opened minidump /tmp/tmpLwEjSY/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp 13:20:37 INFO - 2014-06-24 13:20:37: minidump.cc:3847: ERROR: Minidump header signature mismatch: (0x0, 0x0) != 0x504d444d 13:20:37 INFO - 2014-06-24 13:20:37: minidump_processor.cc:268: ERROR: Minidump /tmp/tmpLwEjSY/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp could not be read 13:20:37 INFO - 2014-06-24 13:20:37: minidump.cc:3787: INFO: Minidump closing minidump 13:20:37 INFO - 2014-06-24 13:20:37: minidump_stackwalk.cc:529: ERROR: MinidumpProcessor::Process failed 13:20:37 INFO - minidump_stackwalk exited with return code 1 13:20:37 INFO - mozcrash INFO | Saved minidump as /builds/panda-0737/test/build/blobber_upload_dir/3a1bd888-2347-0b34-180d56c8-6ccc20aa.dmp I think we have seen this before, but I don't know the cause. :ted -- Any ideas? Is it worth sleeping longer before the final kill?
Flags: needinfo?(ted)
(In reply to Geoff Brown [:gbrown] from comment #4) > I pushed an initial patch to try with: [ kill -3, kill -6, kill ]. It > consistently produces an ANR report and consistently produces an .dmp file. > But frequently (50%?), the dmp file has a bad header signature. This happens > even if I [ kill -3, kill -6, sleep 3 seconds, kill ]. Not sure if it matters here, but I would put a sleep between the 'kill -3' and 'kill -6' too.
Ugh, it's really depressing that the ANR output does not have the native stack. I guess they added that in later versions. When are we getting JB/KK emulators up? :)
Depends on: 1030342
See Also: → 1030330
(In reply to Geoff Brown [:gbrown] from comment #4) > :ted -- Any ideas? Is it worth sleeping longer before the final kill? Ideally the right behavior would be "kill -6; wait for the process to exit, if that times out kill -9". Maybe that's a pain to do, but we don't know how long it will take to write the minidump and exit cleanly.
Flags: needinfo?(ted)
Following Ted's advice, the approach here is to kill -3 to generate an ANR report, wait 3 seconds, kill -6 to generate a .dmp file, wait up to 15 seconds for the process to end, kill -9 if still running, and finally fall back to the old killProcess behavior if *still* running. Testing on Android 2.3 always produces an ANR and a stack trace, and the process consistently dies quickly following kill -6. Testing on Android 4.0 always produces an ANR and sometimes produces a stack trace and dies quickly following kill -6. Frequently, kill -6 produces a .dmp file with a header signature mismatch (Comment 4) and Fennec does not die until kill -9; in these cases, even waiting for several minutes does not help. Although we don't get a stack trace in these cases, no harm is done and at least we get the ANR report.
Attachment #8450559 - Flags: review?(jmaher)
Comment on attachment 8450559 [details] [diff] [review] on hang, try to get Fennec ANR and stack trace Review of attachment 8450559 [details] [diff] [review]: ----------------------------------------------------------------- a few nits ::: build/mobile/remoteautomation.py @@ +301,5 @@ > + if pid and pid > 0: > + time.sleep(5) > + else: > + return > + retries += 1 if we retry and don't succeed, should we print a warning? @@ +302,5 @@ > + time.sleep(5) > + else: > + return > + retries += 1 > + self.dm.killProcess(self.procName, 9) do we need a 9 here or can we do the old school way? ::: testing/mozbase/mozdevice/mozdevice/devicemanagerSUT.py @@ +539,5 @@ > if sig: > + pid = self.processExist(appname) > + if pid and pid > 0: > + try: > + self.shellCheckOutput(['kill', '-%d' % sig, str(pid)], nit: trailing whitespace @@ +557,5 @@ > + if self.processExist(appname): > + self._runCmds([{ 'cmd': 'kill ' + appname }]) > + return > + except DMError, err: > + retries +=1 nit: retries += 1 <- add a space before the 1
Attachment #8450559 - Flags: review?(jmaher) → review+
Thanks! I fixed the nits and added a retry warning, but kept the kill -9 -- I am not sure it is needed but I think it is good practice to kill aggressively when faced with a process that is not exiting.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
\o/ I'm excited to see what this turns up for existing intermittent failure modes :-)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: