Intermittent TV timeout cases on "android-em-4-3-armv7-api16" that were caused by firefox shutdown hang
Categories
(Firefox for Android Graveyard :: Testing, defect, P3)
Tracking
(Not tracked)
People
(Reporter: violet.bugreport, Unassigned)
References
(Blocks 1 open bug)
Details
Crash Data
This bug is meant to track intermittent timeout reports on "android-em-4-3-armv7-api16", which were caused by shutdown hanging problem in the testing platform itself after the successful completion of the actual testcase.
For more context:
https://bugzilla.mozilla.org/show_bug.cgi?id=1534079#c6
https://groups.google.com/forum/#!topic/mozilla.dev.platform/WoHZrsfUIyk
Updated•5 years ago
|
Comment 5•5 years ago
|
||
These are browser shutdown hangs: After test completion, during or after xpcom-will-shutdown, firefox hangs and fails to exit. In all cases reported so far, this happens in TV, with chaos mode.
Updated•5 years ago
|
Reporter | ||
Comment 11•5 years ago
|
||
I'm pretty sure this is not an infrequent issue. It is infrequent only in the sense that people don't add new crashtest/reftest frequently.
Recently SVG component has resolved a lot of crash bugs, thus we added many new crashtests these days. Almost all of them have got this problem.
Comment 12•5 years ago
|
||
I see ScriptPreloader::Observe in several of the crash reports -- could be significant?
Comment hidden (Intermittent Failures Robot) |
Comment 17•5 years ago
|
||
It is easy to reproduce the failures on try: Modify a few of the tests mentioned in this bug and run TV on Android/pgo:
I see ScriptPreloader::Observe in several of the crash reports -- could be significant?
That try push has extra logging to check ScriptPreloader::Observe -- see the logcat artifacts for the extra logging. For example:
03-27 13:20:30.219 963 980 E ZZZ : ScriptPreloader::Observe start [browser-idle-startup-tasks-finished]
03-27 13:20:30.219 963 980 E ZZZ : ScriptPreloader::Observe end
03-27 13:20:30.219 963 980 E ZZZ : ScriptPreloader::Observe start [browser-idle-startup-tasks-finished]
03-27 13:20:30.219 963 980 E ZZZ : ScriptPreloader::Observe end
03-27 13:20:30.729 963 980 I Gecko : 1553718030739 Marionette TRACE Received observer notification xpcom-will-shutdown
03-27 13:20:30.849 963 980 E ZZZ : ScriptPreloader::Observe start [xpcom-shutdown]
03-27 13:20:30.849 963 980 E ZZZ : ScriptPreloader::Observe end
03-27 13:20:30.849 963 980 E ZZZ : ScriptPreloader::Observe start [xpcom-shutdown]
03-27 13:21:19.440 275 279 D dalvikvm: GC_CONCURRENT freed 714K, 43% free 5570K/9764K, paused 6ms+19ms, total 121ms
03-27 13:21:59.649 275 279 D dalvikvm: GC_CONCURRENT freed 653K, 44% free 5560K/9764K, paused 5ms+5ms, total 74ms
03-27 13:22:00.169 334 338 D dalvikvm: GC_CONCURRENT freed 14221K, 70% free 6348K/20640K, paused 4ms+13ms, total 138ms
03-27 13:22:22.639 275 279 D dalvikvm: GC_CONCURRENT freed 635K, 43% free 5568K/9764K, paused 5ms+17ms, total 103ms
03-27 13:22:54.219 275 279 D dalvikvm: GC_CONCURRENT freed 642K, 43% free 5569K/9764K, paused 6ms+5ms, total 84ms
03-27 13:23:25.859 275 279 D dalvikvm: GC_CONCURRENT freed 643K, 43% free 5568K/9764K, paused 5ms+6ms, total 83ms
03-27 13:23:59.839 275 279 D dalvikvm: GC_CONCURRENT freed 650K, 44% free 5560K/9764K, paused 4ms+5ms, total 71ms
03-27 13:24:29.019 275 279 D dalvikvm: GC_CONCURRENT freed 636K, 43% free 5567K/9764K, paused 6ms+10ms, total 104ms
03-27 13:25:00.599 275 279 D dalvikvm: GC_CONCURRENT freed 650K, 43% free 5568K/9764K, paused 7ms+7ms, total 98ms
03-27 13:25:42.579 275 279 D dalvikvm: GC_CONCURRENT freed 695K, 43% free 5593K/9764K, paused 6ms+19ms, total 137ms
03-27 13:26:14.289 275 279 D dalvikvm: GC_CONCURRENT freed 714K, 43% free 5587K/9764K, paused 5ms+17ms, total 133ms
03-27 13:26:46.019 275 279 D dalvikvm: GC_CONCURRENT freed 699K, 43% free 5590K/9764K, paused 8ms+24ms, total 141ms
03-27 13:27:17.889 275 279 D dalvikvm: GC_CONCURRENT freed 677K, 43% free 5566K/9764K, paused 5ms+8ms, total 96ms
03-27 13:27:18.409 963 968 I dalvikvm: threadid=3: reacting to signal 3
It does appear that these hangs are caused by a hang in ScriptPreloader::Observe for "xpcom-shutdown".
:kmag -- it looks like you have worked on ScriptPreloader::Cleanup before; would you be able to look into this?
Comment 18•5 years ago
|
||
Narrowed this down a little more here:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=237000287&repo=try&lineNumber=1487
03-29 14:03:54.936 957 974 E ZZZ : [0x52ba39d0] ScriptPreloader::Observe cleanup start [xpcom-shutdown]
03-29 14:03:54.936 957 974 E ZZZ : [0x52ba39d0] ScriptPreloader::Observe cleanup end
03-29 14:03:54.946 957 974 E ZZZ : [0x52ba3500] ScriptPreloader::Observe cleanup start [xpcom-shutdown]
03-29 14:03:54.946 957 974 E ZZZ : [0x52ba3500] Cleanup/mSaveThread before wait loop
03-29 14:04:40.476 275 278 D dalvikvm: GC_CONCURRENT freed 714K, 44% free 5569K/9828K, paused 6ms+5ms, total 116ms
03-29 14:05:12.137 275 278 D dalvikvm: GC_CONCURRENT freed 653K, 44% free 5566K/9828K, paused 5ms+5ms, total 111ms
03-29 14:05:54.037 275 278 D dalvikvm: GC_CONCURRENT freed 694K, 44% free 5586K/9828K, paused 14ms+19ms, total 149ms
03-29 14:06:25.736 275 278 D dalvikvm: GC_CONCURRENT freed 704K, 44% free 5590K/9828K, paused 6ms+23ms, total 139ms
03-29 14:06:57.476 275 278 D dalvikvm: GC_CONCURRENT freed 707K, 44% free 5591K/9828K, paused 7ms+24ms, total 150ms
03-29 14:07:29.146 275 278 D dalvikvm: GC_CONCURRENT freed 682K, 44% free 5565K/9828K, paused 7ms+5ms, total 115ms
03-29 14:08:00.706 275 278 D dalvikvm: GC_CONCURRENT freed 652K, 44% free 5565K/9828K, paused 6ms+5ms, total 114ms
03-29 14:08:42.616 275 278 D dalvikvm: GC_CONCURRENT freed 704K, 44% free 5576K/9828K, paused 6ms+5ms, total 138ms
03-29 14:09:14.336 275 278 D dalvikvm: GC_CONCURRENT freed 706K, 44% free 5577K/9828K, paused 6ms+24ms, total 139ms
03-29 14:09:46.066 275 278 D dalvikvm: GC_CONCURRENT freed 699K, 44% free 5586K/9828K, paused 5ms+23ms, total 150ms
03-29 14:10:17.727 275 278 D dalvikvm: GC_CONCURRENT freed 675K, 44% free 5564K/9828K, paused 5ms+5ms, total 109ms
03-29 14:10:39.496 957 962 I dalvikvm: threadid=3: reacting to signal 3
...suggesting a hang at:
https://hg.mozilla.org/try/rev/1b99b00f943bf59548eb34b83b788ade22e61e0d#l4.35
Comment 19•5 years ago
|
||
And tracking that a little deeper:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=237726224&repo=try&lineNumber=1510
04-02 15:56:40.881 1101 1118 E ZZZ : [0x52b9f500] Cleanup running on main thread!
04-02 15:56:40.881 1101 1118 E ZZZ : [0x52b9f500] Cleanup/mSaveThread before wait loop
04-02 15:56:50.310 1101 1194 E ZZZ : [0x52b9f500] mSaveThread after preloader WriteCache
04-02 15:56:50.310 1101 1194 E ZZZ : [0x52b9f500] WriteCache dispatches to PrepareCacheWrite
04-02 15:57:20.210 275 278 D dalvikvm: GC_CONCURRENT freed 718K, 42% free 5606K/9644K, paused 5ms+22ms, total 144ms
04-02 15:57:55.701 275 278 D dalvikvm: GC_CONCURRENT freed 698K, 43% free 5560K/9644K, paused 5ms+6ms, total 75ms
04-02 15:58:23.460 275 278 D dalvikvm: GC_CONCURRENT freed 638K, 43% free 5568K/9644K, paused 6ms+5ms, total 96ms
04-02 15:59:00.200 275 278 D dalvikvm: GC_CONCURRENT freed 655K, 43% free 5564K/9644K, paused 4ms+27ms, total 199ms
04-02 15:59:37.010 275 278 D dalvikvm: GC_CONCURRENT freed 685K, 42% free 5594K/9644K, paused 12ms+22ms, total 147ms
04-02 16:00:08.731 275 278 D dalvikvm: GC_CONCURRENT freed 712K, 42% free 5595K/9644K, paused 6ms+25ms, total 143ms
04-02 16:00:40.480 275 278 D dalvikvm: GC_CONCURRENT freed 705K, 43% free 5590K/9644K, paused 6ms+23ms, total 150ms
04-02 16:01:12.130 275 278 D dalvikvm: GC_CONCURRENT freed 671K, 43% free 5568K/9644K, paused 6ms+7ms, total 107ms
04-02 16:01:50.010 275 278 D dalvikvm: GC_CONCURRENT freed 651K, 43% free 5559K/9644K, paused 4ms+8ms, total 77ms
04-02 16:02:15.200 275 278 D dalvikvm: GC_CONCURRENT freed 635K, 43% free 5566K/9644K, paused 6ms+9ms, total 109ms
04-02 16:02:54.650 275 278 D dalvikvm: GC_CONCURRENT freed 650K, 43% free 5558K/9644K, paused 4ms+5ms, total 74ms
04-02 16:03:18.501 275 278 D dalvikvm: GC_CONCURRENT freed 638K, 43% free 5565K/9644K, paused 6ms+6ms, total 97ms
04-02 16:03:19.050 1101 1106 I dalvikvm: threadid=3: reacting to signal 3
So Cleanup(), running on the main thread, is waiting for mSaveComplete:
https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.38
Another thread, which would eventually set mSaveComplete, tries to dispatch to the main thread with NS_DISPATCH_SYNC:
https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.114
and they are deadlocked.
Although Android/pgo test-verify seems to provide a reliable way of triggering this, I don't see any Android-specific or test-specific code involved; I think we are vulnerable to a shutdown hang on any platform, whenever !mSaveComplete.
:kmag -- Still hoping you can provide a fix!
Reporter | ||
Comment 20•5 years ago
•
|
||
I don't see any Android-specific or test-specific code involved; I think we are vulnerable to a shutdown hang on any platform, whenever !mSaveComplete.
I'm actually a little surprised that the bug isn't revealed on other platform in chaos mode. At least the release assert at https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.30 isn't that difficult to trigger under strange scheduling.
STR:
- Add a
sleep(30)
just afterMonitorAutoUnlock mau(mSaveMonitor);
https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.110 to mimic a strange scheduling of that thread. - Open the browser go to a random site.
- Close it quickly (within 30s of course).
Then the release assert will be triggered.
Is the chaos mode not chaos enough? A single extremely slow scheduling doesn't seem to be a complicated condition if the chaos mode does try various strange scheduling randomly.
Comment hidden (Intermittent Failures Robot) |
Comment 23•5 years ago
|
||
I wonder whether rr's chaos mode would catch this.
Updated•5 years ago
|
Comment 32•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #19)
So Cleanup(), running on the main thread, is waiting for mSaveComplete:
https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.38
Another thread, which would eventually set mSaveComplete, tries to dispatch to the main thread with NS_DISPATCH_SYNC:
https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.114
and they are deadlocked.
Thanks. That makes my job much easier.
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Description
•