Closed Bug 1536696 Opened 5 years ago Closed 5 years ago

Intermittent TV timeout cases on "android-em-4-3-armv7-api16" that were caused by firefox shutdown hang

Categories

(Firefox for Android Graveyard :: Testing, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1526086

People

(Reporter: violet.bugreport, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

This bug is meant to track intermittent timeout reports on "android-em-4-3-armv7-api16", which were caused by shutdown hanging problem in the testing platform itself after the successful completion of the actual testcase.

For more context:

https://bugzilla.mozilla.org/show_bug.cgi?id=1534079#c6
https://groups.google.com/forum/#!topic/mozilla.dev.platform/WoHZrsfUIyk

Crash Signature: [@ libc.so + 0x1c3dc]

These are browser shutdown hangs: After test completion, during or after xpcom-will-shutdown, firefox hangs and fails to exit. In all cases reported so far, this happens in TV, with chaos mode.

Blocks: test-verify
Summary: Intermittent TV timeout cases on "android-em-4-3-armv7-api16" that were caused by shutdown hanging on the testing platform itself → Intermittent TV timeout cases on "android-em-4-3-armv7-api16" that were caused by firefox shutdown hang
Priority: -- → P3

I'm pretty sure this is not an infrequent issue. It is infrequent only in the sense that people don't add new crashtest/reftest frequently.

Recently SVG component has resolved a lot of crash bugs, thus we added many new crashtests these days. Almost all of them have got this problem.

I see ScriptPreloader::Observe in several of the crash reports -- could be significant?

It is easy to reproduce the failures on try: Modify a few of the tests mentioned in this bug and run TV on Android/pgo:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=4e0370d14994162476a30b3ddbe625bbd533def3

I see ScriptPreloader::Observe in several of the crash reports -- could be significant?

That try push has extra logging to check ScriptPreloader::Observe -- see the logcat artifacts for the extra logging. For example:

https://taskcluster-artifacts.net/Yv6CP6JbQFSp7I4LjFPJwA/0/public/test_info//logcat-emulator-5554.log

03-27 13:20:30.219   963   980 E ZZZ     : ScriptPreloader::Observe start [browser-idle-startup-tasks-finished]
03-27 13:20:30.219   963   980 E ZZZ     : ScriptPreloader::Observe end
03-27 13:20:30.219   963   980 E ZZZ     : ScriptPreloader::Observe start [browser-idle-startup-tasks-finished]
03-27 13:20:30.219   963   980 E ZZZ     : ScriptPreloader::Observe end
03-27 13:20:30.729   963   980 I Gecko   : 1553718030739	Marionette	TRACE	Received observer notification xpcom-will-shutdown
03-27 13:20:30.849   963   980 E ZZZ     : ScriptPreloader::Observe start [xpcom-shutdown]
03-27 13:20:30.849   963   980 E ZZZ     : ScriptPreloader::Observe end
03-27 13:20:30.849   963   980 E ZZZ     : ScriptPreloader::Observe start [xpcom-shutdown]
03-27 13:21:19.440   275   279 D dalvikvm: GC_CONCURRENT freed 714K, 43% free 5570K/9764K, paused 6ms+19ms, total 121ms
03-27 13:21:59.649   275   279 D dalvikvm: GC_CONCURRENT freed 653K, 44% free 5560K/9764K, paused 5ms+5ms, total 74ms
03-27 13:22:00.169   334   338 D dalvikvm: GC_CONCURRENT freed 14221K, 70% free 6348K/20640K, paused 4ms+13ms, total 138ms
03-27 13:22:22.639   275   279 D dalvikvm: GC_CONCURRENT freed 635K, 43% free 5568K/9764K, paused 5ms+17ms, total 103ms
03-27 13:22:54.219   275   279 D dalvikvm: GC_CONCURRENT freed 642K, 43% free 5569K/9764K, paused 6ms+5ms, total 84ms
03-27 13:23:25.859   275   279 D dalvikvm: GC_CONCURRENT freed 643K, 43% free 5568K/9764K, paused 5ms+6ms, total 83ms
03-27 13:23:59.839   275   279 D dalvikvm: GC_CONCURRENT freed 650K, 44% free 5560K/9764K, paused 4ms+5ms, total 71ms
03-27 13:24:29.019   275   279 D dalvikvm: GC_CONCURRENT freed 636K, 43% free 5567K/9764K, paused 6ms+10ms, total 104ms
03-27 13:25:00.599   275   279 D dalvikvm: GC_CONCURRENT freed 650K, 43% free 5568K/9764K, paused 7ms+7ms, total 98ms
03-27 13:25:42.579   275   279 D dalvikvm: GC_CONCURRENT freed 695K, 43% free 5593K/9764K, paused 6ms+19ms, total 137ms
03-27 13:26:14.289   275   279 D dalvikvm: GC_CONCURRENT freed 714K, 43% free 5587K/9764K, paused 5ms+17ms, total 133ms
03-27 13:26:46.019   275   279 D dalvikvm: GC_CONCURRENT freed 699K, 43% free 5590K/9764K, paused 8ms+24ms, total 141ms
03-27 13:27:17.889   275   279 D dalvikvm: GC_CONCURRENT freed 677K, 43% free 5566K/9764K, paused 5ms+8ms, total 96ms
03-27 13:27:18.409   963   968 I dalvikvm: threadid=3: reacting to signal 3

It does appear that these hangs are caused by a hang in ScriptPreloader::Observe for "xpcom-shutdown".

:kmag -- it looks like you have worked on ScriptPreloader::Cleanup before; would you be able to look into this?

Flags: needinfo?(kmaglione+bmo)

Narrowed this down a little more here:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=237000287&repo=try&lineNumber=1487

https://taskcluster-artifacts.net/JRUVjwwhQcKaCI77J4u1NA/0/public/test_info//logcat-emulator-5554.log

03-29 14:03:54.936   957   974 E ZZZ     : [0x52ba39d0] ScriptPreloader::Observe cleanup start [xpcom-shutdown]
03-29 14:03:54.936   957   974 E ZZZ     : [0x52ba39d0] ScriptPreloader::Observe cleanup end
03-29 14:03:54.946   957   974 E ZZZ     : [0x52ba3500] ScriptPreloader::Observe cleanup start [xpcom-shutdown]
03-29 14:03:54.946   957   974 E ZZZ     : [0x52ba3500] Cleanup/mSaveThread before wait loop
03-29 14:04:40.476   275   278 D dalvikvm: GC_CONCURRENT freed 714K, 44% free 5569K/9828K, paused 6ms+5ms, total 116ms
03-29 14:05:12.137   275   278 D dalvikvm: GC_CONCURRENT freed 653K, 44% free 5566K/9828K, paused 5ms+5ms, total 111ms
03-29 14:05:54.037   275   278 D dalvikvm: GC_CONCURRENT freed 694K, 44% free 5586K/9828K, paused 14ms+19ms, total 149ms
03-29 14:06:25.736   275   278 D dalvikvm: GC_CONCURRENT freed 704K, 44% free 5590K/9828K, paused 6ms+23ms, total 139ms
03-29 14:06:57.476   275   278 D dalvikvm: GC_CONCURRENT freed 707K, 44% free 5591K/9828K, paused 7ms+24ms, total 150ms
03-29 14:07:29.146   275   278 D dalvikvm: GC_CONCURRENT freed 682K, 44% free 5565K/9828K, paused 7ms+5ms, total 115ms
03-29 14:08:00.706   275   278 D dalvikvm: GC_CONCURRENT freed 652K, 44% free 5565K/9828K, paused 6ms+5ms, total 114ms
03-29 14:08:42.616   275   278 D dalvikvm: GC_CONCURRENT freed 704K, 44% free 5576K/9828K, paused 6ms+5ms, total 138ms
03-29 14:09:14.336   275   278 D dalvikvm: GC_CONCURRENT freed 706K, 44% free 5577K/9828K, paused 6ms+24ms, total 139ms
03-29 14:09:46.066   275   278 D dalvikvm: GC_CONCURRENT freed 699K, 44% free 5586K/9828K, paused 5ms+23ms, total 150ms
03-29 14:10:17.727   275   278 D dalvikvm: GC_CONCURRENT freed 675K, 44% free 5564K/9828K, paused 5ms+5ms, total 109ms
03-29 14:10:39.496   957   962 I dalvikvm: threadid=3: reacting to signal 3

...suggesting a hang at:

https://hg.mozilla.org/try/rev/1b99b00f943bf59548eb34b83b788ade22e61e0d#l4.35

And tracking that a little deeper:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=237726224&repo=try&lineNumber=1510

https://taskcluster-artifacts.net/aiejVdl-RIKtuUW3LH_euw/0/public/test_info//logcat-emulator-5554.log

04-02 15:56:40.881  1101  1118 E ZZZ     : [0x52b9f500] Cleanup running on main thread!
04-02 15:56:40.881  1101  1118 E ZZZ     : [0x52b9f500] Cleanup/mSaveThread before wait loop
04-02 15:56:50.310  1101  1194 E ZZZ     : [0x52b9f500] mSaveThread after preloader WriteCache
04-02 15:56:50.310  1101  1194 E ZZZ     : [0x52b9f500] WriteCache dispatches to PrepareCacheWrite
04-02 15:57:20.210   275   278 D dalvikvm: GC_CONCURRENT freed 718K, 42% free 5606K/9644K, paused 5ms+22ms, total 144ms
04-02 15:57:55.701   275   278 D dalvikvm: GC_CONCURRENT freed 698K, 43% free 5560K/9644K, paused 5ms+6ms, total 75ms
04-02 15:58:23.460   275   278 D dalvikvm: GC_CONCURRENT freed 638K, 43% free 5568K/9644K, paused 6ms+5ms, total 96ms
04-02 15:59:00.200   275   278 D dalvikvm: GC_CONCURRENT freed 655K, 43% free 5564K/9644K, paused 4ms+27ms, total 199ms
04-02 15:59:37.010   275   278 D dalvikvm: GC_CONCURRENT freed 685K, 42% free 5594K/9644K, paused 12ms+22ms, total 147ms
04-02 16:00:08.731   275   278 D dalvikvm: GC_CONCURRENT freed 712K, 42% free 5595K/9644K, paused 6ms+25ms, total 143ms
04-02 16:00:40.480   275   278 D dalvikvm: GC_CONCURRENT freed 705K, 43% free 5590K/9644K, paused 6ms+23ms, total 150ms
04-02 16:01:12.130   275   278 D dalvikvm: GC_CONCURRENT freed 671K, 43% free 5568K/9644K, paused 6ms+7ms, total 107ms
04-02 16:01:50.010   275   278 D dalvikvm: GC_CONCURRENT freed 651K, 43% free 5559K/9644K, paused 4ms+8ms, total 77ms
04-02 16:02:15.200   275   278 D dalvikvm: GC_CONCURRENT freed 635K, 43% free 5566K/9644K, paused 6ms+9ms, total 109ms
04-02 16:02:54.650   275   278 D dalvikvm: GC_CONCURRENT freed 650K, 43% free 5558K/9644K, paused 4ms+5ms, total 74ms
04-02 16:03:18.501   275   278 D dalvikvm: GC_CONCURRENT freed 638K, 43% free 5565K/9644K, paused 6ms+6ms, total 97ms
04-02 16:03:19.050  1101  1106 I dalvikvm: threadid=3: reacting to signal 3

So Cleanup(), running on the main thread, is waiting for mSaveComplete:

https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.38

Another thread, which would eventually set mSaveComplete, tries to dispatch to the main thread with NS_DISPATCH_SYNC:

https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.114

and they are deadlocked.

Although Android/pgo test-verify seems to provide a reliable way of triggering this, I don't see any Android-specific or test-specific code involved; I think we are vulnerable to a shutdown hang on any platform, whenever !mSaveComplete.

:kmag -- Still hoping you can provide a fix!

I don't see any Android-specific or test-specific code involved; I think we are vulnerable to a shutdown hang on any platform, whenever !mSaveComplete.

I'm actually a little surprised that the bug isn't revealed on other platform in chaos mode. At least the release assert at https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.30 isn't that difficult to trigger under strange scheduling.

STR:

  1. Add a sleep(30) just after MonitorAutoUnlock mau(mSaveMonitor); https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.110 to mimic a strange scheduling of that thread.
  2. Open the browser go to a random site.
  3. Close it quickly (within 30s of course).

Then the release assert will be triggered.

Is the chaos mode not chaos enough? A single extremely slow scheduling doesn't seem to be a complicated condition if the chaos mode does try various strange scheduling randomly.

I wonder whether rr's chaos mode would catch this.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(kmaglione+bmo)
Resolution: --- → DUPLICATE

(In reply to Geoff Brown [:gbrown] from comment #19)

So Cleanup(), running on the main thread, is waiting for mSaveComplete:

https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.38

Another thread, which would eventually set mSaveComplete, tries to dispatch to the main thread with NS_DISPATCH_SYNC:

https://hg.mozilla.org/try/rev/ea532c361fe586d57d1409bbc26b68b92a1e7827#l4.114

and they are deadlocked.

Thanks. That makes my job much easier.

Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.