Closed Bug 846774 Opened 11 years ago Closed 11 years ago

We need to find out what Gecko is doing when geckoEventSync aborts due to timeout

Categories

(Firefox for Android Graveyard :: General, defect)

All
Android
defect
Not set
normal

Tracking

(firefox20+ wontfix, firefox21+ fixed, firefox22 fixed)

RESOLVED FIXED
Firefox 22
Tracking Status
firefox20 + wontfix
firefox21 + fixed
firefox22 --- fixed

People

(Reporter: kats, Assigned: kats)

References

Details

Attachments

(3 files, 1 obsolete file)

We added code in bug 835356 to abort the call to geckoEventSync() if it doesn't complete in 4 seconds, so as to avoid the ANR that Android delivers at 5 seconds. However, if the geckoEventSync() hasn't completed in 4 seconds, then the gecko thread is doing something that's taking too long, and we need to find out what.
My idea here is that when we hit this condition, call a native method that prints out the type of the "curEvent" that is being processed in nsAppShell::ProcessNextNativeEvent. However it would be even better if we could dump a backtrace of the gecko thread to see exactly where it is.
Also CC'ing jchen and cpeterson for any ideas you might have.
Last time I had a sync timeout, I dropped into gdb and backtraced the Gecko thread. The top frame was in mozilla::layers::TiledLayerBuffer::Update, but I wasn't able to trace back further.

The symptom of the sync timeout was pages not loading. I was on about:home, but tapping on top sites didn't do anything, with the throbber spinning. I wonder if this is related to reports of Firefox being "slow"

This also happened before the compositor startup bug was fixed. I don't remember if my problem happened on startup, but it's possible that it was caused by that bug.
Bug 797615 shows this happening quite often during robocop testSystemPages -- that might help us reproduce / investigate.
Tracking this since it's blocking a couple of tracked bugs.
(In reply to Geoff Brown [:gbrown] from comment #4)
> Bug 797615 shows this happening quite often during robocop testSystemPages
> -- that might help us reproduce / investigate.

I pushed a try build with some logging, but the robocop tests didn't run on it. My trychooser syntax seems fine unless I'm missing something. wtf!

https://tbpl.mozilla.org/?tree=Try&rev=192c083838e6

Also what's the deal with "robocop-1" and "robocop-2"?
Looks like a flag got clobbered here.
(In reply to Kevin Brosnan [:kbrosnan] from comment #7)
> Looks like a flag got clobbered here.

Yeah, whoops :(

Re-pushed with robocop-1 and robocop-2 in the try syntax: https://tbpl.mozilla.org/?tree=Try&rev=ecddbc545fc6
I goofed the try build in comment 8 since I used ALOG which is only enabled on debug builds. I re-pushed to https://tbpl.mozilla.org/?tree=Try&rev=ccf7affd6164 and eventually got it to reproduce. The log output during the UI thread blockage was:

nsAppShell: queued event type: 28 details: 
nsAppShell: queued event type: 20 details: 
nsAppShell: queued event type: 0 details: 
nsAppShell: queued event type: 25 details: 
nsAppShell: queued event type: 19 details: SearchEngines:Get
nsAppShell: queued event type: 15 details: 

which is kind of surprising. It means that at the time this was dumped, the gecko thread was not processing an event in nsAppShell::ProcessNextNativeEvent, and that the queue contained 6 items (COMPOSITOR_PAUSE, VIEWPORT, NATIVE_POKE, THUMBNAIL, SearchEngines:Get, and GECKO_EVENT_SYNC). So the Gecko thread hadn't even started processing the compositor-pause message, but was stuck before that. Also surprising is that 4 other events manage to get into the queue between the two calls to GeckoAppShell.notifyGeckoOfEvent() generated by the call to GeckoAppShell.sendEventToGeckoSync(). I'll add some more logging and try again.
Updated try build is at https://tbpl.mozilla.org/?tree=Try&rev=5c9bc8aa82a9

I found two logs that contained the exception:
https://tbpl.mozilla.org/php/getParsedLog.php?id=20444575&tree=Try&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=20444753&tree=Try&full=1

They're both similar in that they have output showing a large gap (~15 seconds) between calls to ProcessNextNativeEvent, during which the awesomebar is opened, and the sync compositor pause request gets queued, resulting in the timeout and stack dump. What's interesting is that the gap occurs between calls to ProcessNextNativeEvent, rather than during a call to ProcessNextNativeEvent (as evidenced by the "current event" being null), so that means the gecko thread is doing something else like idle work or one of the other things in MessagePump::Run() [1].

The other suspicious piece of the log is that right before the call to the post-gap ProcessNextNativeEvent, there is this:

E/Surface ( 5496): using an invalid surface id=0, identity=196 should be 199
I/CompositorParent( 5496): Unable to renew compositor surface; remaining in paused state

This makes no sense to me, because this gets printed from CompositorParent::ResumeComposition [2] which runs on the compositor thread but is triggered from only a couple of other places, and all of those places would have shown up somehow in the log.

[1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessagePump.cpp#79
[2] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#343
Maybe you can cause a crash when the sync times out? Then we can grab the minidump and get the Gecko stack.
I now have a device on which it is pretty easy to reproduce this crash. Unfortunately the adb connection on it seems to be unreliable so I can't use jimdb to debug it, but I can still do printf-debugging. I found at least one problem so far: when we send a sync compositor-pause event, the code in nsAppShell.cpp inserts the compositor-pause at the front of the queue because it is a high-priority event. However the sync event that follows it doesn't get the same treatment and gets put at the back of the queue. So even if the compositor-pause is handled quickly, the UI thread doesn't know about it until the rest of the queue is processed (which in one instance consisted of many broadcast events and took a long time to clear). So instead of having a separate "sync" event that we send from java, I would prefer to set a sync flag on the event itself that lets nsAppShell.cpp know it needs to ack it.
.. and the other problem I found that I did not expect is that ProcessNextNativeEvent is (or can be) recursive. In my case, when I hit the home button on the device and put fennec in the background, it sends a bunch of events from java to nsAppShell, including ACTIVITY_PAUSING and ACTIVITY_STOPPING which both take a while to run. The COMPOSITOR_PAUSE event also gets sent and often gets inserted into the queue ahead of these two (because of the special-casing in nsAppShell) whereas the GECKO_EVENT_SYNC event it's paired with gets inserted after those two. So the time gap until the sync is acknowledged ends up being very large and hits the timeout.

The recursion happens during the processing of ACTIVITY_PAUSING. It looks like some of the code in that handler calls back into the messageloop to pump more events (maybe it's blocked waiting on I/O or something?). This shouldn't really be a problem, I think, but it explains why in comment 9 and 10 I was seeing the "current event" as null (my mCurEvent variable would get clobbered on the recursive exit even though the outer level was still processing an event).

So in summary, there are two things we can do here:
1) Turn the "sync" event into a sync flag on the event itself, so that there isn't a huge gap between an event and the corresponding sync event.
2) Try and reduce the amount of time spent in ACTIVITY_PAUSING and ACTIVITY_STOPPING as much as possible, because it blocks the gecko thread for long periods of time. Any work there that can be posted back to the main thread or done async should be done that way to keep the gecko thread responsive.
They don't need to be public.
Attachment #723020 - Flags: review?(cpeterson)
Small fix to deal with the oranges in that try build (s/GetIntField/GetBooleanField/). Updated try build is green at https://tbpl.mozilla.org/?tree=Try&rev=2299b7ef236a
Attachment #723018 - Attachment is obsolete: true
Attachment #723018 - Flags: review?(cpeterson)
Attachment #723063 - Flags: review?(cpeterson)
Comment on attachment 723020 [details] [diff] [review]
Part 2 - Make GeckoEvent variables private

Review of attachment 723020 [details] [diff] [review]:
-----------------------------------------------------------------

+1
Attachment #723020 - Flags: review?(cpeterson) → review+
Comment on attachment 723063 [details] [diff] [review]
Part 1 - Move event sync behaviour to a flag in GeckoEvent

Review of attachment 723063 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #723063 - Flags: review?(cpeterson) → review+
https://hg.mozilla.org/mozilla-central/rev/95729bbde220
https://hg.mozilla.org/mozilla-central/rev/5f92acf34104
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 22
To uplift this would require uplifting bug 844275 as well and since there's too much risk to doing that on aurora we'll have to wontfix this for FF20.
Comment on attachment 723020 [details] [diff] [review]
Part 2 - Make GeckoEvent variables private

[Approval Request Comment]
Bug caused by (feature/regressing bug #): various old bugs
User impact if declined: occasional compositor startup errors
Testing completed (on m-c, etc.): on m-c, aurora try build at https://tbpl.mozilla.org/?tree=Try&rev=26b79db37644
Risk to taking this patch (and alternatives if risky): fennec only, low risk
String or UUID changes made by this patch: none
Attachment #723020 - Flags: approval-mozilla-aurora?
Attachment #723063 - Flags: approval-mozilla-aurora?
Comment on attachment 723020 [details] [diff] [review]
Part 2 - Make GeckoEvent variables private

This aids with bug 844275(recently approved on aurora) to help with compositor startup errors.Low risk, fennec only approving for uplift
Attachment #723020 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #723063 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attached file Nightly (03/21) - Log
On Nightly (03/21), I just hit a similar issue here. I had returned to my device after idle for an hour opened the awesome-screen; see log
Flags: needinfo?
Flags: needinfo?
I tracked down another possible cause of this error in bug 797615 and filed bug 852467 for it. There may be other causes as well, but without a specially-instrumented build it's pretty much impossible to tell from the log what's going on. :(
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: