Closed
Bug 846774
Opened 12 years ago
Closed 12 years ago
We need to find out what Gecko is doing when geckoEventSync aborts due to timeout
Categories
(Firefox for Android Graveyard :: General, defect)
Tracking
(firefox20+ wontfix, firefox21+ fixed, firefox22 fixed)
RESOLVED
FIXED
Firefox 22
People
(Reporter: kats, Assigned: kats)
References
Details
Attachments
(3 files, 1 obsolete file)
3.19 KB,
patch
|
cpeterson
:
review+
bajaj
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
18.93 KB,
patch
|
cpeterson
:
review+
bajaj
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
6.90 KB,
text/plain
|
Details |
We added code in bug 835356 to abort the call to geckoEventSync() if it doesn't complete in 4 seconds, so as to avoid the ANR that Android delivers at 5 seconds. However, if the geckoEventSync() hasn't completed in 4 seconds, then the gecko thread is doing something that's taking too long, and we need to find out what.
Assignee | ||
Comment 1•12 years ago
|
||
My idea here is that when we hit this condition, call a native method that prints out the type of the "curEvent" that is being processed in nsAppShell::ProcessNextNativeEvent. However it would be even better if we could dump a backtrace of the gecko thread to see exactly where it is.
Assignee | ||
Comment 2•12 years ago
|
||
Also CC'ing jchen and cpeterson for any ideas you might have.
Comment 3•12 years ago
|
||
Last time I had a sync timeout, I dropped into gdb and backtraced the Gecko thread. The top frame was in mozilla::layers::TiledLayerBuffer::Update, but I wasn't able to trace back further.
The symptom of the sync timeout was pages not loading. I was on about:home, but tapping on top sites didn't do anything, with the throbber spinning. I wonder if this is related to reports of Firefox being "slow"
This also happened before the compositor startup bug was fixed. I don't remember if my problem happened on startup, but it's possible that it was caused by that bug.
Comment 4•12 years ago
|
||
Bug 797615 shows this happening quite often during robocop testSystemPages -- that might help us reproduce / investigate.
Comment 5•12 years ago
|
||
Tracking this since it's blocking a couple of tracked bugs.
tracking-firefox20:
--- → +
Assignee | ||
Comment 6•12 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #4)
> Bug 797615 shows this happening quite often during robocop testSystemPages
> -- that might help us reproduce / investigate.
I pushed a try build with some logging, but the robocop tests didn't run on it. My trychooser syntax seems fine unless I'm missing something. wtf!
https://tbpl.mozilla.org/?tree=Try&rev=192c083838e6
Also what's the deal with "robocop-1" and "robocop-2"?
tracking-firefox20:
+ → ---
Assignee | ||
Comment 8•12 years ago
|
||
(In reply to Kevin Brosnan [:kbrosnan] from comment #7)
> Looks like a flag got clobbered here.
Yeah, whoops :(
Re-pushed with robocop-1 and robocop-2 in the try syntax: https://tbpl.mozilla.org/?tree=Try&rev=ecddbc545fc6
Assignee | ||
Comment 9•12 years ago
|
||
I goofed the try build in comment 8 since I used ALOG which is only enabled on debug builds. I re-pushed to https://tbpl.mozilla.org/?tree=Try&rev=ccf7affd6164 and eventually got it to reproduce. The log output during the UI thread blockage was:
nsAppShell: queued event type: 28 details:
nsAppShell: queued event type: 20 details:
nsAppShell: queued event type: 0 details:
nsAppShell: queued event type: 25 details:
nsAppShell: queued event type: 19 details: SearchEngines:Get
nsAppShell: queued event type: 15 details:
which is kind of surprising. It means that at the time this was dumped, the gecko thread was not processing an event in nsAppShell::ProcessNextNativeEvent, and that the queue contained 6 items (COMPOSITOR_PAUSE, VIEWPORT, NATIVE_POKE, THUMBNAIL, SearchEngines:Get, and GECKO_EVENT_SYNC). So the Gecko thread hadn't even started processing the compositor-pause message, but was stuck before that. Also surprising is that 4 other events manage to get into the queue between the two calls to GeckoAppShell.notifyGeckoOfEvent() generated by the call to GeckoAppShell.sendEventToGeckoSync(). I'll add some more logging and try again.
Assignee | ||
Comment 10•12 years ago
|
||
Updated try build is at https://tbpl.mozilla.org/?tree=Try&rev=5c9bc8aa82a9
I found two logs that contained the exception:
https://tbpl.mozilla.org/php/getParsedLog.php?id=20444575&tree=Try&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=20444753&tree=Try&full=1
They're both similar in that they have output showing a large gap (~15 seconds) between calls to ProcessNextNativeEvent, during which the awesomebar is opened, and the sync compositor pause request gets queued, resulting in the timeout and stack dump. What's interesting is that the gap occurs between calls to ProcessNextNativeEvent, rather than during a call to ProcessNextNativeEvent (as evidenced by the "current event" being null), so that means the gecko thread is doing something else like idle work or one of the other things in MessagePump::Run() [1].
The other suspicious piece of the log is that right before the call to the post-gap ProcessNextNativeEvent, there is this:
E/Surface ( 5496): using an invalid surface id=0, identity=196 should be 199
I/CompositorParent( 5496): Unable to renew compositor surface; remaining in paused state
This makes no sense to me, because this gets printed from CompositorParent::ResumeComposition [2] which runs on the compositor thread but is triggered from only a couple of other places, and all of those places would have shown up somehow in the log.
[1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessagePump.cpp#79
[2] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#343
Comment 11•12 years ago
|
||
Maybe you can cause a crash when the sync times out? Then we can grab the minidump and get the Gecko stack.
Assignee | ||
Comment 12•12 years ago
|
||
I now have a device on which it is pretty easy to reproduce this crash. Unfortunately the adb connection on it seems to be unreliable so I can't use jimdb to debug it, but I can still do printf-debugging. I found at least one problem so far: when we send a sync compositor-pause event, the code in nsAppShell.cpp inserts the compositor-pause at the front of the queue because it is a high-priority event. However the sync event that follows it doesn't get the same treatment and gets put at the back of the queue. So even if the compositor-pause is handled quickly, the UI thread doesn't know about it until the rest of the queue is processed (which in one instance consisted of many broadcast events and took a long time to clear). So instead of having a separate "sync" event that we send from java, I would prefer to set a sync flag on the event itself that lets nsAppShell.cpp know it needs to ack it.
Assignee | ||
Comment 13•12 years ago
|
||
.. and the other problem I found that I did not expect is that ProcessNextNativeEvent is (or can be) recursive. In my case, when I hit the home button on the device and put fennec in the background, it sends a bunch of events from java to nsAppShell, including ACTIVITY_PAUSING and ACTIVITY_STOPPING which both take a while to run. The COMPOSITOR_PAUSE event also gets sent and often gets inserted into the queue ahead of these two (because of the special-casing in nsAppShell) whereas the GECKO_EVENT_SYNC event it's paired with gets inserted after those two. So the time gap until the sync is acknowledged ends up being very large and hits the timeout.
The recursion happens during the processing of ACTIVITY_PAUSING. It looks like some of the code in that handler calls back into the messageloop to pump more events (maybe it's blocked waiting on I/O or something?). This shouldn't really be a problem, I think, but it explains why in comment 9 and 10 I was seeing the "current event" as null (my mCurEvent variable would get clobbered on the recursive exit even though the outer level was still processing an event).
So in summary, there are two things we can do here:
1) Turn the "sync" event into a sync flag on the event itself, so that there isn't a huge gap between an event and the corresponding sync event.
2) Try and reduce the amount of time spent in ACTIVITY_PAUSING and ACTIVITY_STOPPING as much as possible, because it blocks the gecko thread for long periods of time. Any work there that can be posted back to the main thread or done async should be done that way to keep the gecko thread responsive.
Updated•12 years ago
|
Assignee | ||
Comment 14•12 years ago
|
||
This seems to do the job. Pushed to try at https://tbpl.mozilla.org/?tree=Try&rev=3ad722b4691e
Attachment #723018 -
Flags: review?(cpeterson)
Assignee | ||
Comment 15•12 years ago
|
||
They don't need to be public.
Attachment #723020 -
Flags: review?(cpeterson)
Assignee | ||
Comment 16•12 years ago
|
||
Small fix to deal with the oranges in that try build (s/GetIntField/GetBooleanField/). Updated try build is green at https://tbpl.mozilla.org/?tree=Try&rev=2299b7ef236a
Attachment #723018 -
Attachment is obsolete: true
Attachment #723018 -
Flags: review?(cpeterson)
Attachment #723063 -
Flags: review?(cpeterson)
Comment 17•12 years ago
|
||
Comment on attachment 723020 [details] [diff] [review]
Part 2 - Make GeckoEvent variables private
Review of attachment 723020 [details] [diff] [review]:
-----------------------------------------------------------------
+1
Attachment #723020 -
Flags: review?(cpeterson) → review+
Comment 18•12 years ago
|
||
Comment on attachment 723063 [details] [diff] [review]
Part 1 - Move event sync behaviour to a flag in GeckoEvent
Review of attachment 723063 [details] [diff] [review]:
-----------------------------------------------------------------
LGTM
Attachment #723063 -
Flags: review?(cpeterson) → review+
Assignee | ||
Comment 19•12 years ago
|
||
Comment 20•12 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/95729bbde220
https://hg.mozilla.org/mozilla-central/rev/5f92acf34104
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 22
Comment 21•12 years ago
|
||
To uplift this would require uplifting bug 844275 as well and since there's too much risk to doing that on aurora we'll have to wontfix this for FF20.
status-firefox20:
--- → wontfix
status-firefox21:
--- → affected
tracking-firefox21:
--- → +
Depends on: 844275
Assignee | ||
Comment 23•12 years ago
|
||
Comment on attachment 723020 [details] [diff] [review]
Part 2 - Make GeckoEvent variables private
[Approval Request Comment]
Bug caused by (feature/regressing bug #): various old bugs
User impact if declined: occasional compositor startup errors
Testing completed (on m-c, etc.): on m-c, aurora try build at https://tbpl.mozilla.org/?tree=Try&rev=26b79db37644
Risk to taking this patch (and alternatives if risky): fennec only, low risk
String or UUID changes made by this patch: none
Attachment #723020 -
Flags: approval-mozilla-aurora?
Assignee | ||
Updated•12 years ago
|
Attachment #723063 -
Flags: approval-mozilla-aurora?
Comment 24•12 years ago
|
||
Comment on attachment 723020 [details] [diff] [review]
Part 2 - Make GeckoEvent variables private
This aids with bug 844275(recently approved on aurora) to help with compositor startup errors.Low risk, fennec only approving for uplift
Attachment #723020 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Updated•12 years ago
|
Attachment #723063 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Assignee | ||
Comment 25•12 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/bab7bc14c912
https://hg.mozilla.org/releases/mozilla-aurora/rev/a7232156dace
status-firefox22:
--- → fixed
Comment 26•12 years ago
|
||
On Nightly (03/21), I just hit a similar issue here. I had returned to my device after idle for an hour opened the awesome-screen; see log
Flags: needinfo?
Updated•12 years ago
|
Flags: needinfo?
Assignee | ||
Comment 27•12 years ago
|
||
I tracked down another possible cause of this error in bug 797615 and filed bug 852467 for it. There may be other causes as well, but without a specially-instrumented build it's pretty much impossible to tell from the log what's going on. :(
Updated•4 years ago
|
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•