Closed Bug 846774 Opened 11 years ago Closed 11 years ago

We need to find out what Gecko is doing when geckoEventSync aborts due to timeout

Tracking

(firefox20+ wontfix, firefox21+ fixed, firefox22 fixed)

Status:

RESOLVED FIXED

Milestone:

Firefox 22

Tracking Flags:

Tracking

Status

firefox20

wontfix

firefox21

fixed

firefox22

---

fixed

People

(Reporter: kats, Assigned: kats)

References

Details

Attachments

(3 files, 1 obsolete file)

Part 1 - Move event sync behaviour to a flag in GeckoEvent 11 years ago Kartikaya Gupta (email:kats@mozilla.staktrace.com) 18.92 KB, patch		Details \| Diff \| Splinter Review
Part 2 - Make GeckoEvent variables private 11 years ago Kartikaya Gupta (email:kats@mozilla.staktrace.com) 3.19 KB, patch	cpeterson : review+ bajaj : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
Part 1 - Move event sync behaviour to a flag in GeckoEvent 11 years ago Kartikaya Gupta (email:kats@mozilla.staktrace.com) 18.93 KB, patch	cpeterson : review+ bajaj : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
Nightly (03/21) - Log 11 years ago Aaron Train [:aaronmt] 6.90 KB, text/plain		Details

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Description

•

11 years ago

We added code in bug 835356 to abort the call to geckoEventSync() if it doesn't complete in 4 seconds, so as to avoid the ANR that Android delivers at 5 seconds. However, if the geckoEventSync() hasn't completed in 4 seconds, then the gecko thread is doing something that's taking too long, and we need to find out what.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

11 years ago

Blocks: 834243

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 1

•

11 years ago

My idea here is that when we hit this condition, call a native method that prints out the type of the "curEvent" that is being processed in nsAppShell::ProcessNextNativeEvent. However it would be even better if we could dump a backtrace of the gecko thread to see exactly where it is.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 2

•

11 years ago

Also CC'ing jchen and cpeterson for any ideas you might have.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

11 years ago

Blocks: 829100

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

11 years ago

Blocks: 706500

(inactive) Jim Chen [:jchen] [:darchons]

Comment 3

•

11 years ago

Last time I had a sync timeout, I dropped into gdb and backtraced the Gecko thread. The top frame was in mozilla::layers::TiledLayerBuffer::Update, but I wasn't able to trace back further.

The symptom of the sync timeout was pages not loading. I was on about:home, but tapping on top sites didn't do anything, with the throbber spinning. I wonder if this is related to reports of Firefox being "slow"

This also happened before the compositor startup bug was fixed. I don't remember if my problem happened on startup, but it's possible that it was caused by that bug.

Geoff Brown [:gbrown] (pto, back Jun 10)

Comment 4

•

11 years ago

Bug 797615 shows this happening quite often during robocop testSystemPages -- that might help us reproduce / investigate.

Geoff Brown [:gbrown] (pto, back Jun 10)

Updated

•

11 years ago

Blocks: 797615

Lukas Blakk [:lsblakk] use ?needinfo

Comment 5

•

11 years ago

Tracking this since it's blocking a couple of tracked bugs.

tracking-firefox20: --- → +

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 6

•

11 years ago

(In reply to Geoff Brown [:gbrown] from comment #4)
> Bug 797615 shows this happening quite often during robocop testSystemPages
> -- that might help us reproduce / investigate.

I pushed a try build with some logging, but the robocop tests didn't run on it. My trychooser syntax seems fine unless I'm missing something. wtf!

https://tbpl.mozilla.org/?tree=Try&rev=192c083838e6

Also what's the deal with "robocop-1" and "robocop-2"?

tracking-firefox20: + → ---

Kevin Brosnan [Ex-Mozilla]

Comment 7

•

11 years ago

Looks like a flag got clobbered here.

tracking-firefox20: --- → ?

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 8

•

11 years ago

(In reply to Kevin Brosnan [:kbrosnan] from comment #7)
> Looks like a flag got clobbered here.

Yeah, whoops :(

Re-pushed with robocop-1 and robocop-2 in the try syntax: https://tbpl.mozilla.org/?tree=Try&rev=ecddbc545fc6

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 9

•

11 years ago

I goofed the try build in comment 8 since I used ALOG which is only enabled on debug builds. I re-pushed to https://tbpl.mozilla.org/?tree=Try&rev=ccf7affd6164 and eventually got it to reproduce. The log output during the UI thread blockage was:

nsAppShell: queued event type: 28 details: 
nsAppShell: queued event type: 20 details: 
nsAppShell: queued event type: 0 details: 
nsAppShell: queued event type: 25 details: 
nsAppShell: queued event type: 19 details: SearchEngines:Get
nsAppShell: queued event type: 15 details: 

which is kind of surprising. It means that at the time this was dumped, the gecko thread was not processing an event in nsAppShell::ProcessNextNativeEvent, and that the queue contained 6 items (COMPOSITOR_PAUSE, VIEWPORT, NATIVE_POKE, THUMBNAIL, SearchEngines:Get, and GECKO_EVENT_SYNC). So the Gecko thread hadn't even started processing the compositor-pause message, but was stuck before that. Also surprising is that 4 other events manage to get into the queue between the two calls to GeckoAppShell.notifyGeckoOfEvent() generated by the call to GeckoAppShell.sendEventToGeckoSync(). I'll add some more logging and try again.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 10

•

11 years ago

Updated try build is at https://tbpl.mozilla.org/?tree=Try&rev=5c9bc8aa82a9

I found two logs that contained the exception:
https://tbpl.mozilla.org/php/getParsedLog.php?id=20444575&tree=Try&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=20444753&tree=Try&full=1

They're both similar in that they have output showing a large gap (~15 seconds) between calls to ProcessNextNativeEvent, during which the awesomebar is opened, and the sync compositor pause request gets queued, resulting in the timeout and stack dump. What's interesting is that the gap occurs between calls to ProcessNextNativeEvent, rather than during a call to ProcessNextNativeEvent (as evidenced by the "current event" being null), so that means the gecko thread is doing something else like idle work or one of the other things in MessagePump::Run() [1].

The other suspicious piece of the log is that right before the call to the post-gap ProcessNextNativeEvent, there is this:

E/Surface ( 5496): using an invalid surface id=0, identity=196 should be 199
I/CompositorParent( 5496): Unable to renew compositor surface; remaining in paused state

This makes no sense to me, because this gets printed from CompositorParent::ResumeComposition [2] which runs on the compositor thread but is triggered from only a couple of other places, and all of those places would have shown up somehow in the log.

[1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessagePump.cpp#79
[2] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#343

(inactive) Jim Chen [:jchen] [:darchons]

Comment 11

•

11 years ago

Maybe you can cause a crash when the sync times out? Then we can grab the minidump and get the Gecko stack.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 12

•

11 years ago

I now have a device on which it is pretty easy to reproduce this crash. Unfortunately the adb connection on it seems to be unreliable so I can't use jimdb to debug it, but I can still do printf-debugging. I found at least one problem so far: when we send a sync compositor-pause event, the code in nsAppShell.cpp inserts the compositor-pause at the front of the queue because it is a high-priority event. However the sync event that follows it doesn't get the same treatment and gets put at the back of the queue. So even if the compositor-pause is handled quickly, the UI thread doesn't know about it until the rest of the queue is processed (which in one instance consisted of many broadcast events and took a long time to clear). So instead of having a separate "sync" event that we send from java, I would prefer to set a sync flag on the event itself that lets nsAppShell.cpp know it needs to ack it.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 13

•

11 years ago

.. and the other problem I found that I did not expect is that ProcessNextNativeEvent is (or can be) recursive. In my case, when I hit the home button on the device and put fennec in the background, it sends a bunch of events from java to nsAppShell, including ACTIVITY_PAUSING and ACTIVITY_STOPPING which both take a while to run. The COMPOSITOR_PAUSE event also gets sent and often gets inserted into the queue ahead of these two (because of the special-casing in nsAppShell) whereas the GECKO_EVENT_SYNC event it's paired with gets inserted after those two. So the time gap until the sync is acknowledged ends up being very large and hits the timeout.

The recursion happens during the processing of ACTIVITY_PAUSING. It looks like some of the code in that handler calls back into the messageloop to pump more events (maybe it's blocked waiting on I/O or something?). This shouldn't really be a problem, I think, but it explains why in comment 9 and 10 I was seeing the "current event" as null (my mCurEvent variable would get clobbered on the recursive exit even though the outer level was still processing an event).

So in summary, there are two things we can do here:
1) Turn the "sync" event into a sync flag on the event itself, so that there isn't a huge gap between an event and the corresponding sync event.
2) Try and reduce the amount of time spent in ACTIVITY_PAUSING and ACTIVITY_STOPPING as much as possible, because it blocks the gecko thread for long periods of time. Any work there that can be posted back to the main thread or done async should be done that way to keep the gecko thread responsive.

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

11 years ago

tracking-firefox20: ? → +

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 14

•

11 years ago

Attached patch Part 1 - Move event sync behaviour to a flag in GeckoEvent (obsolete) — Details — Splinter Review

This seems to do the job. Pushed to try at https://tbpl.mozilla.org/?tree=Try&rev=3ad722b4691e

Attachment #723018 - Flags: review?(cpeterson)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 15

•

11 years ago

Attached patch Part 2 - Make GeckoEvent variables private — Details — Splinter Review

They don't need to be public.

Attachment #723020 - Flags: review?(cpeterson)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 16

•

11 years ago

Attached patch Part 1 - Move event sync behaviour to a flag in GeckoEvent — Details — Splinter Review

Small fix to deal with the oranges in that try build (s/GetIntField/GetBooleanField/). Updated try build is green at https://tbpl.mozilla.org/?tree=Try&rev=2299b7ef236a

Attachment #723018 - Attachment is obsolete: true

Attachment #723018 - Flags: review?(cpeterson)

Attachment #723063 - Flags: review?(cpeterson)

Chris Peterson [:cpeterson]

Comment 17

•

11 years ago

Comment on attachment 723020 [details] [diff] [review]
Part 2 - Make GeckoEvent variables private

Review of attachment 723020 [details] [diff] [review]:
-----------------------------------------------------------------

+1

Attachment #723020 - Flags: review?(cpeterson) → review+

Chris Peterson [:cpeterson]

Comment 18

•

11 years ago

Comment on attachment 723063 [details] [diff] [review]
Part 1 - Move event sync behaviour to a flag in GeckoEvent

Review of attachment 723063 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM

Attachment #723063 - Flags: review?(cpeterson) → review+

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 19

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/95729bbde220
https://hg.mozilla.org/integration/mozilla-inbound/rev/5f92acf34104

Ryan VanderMeulen [:RyanVM]

Comment 20

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/95729bbde220
https://hg.mozilla.org/mozilla-central/rev/5f92acf34104

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → Firefox 22

Lukas Blakk [:lsblakk] use ?needinfo

Comment 21

•

11 years ago

To uplift this would require uplifting bug 844275 as well and since there's too much risk to doing that on aurora we'll have to wontfix this for FF20.

status-firefox20: --- → wontfix

status-firefox21: --- → affected

tracking-firefox21: --- → +

Depends on: 844275

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 23

•

11 years ago

Comment on attachment 723020 [details] [diff] [review]
Part 2 - Make GeckoEvent variables private

[Approval Request Comment]
Bug caused by (feature/regressing bug #): various old bugs
User impact if declined: occasional compositor startup errors
Testing completed (on m-c, etc.): on m-c, aurora try build at https://tbpl.mozilla.org/?tree=Try&rev=26b79db37644
Risk to taking this patch (and alternatives if risky): fennec only, low risk
String or UUID changes made by this patch: none

Attachment #723020 - Flags: approval-mozilla-aurora?

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

11 years ago

Attachment #723063 - Flags: approval-mozilla-aurora?

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

11 years ago

Depends on: 851180

bhavana bajaj [:bajaj]

Comment 24

•

11 years ago

Comment on attachment 723020 [details] [diff] [review]
Part 2 - Make GeckoEvent variables private

This aids with bug 844275(recently approved on aurora) to help with compositor startup errors.Low risk, fennec only approving for uplift

Attachment #723020 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

bhavana bajaj [:bajaj]

Updated

•

11 years ago

Attachment #723063 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 25

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/bab7bc14c912
https://hg.mozilla.org/releases/mozilla-aurora/rev/a7232156dace

status-firefox21: affected → fixed

status-firefox22: --- → fixed

Aaron Train [:aaronmt]

Comment 26

•

11 years ago

Attached file Nightly (03/21) - Log — Details

On Nightly (03/21), I just hit a similar issue here. I had returned to my device after idle for an hour opened the awesome-screen; see log

Flags: needinfo?

Aaron Train [:aaronmt]

Updated

•

11 years ago

Flags: needinfo?

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 27

•

11 years ago

I tracked down another possible cause of this error in bug 797615 and filed bug 852467 for it. There may be other causes as well, but without a specially-instrumented build it's pretty much impossible to tell from the log what's going on. :(

Nobody; OK to take it and work on it

Updated

•

4 years ago

Product: Firefox for Android → Firefox for Android Graveyard

You need to log in before you can comment on or make changes to this bug.