Closed Bug 797615 Opened 13 years ago Closed 12 years ago

Intermittent testHistoryTab, testHistory, testSystemPages, testVkbOverlap | PaintExpecter - blockUtilClear timeout (x3)

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME
Firefox 24

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [see comment 534, 549, 553, 562, 709])

Attachments

(2 files, 2 obsolete files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=15783475&tree=Mozilla-Inbound Android no-ionmonkey Tegra 250 mozilla-inbound opt test robocop on 2012-10-03 12:23:57 PDT for push 6925601f4299 slave: tegra-283 4 INFO TEST-UNEXPECTED-FAIL | testHistoryTab | PaintExpecter - blockUtilClear timeout junit.framework.AssertionFailedError: 4 INFO TEST-UNEXPECTED-FAIL | testHistoryTab | PaintExpecter - blockUtilClear timeout 5 INFO TEST-UNEXPECTED-FAIL | testHistoryTab | Exception caught - junit.framework.AssertionFailedError: 4 INFO TEST-UNEXPECTED-FAIL | testHistoryTab | PaintExpecter - blockUtilClear timeout
Summary: Intermittent testHistoryTab | PaintExpecter - blockUtilClear timeout (x3) → Intermittent testHistoryTab, testBookmark | PaintExpecter - blockUtilClear timeout (x3)
Blocks: 784186
(In reply to TinderboxPushlog Robot from comment #9) This log actually has logcat data for the failed test! But I don't see anything helpful there.... :(
The failure in comments 12, 13, and 14 - blockUntilClear timeout in testHistoryTab - occurs when no drawFinished event is received after trying to load a url. This test actually loads 2 urls back-to-back; the first seems to succeed, the second fails -- I do not know why.
Whiteboard: [orange]
Summary: Intermittent testHistoryTab, testBookmark | PaintExpecter - blockUtilClear timeout (x3) → Intermittent testHistoryTab, testBookmark, testHistory | PaintExpecter - blockUtilClear timeout (x3)
Summary: Intermittent testHistoryTab, testBookmark, testHistory | PaintExpecter - blockUtilClear timeout (x3) → Intermittent testHistoryTab, testBookmark, testHistory, testSystemPages | PaintExpecter - blockUtilClear timeout (x3)
Assignee: nobody → gbrown
logcat excerpt from successful run of testSystemPages: 01-01 01:54:48.356 9264 9272 I Robocop : 5 INFO TEST-PASS | testSystemPages | Awesomebar URL typed properly - about:rights should equal about:rights 01-01 01:54:48.366 9264 9272 D Robocop : waiting for DOMContentLoaded 01-01 01:54:48.396 9264 9276 D GeckoApp: State - 983041 01-01 01:54:48.396 9264 9276 D GeckoApp: Got a document start event. 01-01 01:54:48.426 9264 9264 I GeckoToolbar: zerdatime 6872009 - Throbber start 01-01 01:54:48.466 9264 9276 I GeckoApp: Security Mode - unknown 01-01 01:54:48.536 9264 9307 D Robocop : Received drawFinished notification 01-01 01:54:48.576 9264 9307 D Robocop : Received drawFinished notification 01-01 01:54:48.616 9264 9264 I GeckoToolbar: zerdatime 6872191 - Throbber start 01-01 01:54:48.646 9264 9307 D Robocop : Received drawFinished notification 01-01 01:54:48.666 9264 9273 W GeckoGlobalHistory: Rebuilding visited link set... 01-01 01:54:48.666 9264 9307 D Robocop : Received drawFinished notification 01-01 01:54:48.676 9264 9276 D Robocop : Waking up on handleMessage 01-01 01:54:48.676 9264 9276 D Robocop : received event DOMContentLoaded 01-01 01:54:48.676 9264 9276 D Robocop : Waking up on handleMessage 01-01 01:54:48.676 9264 9276 D Robocop : received event DOMContentLoaded 01-01 01:54:48.676 9264 9272 D Robocop : unblocked on expecter for DOMContentLoaded 01-01 01:54:48.756 9264 9276 D GeckoApp: State - 786448 01-01 01:54:48.756 9264 9276 D GeckoApp: Got a document stop event. 01-01 01:54:48.806 9264 9264 I GeckoToolbar: zerdatime 6872383 - Throbber stop 01-01 01:54:48.926 9264 9264 D dalvikvm: GC_EXTERNAL_ALLOC freed 8535 objects / 494544 bytes in 57ms 01-01 01:54:48.926 9264 9307 D Robocop : Received drawFinished notification 01-01 01:54:48.966 9264 9269 I dalvikvm: Jit: resizing JitTable from 8192 to 16384 01-01 01:54:48.988 9264 9307 D Robocop : Received drawFinished notification 01-01 01:54:49.066 9264 9307 D Robocop : Received drawFinished notification 01-01 01:54:49.086 9264 9315 D GeckoFavicons: The provided favicon URL is not valid 01-01 01:54:52.156 9264 9272 D dalvikvm: GC_FOR_MALLOC freed 4260 objects / 218904 bytes in 71ms 01-01 01:54:52.166 9264 9272 I dalvikvm-heap: Grow heap (frag case) to 7.063MB for 2846736-byte allocation 01-01 01:54:52.206 9264 9285 D dalvikvm: GC_FOR_MALLOC freed 68 objects / 3128 bytes in 39ms 01-01 01:54:52.376 9264 9272 I global : Default buffer size used in BufferedOutputStream constructor. It would be better to be explicit if an 8k buffer is required. 01-01 01:54:52.876 9264 9272 I Robocop : 6 INFO TEST-PASS | testSystemPages | The correct number of tabs are opened - 1 should equal 1
...and from failed run (Comment 58): 02-05 10:53:48.153 I/Robocop ( 9218): 5 INFO TEST-PASS | testSystemPages | Awesomebar URL typed properly - about:rights should equal about:rights 02-05 10:53:48.153 D/Robocop ( 9218): waiting for DOMContentLoaded 02-05 10:53:50.103 I/ALL_PAGES( 9218): Got cursor in 10658ms 02-05 10:53:50.103 I/ALL_PAGES( 9218): Got cursor in 6267ms 02-05 10:53:50.163 I/ALL_PAGES( 9218): Got cursor in 53ms 02-05 10:53:50.263 E/Surface ( 9218): using an invalid surface id=0, identity=378 should be 381 02-05 10:53:50.263 I/CompositorParent( 9218): Unable to renew compositor surface; remaining in paused state 02-05 10:53:53.983 D/GeckoApp( 9218): State - 983041 02-05 10:53:53.983 D/GeckoApp( 9218): Got a document start event. 02-05 10:53:53.983 I/GeckoToolbar( 9218): zerdatime 2922168 - Throbber start 02-05 10:53:54.043 I/GeckoApp( 9218): Security Mode - unknown 02-05 10:53:54.153 I/GeckoToolbar( 9218): zerdatime 2922333 - Throbber start 02-05 10:53:54.843 D/Robocop ( 9218): Waking up on handleMessage 02-05 10:53:54.843 D/Robocop ( 9218): received event DOMContentLoaded 02-05 10:53:54.843 D/Robocop ( 9218): Waking up on handleMessage 02-05 10:53:54.843 D/Robocop ( 9218): received event DOMContentLoaded 02-05 10:53:54.843 D/Robocop ( 9218): unblocked on expecter for DOMContentLoaded 02-05 10:53:54.923 W/GeckoGlobalHistory( 9218): Rebuilding visited link set... 02-05 10:53:54.933 D/GeckoApp( 9218): State - 786448 02-05 10:53:54.933 D/GeckoApp( 9218): Got a document stop event. 02-05 10:53:54.933 I/GeckoToolbar( 9218): zerdatime 2923118 - Throbber stop 02-05 10:53:55.043 D/GeckoFavicons( 9218): The provided favicon URL is not valid 02-05 10:54:57.923 E/GeckoConsole( 9218): [JavaScript Warning: "WARN addons.updates: Request failed: 404: Not Found" {file: "resource://gre/modules/AddonUpdateChecker.jsm" line: 464}] 02-05 10:55:24.943 E/Robocop ( 9218): Dumping ALL the threads!
I reviewed the last 6 failure logs. They all have: ... D/GeckoAppShell( 9217): Gecko event sync taking too long: 4022ms W/GeckoAppShell( 9217): Gecko event sync took too long, aborting! W/GeckoAppShell( 9217): java.lang.Exception W/GeckoAppShell( 9217): at org.mozilla.gecko.GeckoAppShell.geckoEventSync(GeckoAppShell.java:675) W/GeckoAppShell( 9217): at org.mozilla.gecko.GeckoAppShell.sendEventToGeckoSync(GeckoAppShell.java:610) W/GeckoAppShell( 9217): at org.mozilla.gecko.gfx.GeckoLayerClient.compositionPauseRequested(GeckoLayerClient.java:599) W/GeckoAppShell( 9217): at org.mozilla.gecko.gfx.LayerView.onDestroyed(LayerView.java:363) W/GeckoAppShell( 9217): at org.mozilla.gecko.gfx.LayerView.access$300(LayerView.java:42) W/GeckoAppShell( 9217): at org.mozilla.gecko.gfx.LayerView$SurfaceListener.surfaceDestroyed(LayerView.java:405) W/GeckoAppShell( 9217): at android.view.SurfaceView.reportSurfaceDestroyed(SurfaceView.java:568) W/GeckoAppShell( 9217): at android.view.SurfaceView.updateWindow(SurfaceView.java:472) W/GeckoAppShell( 9217): at android.view.SurfaceView.onWindowVisibilityChanged(SurfaceView.java:206) W/GeckoAppShell( 9217): at android.view.View.dispatchWindowVisibilityChanged(View.java:3891) W/GeckoAppShell( 9217): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) W/GeckoAppShell( 9217): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) W/GeckoAppShell( 9217): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) W/GeckoAppShell( 9217): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) W/GeckoAppShell( 9217): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) W/GeckoAppShell( 9217): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) W/GeckoAppShell( 9217): at android.view.ViewRoot.performTraversals(ViewRoot.java:744) W/GeckoAppShell( 9217): at android.view.ViewRoot.handleMessage(ViewRoot.java:1727) W/GeckoAppShell( 9217): at android.os.Handler.dispatchMessage(Handler.java:99) W/GeckoAppShell( 9217): at android.os.Looper.loop(Looper.java:123) W/GeckoAppShell( 9217): at android.app.ActivityThread.main(ActivityThread.java:4627) W/GeckoAppShell( 9217): at java.lang.reflect.Method.invokeNative(Native Method) W/GeckoAppShell( 9217): at java.lang.reflect.Method.invoke(Method.java:521) W/GeckoAppShell( 9217): at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:868) W/GeckoAppShell( 9217): at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:626) W/GeckoAppShell( 9217): at dalvik.system.NativeStart.main(Native Method) ... E/Surface ( 9218): using an invalid surface id=0, identity=378 should be 381 I/CompositorParent( 9218): Unable to renew compositor surface; remaining in paused state :snorp, :kats -- what does this mean?
(In reply to Geoff Brown [:gbrown] from comment #62) > > E/Surface ( 9218): using an invalid surface id=0, identity=378 should be 381 > I/CompositorParent( 9218): Unable to renew compositor surface; remaining in > paused state > > > :snorp, :kats -- what does this mean? The "unable to renew" message means that when the command to resume the compositor was executed (on the compositor thread) there was no surface available. The usual cause for this is us calling scheduleResumeCompositor() just before the awesome screen comes up and kills the surface. By the time the resume executes on the compositor thread the surface is gone and so it cannot actually resume. In theory this message should be benign since we'll do another resume when we dismiss the awesomescreen and that one will actually resume the compositor and renew the surface. As for the geckoEventSync stack, well, it means that gecko was busy doing something when we tried to pause the compositor. Doing what I don't know. I wonder if there's some way to reliably get a stack from Gecko when this happens. Or attach gdb if you can reproduce it locally.
Depends on: 846774
Summary: Intermittent testHistoryTab, testBookmark, testHistory, testSystemPages | PaintExpecter - blockUtilClear timeout (x3) → Intermittent testHistoryTab, testBookmark, testHistory, testSystemPages, testVkbOverlap | PaintExpecter - blockUtilClear timeout (x3)
I thought (hoped?) that bug 846774 would address the "Gecko event sync took too long" issue, but it has not: 03-12 08:41:46.590 W/GeckoAppShell( 5339): Gecko event sync took too long, aborting! 03-12 08:41:46.590 W/GeckoAppShell( 5339): java.lang.Exception 03-12 08:41:46.590 W/GeckoAppShell( 5339): at org.mozilla.gecko.GeckoAppShell.sendEventToGeckoSync(GeckoAppShell.java:397) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at org.mozilla.gecko.gfx.GLController.surfaceDestroyed(GLController.java:92) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at org.mozilla.gecko.gfx.LayerView.onDestroyed(LayerView.java:402) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at org.mozilla.gecko.gfx.LayerView.access$400(LayerView.java:44) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at org.mozilla.gecko.gfx.LayerView$SurfaceListener.surfaceDestroyed(LayerView.java:444) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.SurfaceView.reportSurfaceDestroyed(SurfaceView.java:568) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.SurfaceView.updateWindow(SurfaceView.java:472) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.SurfaceView.onWindowVisibilityChanged(SurfaceView.java:206) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.View.dispatchWindowVisibilityChanged(View.java:3891) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.ViewRoot.performTraversals(ViewRoot.java:744) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.view.ViewRoot.handleMessage(ViewRoot.java:1727) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.os.Handler.dispatchMessage(Handler.java:99) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.os.Looper.loop(Looper.java:123) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at android.app.ActivityThread.main(ActivityThread.java:4627) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at java.lang.reflect.Method.invokeNative(Native Method) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at java.lang.reflect.Method.invoke(Method.java:521) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:868) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:626) 03-12 08:41:46.590 W/GeckoAppShell( 5339): at dalvik.system.NativeStart.main(Native Method)
I had hoped so too. I'll see if I can figure out what's going on here.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #300) > I had hoped so too. I'll see if I can figure out what's going on here. Of course, when I try to repro it with debugging it doesn't happen. Sigh. https://tbpl.mozilla.org/?tree=Try&rev=09daa717f2f5
I pushed another try build with some more logging at https://tbpl.mozilla.org/?tree=Try&rev=8d54e671f556 and got the log at https://tbpl.mozilla.org/php/getParsedLog.php?id=20722317&tree=Try&full=1. In this log, if you look around the "Gecko event sync taking too long" messages, you can see that the gecko thread finishes processing an event of type=0 at 04:40:19.856 and then doesn't start processing another event (type=28) until 04:40:37.593. That's a 18-second gap during which a big pile of events get added to the queue and the 4-second sync event timer expires. I don't know what the gecko thread is doing during that gap, but it's not in the nsAppShell event loop. It's also not recursed inside another event processing (as happens later in the log with event type=10).
Oh, here's something more interesting: the event type=0 that wasn't processed for 18 seconds was inserted by a call to ScheduleNativeEventCallback from thread 0x5261e400, which doesn't appear anywhere else in the log.
Depends on: 834243
Ok, so I made some more progress. I think the "interesting" thread thing I pointed out in comment 327 was just a red herring. I pushed a new try build at [1] with some more dumps and a crash to try get and see what all the threads were doing at the time. In all 12 crash traces at [2], the gecko thread is here: 2 libxul.so!nsCacheService::Lock(mozilla::Telemetry::ID) [Mutex.h 3 libxul.so!nsCacheService::SetDiskSmartSize() [nsCacheService.h 4 libxul.so!nsDisableOldMaxSmartSizePrefEvent::Run() [nsCacheService.cpp 5 libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp So the problem stems from the code at [3] running on the gecko thread and it blocking for long periods of time. I will file a bug for this and mark it as blocking this bug. [1] https://tbpl.mozilla.org/?tree=Try&rev=83cde65aa095 [2] https://tbpl.mozilla.org/php/getParsedLog.php?id=20803898&tree=Try&full=1 [3] http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsCacheService.cpp#1580
Blocks: 834243
No longer depends on: 834243
Whiteboard: [see comment 342]
The try push at https://tbpl.mozilla.org/?tree=Try&rev=336832f5b5d5 has the disk cache disabled and testSystemPages has never been greener.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #348) > The try push at https://tbpl.mozilla.org/?tree=Try&rev=336832f5b5d5 has the > disk cache disabled and testSystemPages has never been greener. Sadly Try was reset since (bug 853697), so that run sadly isn't available to view now :-( Thank you for looking into this :-)
Just waiting for 852467 to be resolved -- probably nothing else to do here.
Assignee: gbrown → nobody
So, bug 852467 is fixed now, landed on the 26th of April. All the failures in this bug reported since then (comment ~504 onwards) are either on Aurora or point to some other problem. That is, none of the new failures have the sendEventToGeckoSync timeout that used to be present. The last one of those I can find is in comment 486.
Try push with more logging to try and figure out the new issue: https://tbpl.mozilla.org/?tree=Try&rev=899bae21617c
Is the dynamic toolbar disabled for these tests? The output from my try run at [1] indicates that it is not. Specifically, just before the "dump all the threads" output, the drag of the content results in the margin values changing rather than the content y-coordinate changing. this indicates the dynamic toolbar is intercepting the scroll and adjusting the margins. [1] https://tbpl.mozilla.org/php/getParsedLog.php?id=22372444&tree=Try&full=1#error0
Flags: needinfo?(gbrown)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #522) > Is the dynamic toolbar disabled for these tests? We intend to disable it: https://hg.mozilla.org/mozilla-central/file/059a39ad762c/testing/mochitest/runtestsremote.py#l559 Is this going wrong somehow?
Flags: needinfo?(gbrown)
This does not look right to me: https://hg.mozilla.org/mozilla-central/file/059a39ad762c/mobile/android/base/BrowserApp.java#l431 ie, shouldn't setDynamicToolbarEnabled() be called regardless of the accessibility setting?
So after discussion with Cwiiis it turns out I misinterpreted my output the from the try run above. The nonzero margins reported in the ImmutableViewportMetrics is actually caused by overscroll, not by dynamic toolbar behaviour. I now think that the problem here is that the second page (in testVkbOverlap, which is the only test still affected AFAICT) is loaded at a y=18 scroll offset. This shouldn't be happening, and results in the "pan down" not doing any paints, because the page just goes into overscroll and snaps back. Therefore the "wait for paint" never fires, and causes the test to time out. I will continue investigating.
Whiteboard: [see comment 342] → [see comment 534]
Summary: Intermittent testHistoryTab, testBookmark, testHistory, testSystemPages, testVkbOverlap | PaintExpecter - blockUtilClear timeout (x3) → Intermittent testHistoryTab, testBookmark, testHistory, testSystemPages, testVkbOverlap | PaintExpecter - blockUntilClear timeout (x3)
Oops, looks like this typo is in the code too.
Summary: Intermittent testHistoryTab, testBookmark, testHistory, testSystemPages, testVkbOverlap | PaintExpecter - blockUntilClear timeout (x3) → Intermittent testHistoryTab, testBookmark, testHistory, testSystemPages, testVkbOverlap | PaintExpecter - blockUtilClear timeout (x3)
My try build with more logging [1] shows that in the failure case, when the second or third page in testVkbOverlap is loaded, setFirstPaintViewport is called with a y-value of 17 (sometimes 18, depending on rounding?): got first paint viewport: ImmutableViewportMetrics v=(0.0,17.0,1024.0,712.0) p=(0.0,0.0,1024.0,713.0) c=(0.0,0.0,1024.0,713.0) m=(0.0,0.0,0.0,0.0) z=1.0, rtl=false In the success case, the y-value is 0, as expected: got first paint viewport: ImmutableViewportMetrics v=(0.0,0.0,1024.0,695.0) p=(0.0,0.0,1024.0,713.0) c=(0.0,0.0,1024.0,713.0) m=(0.0,0.0,0.0,0.0) z=1.0, rtl=false This difference is what causes the failure as explained in comment 534. I don't know why setFirstPaintViewport is using the scroll position from the previous page in the failure case. The other interesting thing is that of the three pages that testVkbOverlap loads, sometimes this happens on the second page and sometimes it happens on the third. When it happens on the third page (which has no meta-viewport tag), the first-paint viewport is: got first paint viewport: ImmutableViewportMetrics v=(0.0,17.0,1024.0,712.0) p=(0.0,0.0,980.0,683.0) c=(0.0,0.0,980.0,683.13336) m=(0.0,0.0,0.0,0.0) z=1.0, rtl=false Note that this retains the 1.0 zoom from the first two pages, even though it should scale the page up to z=1.0448979 to make it fit in the 1024-pixel-wide viewport. This, to me, indicates that the viewport is getting clobbered by Java/JS rather than being a bug in core Gecko. [1] https://tbpl.mozilla.org/?tree=Try&rev=d635b9de3de5
Whiteboard: [see comment 534] → [see comment 534, 549]
The next try build I pushed [1] shows that the viewport is getting set in browser.js sometime between browser.js receives the before-first-paint event and the compositor thread calls setFirstPaintViewport in Java-land. The guard at [2] is supposed to prevent this from happening, so I've added some more logging to try and figure out why that guard isn't working. Try push at [3] [1] https://tbpl.mozilla.org/?tree=Try&rev=e9a8e94a3a2b [2] https://hg.mozilla.org/try/file/f246fb2ae748/mobile/android/chrome/content/browser.js#l1361 [3] https://tbpl.mozilla.org/?tree=Try&rev=5e6c6aecab84
The output from the final try run indicates that the Gecko event loop is getting spun between the paint happening and the layers transaction that hands off the painted content to the compositor. This event loop spin is when the viewport gets clobbered. // here the before-first-paint is triggered 05-01 14:24:06.488 E/GeckoConsole( 2364): staktrace: got the before-first-paint event 05-01 14:24:06.508 D/staktrace( 2364): GLC::setViewportMetrics ImmutableViewportMetrics v=(0.0,17.999985,1024.0,713.0) p=(0.0,0.0,1024.0,713.0) c=(0.0,0.0,1024.0,713.0) m=(0.0,0.0,0.0,0.0) z=1.0, rtl=false 05-01 14:24:06.508 D/staktrace( 2364): setViewportMetrics called from setIsRTL // here the paint happens 05-01 14:24:06.508 I/Gecko ( 2364): staktrace: setting nsPresShell::mIsFirstPaint back to false in Paint() // here the event loop is spun and the viewport is clobbered 05-01 14:24:06.538 E/GeckoConsole( 2364): staktrace: Viewport:Change with { "x" : 0.0, "y" : 17.999985, "zoom" : 1.0, "fixedMarginLeft" : 0.0, "fixedMarginTop" : 0.0, "fixedMarginRight" : 0.0, "fixedMarginBottom" : 0.0, "displayPort" :{ "left": 0.0, "top": 0.0, "right": 1024.0, "bottom": 713.0, "resolution": 1.0}} and true 05-01 14:24:06.548 D/GeckoEventDispatcher( 2364): dispatchEvent: no listeners registered for event 'robocop:scroll' // here the layers transaction happens 05-01 14:24:06.558 I/Gecko ( 2364): staktrace: got a first-paint txn in LayerTransactionParent.cpp // and the clobbered viewport is sent back to java 05-01 14:24:06.558 D/staktrace( 2364): got first paint viewport: ImmutableViewportMetrics v=(0.0,17.0,1024.0,712.0) p=(0.0,0.0,1024.0,713.0) c=(0.0,0.0,1024.0,713.0) m=(0.0,0.0,0.0,0.0) z=1.0, rtl=false
I may have jumped the gun in comment 562. The log I added to LayerTransactionParent actually gets run on the CompositorThread so it's possible that the output I saw is expected. Pushed another try build to https://tbpl.mozilla.org/?tree=Try&rev=a033695bd1db - I'll keep working on this in the background.
Whiteboard: [see comment 534, 549] → [see comment 534, 549, 553, 562]
My latest try build at [1] sheds more light on what's going on. 1) Layout does the "first paint" correctly 2) Layout sends the transaction over to the compositor and clears the "first paint" flag 3) Java sends a viewport update to browser.js with the y=17 viewport. At this point the first paint flag is cleared, so the update takes effect 4) Layout does another paint 5) Layout sends the new transaction over to the compositor, clobbering the old (correct) viewport metrics with the new (bad) ones 6) The compositor composites and sends the bad viewport to Java. The bug doesn't happen if the composition in step 6 happens anytime before step 5. Since it is on a separate thread, it can happen after step 3 or step 4 and in those cases the bug doesn't manifest. I'll have to think about what the correct fix here is. [1] https://tbpl.mozilla.org/?tree=Try&rev=068ce8cf9c1c
Testing a possible fix at https://tbpl.mozilla.org/?tree=Try&rev=30230058bb31. I'm not entirely happy with this patch but if it works at least it should confirm the theory above.
Try is looking happy so far. I cleaned up the patch a bit and pushed to try to make sure it didn't break anything else. https://tbpl.mozilla.org/?tree=Try&rev=e3504565c5ad
Attached patch Part 1 - cleanupSplinter Review
This is just moving a few lines around so that it is consistent with other things in the file. Cosmetic changes only.
Attachment #754435 - Flags: review?(chrislord.net)
See comment 709 for what's happening. The commit message on the patch also explains things.
Attachment #754436 - Flags: review?(chrislord.net)
Whiteboard: [see comment 534, 549, 553, 562] → [see comment 534, 549, 553, 562, 709]
Comment on attachment 754435 [details] [diff] [review] Part 1 - cleanup Review of attachment 754435 [details] [diff] [review]: ----------------------------------------------------------------- LGTM.
Attachment #754435 - Flags: review?(chrislord.net) → review+
Comment on attachment 754436 [details] [diff] [review] Part 2 - Fix isBrowserContentDocumentDisplayed intermittently returning the wrong value Review of attachment 754436 [details] [diff] [review]: ----------------------------------------------------------------- Your overview comment mentions the is-first-paint flag (which I assume is the mInFirstPaint flag) should be modified synchronously, but it seems in the patch that it isn't... I may have misunderstood, but r- until there's an explanation. ::: mobile/android/base/gfx/GeckoLayerClient.java @@ +94,5 @@ > private LayerView mView; > > + /* This flag is true from the time that browser.js detects a first-paint is about to start, > + * to the time that we receive the first-paint composite notification from the compositor. */ > + private volatile boolean mInFirstPaint; I think mFirstPaintPending might be a slightly better name for this? I don't have strong feelings, feel free to disagree. @@ +421,5 @@ > + mInFirstPaint = true; > + } > + > + /* This is invoked by JNI on the gecko thread */ > + boolean isFirstPaintInProgress() { similarly, isFirstPaintPending is a little more descriptive (layout can finish painting and this will still return true, right?) - I am similarly weak-willed with my feelings about this though :) @@ +592,5 @@ > } > DisplayPortCalculator.resetPageState(); > mDrawTimingQueue.reset(); > + > + mInFirstPaint = false; This is called on the compositor thread, but startingFirstPaint is called by the Gecko thread - does there need to be some synchronisation here?
Attachment #754436 - Flags: review?(chrislord.net) → review-
(In reply to Chris Lord [:cwiiis] from comment #746) > Your overview comment mentions the is-first-paint flag (which I assume is > the mInFirstPaint flag) should be modified synchronously, but it seems in > the patch that it isn't... I may have misunderstood, but r- until there's an > explanation. Ah, I should have updated the comment. I originally had the flag accesses inside a synchronized block but then I changed it to just be volatile instead since that should make it thread-safe without incurring the overhead of the lock. I can update the comment there. > I think mFirstPaintPending might be a slightly better name for this? I don't > have strong feelings, feel free to disagree. > > similarly, isFirstPaintPending is a little more descriptive (layout can > finish painting and this will still return true, right?) - I am similarly > weak-willed with my feelings about this though :) To me "isFirstPaintPending" sounds like the first paint hasn't happened yet, which isn't the case here - the first paint is in progress and still working its way through the layout/compositor pipeline. Maybe if we said isFirstCompositePending that might be more accurate. I can change it to that if you prefer.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #747) > (In reply to Chris Lord [:cwiiis] from comment #746) > > Your overview comment mentions the is-first-paint flag (which I assume is > > the mInFirstPaint flag) should be modified synchronously, but it seems in > > the patch that it isn't... I may have misunderstood, but r- until there's an > > explanation. > > Ah, I should have updated the comment. I originally had the flag accesses > inside a synchronized block but then I changed it to just be volatile > instead since that should make it thread-safe without incurring the overhead > of the lock. I can update the comment there. Is the lock unnecessary? Could it be possible, for example, to get a first-paint-started for a document, then get a second one, at nearly the same time as the compositor sets it to false, causing mInFirstPaint to be false when it should be true? (I've not thought this through much, I just want to confirm that you have and hear your thinking) > > I think mFirstPaintPending might be a slightly better name for this? I don't > > have strong feelings, feel free to disagree. > > > > similarly, isFirstPaintPending is a little more descriptive (layout can > > finish painting and this will still return true, right?) - I am similarly > > weak-willed with my feelings about this though :) > > To me "isFirstPaintPending" sounds like the first paint hasn't happened yet, > which isn't the case here - the first paint is in progress and still working > its way through the layout/compositor pipeline. Maybe if we said > isFirstCompositePending that might be more accurate. I can change it to that > if you prefer. mm, we've traditionally spoken of painting as what Gecko does and compositing being a separate thing... I guess the proper name ought to be mBetweenFirstPaintAndComposite, but given that's a bit of a mouthful, I'll leave this down to you. Will change my r-.
Comment on attachment 754436 [details] [diff] [review] Part 2 - Fix isBrowserContentDocumentDisplayed intermittently returning the wrong value Review of attachment 754436 [details] [diff] [review]: ----------------------------------------------------------------- r+ given a revised commit message and optionally a more detailed comment above mInFirstPaint/a rename.
Attachment #754436 - Flags: review- → review+
(In reply to Chris Lord [:cwiiis] from comment #748) > Is the lock unnecessary? Could it be possible, for example, to get a > first-paint-started for a document, then get a second one, at nearly the > same time as the compositor sets it to false, causing mInFirstPaint to be > false when it should be true? (I've not thought this through much, I just > want to confirm that you have and hear your thinking) In this case using a lock instead of volatile would make no difference. But you're right in that there is still a potential race if gecko two different documents in a row such that they are both "first" paints, but the first of the two gets composited right after the second one starts painting. If that happened the flag would be set to false by the compositor after it was double-set to true, and it would stay false until the second paint got composited. I think the only way to avoid this is turn it into a counter instead of a flag. Instead of setting it to true, I would increment the counter, and instead of setting it to false I would decrement it. It would also need to be decremented at http://mxr.mozilla.org/mozilla-central/source/gfx/layers/composite/AsyncCompositionManager.h#94 if both mIsFirstPaint and isFirstPaint are true, because that means two paints are getting squashed into one composite. This approach seems more scary to me though since if the counter gets out of sync because of a dropped paint or composite somewhere then it will stay out of sync forever. I don't know if that can happen or not. It's a pretty big risk to take for closing a pretty small edge condition. Will think about this a bit more. > mm, we've traditionally spoken of painting as what Gecko does and > compositing being a separate thing... I guess the proper name ought to be > mBetweenFirstPaintAndComposite, but given that's a bit of a mouthful, I'll > leave this down to you. Will change my r-. I'll see if I need to change the patch first and then figure out what to do with this.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #751) > In this case using a lock instead of volatile would make no difference. But > you're right in that there is still a potential race if gecko two different > documents in a row such that they are both "first" paints, but the first of > the two gets composited right after the second one starts painting. If that > happened the flag would be set to false by the compositor after it was > double-set to true, and it would stay false until the second paint got > composited. I think the only way to avoid this is turn it into a counter > instead of a flag. Instead of setting it to true, I would increment the > counter, and instead of setting it to false I would decrement it. It would > also need to be decremented at > http://mxr.mozilla.org/mozilla-central/source/gfx/layers/composite/ > AsyncCompositionManager.h#94 if both mIsFirstPaint and isFirstPaint are > true, because that means two paints are getting squashed into one composite. > This approach seems more scary to me though since if the counter gets out of > sync because of a dropped paint or composite somewhere then it will stay out > of sync forever. I don't know if that can happen or not. It's a pretty big > risk to take for closing a pretty small edge condition. Will think about > this a bit more. I spent some time thinking about this and looking around the code, and I wasn't able to convince myself that I can sufficiently mitigate the risk with the "correct fix" that I described above. The patch as attached significantly reduces the size of the race window (enough to fix this intermittent failure, as evidenced by many try runs) so I would rather land it as-is and then re-consider the riskier fix if it's still a problem. I can add comments and maybe a debug log to the code to help catch this problem in the future. > > mm, we've traditionally spoken of painting as what Gecko does and > > compositing being a separate thing... I guess the proper name ought to be > > mBetweenFirstPaintAndComposite, but given that's a bit of a mouthful, I'll > > leave this down to you. Will change my r-. > > I'll see if I need to change the patch first and then figure out what to do > with this. I think I'll rename this to isContentDocumentDisplayed() since that is less ambiguous and also correct.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #777) > I spent some time thinking about this and looking around the code, and I > wasn't able to convince myself that I can sufficiently mitigate the risk > with the "correct fix" that I described above. The patch as attached > significantly reduces the size of the race window (enough to fix this > intermittent failure, as evidenced by many try runs) so I would rather land > it as-is and then re-consider the riskier fix if it's still a problem. I can > add comments and maybe a debug log to the code to help catch this problem in > the future. This sounds good to me, if only to stop my mailbox getting flooded :)
Uploaded the wrong patch last time.
Attachment #755415 - Attachment is obsolete: true
Attachment #755415 - Flags: review?(chrislord.net)
Attachment #755921 - Flags: review?(chrislord.net)
Comment on attachment 755921 [details] [diff] [review] Part 2 - Mostly fix isBrowserContentDocumentDisplayed intermittently returning the wrong value (v2) Review of attachment 755921 [details] [diff] [review]: ----------------------------------------------------------------- LGTM.
Attachment #755921 - Flags: review?(chrislord.net) → review+
Comment 803 is on testBookmark, not testVkbOverlap. This appears to be a different bug that was not addressed by my patch. Looking at the history on this bug, the testBookmark failure didn't happen between april 17 and may 14 so the recent occurrences are probably a regression.
Split out the testBookmark failures to bug 877737.
No longer blocks: 877737
Summary: Intermittent testHistoryTab, testBookmark, testHistory, testSystemPages, testVkbOverlap | PaintExpecter - blockUtilClear timeout (x3) → Intermittent testHistoryTab, testHistory, testSystemPages, testVkbOverlap | PaintExpecter - blockUtilClear timeout (x3)
Blocks: 877882
Status: NEW → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → Firefox 24
(In reply to TinderboxPushlog Robot from comment #812)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Closing inactive keywords:intermittent-failure bugs where the TBPLbot has previously commented and the test isn't marked as disabled; filter on orange-cleanup-201401.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → WORKSFORME
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: