Closed Bug 766059 Opened 13 years ago Closed 13 years ago

Talos hang with DLBI

Categories

(Firefox for Android Graveyard :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: mattwoodrow, Unassigned)

References

Details

Attachments

(2 files, 1 obsolete file)

Attached patch Reduced patchSplinter Review
With DLBI (bug 539356) I'm getting hangs in some talos tests (testCheck{2}, trobopan). This appears to be hanging inside a call to Instrumentation::sendStringSync (in FennecNativeActions), originating from the loadAndPaint call. I'm attaching a reduced patch that reproduces the issue on tryserver, I haven't been able to get this same behaviour locally. The important change is the extra PresShell::Paint call inside nsViewManager::ProcessPendingUpdatesForView. This is called from the refresh driver. The docs for sendStringSync are pretty thin, but is it possible that it waits for the entire java event queue to be clear instead of just the event created by that call to be processed? The only guess I have is that I've moved painting to be synchronous with the java event queue processing, and painting is generating new events every time. Does anyone have ideas as to what would cause this, how to debug it, or how to fix it? Even just dumping the event queue state into the tinderbox logs would help diagnose why this call blocks indefinitely.
This doesn't make much sense to me. I've been looking over the code and the way I understand it, the test runs in a thread that is completely independent of the threads created by Fennec. When sendStringSync gets called, it creates a bunch of key events and injects them into the UI thread in Fennec via some sort of IPC binding here: (IWindowManager.Stub.asInterface(ServiceManager.getService("window"))) .injectKeyEvent(event, true); This also blocks for the events to be processed by Fennec before returning. Now the only way I can see these events not getting processed is if the Fennec UI thread is stuck in an infinite loop somehow, or perhaps already so flooded with events that the test times out before it finishes processing these key events. It looks like you came to the same conclusion. To debug this, I would try something like the attached patch. In theory this should dump the event queue to the browser_output file every second. I think the browser_output.txt file is the one that shows up as "NOISE" lines in the tbpl logs, but I'm not sure about that. CC'ing jmaher as well since he knows more about this part.
Attached patch dump event queue every second (obsolete) — Splinter Review
The patch that should dump the event queue every second. (Compiles for me but I haven't tested it)
slightly more tested version: i forgot to open the log file in append mode so it would only have the last line of output. this version seems to log stuff reasonably.
Attachment #634418 - Attachment is obsolete: true
everything in browser_output.txt is the noise.
https://tbpl.mozilla.org/php/getParsedLog.php?id=12811524&tree=Try&full=1 Looks like the event queue isn't being processed at all..
I'm having a hard time finding what the event codes mean, but it looks like this event: Message 0: { what=1000 when=823393 } Is the one holding up the queue (assuming they are processed from the 0 end first). That gets added between my 'd' and 'e' printfs, so somewhere within BaseTest::getActivityFromClick. The sleeps to 'avoid hangs' here worry me more than a little.
Looking through the android source, I found message type 1000 corresponds to the DO_TRAVERSAL message in ViewRoot.java: http://grepcode.com/file/repository.grepcode.com/java/ext/com.google.android/android/2.2_r1.1/android/view/ViewRoot.java#1695 which seems to traverse through the view hierarchy. I suppose it's possible this gets into an infinite loop somehow, but I don't understand why your patch would trigger it. (Also a caveat: I think these message codes can be reused by different files so it's possible somebody else has a message type of 1000, but I didn't find any other ones.) I agree the sleeps and comments there are worrying; bug 758792 has the backstory but we still don't know what was causing the hangs.
CC'ing gbrown as well in case he has any insights into this.
Bug 766445 is also interesting, and might provide some clues. It seems to imply that when the test tries to type in the URL the keys get sent to the wrong activity, so perhaps our code to getActivityFromClick is broken in more ways than one.
This sounds a little different, but be aware of: Bug 756817 - during tcheck2/tcheck3 loadAndPaint is hanging in blockUntilClear because it is receiving a constant stream of drawFinished events Bug 759792 - during robocop tests, I sometimes see "Key dispatching timed out sending..." and ANR errors, but only on my Galaxy S
Looking at the handleMessage implementation for DO_TRAVERSAL, it looks like it can issue another event of the same type in certain cases. Though I don't think that's actually happening since the message in the queue always has the same timestamp. It looks like the message is removed from the queue before we call handleMessage http://grepcode.com/file/repository.grepcode.com/java/ext/com.google.android/android/4.0.4_r1.2/android/os/MessageQueue.java#128 so I guess we're never actually getting that far.
I also just tested disabling the drawWidgetOverlay/Underlay calls (from LayerManagerOGL - compositor thread -> nsWindow), as these are the only widget callbacks (that I know of) that are triggered by painting. No improvement.
There are also widget callbacks in CompositorParent.cpp to three AndroidBridge functions.
Interestingly, disabling shadow layer transactions (removing the calls to ShadowLayerForwarder::ForwardTransaction) still doesn't fix the bug. Not sure what else within painting could be even remotely related. This is fairly high priority for the layout team, so I'll probably try to check out a talos machine unless someone has time to look at this.
In bug 756817, we found that an animated gif was generating draw events that were causing loadAndPaint to wait indefinitely (hang) for paint completion, when running tcheck2 and tcheck3. This issue has likely always existed for tcheck2, but did not become apparent until improvements in efficiency allowed the draw events to be generated and/or processed without interruption. There may be similar issues in other robocop tests that use loadAndPaint or blockUntilClear, waiting for draw events to complete.
Depends on: tegra-011
I've got a tegra board loaned out to help me debug this issue. I have talos running, and believe I'm hitting the same bug. I get this in logcat: W/WindowManager( 1017): Key dispatching timed out sending to org.mozilla.fennec_matt/org.mozilla.gecko.AwesomeBar W/WindowManager( 1017): Previous dispatch state: {{KeyEvent{action=0 code=36 repeat=0 meta=0 scancode=0 mFlags=8} to Window{44623830 org.mozilla.fennec_matt/org.mozilla.gecko.AwesomeBar paused=false} @ 1340571936623 lw=Window{44623830 org.mozilla.fennec_matt/org.mozilla.gecko.AwesomeBar paused=false} lb=android.os.BinderProxy@44603a08 fin=false gfw=true ed=true tts=0 wf=false fp=false mcf=Window{44623830 org.mozilla.fennec_matt/org.mozilla.gecko.AwesomeBar paused=false}}} W/WindowManager( 1017): Current dispatch state: {{null to Window{44623830 org.mozilla.fennec_matt/org.mozilla.gecko.AwesomeBar paused=false} @ 1340571996629 lw=Window{44623830 org.mozilla.fennec_matt/org.mozilla.gecko.AwesomeBar paused=false} lb=android.os.BinderProxy@44603a08 fin=false gfw=true ed=true tts=0 wf=false fp=false mcf=Window{44623830 org.mozilla.fennec_matt/org.mozilla.gecko.AwesomeBar paused=false}}} Which suggests that we're actually blocking on the key event, not the DO_TRAVERSAL event. How would I debug this? I've had a look at the code for the AwesomeBar class and can't see any obvious event related code that might block. It would be nice to be able to get a stack trace for where the current code handling the key dispatch event is stuck at.
That sounds just like bug 759792, which I see very occasionally now. I was thinking that it might be caused by the UI thread hanging, or being too busy (bug 756817?) or starved (bug 764147??)...but I'm just guessing, and not sure how to proceed.
Top shows nothing obvious hanging, total CPU usage was very low. Is there a way to print the current stack trace of all java threads? Similar to pausing, and then 'info threads' in gdb.
Deadlock! The UI thread is hung inside GeckoAppShell::geckoEventSync, it posts a message (GECKO_EVENT_SYNC) to the gecko thread and then waits until the gecko thread notifies it that the event has been processed. The gecko thread processes events until it ends up attempting a paint. Due to me moving where paint is called from, this can happen before we process the sync event. Though I very much suspect that we were only getting lucky previously, as any unprocessed DRAW event in the queue would have had the same issue. Painting ends up calling AndroidBridge::ProvideEGLSurface (via a synchronous call to the compositor thread I believe?) which then blocks inside GLController::waitForValidSurface. I *think* this is expecting the java UI thread (which is already blocked) to release it. Both threads end up blocked, and nothing happens. Anyone have suggestions for fixing this? CC'ing gfx people that might have ideas for the EGL waiting side.
Ali, does this deadlock make sense to you?
There are only a handful of pieces of code that end up calling geckoEventSync() - two of them are in GeckoInputConnection, one is in GeckoApp (onLowMemory) and one is in GeckoLayerClient (compositionPauseRequested, which gets called on surface destruction). Matt, did you end up getting a full stack trace? It would be useful to know which of these calls is doing the call to better understand the deadlock.
It's being called from compositionPauseRequested. I have a screenshot of the stack trace, I couldn't see a way to copy it from DDMS. I'm thinking it might be best to block painting when the compositor is paused, since we're getting stuck trying to create a compositor.
The only situations in which we call ProvideEGLSurface are when composition is being resumed, and during compositor creation (in nsWindow::GetLayerManager). This leads to a couple theories: 1) We're getting a surfaceDestroyed event (triggering a compositionPauseRequested) before we've even created the compositor. Then we get the paint event, triggering a call to nsWindow::GetLayerManager, which causes us to try to create the compositor, but this blocks on getting a surfaceChanged event (by calling waitForValidSurface), but the surfaceChanged event doesn't arrive since the UI thread is still handling the surfaceDestroyed event. If this is the problem, we need to track the creation of the compositor and avoid sending a compositionPauseRequested when there isn't actually a compositor to pause. 2) We're creating a new widget and compositor just as we're pausing the existing compositor. This sounds weird, but an attempt to create an additional widget does seem to happen is some situations (e.g. see Bug 766425). If this is the problem, we need to prevent the creation of the additional widget. (In reply to Matt Woodrow (:mattwoodrow) from comment #22) > I'm thinking it might be best to block painting when the compositor is > paused, since we're getting stuck trying to create a compositor. We already block painting from nsWindow when the compositor is paused (see the use of sCompositorPaused there). However, painting shouldn't cause us to create *additional* compositors (assuming we already have one).
As I mentioned earlier, I see similar symptoms for an occasional Talos failure in current builds (without the DLBI changes). I have now verified that those failures have the same cause. The following is from a run of the testCheck2 Talos test. One thread is at: 06-26 18:31:56.052 32527 32527 W System.err: at org.mozilla.gecko.GeckoAppShell.geckoEventSync(GeckoAppShell.java:549) 06-26 18:31:56.054 32527 32527 W System.err: at org.mozilla.gecko.GeckoAppShell.sendEventToGeckoSync(GeckoAppShell.java:507) 06-26 18:31:56.054 32527 32527 W System.err: at org.mozilla.gecko.gfx.GeckoLayerClient.compositionPauseRequested(GeckoLayerClient.java:448) 06-26 18:31:56.054 32527 32527 W System.err: at org.mozilla.gecko.gfx.LayerView.surfaceDestroyed(LayerView.java:230) 06-26 18:31:56.054 32527 32527 W System.err: at android.view.SurfaceView.reportSurfaceDestroyed(SurfaceView.java:586) 06-26 18:31:56.054 32527 32527 W System.err: at android.view.SurfaceView.updateWindow(SurfaceView.java:478) 06-26 18:31:56.056 32527 32527 W System.err: at android.view.SurfaceView.onWindowVisibilityChanged(SurfaceView.java:208) 06-26 18:31:56.056 32527 32527 W System.err: at android.view.View.dispatchWindowVisibilityChanged(View.java:3944) 06-26 18:31:56.056 32527 32527 W System.err: at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 06-26 18:31:56.058 32527 32527 W System.err: at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 06-26 18:31:56.058 32527 32527 W System.err: at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 06-26 18:31:56.058 32527 32527 W System.err: at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 06-26 18:31:56.058 32527 32527 W System.err: at android.view.ViewGroup.dispatchWindowVisibilityChanged(ViewGroup.java:719) 06-26 18:31:56.058 32527 32527 W System.err: at android.view.ViewRoot.performTraversals(ViewRoot.java:755) 06-26 18:31:56.058 32527 32527 W System.err: at android.view.ViewRoot.handleMessage(ViewRoot.java:1752) 06-26 18:31:56.060 32527 32527 W System.err: at android.os.Handler.dispatchMessage(Handler.java:99) 06-26 18:31:56.060 32527 32527 W System.err: at android.os.Looper.loop(Looper.java:123) 06-26 18:31:56.062 32527 32527 W System.err: at android.app.ActivityThread.main(ActivityThread.java:4627) 06-26 18:31:56.062 32527 32527 W System.err: at java.lang.reflect.Method.invokeNative(Native Method) 06-26 18:31:56.064 32527 32527 W System.err: at java.lang.reflect.Method.invoke(Method.java:521) 06-26 18:31:56.064 32527 32527 W System.err: at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:858) 06-26 18:31:56.064 32527 32527 W System.err: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:616) 06-26 18:31:56.064 32527 32527 W System.err: at dalvik.system.NativeStart.main(Native Method) The other thread is at: 06-26 18:31:56.080 32527 32570 W System.err: at org.mozilla.gecko.gfx.GLController.waitForValidSurface(GLController.java:83) 06-26 18:31:56.080 32527 32570 W System.err: at dalvik.system.NativeStart.run(Native Method) 06-26 18:31:56.080 32527 32570 W System.err: at dalvik.system.NativeStart.run(Native Method)
Thanks Ali, that explains a lot. So, with my patch above, painting isn't happening from the nsWindow::OnDraw function, so we weren't protected by the compositor paused check in that function. I've added an equivalent check the actual painting path under DLBI and this appears to be fixing it for now. That doesn't really explain why Geoff is hitting this without DLBI, though we could potentially have this issue from any caller of nsWindow::GetLayerManager (when we don't have a compositor created), not just painting. Geoff: It would be really good to get a c++ stack trace of the second thread to try figure out what is happening here. I don't have many ideas how to do this though, I managed to hang my tegra board trying to get gdb attached remotely. You might have more luck doing this with a local machine, instead of across the pacific. I ended up scattering gecko code with printfs to narrow down the callstack. Not a very efficient method, but effective at least. If you do end up going with this, I'd try putting start/end log prints inside widget/android/nsWindow.cpp:GetLayerManager to confirm that we are getting stuck within that, and then audit the callers of that function.
(In reply to Matt Woodrow (:mattwoodrow) from comment #26) > Geoff: It would be really good to get a c++ stack trace of the second thread > to try figure out what is happening here. I will try, but that may take a while -- it's not easy to repro this.
The Talos harness also doesn't push an updated robotium.config to the device when --fennecIDs isn't specified during setup. Running make mochitest-robotium (as suggested in the existing docs) generates one, but with the wrong logfile (mochitest.log), which isn't picked up by the Talos harness. I've updated the docs here: https://wiki.mozilla.org/Mobile/Fennec/Android#talos to say this, though it would be useful if someone could add what else the fennecIDs parameter is used for.
Updated the notes a bit. --fennecIDs is only used for robocop based talos tests and is a flag to talos that it needs to copy over and generate bits to run robocop successfully.
(In reply to Matt Woodrow (:mattwoodrow) from comment #26) > Geoff: It would be really good to get a c++ stack trace of the second thread > to try figure out what is happening here. > > If you do end up going with this, I'd try putting start/end log prints > inside widget/android/nsWindow.cpp:GetLayerManager to confirm that we are > getting stuck within that, and then audit the callers of that function. I have confirmed that in my case (see comment 24) we are stuck in nsWindow::GetLayerManager, and the call sequence is: nsWindow::OnGlobalAndroidEvent receives AndroidGeckoEvent::DRAW, calls nsWindow::OnDraw, calls nsWindow::GetLayerManager, calls nsBaseWidget::CreateCompositor, calls CompositorChild::SendPLayersConstructor(LayerManager::LAYERS_OPENGL, ...) That is: https://hg.mozilla.org/mozilla-central/annotate/027b86239bf3/widget/android/nsWindow.cpp#l861 -> https://hg.mozilla.org/mozilla-central/annotate/027b86239bf3/widget/android/nsWindow.cpp#l1101 -> https://hg.mozilla.org/mozilla-central/annotate/027b86239bf3/widget/android/nsWindow.cpp#l991 -> https://hg.mozilla.org/mozilla-central/annotate/027b86239bf3/widget/android/nsWindow.cpp#l698 -> https://hg.mozilla.org/mozilla-central/annotate/a12b54b2f848/widget/xpwidgets/nsBaseWidget.cpp#l888
...forgot to mention: nsWindow::Invalidate posts the DRAW event.
It looks like this may also be the cause of bug 769524; I see threads stuck in the same place described in comment 24.
Spun off bug 771189 to consolidate the non-DLBI discussion.
Please reopen if anyone sees this again.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: