Closed Bug 814282 Opened 12 years ago Closed 11 years ago

Intermittent testVkbOverlap | testVkbOverlap - Found 0 green pixels after tapping; expected 2428

Categories

(Firefox for Android Graveyard :: Keyboards and IME, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 22

People

(Reporter: RyanVM, Assigned: kats)

References

Details

(Keywords: intermittent-failure)

Attachments

(3 files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=17251801&tree=Mozilla-Inbound

Android Armv6 Tegra 250 mozilla-inbound opt test robocop on 2012-11-21 12:44:20 PST for push 8f61af760950
slave: tegra-055

0 INFO SimpleTest START
1 INFO TEST-START | testVkbOverlap
2 INFO TEST-PASS | testVkbOverlap | Awesomebar URL typed properly - http://mochi.test:8888/tests/robocop/robocop_input.html should equal http://mochi.test:8888/tests/robocop/robocop_input.html
3 INFO TEST-PASS | testVkbOverlap | testInputVisible - Found 2428 green pixels after scrolling
4 INFO TEST-UNEXPECTED-FAIL | testVkbOverlap | testVkbOverlap - Found 0 green pixels after tapping; expected 2428
Exception caught during test!
junit.framework.AssertionFailedError: 4 INFO TEST-UNEXPECTED-FAIL | testVkbOverlap | testVkbOverlap - Found 0 green pixels after tapping; expected 2428
	at junit.framework.Assert.fail(Assert.java:47)
	at org.mozilla.fennec.FennecMochitestAssert._logMochitestResult(FennecMochitestAssert.java:107)
	at org.mozilla.fennec.FennecMochitestAssert.ok(FennecMochitestAssert.java:136)
	at org.mozilla.fennec.tests.testVkbOverlap.testVkbOverlap(testVkbOverlap.java:55)
	at java.lang.reflect.Method.invokeNative(Native Method)
	at java.lang.reflect.Method.invoke(Method.java:521)
	at android.test.InstrumentationTestCase.runMethod(InstrumentationTestCase.java:204)
	at android.test.InstrumentationTestCase.runTest(InstrumentationTestCase.java:194)
	at android.test.ActivityInstrumentationTestCase2.runTest(ActivityInstrumentationTestCase2.java:186)
	at org.mozilla.fennec.tests.BaseTest.runTest(BaseTest.java:120)
	at junit.framework.TestCase.runBare(TestCase.java:127)
	at junit.framework.TestResult$1.protect(TestResult.java:106)
	at junit.framework.TestResult.runProtected(TestResult.java:124)
	at junit.framework.TestResult.run(TestResult.java:109)
	at junit.framework.TestCase.run(TestCase.java:118)
	at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:169)
	at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:154)
	at android.test.InstrumentationTestRunner.onStart(InstrumentationTestRunner.java:520)
	at android.app.Instrumentation$InstrumentationThread.run(Instrumentation.java:1447)
5 INFO TEST-UNEXPECTED-FAIL | testVkbOverlap | Exception caught - junit.framework.AssertionFailedError: 4 INFO TEST-UNEXPECTED-FAIL | testVkbOverlap | testVkbOverlap - Found 0 green pixels after tapping; expected 2428
6 INFO TEST-END | testVkbOverlap | finished in 47101ms
7 INFO TEST-START | Shutdown
8 INFO Passed: 2
9 INFO Failed: 2
10 INFO Todo: 0
11 INFO SimpleTest FINISHED
Running locally on a tegra, I normally see 4 drawFinished notifications following the testInputVisible assertion:

11-23 21:23:06.261 17865 17873 I Robocop : 3 INFO TEST-PASS | testVkbOverlap | testInputVisible - Found 2428 green pixels after scrolling
11-23 21:23:06.261 17865 17873 D RobocopMotionEventHelper: Triggering down at (5.0,690.0)
11-23 21:23:06.261 17865 17873 D RobocopMotionEventHelper: Triggering up at (5.0,690.0)
11-23 21:23:06.291 17865 17922 D Robocop : Received drawFinished notification
11-23 21:23:06.301 17865 17877 I GeckoScreenshot: rect: 678.000000, 0.000000, 154.000000, 696.000000
11-23 21:23:06.614 17865 17922 D Robocop : Received drawFinished notification
11-23 21:23:06.614 17865 17877 I GeckoScreenshot: rect: 678.000000, 0.000000, 154.000000, 696.000000
11-23 21:23:06.634 17865 17922 D Robocop : Received drawFinished notification
11-23 21:23:06.861  1025  1309 D dalvikvm: GC_EXPLICIT freed 14542 objects / 680640 bytes in 43ms
11-23 21:23:06.891 17865 17922 D Robocop : Received drawFinished notification
11-23 21:23:06.891  1025  1026 D dalvikvm: GREF has decreased to 1099
11-23 21:23:07.391 17865 17873 D dalvikvm: GC_FOR_MALLOC freed 2775 objects / 3012792 bytes in 40ms
11-23 21:23:07.391 17865 17873 I dalvikvm-heap: Grow heap (frag case) to 8.310MB for 2846736-byte allocation
11-23 21:23:07.442 17865 17867 D dalvikvm: GC_FOR_MALLOC freed 108 objects / 4864 bytes in 45ms
11-23 21:23:11.401 17865 17873 I Robocop : 4 INFO TEST-PASS | testVkbOverlap | testVkbOverlap - Found 2352 green pixels after tapping; expected 2428

From the failure log in the Description, there is only 1 drawFinished:

11-21 13:36:30.873 I/Robocop (10710): 3 INFO TEST-PASS | testVkbOverlap | testInputVisible - Found 2428 green pixels after scrolling
11-21 13:36:30.873 D/RobocopMotionEventHelper(10710): Triggering down at (5.0,690.0)
11-21 13:36:30.873 D/RobocopMotionEventHelper(10710): Triggering up at (5.0,690.0)
11-21 13:36:30.903 D/Robocop (10710): Received drawFinished notification
11-21 13:36:30.903 I/GeckoScreenshot(10710): rect: 678.000000, 0.000000, 154.000000, 696.000000
11-21 13:36:31.393 I/dalvikvm-heap(10710): Grow heap (frag case) to 7.594MB for 2846736-byte allocation
11-21 13:36:31.493 I/GeckoScreenshot(10710): rect: 678.000000, 0.000000, 154.000000, 696.000000
11-21 13:36:35.595 I/Robocop (10710): 4 INFO TEST-UNEXPECTED-FAIL | testVkbOverlap | testVkbOverlap - Found 0 green pixels after tapping; expected 2428

I notice there is a delay of over 4000 ms before the failed assertion and I cannot think of why that would be (blockUntilClear(LESS_THAN_CURSOR_BLINK_PERIOD) should only wait for 450 ms after the drawFinished).
(In reply to Geoff Brown [:gbrown] from comment #1)
> I notice there is a delay of over 4000 ms before the failed assertion and I
> cannot think of why that would be

I see now -- that's normal and present in my "success" log above. countGreenPixels takes the majority of the time.
Based on the log in comment 4, it looks like we might need to wait longer for the VKB to be displayed and the page to "settle" after simulating the tap. Here's my best guess as to what's happening:

> 12-05 15:56:40.226 I/Robocop (11861): 3 INFO TEST-PASS | testVkbOverlap | testInputVisible - Found 2428 green pixels after scrolling

Page with input field is done scrolling, and the input field is in view.

> 12-05 15:56:40.226 D/RobocopMotionEventHelper(11861): Triggering down at (5.0,690.0)
> 12-05 15:56:40.236 D/RobocopMotionEventHelper(11861): Triggering up at (5.0,690.0)

We simulate the tap on the input field

> 12-05 15:56:40.256 D/Robocop (11861): Received drawFinished notification

The page finishes some sort of draw, and our blockUntilClear(450) timer starts. However it looks like two screenshots also get queued up, maybe because of the focus highlight that appears/disappears on the text input field.

> 12-05 15:56:40.256 I/GeckoScreenshot(11861): rect: 678.000000, 0.000000, 154.000000, 696.000000

Screenshot the first. This takes a some time, preventing the page from processing all the input-related events and scrolling the input field into view.

> 12-05 15:56:40.756 I/dalvikvm-heap(11861): Grow heap (frag case) to 6.602MB for 2846736-byte allocation

In the meantime, our blockUntilClear timeout expires and we start grabbing the pixels from the page (this allocation is the pixel buffer).

> 12-05 15:56:40.839 I/GeckoScreenshot(11861): rect: 678.000000, 0.000000, 154.000000, 696.000000

Screenshot the second. We probably finish grabbing the pixels around here somewhere as well.

> 12-05 15:56:41.999 I/global  (11861): Default buffer size used in BufferedOutputStream constructor. It would be better to be explicit if an 8k buffer is required.
> 12-05 15:56:45.849 I/Robocop (11861): 4 INFO TEST-UNEXPECTED-FAIL | testVkbOverlap | testVkbOverlap - Found 0 green pixels after tapping; expected 2428

We finish checking the pixels. The pixels were probably grabbed before the page scroll was done so there is no green.

Therefore I think inserting a delay between the tap and blockUntilClear might help with this as it will give the page time to settle into its new position and finish processing any focus change draws and/or screenshots that it's doing.
Attached patch PatchSplinter Review
Compiled but untested.
Assignee: nobody → bugmail.mozilla
Attachment #689198 - Flags: review?(gbrown)
Comment on attachment 689198 [details] [diff] [review]
Patch

Review of attachment 689198 [details] [diff] [review]:
-----------------------------------------------------------------

nit: you could use mSolo.sleep, which doesn't throw the mostly-useless InterruptedException
Attachment #689198 - Flags: review?(gbrown) → review+
Just to update the bug: we discussed an alternate solution to this racy behaviour during the mobile platform deep dive meeting today. The solution involves throwing things from the test thread onto the gecko main thread so that we can be sure they run after certain events in gecko. In this case we would queue it on some event such that the final check is guaranteed to run after the page scroll is done. This would allow us to eliminate the sleep and the test architecture would be more of a continuation-style. I plan to try this out but it will take some time so in the meantime I'll land this patch as is. This should help hopefully reduce robocop failures and also confirm my hypothesis from comment 6.
See Also: → 837318
testVkbOverlap was failing consistently; that was fixed in bug 837318. It may have continued as an intermittent failure and went unnoticed until Panda rc was unhidden on 2013-03-05.
kats, could you take a look at this top orange please? :-)
Flags: needinfo?(bugmail.mozilla)
Whiteboard: [leave open] → [disable-me 2013-04-01]
(In reply to Ed Morley [:edmorley UTC+0] from comment #82)
> kats, could you take a look at this top orange please? :-)

Looks like it only hits pandaboards, and it looks like it might be a legitimate failure in our code. Based on the log, after the input field is clicked on, the VKB does come up (as seen by the window resizing to 1280x236) but there are no green pixels on the screen. This could either happen because the page isn't scrolled correctly to move the input field into view, or it might be related to the "eglMakeCurrent:651 error 3009 (EGL_BAD_MATCH)" error output I see in the log. Either way it's something we should fix.

I pushed a try build to https://tbpl.mozilla.org/?tree=Try&rev=1b6c90ddaf1d with more logging to see if I can figure out what's going on.
Flags: needinfo?(bugmail.mozilla)
Thank you :-)
(In reply to Geoff Brown [:gbrown] from comment #22)
> testVkbOverlap was failing consistently; that was fixed in bug 837318. It
> may have continued as an intermittent failure and went unnoticed until Panda
> rc was unhidden on 2013-03-05.

Although panda rc was unhidden on 2013-03-05, it was still running before that date. I wrote a script to scrape the inbound logs from ftp.mozilla.org and see if it was failing before that date (attached). The output shows that the first failure happened here:

FAIL 1362526821/ mozilla-inbound_panda_android_test-robocop-2-bm44-tests1-panda-build3.txt.gz

which is actually the same build reported in comment 15. Since then there have been 118 failures and 1844 passes, which is around a 6% failure rate. I'll retrigger the test on panda for some of the changesets prior to the one it first showed up on in an attempt to bisect the exact cset where this started happening. Note that we can do this even for csets where panda rc was hidden by using the build self-serve link and looking at the result log files in the ftp folder.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #87)
> I pushed a try build to https://tbpl.mozilla.org/?tree=Try&rev=1b6c90ddaf1d
> with more logging to see if I can figure out what's going on.

That build didn't work, maybe because of the mozconfig changes I accidentally included. New try at https://tbpl.mozilla.org/?tree=Try&rev=e8b95e6e65e2
I found an earlier build on which I was able to trigger this failure via manual retriggers. Log at [1].

Because of the intermittentness it's tricky to definitively bisect the change that introduced this, so I'm going to hold off on that approach for now. For now I know that the change is at or before 6d8d36cce3e3.

The try push at [2] seems to indicate that the failures happen when we get two window-resize events instead of one. That is, for the success cases, when the input field is tapped, we get this:

Window-size changed to (1280,236)

but for the failure cases we get this:

Window-size changed to (1280,280)
Window-size changed to (1280,236)

In both cases I see the same GL error output so I don't think that is related. Will keep digging.

[1] https://tbpl.mozilla.org/php/getParsedLog.php?id=21061803&tree=Mozilla-Inbound
[2] https://tbpl.mozilla.org/?tree=Try&rev=e8b95e6e65e2
Yeah I think this is almost certainly related to the extra window-size event we get in some cases but not others. See [1] (success case) and [2] (failure case).

In the success case, the ZoomToRect message ends up animating to this viewport:

ImmutableViewportMetrics v=(0.0,18.0,1280.0,254.0) p=(0.0,0.0,1280.0,254.0) c=(0.0,0.0,1280.0,254.0) m=(0.0,0.0,0.0,0.0) z=1.0

but in the failure case, the first ZoomToRect animates to here:

ImmutableViewportMetrics v=(0.0,18.0,1280.0,298.0) p=(0.0,0.0,1280.0,298.0) c=(0.0,0.0,1280.0,298.0) m=(0.0,0.0,0.0,0.0) z=1.0

which seems fine given the window size but then the second ZoomToRect animates to here:

ImmutableViewportMetrics v=(65.51178,0.0,1345.5117,280.0) p=(0.0,0.0,1411.0236,280.0) c=(0.0,0.0,1280.0,254.0) m=(0.0,0.0,0.0,0.0) z=1.1023622

which is at y=0 and so doesn't keep the input field in view. This is certainly a bug in our code, but I don't know if it's a regression or not. It's certainly not nice of the panda boards to be sending different sets of events in some cases but I guess we should be handling it correctly regardless.

[1] https://tbpl.mozilla.org/php/getParsedLog.php?id=21108320&tree=Try&full=1
[2] https://tbpl.mozilla.org/php/getParsedLog.php?id=21108256&tree=Try&full=1
Based on the log (with even more debug output) at [1], I believe the fundamental problem in our code is that if the viewport size changes while we're in the middle of a bounce animation, the viewport size change gets clobbered. Our bounce animation always interpolates between mBounceStartMetrics and mBounceEndMetrics, which are calculated at the start of the bounce using the original viewport size.

In this case, we process the first viewport size change and our viewport is at v=(0.0,0.0,1280.0,280.0) when we get the first animatedZoomTo call. We do an animation to interpolate our way over to v=(0.0,18.0,1280.0,298.0). At some point during this animation the viewport size changes again, and the java code clobbers this size change because of the animation as described above. The JS code triggers a second animatedZoomTo, this time to v=(0.0,18.0,1280.0,254.0). This request arrives around 2/3 of the way through the first animation, at v=(0.0,12.667846,1280.0,292.66785).

This test is particular annoying in this case, because it so happens that the second viewport resize shrinks the viewport height from 280 down to 236, which means the page height shrinks from 298 down to 254 (since the page is 100% of screen height + height of input field). Since 254 is less than 280 (which is what the Java code still thinks the viewport height is), the java code then increases the zoom factor even though it shouldn't have needed to, and then that throws the rendering off.

I think the correct fix here is to somehow not drop the viewport size change that arrives in the middle of an animation, but I worry that may not fully fix the problems. Still, it's somewhere to start.

[1] https://tbpl.mozilla.org/php/getParsedLog.php?id=21210632&tree=Try&full=1#error0
I pushed a patch with my fix to https://tbpl.mozilla.org/?tree=Try&rev=f98f88263673 and it seems to be an improvement in that this "found 0 green pixels" error hasn't shown up yet. There's all sorts of other intermittents and restarts though. I'll monitor it some more and if it still looks good I'll clean up the patches and request review.
Landed bug 856039 which should fix this problem. Leaving open for now, but unassigning from myself since my work here is done. The test seems to be hidden now anyway (not sure if it was because of this bug or some other).
Assignee: bugmail.mozilla → nobody
Whiteboard: [disable-me 2013-04-01]
Agreed - the only orange I'm seeing since this landed is the Panda timeouts. Thanks Kats! Re-assigning it to you for posterity's sake since it was your work that fixed it.
Assignee: nobody → bugmail.mozilla
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 22
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: