Closed Bug 1390059 Opened 5 years ago Closed 1 year ago

Intermittent testFlingCorrectness | Pixel at 0, 0 - Color rgba(0,0,0,0) not close enough to expected rgb(0,0,0)

Categories

(Firefox for Android Graveyard :: Testing, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(1 file)

Filed by: archaeopteryx [at] coole-files.de

https://treeherder.mozilla.org/logviewer.html#?job_id=122889230&repo=autoland

https://queue.taskcluster.net/v1/task/Fu4h4H5CS9ObuhFktXYw_g/runs/0/artifacts/public/test_info//robocop_errorsummary.log

[task 2017-08-14T00:26:57.286312Z] 00:26:57     INFO -  TEST-START | testFlingCorrectness
[task 2017-08-14T00:26:57.286985Z] 00:26:57     INFO -  GECKO | EventExpecter: no longer listening for Gecko:Ready
[task 2017-08-14T00:26:57.287508Z] 00:26:57     INFO -  GECKO | EventExpecter: no longer listening for Content:DOMContentLoaded
[task 2017-08-14T00:26:57.288166Z] 00:26:57     INFO -  GECKO | PaintExpecter: no longer listening for events
[task 2017-08-14T00:26:57.288841Z] 00:26:57     INFO -  Buffered messages logged at 00:26:15
[task 2017-08-14T00:26:57.288925Z] 00:26:57     INFO -  TEST-PASS | testFlingCorrectness | Robocop tests need the test device screen to be powered on. -
[task 2017-08-14T00:26:57.293426Z] 00:26:57     INFO -  Buffered messages logged at 00:26:24
[task 2017-08-14T00:26:57.293522Z] 00:26:57     INFO -  TEST-PASS | testFlingCorrectness | Given message occurred for registered event: {type=null} - Gecko:Ready should equal Gecko:Ready
[task 2017-08-14T00:26:57.293575Z] 00:26:57     INFO -  Buffered messages logged at 00:26:34
[task 2017-08-14T00:26:57.293683Z] 00:26:57     INFO -  TEST-PASS | testFlingCorrectness | Given message occurred for registered event: {metadata=null, bgColor=rgba(0, 0, 0, 0), type=null, tabID=0, errorType=} - Content:DOMContentLoaded should equal Content:DOMContentLoaded
[task 2017-08-14T00:26:57.293729Z] 00:26:57     INFO -  Buffered messages finished
[task 2017-08-14T00:26:57.293802Z] 00:26:57  WARNING -  TEST-UNEXPECTED-FAIL | testFlingCorrectness | Pixel at 0, 0 - Color rgba(0,0,0,0) not close enough to expected rgb(0,0,0)
[task 2017-08-14T00:26:57.293919Z] 00:26:57     INFO -  0 ERROR Exception caught during test! - junit.framework.AssertionFailedError: TEST-UNEXPECTED-FAIL | testFlingCorrectness | Pixel at 0, 0 - Color rgba(0,0,0,0) not close enough to expected rgb(0,0,0)
[task 2017-08-14T00:26:57.293975Z] 00:26:57     INFO -  	at junit.framework.Assert.fail(Assert.java:50)
[task 2017-08-14T00:26:57.294047Z] 00:26:57     INFO -  	at org.mozilla.gecko.FennecMochitestAssert._logMochitestResult(FennecMochitestAssert.java:124)
[task 2017-08-14T00:26:57.294113Z] 00:26:57     INFO -  	at org.mozilla.gecko.FennecMochitestAssert.ok(FennecMochitestAssert.java:145)
[task 2017-08-14T00:26:57.294182Z] 00:26:57     INFO -  	at org.mozilla.gecko.FennecMochitestAssert.ispixel(FennecMochitestAssert.java:164)
[task 2017-08-14T00:26:57.294248Z] 00:26:57     INFO -  	at org.mozilla.gecko.tests.PixelTest.checkScrollWithBoxes(PixelTest.java:95)
[task 2017-08-14T00:26:57.294313Z] 00:26:57     INFO -  	at org.mozilla.gecko.tests.PixelTest.loadAndVerifyBoxes(PixelTest.java:112)
[task 2017-08-14T00:26:57.294385Z] 00:26:57     INFO -  	at org.mozilla.gecko.tests.testFlingCorrectness.testFlingCorrectness(testFlingCorrectness.java:25)
[task 2017-08-14T00:26:57.294443Z] 00:26:57     INFO -  	at java.lang.reflect.Method.invokeNative(Native Method)
[task 2017-08-14T00:26:57.294504Z] 00:26:57     INFO -  	at java.lang.reflect.Method.invoke(Method.java:525)
[task 2017-08-14T00:26:57.294573Z] 00:26:57     INFO -  	at android.test.InstrumentationTestCase.runMethod(InstrumentationTestCase.java:214)
[task 2017-08-14T00:26:57.294641Z] 00:26:57     INFO -  	at android.test.InstrumentationTestCase.runTest(InstrumentationTestCase.java:199)
[task 2017-08-14T00:26:57.294713Z] 00:26:57     INFO -  	at android.test.ActivityInstrumentationTestCase2.runTest(ActivityInstrumentationTestCase2.java:192)
[task 2017-08-14T00:26:57.294779Z] 00:26:57     INFO -  	at org.mozilla.gecko.tests.BaseRobocopTest.runTest(BaseRobocopTest.java:188)
[task 2017-08-14T00:26:57.294836Z] 00:26:57     INFO -  	at junit.framework.TestCase.runBare(TestCase.java:134)
[task 2017-08-14T00:26:57.295083Z] 00:26:57     INFO -  	at junit.framework.TestResult$1.protect(TestResult.java:115)
[task 2017-08-14T00:26:57.295327Z] 00:26:57     INFO -  	at junit.framework.TestResult.runProtected(TestResult.java:133)
[task 2017-08-14T00:26:57.295564Z] 00:26:57     INFO -  	at junit.framework.TestResult.run(TestResult.java:118)
[task 2017-08-14T00:26:57.295804Z] 00:26:57     INFO -  	at junit.framework.TestCase.run(TestCase.java:124)
[task 2017-08-14T00:26:57.296054Z] 00:26:57     INFO -  	at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:191)
[task 2017-08-14T00:26:57.296392Z] 00:26:57     INFO -  	at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:176)
[task 2017-08-14T00:26:57.296652Z] 00:26:57     INFO -  	at android.test.InstrumentationTestRunner.onStart(InstrumentationTestRunner.java:554)
[task 2017-08-14T00:26:57.296915Z] 00:26:57     INFO -  	at org.mozilla.gecko.FennecInstrumentationTestRunner.onStart(FennecInstrumentationTestRunner.java:64)
[task 2017-08-14T00:26:57.297162Z] 00:26:57     INFO -  	at android.app.Instrumentation$InstrumentationThread.run(Instrumentation.java:1701)
[task 2017-08-14T00:26:57.297554Z] 00:26:57  WARNING -  TEST-UNEXPECTED-FAIL | testFlingCorrectness | Exception caught - junit.framework.AssertionFailedError: TEST-UNEXPECTED-FAIL | testFlingCorrectness | Pixel at 0, 0 - Color rgba(0,0,0,0) not close enough to expected rgb(0,0,0)
See Also: → 1387264
This became very frequent today.
Whiteboard: [stockwell needswork]
There have been earlier, occasional failures similar to this (bug 1387264), but https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=robocop&tochange=b6840d399d8208d3c6ef890b97cf88e16d41381d&fromchange=09b6a5ec8d7cc93edd3929594f1760774f85348e suggests that bug 1390321 made this a frequent failure.

:botond, :kats -- Can you have a look?
Blocks: 1390321
Flags: needinfo?(bugmail)
Flags: needinfo?(botond)
I did some more pushes and indeed this seems to have started with bug 1390321. However that patch was definitely only removing dead code so I don't see how it could be directly responsible. It must have perturbed something else somehow.

That being said I also looked at the logs between a passing run and a failing run and there's not really much to go on. The passing run finds the pixel at (0,0) to be rgba(0, 0, 0, 255) as expected and the failing run finds rgba(0, 0, 0, 0). The screenshot on the failing run is also all black in the content area which is consistent.

I think next steps here would be to reproduce locally. I'll see if I can find some time to try that.
Flags: needinfo?(bugmail)
As Kats mentioned, bug 1390321 was just dead code removal, so any effect on this intermittent failure must be very incidental, like changing some code size / timings.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4)
> I think next steps here would be to reproduce locally. I'll see if I can
> find some time to try that.

Thanks - let me know if I can help with something.
Flags: needinfo?(botond)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4)
> I think next steps here would be to reproduce locally. I'll see if I can
> find some time to try that.

I ran the test locally using `mach robocop testFlingCorrectness` in the android 4.3 emulator but was unable to repro the problem. I get the same behaviour on an older (pre-bug 1390321) build as well as a newer one that includes bug 1390321. In both cases the page loads, the pixel checks run and pass, the page scrolls, the rest of the pixel checks run and pass, and then it seems like the test hangs waiting for something and eventually times out. It definitely gets past the point of the error we're seeing in automation.

I kicked off a try push with the "regressing bug" backed out to see if that makes the problem go away. https://treeherder.mozilla.org/#/jobs?repo=try&revision=042be61521665f3994ce5b7492e4c611be926c22
As I suspected, the backout doesn't fix the problem. The options now are to debug via try pushes or disable the test. Or just hope it resolves itself. Frankly, I don't know if it's worth the effort to debug.
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/570450788720
Disable testFlingCorrectness. r=me a=testonly
leaving this open, please consider looking into the failure and fixing this in the near future :)
Flags: needinfo?(gbrown)
Keywords: leave-open
Whiteboard: [stockwell needswork] → [stockwell disabled]
The robocop PixelTests - testFlingCorrectness, testPanCorrectness, testLoad, testAxisLocking - have much in common and tend to fail together. There are similar failures recently reported against all of them. We may need to disable all of them.

I had another look at treeherder and retried many robocop tasks:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=robocop&fromchange=57e52be85a1ad9904a4de911ad007db11d5af7b6&tochange=7d03c3a5bb4b722c07e7acf76509f59cb9a88fa5

Bug 1390321 remains the earliest failure in this range (as in comment 2), but retries show that is was still very infrequent - just one failure in ~40 runs. (Thanks for checking :kats!)

Now it looks like the failure became frequent with https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=b6840d399d8208d3c6ef890b97cf88e16d41381d&filter-searchStr=robocop, bug 1389715. That seems really unlikely, but I see 13 failures in 22 runs - a big change.

:gps -- I can try to narrow this down further, but thought I'd check with you quickly first. It looks to me like those changes should have only affected checkouts + a no-op for deleting the unused image. Any idea how it could have affected tests?
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
Priority: P5 → P1
Flags: needinfo?(gps)
Because Docker image rebuilding isn't deterministic, every time we rebuild Docker images we could pick up a different, newer package version and this can break things in mysterious ways. See bug 1289812.

This is really unfortunate because it can create a) roadblocks to updating Docker images and b) unplanned work to fix whatever breaks. However, if things fail with the latest packages installed by a distro, then those same failures will likely manifest on actual machines in the wild. So I consider these legitimate bugs worth investigating.

I think the appropriate thing to do is disable failing tests as a result of a Docker image rebuilds / package upgrades and then fix the failures and restore the test at the earliest convenience. This is annoying. But if we block on fixing the test failures, then it becomes much harder to do things like upgrade packages for security fixes, which is exactly what happened in bug 1389715.
Flags: needinfo?(gps)
See Also: → 1394692
Thanks :gps.

For completeness, I tried a backout, but failures persisted -- consistent with the non-deterministic image rebuilding theory.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4b524c1a1003d1220042ef94e99af6a35cf28e1d
My conclusion is that the PixelTest-based robocop tests are flaky. 

I'm particularly concerned about the poor definition of the drawFinished event. As I recall, we initially used drawFinished as a signal that the test could proceed to check pixel colors. But now tests generally wait for drawFinished to be received, expect multiple such events, and then wait for a period of time when drawFinished is *not* received as an approximation of when it is safe-ish to check pixels. We need a better signal.

I'd like to disable the remaining PixelTests, dup the related bugs here, leave this bug open, and hope that some interested party re-writes this mechanism and restores these tests one day.
Attachment #8902407 - Flags: review?(snorp)
Comment on attachment 8902407 [details] [diff] [review]
disable remaining PixelTests

Sorry, I'm anxious to stop the bleeding here: I'm going to go ahead and check this in, r=me.
Attachment #8902407 - Flags: review?(snorp) → feedback?(snorp)
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5f6ae84f80cb
Disable all robocop PixelTests, for intermittent failures; r=me,test-only
Duplicate of this bug: 1387264
Duplicate of this bug: 1394218
Duplicate of this bug: 1394183
Duplicate of this bug: 1394159
Duplicate of this bug: 1392073
Duplicate of this bug: 1389808
Duplicate of this bug: 1283784
Duplicate of this bug: 1394692
Duplicate of this bug: 1395038
Duplicate of this bug: 1395041
Duplicate of this bug: 1395045
Duplicate of this bug: 1395165
Duplicate of this bug: 1395180
Assignee: gbrown → nobody
Priority: P1 → P3
Assignee: nobody → rbarker
Bulk priority update of open intermittent test failure bugs. 

P3 => P5

https://bugzilla.mozilla.org/show_bug.cgi?id=1381960
Priority: P3 → P5
Re-triaging per https://bugzilla.mozilla.org/show_bug.cgi?id=1473195

Needinfo :susheel if you think this bug should be re-triaged.
Assignee: rbarker → nobody
We have completed our launch of our new Firefox on Android. The development of the new versions use GitHub for issue tracking. If the bug report still reproduces in a current version of [Firefox on Android nightly](https://play.google.com/store/apps/details?id=org.mozilla.fenix) an issue can be reported at the [Fenix GitHub project](https://github.com/mozilla-mobile/fenix/). If you want to discuss your report please use [Mozilla's chat](https://wiki.mozilla.org/Matrix#Connect_to_Matrix) server https://chat.mozilla.org and join the [#fenix](https://chat.mozilla.org/#/room/#fenix:mozilla.org) channel.
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → INCOMPLETE
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.