Closed Bug 756817 Opened 12 years ago Closed 12 years ago

Fix and reenable tcheck2 and tcheck3

Categories

(Testing :: Talos, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: gbrown)

References

Details

Attachments

(2 files)

See, for example, https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=ab3f805f3210&jobname=remote-trobocheck3 where a run of bad luck has us at 21 failed runs in a row and counting. Typical numbers are more like 6-8 runs before it goes green.

It's sort of funny, in a sick horrible way, to keep retriggering it and retriggering it and retriggering it, but the fact is, we don't have the tegra capacity to add another 6-8(-21) runs per push, and I'll be hiding it and filing a bug to get it turned off everywhere except Try, where we can leave it running hidden for the benefit of whoever's going to try to fix it.

Then I'm going to be looking at whether we can actually afford to run the other two robochecks, since while I was counting check3, I counted check2 at 66% failure.
Depends on: 756818
Several local runs of check3 have succeeded, with no sign of trouble. However, in one run, the test hung and timed out. Logs showed Gecko:Ready was received, but the test was waiting for something before entering the url -- possibly a variation on bug 756813.
Wrong bug -- I meant bug 756183.
Assignee: nobody → gbrown
Depends on: 758792, 756183, 758405
Depends on: 759792
Depends on: 756704
I have found several possible ways for robocop tests to hang and spun off dependent bugs. Fixes for most of those bugs do not have an appreciable affect on failure rate for testCheck3 (or testCheck2). 

My best guess is that bug 759792 is the main cause of testCheck3 failures, but the evidence for this is still unclear, and I don't yet have a fix for that bug.
One curiosity about this bug is that tcheck3 fails more than 90% of the time, while its cousin tcheck2 fails less than 50% of the time. tcheck3 differs from tcheck2 only in that it disables the screenshot feature. I did a try run with that difference removed, making tcheck2 and tcheck3 identical. tcheck3 continued to fail: https://tbpl.mozilla.org/?tree=Try&rev=fd51955d1ab3&noignore=1.
the config inside of talos as well as buildbot are identical except for the name.

If we have 1 difference between the two tests, why do we have two tests?  Is it really that critical to have a screenshot be the difference?
One other curiosity is that following the landing from bug 755070 (which apparently did something to avoid screenshotting), tcheck2 now fails 90% of the time.
Blocks: 755070
well, then tcheck3 is not needed :)  

Seriously, the mobile devs should look at why this is happening.  Maybe it is a timing issue or a resource issue
Right.

I'll hide check2, and file and fix a bug to stop running it, just like check3, when I get home.

Should I go ahead and drop check3 from running hidden on Try, or run both check2 and check3 pointlessly for nobody to see on Try until we finally forget why they even run there?
(In reply to Joel Maher (:jmaher) from comment #7)
> well, then tcheck3 is not needed :)  

Damon requested this.  The issue is that we are shipping with screenshotting on and need to make sure we don't regress, however we are working on solutions to turn screenshotting off and need to know how we progress there.
Summary: Fix and reenable tcheck3 → Fix and reenable tcheck2 and tcheck3
(In reply to Phil Ringnalda (:philor) from comment #8)
> Should I go ahead and drop check3 from running hidden on Try, or run both
> check2 and check3 pointlessly for nobody to see on Try until we finally
> forget why they even run there?

Please keep them both running on Try -- failures occur much more reliably on Try than any local configuration that I have access to, and I am actively testing on Try: https://tbpl.mozilla.org/?tree=Try&rev=c14c319dc45c.
Depends on: 765830
Finally some progress: In recent builds, I am now often seeing loadAndPaint() taking several minutes to complete, or not completing at all (likely because of a timeout). When running locally, the cnn test page is displayed and has not started scrolling or zooming. logcat shows a constant stream of:
...
D/Robocop (30061): Received drawFinished notification
I/GeckoScreenshot(30061): rect: 227.000000, 837.000000, 867.000000, 257.000000
D/Robocop (30061): Received drawFinished notification
I/GeckoScreenshot(30061): rect: 227.000000, 837.000000, 867.000000, 257.000000
D/Robocop (30061): Received drawFinished notification
I/GeckoScreenshot(30061): rect: 227.000000, 837.000000, 867.000000, 257.000000
D/Robocop (30061): Received drawFinished notification
I/GeckoScreenshot(30061): rect: 227.000000, 837.000000, 867.000000, 257.000000
D/Robocop (30061): Received drawFinished notification
...
There are often multiple drawFinished events per second.

loadAndPaint() is stuck in blockUntilClear(500), which is waiting for a duration of 500 ms in which no drawFinished events are received. Since drawFinished events keep being generated, blockUntilClear keeps resetting and waiting, again and again, until the test times out.


Why are we getting all these drawFinished events?
(BTW, I hypothesize that screenshots interrupted the generation or delivery of drawFinished events sufficiently to allow blockUntilClear to progress. That's why tcheck2 worked better than tcheck3 initially, and why tcheck2 started failing more once 755070 landed.)
I can manipulate the test so that it doesn't get stuck, but I would prefer not to, unless there is a good reason for the repeated drawFinished events.

https://tbpl.mozilla.org/?tree=Try&rev=903c463396f6&noignore=1
Removed 2 dependencies: I don't have evidence that those bugs are affecting these tests on the tegras.
No longer depends on: 756704, 759792
(In reply to Geoff Brown [:gbrown] from comment #11)
> Finally some progress: In recent builds, I am now often seeing
> loadAndPaint() taking several minutes to complete, or not completing at all
> (likely because of a timeout). When running locally, the cnn test page is
> displayed and has not started scrolling or zooming. logcat shows a constant
> stream of:
> ...
> D/Robocop (30061): Received drawFinished notification
> I/GeckoScreenshot(30061): rect: 227.000000, 837.000000, 867.000000,
> 257.000000
> D/Robocop (30061): Received drawFinished notification
> I/GeckoScreenshot(30061): rect: 227.000000, 837.000000, 867.000000,
> 257.000000
> D/Robocop (30061): Received drawFinished notification
> I/GeckoScreenshot(30061): rect: 227.000000, 837.000000, 867.000000,
> 257.000000
> D/Robocop (30061): Received drawFinished notification
> I/GeckoScreenshot(30061): rect: 227.000000, 837.000000, 867.000000,
> 257.000000
> D/Robocop (30061): Received drawFinished notification
> ...
> There are often multiple drawFinished events per second.
> 
> loadAndPaint() is stuck in blockUntilClear(500), which is waiting for a
> duration of 500 ms in which no drawFinished events are received. Since
> drawFinished events keep being generated, blockUntilClear keeps resetting
> and waiting, again and again, until the test times out.
> 
> 
> Why are we getting all these drawFinished events?

Presumably, we are actually painting. We'll get these events if there is a blinking cursor or an animated gif. The rect reported with the GeckoScreenshot tag is the bounding dirty rect of the mozAfterPaint event, which is showing the same 30x30px rect on the screen being repainted continuously.
and as expected, here is a 30x30 animated gif, which is displayed at roughly that position on the cnn.com page in talos:

https://mxr.mozilla.org/build/source/talos/talos/startup_test/fennecmark/cnn/i.cdn.turner.com/cnn/.element/img/3.0/global/misc/loading.gif
I don't think I can test this on try, but I suspect this will fix the issue.
Attachment #635269 - Flags: review?(jmaher)
Comment on attachment 635269 [details] [diff] [review]
patch to make loading gif static

Review of attachment 635269 [details] [diff] [review]:
-----------------------------------------------------------------

We will have to land this on talos and get releng to update the bits.
Attachment #635269 - Flags: review?(jmaher) → review+
I searched the talos files for other animated gifs. The one Brad found is the only one in fennecmark. 

There are animated gifs in page_load_test, but I would not expect these to cause a problem:

./page_load_test/mobile_tp4/amazon.com/z-ecx.images-amazon.com/images/G/01/s9-campaigns/music-player/spinner._V206785309_.gif
./page_load_test/mobile_tp4/amazon.com/z-ecx.images-amazon.com/images/G/01/x-locale/personalization/shoveler/loading-indicator._V31970667_.gif
./page_load_test/mobile_tp4/amazon.com/g-ecx.images-amazon.com/images/G/01/advertising/banners/gw/300x250_grocery._V44581809_.gif
./page_load_test/mobile_tp4/amazon.com/g-ecx.images-amazon.com/images/G/01/x-locale/personalization/shoveler/loading-indicator._V31970667_.gif
./page_load_test/mobile_tp4/m.yandex.ru/yandex.st/lego/_/ACaplGGucdbQGmgLs7pJSLAUQAk.gif
./page_load_test/mobile_tp4/m.yandex.ru/awaps.yandex.ru/0/c1/tgFtaeLDK0ya43AvjS329vn3evnqZ4urteKsRuH89GZogQqjF7S6XDpF2VxU9_tZm3kyskqUp4M1vdnyR5tzzCP-s6zpmTRK1-EeFol3OwpcpxN2A6W1G5qQc1A_t7uxFqah3rCyMK1kKb654gJP4JqASrTNHdu8U2xvLZY7vmAzM67vFGmmyUBAp_p0SACyYWxw7l1cTxYe5QStu1fyDzRyhW9UDscKUj97aCLgrA5Q66BAfYA_A_.gif
./page_load_test/mobile_tp4/m.nytimes.com/mobile.nytimes.com/i/s-loader.gif
./page_load_test/mobile_tp4/m.baidu.com/wap.baidu.com/static/hb/hot.gif
./page_load_test/mobile_tp4/m.twitter.com/a0.twimg.com/a/1297205904/images/loader.gif
./page_load_test/mobile_tp4/m.twitter.com/a1.twimg.com/a/1297205904/images/spinner.gif
./page_load_test/mobile_tp4/m.twitter.com/a3.twimg.com/a/1297205904/images/petal_spinner.gif
./page_load_test/mobile_tp4/m.twitter.com/a3.twimg.com/a/1297205904/images/spinner.gif
./page_load_test/mobile_tp4/m.twitter.com/a3.twimg.com/a/1297205904/images/icon_throbber.gif
./page_load_test/mobile_tp4/m.twitter.com/a3.twimg.com/a/1297205904/images/ajax.gif
./page_load_test/mobile_tp4/m.espn.com/abcnews.go.com/assets/iproto/images/loader_vid.gif
With the static gif, tcheck2 and tcheck3 are running much better:

https://tbpl.mozilla.org/?tree=Try&rev=085a378263e6&noignore=1
Attachment #637349 - Flags: review?(armenzg) → review+
landed in buildbot-configs:
http://hg.mozilla.org/build/buildbot-configs/rev/4f147929e69d

Next step would be to get this rolled out in a buildbot reconfig:)
Comment on attachment 637349 [details] [diff] [review]
turn on tcheck3 by default in buildbot (1.0)

This is live now.
tcheck2 and tcheck3 are still hidden on Try (only)...what did we miss?
(In reply to Geoff Brown [:gbrown] from comment #25)
> what did we miss?

Unhiding them? :)

Done.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: