Closed Bug 913683 Opened 6 years ago Closed 6 years ago

Talos regressions in trobopan and tcheck2 on Android since Aug 22 2013

Categories

(Firefox for Android :: General, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 27
Tracking Status
firefox26 --- fixed
firefox27 --- fixed
fennec - ---

People

(Reporter: gbrown, Assigned: Margaret)

References

Details

(Keywords: perf, regression)

Attachments

(2 files)

Since about Aug 22, trobopan and tcheck2 have been unstable on both Android 2.2 and Android 4.0, triggering lots of Talos alerts. 

I think we are already working on sorting this out (via bug 908823 especially) but I wanted another bug on file to reflect that there have been various regressions and improvements reported since then (it seems pointless to report bugs for all of the individual alerts -- there's a new one almost every day.)

http://graphs.mozilla.org/graph.html#tests=[[201,11,20]]&sel=none&displayrange=30&datatype=running
http://graphs.mozilla.org/graph.html#tests=[[174,11,20]]&sel=none&displayrange=30&datatype=running
http://graphs.mozilla.org/graph.html#tests=[[201,11,29]]&sel=none&displayrange=30&datatype=running
http://graphs.mozilla.org/graph.html#tests=[[174,11,29]]&sel=none&displayrange=30&datatype=running
Cc'ing kats, since we were talking about this on IRC this morning.
Yeah I started looking at this a little today. Note that it's only the Tegras that have been off since the fig landing, and I agree with Margaret that for some reason it was because we were running the test on the home screen rather than the test page. Her bandaid makes us wait a bit longer but I think we're still running the test on the home screen for the most part, based on the huge variance I'm seeing in the tegra talos numbers. It did however make the tests run to completion more often.

I think we need to dig into why the behaviour is wrong on the tegra devices. Margaret, when you ran it locally with your patch, did you see the test getting run on the actual page? I wonder if you can run it locally with and without your patch to see if the reported results are similar. If that is the case then maybe I'm wrong and your patch did fix the behaviour but the numbers are off for some other reason. If the reported results are different before and after (i.e. your patch fixes the talos numbers locally) then the tegras in production are not behaving the same way as your local testing and we need to root-cause the home pager problem.
Depends on: 910106, 907720
Also FWIW I did also do a couple of try pushes for trobopan. [1] just applies your patch to trobopan as well, so that it waits for the homepager. As you reported at bug 907720 comment 346, it fails occasionally. [2] is an additional patch on top of that that increases the timeout, and that one hasn't failed yet. In both cases though the reported talos values have a huge range so I don't think the underlying problem is actually fixed, it just hides the symptoms better.

[1] https://tbpl.mozilla.org/?tree=Try&rev=0440a3838828
[2] https://tbpl.mozilla.org/?tree=Try&rev=5f41b9bf69d8
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)

> I think we need to dig into why the behaviour is wrong on the tegra devices.
> Margaret, when you ran it locally with your patch, did you see the test
> getting run on the actual page?

No, I failed to get the real test pages to load. I may have just been putting the files from the talos repo in the wrong place, but I gave up assuming that not seeing the test get run on about:home anymore was a good enough sign (before applying my patch, I almost always saw us panning and zooming the about:home content, but after applying the patch, I always saw panning and zooming the 404 page that showed up when I couldn't load the test page).

If the numbers are still a mess, I agree that we should try to find a way to observe what's actually going on.
(In reply to :Margaret Leibovic from comment #4)
> (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> No, I failed to get the real test pages to load. I may have just been
> putting the files from the talos repo in the wrong place, but I gave up
> assuming that not seeing the test get run on about:home anymore was a good
> enough sign (before applying my patch, I almost always saw us panning and
> zooming the about:home content, but after applying the patch, I always saw
> panning and zooming the 404 page that showed up when I couldn't load the
> test page).

It could be that the 404 page behaves differently from the actual test page (maybe it loads faster or sends different draw-finished notifications) and so that might throw off the test's behaviour.

To run e.g. testPan locally, I believe the steps I've used successfully in the past are:
1) Copy the relevant fileset (in this case wikipedia.html and en.wikipedia.org) from http://hg.mozilla.org/build/talos/file/ca2229a32cb6/talos/startup_test/fennecmark to your local mobile/android/base/tests/ folder.
2) Modify testPan.java.in to return TEST_MOCHITEST from getTestType()
3) Modify testPan.java.in's getAbsoluteUrl call argument to "/robocop/wikipedia.html"
4) Modify robocop.ini to just have testPan

Then rebuild and run as usual.

In the past I think I've only ever done this with the timecube.html page on testCheck.java.in and since that's a standalone file it works well. Wikipedia and CNN have an additional directory of external content and I don't know if the robocop packaging process will pick that up properly. You can modify the MOCHITEST_ROBOCOP_FILES variable in build/mobile/robocop/Makefile.in to include the "en.wikipedia.org" folder as well, which might do the trick.
We are already tracking the dependent bugs
tracking-fennec: ? → -
I just made a try push that experiments with making sure the home pager is hidden earlier, before we start trying to verify that the page is painted:
https://tbpl.mozilla.org/?tree=Try&rev=78d9b242bc90

I also increased the delay we wait between paint events, since locally I observed us starting to pan a blank page before anything started drawing.

Depending on how this goes, this patch may help fix both bug 908823 and bug 910106.

(For future reference, I found sticking the talos test page files directly in objdir/_tests/testing/mochitest/tests/robocop was the easiest way to make them available in local test runs.)
So, good news is that the patch in that try push brought the tpan numbers back closer to normal (fixing bug 908823).

The bad news is that it didn't fix the intermittent HomePager is hidden issue, and if we land this patch, we'll be adding that intermittent failure to tpan in addition to tcheck2 :(

Maybe I need to revisit the verifyHomePagerHidden implementation to see if something there is causing flakiness, or if the home pager just really doesn't get hidden. Maybe we need an even longer timeout? I can push a patch that does that to try. Maybe I should also land a patch that gives us some more logging.
An attempt to increase MAX_WAIT_HOME_PAGER_HIDDEN_MS, but not PAINT_CLEAR_DELAY:
https://tbpl.mozilla.org/?tree=Try&rev=e95999e1170f

An attempt that increases both:
https://tbpl.mozilla.org/?tree=Try&rev=3d666ef0831d
Assignee: nobody → margaret.leibovic
This patch moves the verifyHomePagerHidden check to before we start trying to verify that the page drawn, which seems like a logical change.

However, just moving that check still caused a lot of intermittent "HomePager is hidden failures". I added some logging to see what happens in those failure cases, and it appears the view just really is still visible, so I think these tegras are just really slow. Increasing MAX_WAIT_HOME_PAGER_HIDDEN_MS appears to fix this problem, but maybe we should start looking into why it actually takes so long for the home pager to disappear (this isn't only dependent on the Java UI, since in these tests the home pager is only hidden after we get a location change event).  

When I got this test running locally I saw that without the PAINT_CLEAR_DELAY increase, we would end up starting to pan the page before it finished drawing. This is definitely wrong, and I believe that's what's caused the big spike reported in bug 908823. This increase appears to fix the issue right now, but I'd also like to look into why this wasn't a problem before the fig merge, since the time to paint a page after the home pager is hidden shouldn't have changed with the fig changes.

Lucas has been doing some fig-related performance investigations, so maybe he would have some ideas about these things. In any case, I think this patch fixes a lot of our failure/bogus-number issues, so it would be worthwhile to land.
Attachment #804802 - Flags: review?(gbrown)
Comment on attachment 804802 [details] [diff] [review]
patch to move verifyHomePagerHidden check and increase timeouts

Review of attachment 804802 [details] [diff] [review]:
-----------------------------------------------------------------

I too am concerned about the length of delays that seem to be necessary, but I don't know of another way to get these passing. Let's hope those performance investigations turn up some answers soon!
Attachment #804802 - Flags: review?(gbrown) → review+
(In reply to Geoff Brown [:gbrown] from comment #11)
> I too am concerned about the length of delays that seem to be necessary, but
> I don't know of another way to get these passing.

The test itself could be broken (or not working well with fig), and if that's the case, then the test needs to be fixed... but first it's probably best to understand what changed with fig that made the test irrelevant?
https://hg.mozilla.org/integration/fx-team/rev/dea697e42d70

Let's see how well this works, and we can look to uplift to Aurora if necessary.
(In reply to :Margaret Leibovic from comment #13)
> https://hg.mozilla.org/integration/fx-team/rev/dea697e42d70
> 
> Let's see how well this works, and we can look to uplift to Aurora if
> necessary.

Failing test:
https://tbpl.mozilla.org/php/getParsedLog.php?id=27974067&tree=Fx-Team
Ugh, we use loadAndPaint in testSystemPages to test about:home... so of course the home pager won't be hidden!

I don't really see the value in an automated test to make sure about:home loads, since it's highly unlike a patch would ever land with that problem, since it's so noticable.

Test passed locally, giving that a try:
https://tbpl.mozilla.org/?tree=Try&rev=ddde61cf46f5
Like I mentioned in my last comment, I don't think we should bother checking about:home in testSystemPages. In fact, using loadAndPaint to check this Java-based page doesn't even make sense.
Attachment #806045 - Flags: review?(gbrown)
Comment on attachment 806045 [details] [diff] [review]
patch to remove about:home from testSystemPages

Review of attachment 806045 [details] [diff] [review]:
-----------------------------------------------------------------

I think we can do without an explicit about:home check, but let's check with Paul too, since he wrote the original test.
Attachment #806045 - Flags: review?(gbrown)
Attachment #806045 - Flags: review+
Attachment #806045 - Flags: feedback?(paul.feher)
Attachment #806045 - Flags: feedback?(paul.feher) → feedback+
https://hg.mozilla.org/mozilla-central/rev/b8e91961a401
https://hg.mozilla.org/mozilla-central/rev/b3cef9e12692
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 27
You need to log in before you can comment on or make changes to this bug.