Investigate gaia unit test timeout issues

RESOLVED FIXED

Status

RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: kgrandon, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [systemsfe])

Attachments

(1 attachment)

(Reporter)

Description

4 years ago
Filing a new bug to specifically track the gaia unit test timeout issues we've been seeing after trying to land a few patches. We are seeing a dramatically increased time to run tests after any patch lands that touches the statusbar.

Here is an example of where the tests started timing out : https://tbpl.mozilla.org/?tree=B2g-Inbound&rev=14419f5436e8

For more details see bug 1042105 comment 20 and bug 1038988 comment 61.
Things to note about this slowdown:

1.) Adding the "will-change: transform" property to the #statusbar container in the system app is the smallest change found to trigger these timeouts. However, the patch in bug 1042105 also triggers the timeouts. This bug is now blocking significant work for 2.1.

2.) The timeout is caused by the fact that switching between unit test files goes from less than 1 second to over 10 seconds in certain cases. Between test files, the test agent app will remove an iframe holding the unit test file, create a new iframe for the new test file, append it to the dom, and then wait for a postMessage saying the new test file is ready. It is still unclear at which point this is slow, and getting debugging information from app iframes in tbpl has so far proven very difficult (ie, no one from gaia or releng can figure out how or why we don't see dump() statements from iframes in the TBPL log).

3.) The layer tree is significantly changed by adding the will change property. No one from graphics has yet to take a look and see if it is what we expect. We have a layer dump from tbpl that :lightsofapollo put into github here https://github.com/lightsofapollo/bug-1038988.

4.) No one has been able to reproduce this issue locally, only in TBPL. It is surmised that the slowness in TBPL has something to do with the fact TBPL VM's have no GPU. We have also been able to greatly decrease the slowness between tests by running b2g-desktop through a frame buffer with 16 bit color depth rather than the default 32. The PR for this fix is here: https://github.com/mozilla-b2g/gaia/pull/22494/files
blocking-b2g: --- → 2.1?
Milan, can you take a look at the layer tree dumps James posted here: https://github.com/lightsofapollo/bug-1038988. Is the layer tree what we expect from such a change? Could anything in there cause slowness on b2g-desktop running without GPU acceleration? This bug is now blocking a bunch of work for 2.1, so the more eyes we have on it the better.
Flags: needinfo?(milan)

Comment 3

4 years ago
So was just talking to Guillaume about this;

My best guess is that the layer configuration is changing in such a way that we're now drawing a/some layers with a sub-pixel offset (probably due to rounding errors?) and because of software rendering, we're falling back to a really slow path to render that requires lots of sampling, rather than just a straight blit.

I'm dubious that this would be it, but it's what immediately pops into mind. It would explain the magnitudes of difference (blitting vs. sampling).

If this is the case, perhaps running b2g-desktop on X with the vesa or framebuffer drivers would help reproduce it locally and you could profile and see what's taking all that time. Doing this with Xnest (or whatever the alternative is these days) might make this easier than trying to change the system's X configuration.

I'm uncertain if this would end up being a bug that we want to fix at the platform level, but I guess it could lead to some visual consequence (blurry rendering?) and if it could be fixed trivially, it would be worth doing.

Please note that this is entirely conjecture and someone from the gfx team would be more qualified to comment :)
I already tried on my ubuntu 13.04 laptop with xvfb and xnest, without any success.
Can we try to disable the network activity icon and push the patch to try again? This caused some slowness before.

Comment 6

4 years ago
(In reply to Alexandre LISSY :gerard-majax from comment #4)
> I already tried on my ubuntu 13.04 laptop with xvfb and xnest, without any
> success.

Did you confirm that GL was available, but unaccelerated? (I assume this is the setup of our test machines... Can anyone confirm that?)
Rail, can you confirm the GL setup on our TBPL VM's in regards to comment 6?

I checked with :jlund in releng, and he said the failing job was run on a "tst-linux64-spot-1002 and apparently that's a m1.medium". But he was unable to verify the GL setup (glxinfo was unavailable). Hopefully you can shed some light.
Flags: needinfo?(rail)
(In reply to Chris Lord [:cwiiis] from comment #6)
> (In reply to Alexandre LISSY :gerard-majax from comment #4)
> > I already tried on my ubuntu 13.04 laptop with xvfb and xnest, without any
> > success.
> 
> Did you confirm that GL was available, but unaccelerated?
Flags: needinfo?(lissyx+mozillians)
(In reply to Gregor Wagner [:gwagner] from comment #5)
> Can we try to disable the network activity icon and push the patch to try
> again? This caused some slowness before.

Sure:

https://tbpl.mozilla.org/?rev=001b6a5392561f22be38915478e82d1370d95b6a&tree=Gaia-Try
(In reply to Michael Henretty [:mhenretty] from comment #9)
> (In reply to Gregor Wagner [:gwagner] from comment #5)
> > Can we try to disable the network activity icon and push the patch to try
> > again? This caused some slowness before.
> 
> Sure:
> 
> https://tbpl.mozilla.org/
> ?rev=001b6a5392561f22be38915478e82d1370d95b6a&tree=Gaia-Try

Gregor, you were right! This passed. The problem does have to do with the rendering of the network activity icon. That should narrow down the problem. Also, we might be able to disable the network activity with a pref during our unit tests in the meantime.
(In reply to Michael Henretty [:mhenretty] from comment #10)
> (In reply to Michael Henretty [:mhenretty] from comment #9)
> > (In reply to Gregor Wagner [:gwagner] from comment #5)
> > > Can we try to disable the network activity icon and push the patch to try
> > > again? This caused some slowness before.
> > 
> > Sure:
> > 
> > https://tbpl.mozilla.org/
> > ?rev=001b6a5392561f22be38915478e82d1370d95b6a&tree=Gaia-Try
> 
> Gregor, you were right! This passed. The problem does have to do with the
> rendering of the network activity icon. That should narrow down the problem.
> Also, we might be able to disable the network activity with a pref during
> our unit tests in the meantime.

It worries me that drawing the network icon can cause such a huge amount of work... Do we throttle it's updates? Should we? (if we don't, I think we should)
(In reply to Chris Lord [:cwiiis] from comment #11)
> It worries me that drawing the network icon can cause such a huge amount of
> work... Do we throttle it's updates? Should we? (if we don't, I think we
> should)

A second thought, has anyone checked that we're drawing the icon unscaled and aligned to pixels? Come to think of it, it does look kind of blurry on my main device...
Flags: needinfo?(lissyx+mozillians)
Alexandre has confirmed that my thoughts in comment #12 are unfounded - the image isn't scaled when it's drawn and the update is throttled to half a second.

On the other hand, if I enable paint flashing, I see that the entire notification bar is constantly repainted during network activity. We need to figure that out, that's a serious regression.
(In reply to Michael Henretty [:mhenretty] from comment #7)
> Rail, can you confirm the GL setup on our TBPL VM's in regards to comment 6?
> 
> I checked with :jlund in releng, and he said the failing job was run on a
> "tst-linux64-spot-1002 and apparently that's a m1.medium". But he was unable
> to verify the GL setup (glxinfo was unavailable). Hopefully you can shed
> some light.

The VMs mentioned above run Xvfb and use patched mesa glx. Bug 975034 and bug 818968 have more details about mesa related work.

I hope it helps.
Flags: needinfo?(rail)
(In reply to Chris Lord [:cwiiis] from comment #13)
> Alexandre has confirmed that my thoughts in comment #12 are unfounded - the
> image isn't scaled when it's drawn and the update is throttled to half a
> second.
> 
> On the other hand, if I enable paint flashing, I see that the entire
> notification bar is constantly repainted during network activity. We need to
> figure that out, that's a serious regression.

I see something similar on my Nexus S. Now, this reminds me how often I noticed that having some network activity made the device looking slow at rendering graphics.
Let's see if making the test agent a fullscreen app is a valid workaround until we figure out the graphics issue:

https://tbpl.mozilla.org/?rev=9dbc62f69b068048c79774b302c9a3334efe0259&tree=Gaia-Try
(In reply to Michael Henretty [:mhenretty] from comment #16)
> Let's see if making the test agent a fullscreen app is a valid workaround
> until we figure out the graphics issue:
> 
> https://tbpl.mozilla.org/
> ?rev=9dbc62f69b068048c79774b302c9a3334efe0259&tree=Gaia-Try

Timed out :(

Updated

4 years ago
Depends on: 1054220
I've filed bug 1054220 to track the issue with the status bar, which independent of what we do for this, ought to get fixed.
(In reply to Chris Lord [:cwiiis] from comment #18)
> I've filed bug 1054220 to track the issue with the status bar, which
> independent of what we do for this, ought to get fixed.

Thank you Chris! We'll use this bug to find a work around for the Gaia unit test failure.
Flags: needinfo?(milan)
Created attachment 8474189 [details] [review]
[Gaia PR] disable network activity icon in DEBUG mode

This patch, which should be reverted when bug 1054220 gets figured out, disables the network activity icon in DEBUG mode. I'll run gaia-try with this patch combined with bug 1042105 and bug 1038988 to see if this successfully works around the issue.
(In reply to Michael Henretty [:mhenretty] from comment #22)
> Gaia-Try run for bug 1042105 with this patch applied:
> 
> https://tbpl.mozilla.org/
> ?rev=2f25a0d6a73c0b6a40b222279d5a41d9cb506680&tree=Gaia-Try

Yup, that seemed to fix the slowness when switching test files. Let's move forward with this workaround.
Comment on attachment 8474189 [details] [review]
[Gaia PR] disable network activity icon in DEBUG mode

Kevin, can you take a look?
Attachment #8474189 - Flags: review?(kgrandon)
(Reporter)

Comment 25

4 years ago
Comment on attachment 8474189 [details] [review]
[Gaia PR] disable network activity icon in DEBUG mode

Seems fine to me and I would R+ it, though I guess we should have a systems owner/peer take a look because this is outside the sandboxed feature area that I've been working in.
Attachment #8474189 - Flags: review?(timdream)
Attachment #8474189 - Flags: review?(kgrandon)
Attachment #8474189 - Flags: review?(alive)
Attachment #8474189 - Flags: feedback+
Comment on attachment 8474189 [details] [review]
[Gaia PR] disable network activity icon in DEBUG mode

:/ Better to comment in the statusbar why we early return.
Attachment #8474189 - Flags: review?(alive) → review+
(Reporter)

Updated

4 years ago
Attachment #8474189 - Flags: review?(timdream)
master: https://github.com/mozilla-b2g/gaia/commit/8ff5acfd8912d5556d7202f72535a527e62dadf6
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Already in 2.1
blocking-b2g: 2.1? → ---
No longer depends on: 1054220
You need to log in before you can comment on or make changes to this bug.