873725 - Only run Android mochitest-gl on Try, and make it not-by-default

Reporter

Description

•

11 years ago

Before we had a mochitest-gl, the webgl tests were the reason we couldn't unhide the Android 4.0 mochitest-1, but as far as I know nobody even ever filed a bug about fixing it, much less fixed it.

Now that we have them separated out into their own suite so that we can have an unhidden mochitest-1, we can also stop running that separate permaorange suite on Pandas. Let's do that.

Benoit Jacob [:bjacob] (mostly away)

Comment 1

•

11 years ago

I didn't know about "mochitest-gl". IIUC that consists solely of the WebGL mochitests; in that case, it should really be named "mochitest-webgl", as just "gl" is not very specific (OpenGL is use for all display on many platforms).

Joel Maher ( :jmaher ) (UTC -8)

Comment 2

•

11 years ago

gl is the tbpl abbreviation, we don't have room for a 5 letter job name!

Ed Morley [:emorley]

Comment 3

•

11 years ago

(In reply to Benoit Jacob [:bjacob] from comment #1)
> I didn't know about "mochitest-gl". IIUC that consists solely of the WebGL
> mochitests; in that case, it should really be named "mochitest-webgl", as
> just "gl" is not very specific (OpenGL is use for all display on many
> platforms).

We could rename the buildername (mochitest-gl -> mochitest-webgl) but leave the TBPL symbol as "gl", if you like?

Benoit Jacob [:bjacob] (mostly away)

Comment 4

•

11 years ago

I don't care strongly.

Phil Ringnalda (:philor)

Reporter

Comment 5

•

11 years ago

Or we could just stop running both runs instead of arguing the name - I tried disabing subtests to get the failure rate on tegras down to something acceptable, and if anything wound up with more failures.

Joel Maher ( :jmaher ) (UTC -8)

Comment 6

•

11 years ago

I think removing all instances of webgl is not a good thing.  With that said, our failure rate is too high on these tests, I vote for not running these in production until the test authors have made the tests more reliable.  

Having the ability to easily run these on try server would be critical in order for the test authors to test changes as they would be in production.

Phil Ringnalda (:philor)

Reporter

Comment 7

•

11 years ago

Yeah, we could make ourselves feel good by making it only run on try, for both platforms, and making it try_by_default: False, so that it would only run with "-u (whatever, either all or mochitest-gl)[android]", that being the, um, somewhat surprising way that you ask for a not-by-default unittest.

Practically speaking, gfx hackers are not going to remember to do that, since even when it's just a matter of running opt builds in order to get Fennec tests people forget over and over and think that -b d covered their need to test, but at least we'll feel good and not be running hundreds of pointless jobs per day.

Summary: Stop running mochitest-gl on Pandas → Only run Android mochitest-gl on Try, and make it not-by-default

Benoit Jacob [:bjacob] (mostly away)

Comment 8

•

11 years ago

So far I had been only bikeshedding about names... back to the topic of this bug now:

I understand that high intermittent rates are a burden. But not testing, or testing only on Try (which is almost the same as far as catching regression goes), a feature like WebGL, means that it /will/ regress on Android.

For all I know, the high intermittent rate for WebGL tests on Android boils down to these tests being too resource-intensive for the limited Android testing hardware that we have.

If we can't get better hardware (heck, consumer Android devices with 2G of ram have been around for a while!), then the path of least resistance is to keep disabling selected WebGL tests. That has been our approach so far but I have had a little bit of trouble getting acceptance for patches disabling selected WebGL tests due to the valid concern that this reduces our test coverage. Now if the alternative, being proposed here, is to stop testing altogether, then we're collectively being self-inconsistent.

I really don't want to be in charge of this (I've not been supposed to be working on WebGL for the past 6 months, but this is hard to get away from) so I'll just needinfo? people.

Flags: needinfo?(vladimir)

Flags: needinfo?(blassey.bugs)

Phil Ringnalda (:philor)

Reporter

Comment 9

•

11 years ago

The other problem with disabling our way to victory is that it doesn't seem to actually work - not only is bug 863716 not exactly a shining example of success, but I spent most of yesterday doing it for Android 2.2 on Try without success. Every timeout I looked at happened in one of the last two tests to run, so I tried disabling them, and got twice as many failures in an earlier test, including disconnects like we get on WinXP but don't generally get (beyond the normal number in any test) on Android, all in the same earlier test. I disabled that one as well, and got the same higher failure rate spread across two other tests.

Brad Lassey [:blassey] (use needinfo?)

Comment 10

•

11 years ago

Benoit, what info do you need?

If these tests are too resource intensive to run with 1GB of RAM, it sounds like tests need some fixing.

Flags: needinfo?(blassey.bugs)

Benoit Jacob [:bjacob] (mostly away)

Comment 11

•

11 years ago

The question is about whether there exists some willpower somewhere at Mozilla to invest in fixing the tests and/or understand exactly why the slaves are intermittently failing to run them; and whether that could happen fast enough to prevent disabling the test suite wholesale.

Benoit Jacob [:bjacob] (mostly away)

Comment 12

•

11 years ago

I am trying to make clear here that /nobody/ is currently assigned to fix WebGL tests, I have shifted away from WebGL work 6 months ago, and this is an important problem, so /somebody/ should be assigned to it (not me).

Joel Maher ( :jmaher ) (UTC -8)

Comment 13

•

11 years ago

the webgl tests fail a lot on the tegras, 100% on the pandas and intermittently on desktop platforms (predominately windows).  We need to find an owner for these tests or they will eventually become 100% disabled on more than just the Android platform.  Maybe we can find a community member to look at these?

Benoit Jacob [:bjacob] (mostly away)

Comment 14

•

11 years ago

We have several full-time engineers and directors whose work rests to a large extent on WebGL (e.g. the people involved in the games effort). So it would be absurd if we couldn't have a full-time engineer (say, one of them) working on this for the time it takes to fix this.

Benoit Jacob [:bjacob] (mostly away)

Comment 15

•

11 years ago

(This is in reply to the suggestion to hand this off to a "community member" --- community members are awesome but for this kind of high-priority, unexciting work, you want a full-time engineer. Anyway, this conversation doesn't belong here, sorry.)

Ed Morley [:emorley]

Comment 16

•

11 years ago

(In reply to Benoit Jacob [:bjacob] from comment #8)
> If we can't get better hardware (heck, consumer Android devices with 2G of
> ram have been around for a while!)

Of all the potential avenues we have, obtaining 2GB devices is unlikely to be on the table, given we've only just purchased ~800 Panda boards in the last 6 months (http://pandaboard.org/content/platform).

Benoit Jacob [:bjacob] (mostly away)

Comment 17

•

11 years ago

(In reply to Ed Morley [:edmorley UTC+1] from comment #16)
> Of all the potential avenues we have, obtaining 2GB devices is unlikely to
> be on the table, given we've only just purchased ~800 Panda boards in the
> last 6 months (http://pandaboard.org/content/platform).

Yeah, we're past that discussion. When I wrote that I also didn't realize that Pandas had 1 G --- not bad.

Re: community members I want to make sure that I didn't offend anyone. I'm just saying that we shouldn't rely on a miracle happening on a particular problem in a very short timeframe, even though the community has produced a lot of miracles.

Taking a step back:

The real problem here is that disabling WebGL tests on a platform is a very important decision that affects directly the work of several people at Mozilla --- do people here realize that it is 50% of the way toward turning WebGL off by default on that platform? --- so you want to make sure that all the relevant people are part of the conversation. This bug didn't have a lot of the right people in CC. It has more people now. I'm more than happy to hand that off to a combination of Brad and Vlad, whence the needinfo.

Phil Ringnalda (:philor)

Reporter

Comment 18

•

11 years ago

Taking a further step back:

What can we do to fix this institutional problem we apparently have, where we have deeply concerned stakeholders in a set of tests who have absolutely no idea whether or not they have ever actually ran? The thing I actually filed this bug about turning off? NEVER RAN. Every single time since we first turned on tests on Pandas, it crashed and took out the entire mochitest-1 suite. Every time since we created an entire separate suite for what's actually a single mochitest just so we could run the rest of mochitest-1, it has crashed. No part of the webgl tests past that crash has ever run on Fennec on Android 4.0. Where is the outcry about that? Where is anyone touching any related code even asking "hey, did we run this test?" much less doing anything about the fact that we didn't?

Joel Maher ( :jmaher ) (UTC -8)

Comment 19

•

11 years ago

We focus so much energy on conserving resources, but we have tests that have never passed in 6+ months of running on the pandas, and have been hidden on the tegras as well.  If they are hidden by default, I would like somebody besides to sheriffs to speak up and tell me they are watching those hidden tests and on what branches.  

If nobody has watched them, then I don't see why turning these off is a big deal.  We are wasting resources that could be better used to speed up turnaround time, run more reftests, xpcshell tests, robocop tests, or better yet debug builds.

We split webgl tests out of mochitest chunk 1 because of the high failure rate and with the hope that we would give 100% of available resources to just the webgl test suite.  Even without all the other tests loading before it, we still run into problems.  If we need to wait a week or so for person 'X' to finish a few bugs and take a look at it fine.  If nobody is going to look at this for a few months, then I don't see how it benefits us to run thousands of jobs nobody will look at.

Benoit Jacob [:bjacob] (mostly away)

Comment 20

•

11 years ago

(In reply to Joel Maher (:jmaher) from comment #19)
> We focus so much energy on conserving resources, but we have tests that have
> never passed in 6+ months of running on the pandas, and have been hidden on
> the tegras as well.

I haven't been working much on WebGL in that timeframe. If you're saying that we already have effectively no or little WebGL regression testing on Android, then well, that's extremely concerning. We must fix that or face disabling WebGL by default on Android (which would IMO be terrible). "Enabled by default but untested" is not a stable place to stay in.

Benoit Jacob [:bjacob] (mostly away)

Comment 21

•

11 years ago

(In reply to Phil Ringnalda (:philor) from comment #18)
> Taking a further step back:
> 
> What can we do to fix this institutional problem we apparently have, where
> we have deeply concerned stakeholders in a set of tests who have absolutely
> no idea whether or not they have ever actually ran? The thing I actually
> filed this bug about turning off? NEVER RAN. Every single time since we
> first turned on tests on Pandas, it crashed and took out the entire
> mochitest-1 suite. Every time since we created an entire separate suite for
> what's actually a single mochitest just so we could run the rest of
> mochitest-1, it has crashed. No part of the webgl tests past that crash has
> ever run on Fennec on Android 4.0. Where is the outcry about that? Where is
> anyone touching any related code even asking "hey, did we run this test?"
> much less doing anything about the fact that we didn't?

You are right, this is the real problem. We shouldn't be enabling WebGL by default on Android without actual test coverage, and we should know and react quickly about such problems. I have no idea how to solve this class of problem in general.

In the present case, basically you've convinced me that at this point it makes sense to disable WebGL tests on Android now BUT we must staff fixing and reenabling this test suite as soon as possible, which means, before this Gecko 24 cycle hits the release channel, or else we'll pretty much have to disable WebGL eventually.

Benoit Jacob [:bjacob] (mostly away)

Updated

•

11 years ago

Depends on: 874291

Benoit Jacob [:bjacob] (mostly away)

Comment 22

•

11 years ago

Filed bug 874291 about fixing the WebGL mochitest.

Phil Ringnalda (:philor)

Reporter

Comment 23

•

11 years ago

Not currently an actionable releng bug (since I'm changing horses and looking to disable my way to victory on Pandas while throwing the Tegras to the wolves instead).

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → INCOMPLETE

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Kelsey Gilbert [:jgilbert]

Comment 24

•

10 years ago

Reopening this.
We have Android+DEBUG runs of mochitest-gl running on at least one slave type on Ceder.
Can we get mochitest-gl as opt-in for DEBUG+Android tests on Try?

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Kelsey Gilbert [:jgilbert]

Comment 25

•

10 years ago

I don't think this needinfo is needed anymore.

Flags: needinfo?(vladimir)

Phil Ringnalda (:philor)

Reporter

Comment 26

•

10 years ago

While I find it uproariously funny that you reopened this bug, which was entirely about the way that the mochitest-gl suite on Android was completely unowned and nobody but me would do a damn thing about it, in order to ask for a new flavor of it to be run, as you can see it's not going to get it run for you. If a new bug about that new suite doesn't already exist, you'll want to open one.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 10 years ago

Resolution: --- → INCOMPLETE

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: General Automation → General