Closed Bug 960072 Opened 10 years ago Closed 10 years ago

Fix gaia-integration tests

Categories

(Firefox OS Graveyard :: General, defect)

All
Gonk (Firefox OS)
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: daleharvey)

References

Details

The gaia-integration tests have just been hidden on all trees, since they do not meet the requirements for being shown in the default view:
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy

Notably:

https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#7.29_Low_intermittent_failure_rate
 * For stats see: http://brasstacks.mozilla.com/orangefactor/?display=OrangeFactor&test=gaia-integration&tree=trunk
 * Main bugs are bug 953212, bug 920153, bug 953309.


And:

https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6.29_Outputs_failures_in_a_TBPL-starrable_format
 * Typical failures give the following output, which requires opening the log to ascertain the true cause:
   {
   command timed out: 1200 seconds without output, attempting to kill
   }

   {
   make: *** [node_modules/.bin/mozilla-download] Error 34
   Return code: 512
   Tests exited with return code 512: harness failures
   # TBPL FAILURE #
   }

To see the jobs, the &showall=1 param must be used, eg:
https://tbpl.mozilla.org/?tree=B2g-Inbound&jobname=gaia-integration&showall=1

We should either fix these tests, or disable them to save resources - a la bug 784681.
Depends on: 953304, 953318, 953579, 956061, 957854
Blocks: 866909
Summary: Fix or disable gaia-integration tests → Fix or disable gaia-integration tests (currently running but hidden)
I should add these are also currently permared, but had been starred as one of the intermittents, since the failures messages are all pretty similar unless the full log is opened.

https://tbpl.mozilla.org/php/getParsedLog.php?id=33017441&tree=B2g-Inbound

Failure summary:
{
make: *** [node_modules/.bin/mozilla-download] Error 1
Return code: 512
Tests exited with return code 512: harness failures
# TBPL FAILURE #
}

From full log:
{
00:13:19     INFO - Calling ['make', 'test-integration', 'NPM_REGISTRY=http://npm-mirror.pub.build.mozilla.org', 'REPORTER=mocha-tbpl-reporter'] with output_timeout 330
00:13:19     INFO -  npm install --registry http://npm-mirror.pub.build.mozilla.org
00:13:31     INFO -  npm ERR! Error: shasum check failed for /home/cltbld/tmp/npm-2259-YYIhRDkV/1389773611198-0.06632106122560799/tmp.tgz
00:13:31     INFO -  npm ERR! Expected: 4db64844d80b615b888ca129d12f8accd1e27286
00:13:31     INFO -  npm ERR! Actual:   6633a07cf7b1233a40366ffd16c90170190f139a
00:13:31     INFO -  npm ERR!     at /usr/lib/node_modules/npm/node_modules/sha/index.js:38:8
00:13:31     INFO -  npm ERR!     at ReadStream.<anonymous> (/usr/lib/node_modules/npm/node_modules/sha/index.js:85:7)
00:13:31     INFO -  npm ERR!     at ReadStream.EventEmitter.emit (events.js:117:20)
00:13:31     INFO -  npm ERR!     at _stream_readable.js:920:16
00:13:31     INFO -  npm ERR!     at process._tickCallback (node.js:415:13)
00:13:31     INFO -  npm ERR! If you need help, you may report this log at:
00:13:31     INFO -  npm ERR!     <http://github.com/isaacs/npm/issues>
00:13:31     INFO -  npm ERR! or email it to:
00:13:31     INFO -  npm ERR!     <npm-@googlegroups.com>
00:13:31     INFO -  npm ERR! System Linux 3.2.0-23-generic-pae
00:13:31     INFO -  npm ERR! command "/usr/bin/node" "/usr/bin/npm" "install" "--registry" "http://npm-mirror.pub.build.mozilla.org"
00:13:31     INFO -  npm ERR! cwd /builds/slave/test/gaia
00:13:31     INFO -  npm ERR! node -v v0.10.21
00:13:31     INFO -  npm ERR! npm -v 1.3.11
00:13:32     INFO -  (node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.
00:13:32     INFO -  Trace
00:13:32     INFO -      at Socket.EventEmitter.addListener (events.js:160:15)
00:13:32     INFO -      at Socket.Readable.on (_stream_readable.js:689:33)
00:13:32     INFO -      at Socket.EventEmitter.once (events.js:179:8)
00:13:32     INFO -      at Request.onResponse (/usr/lib/node_modules/npm/node_modules/request/request.js:625:25)
00:13:32     INFO -      at ClientRequest.g (events.js:175:14)
00:13:32     INFO -      at ClientRequest.EventEmitter.emit (events.js:95:17)
00:13:32     INFO -      at HTTPParser.parserOnIncomingClient [as onIncoming] (http.js:1688:21)
00:13:32     INFO -      at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:121:23)
00:13:32     INFO -      at Socket.socketOnData [as ondata] (http.js:1583:20)
00:13:32     INFO -      at TCP.onread (net.js:525:27)
...
...
}
Depends on: 956207
Depends on: 959122
Depends on: 920308
Depends on: 960121
Depends on: 960125
Depends on: 960126
Depends on: 960129
Assignee: nobody → gaye
Depends on: 960188
Depends on: 960393
Depends on: 960394
Depends on: 960578
So yesterday and today (after some tooling work/disabling/patching) Gi was actually quite stable. We still don't have gaia try (see bug 960201), but I think the lower intermittent rate warrants discussion of making these visible again.

Thoughts?
Flags: needinfo?(ryanvm)
Flags: needinfo?(jgriffin)
Flags: needinfo?(emorley)
Depends on: 961336
Depends on: 961337
Depends on: 961338
Depends on: 961340
Depends on: 961438
Depends on: 961450
Just looking at today's runs, there's still a lot of failures...looks like we need to do some more work before unhiding these.
Flags: needinfo?(jgriffin)
Hey Jonathan,

Every time I've looked in the past week (since around last Thursday) we've had about 10 greens in a row (save for the hg issues I brought up with you and Aki). What percent green do we need and over how many observations?
Flags: needinfo?(jgriffin)
I guess there are a lot of dependent bugs. I will share these with Gaia folks and also look through them myself.
Depends on: 962435
(In reply to Gareth Aye [:gaye] from comment #4)
> Hey Jonathan,
> 
> Every time I've looked in the past week (since around last Thursday) we've
> had about 10 greens in a row (save for the hg issues I brought up with you
> and Aki). What percent green do we need and over how many observations?

For guidance on failure rate, see:
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#7.29_Low_intermittent_failure_rate

The other requirements on that page will also need to be met, per comment 0.
Flags: needinfo?(ryanvm)
Flags: needinfo?(jgriffin)
Flags: needinfo?(emorley)
Just to quantify this a bit, on the last 25 runs on b2g-inbound, there were 9 failures:

1 instance of bug 961438
7 instances of bug 953212
1 instance of bug 920153

I'm going to add some help for bug 920153, but bug 953212 is a bigger problem.  It is actually conflating several different problems...sometimes this error happens during hg clone, sometimes during npm install, and sometimes during test execution.  This absolutely needs to be fixed before we can unhide the tests, because it makes them very difficult to sheriff.

The changes I'll make to hg clone in bug 920153 will help with the instances of 953212 that occur during hg clone.  So, we'll still need two other things:

1 - an update to the mozharness script to implement more specific error reporting when 'npm install' times out (or even better, figure why it's timing out and fix it)
2 - better handling when the harness itself times out.  Currently there's no timeout handling in the harness itself, so the timeouts get handled by mozprocess.  It would be much better for the harness to be able to monitor test execution and handle test timeouts more intelligently, but at a minimum, we need better reprorting...when a test times out, we should output an error which indicates which test timed out, rather than a generic (and thus unsheriffable) string.  Potentially we could do this in the mozharness script by clever log parsing, but ideally it's something that should be baked into the harness.

I'm not sure if problem #1 is distinct from bug 953309...it's possible they have the same underlying cause.  It may make sense to investigate and solve that problem, and see if #1 goes away, and if not, to add some smarter failure handling to the mozharness script.
Turns out there is internal timeout handling (though not internal hang handling, that being what results in the 330 seconds without output failures). The error message for a timeout is "test description "before each" hook". The error message for failing to find an iframe the test is looking for is "test description "before each" hook". Probably the error message for 1 != 2 is "test description "before each" hook".

I very strongly feel that this test suite isn't even close to acceptable, and it should be shut off everywhere except Cedar while it's turned into something that is close to acceptable.
Depends on: 964089
(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> 1 - an update to the mozharness script to implement more specific error
> reporting when 'npm install' times out (or even better, figure why it's
> timing out and fix it)

Would you mind creating a patch for the reporting?

I can look into improving timeout management in marionette-js-runner. Mocha does this, but I know that the framework that hooks up mocha, the marionette client, and gecko doesn't always handle things gracefully.
Flags: needinfo?(jgriffin)
(In reply to Gareth Aye [:gaye] from comment #9)
> (In reply to Jonathan Griffin (:jgriffin) from comment #7)
> > 1 - an update to the mozharness script to implement more specific error
> > reporting when 'npm install' times out (or even better, figure why it's
> > timing out and fix it)
> 
> Would you mind creating a patch for the reporting?
> 
> I can look into improving timeout management in marionette-js-runner. Mocha
> does this, but I know that the framework that hooks up mocha, the marionette
> client, and gecko doesn't always handle things gracefully.

Yes, I'll make a patch to improve the reporting.  I also separately have made a patch to dump npm-debug.log when 'npm install' fails, which will hopefully help us figure out why it fails so often - http://hg.mozilla.org/build/mozharness/rev/35223f92c123.

Just eyeballing the failures, it looks like maybe a third of the failures, roughly, occur because 'npm install' causes something to clone a repo directly from github, and github hangs up on us.  See e.g., https://tbpl.mozilla.org/php/getParsedLog.php?id=33629939&tree=Mozilla-Inbound#error0 :

07:05:34     INFO -  npm ERR! git clone https://github.com/dominictarr/crypto-browserify.git Cloning into bare repository '/home/cltbld/.npm/_git-remotes/https-github-com-dominictarr-crypto-browserify-git-a9d1415f'...
07:05:34     INFO -  npm ERR! git clone https://github.com/dominictarr/crypto-browserify.git
07:05:34     INFO -  npm ERR! git clone https://github.com/dominictarr/crypto-browserify.git error: The requested URL returned error: 403 while accessing https://github.com/dominictarr/crypto-browserify.git/info/refs
07:05:34     INFO -  npm ERR! git clone https://github.com/dominictarr/crypto-browserify.git fatal: HTTP request failed

Is there any way we can prevent us from needing to clone github repos?
Flags: needinfo?(jgriffin)
> Is there any way we can prevent us from needing to clone github repos?

We should make the npm-mirror fetch them or disallow github dependencies in gaia. I can do the former for now I think...
Depends on: 964527
Depends on: 966156
Depends on: 966165
Depends on: 966293
Depends on: 966583
Depends on: 966608
Depends on: 967563
Depends on: 968040
Depends on: 969249
Depends on: 971585
Depends on: 971619
Depends on: 972226
Blocks: 982262
Depends on: 982260
Depends on: 992220
Depends on: 1000123
Depends on: 1004453
gaia-integration tests on {m-c, inbound, fx-team, b2g-inbound, try} are now hidden on linux64 (the linux32 and OSX variants were already hidden), due to bug 1004453.
And we've now landed permaorange on b2g-inbound, and merged it around to every other tree.
Depends on: 1017607
I've filed bug 1017607 for switching these off on trunk for !cedar, given that they are perma-failing, hidden and bug 1004453 isn't fixed.
Summary: Fix or disable gaia-integration tests (currently running but hidden) → Fix gaia-integration tests (currently running but hidden)
Depends on: 1015657
Depends on: 1008375
Depends on: 1007519
No longer depends on: 1007519
Depends on: 1023392
Depends on: 1032258
Depends on: 1032288
See Also: → 1035939
Blocks: 1037001
Depends on: 1035939
Depends on: 1037194
Depends on: 1005707
Depends on: 1010415
Depends on: 1037716
Depends on: 1017002
Depends on: 1037924
Depends on: 1056156
No longer depends on: 1035939
No longer depends on: 960126
No longer depends on: 1056156
No longer depends on: 1037716
No longer depends on: 1037194
No longer depends on: 1010415
No longer depends on: 982260
I've updated the dependencies here to only include infrastructure fixes. The tests now all block a separate bug which we should fix, but for the purposes of un-hiding, we can mass-disable them. (Most of them have already been disabled).
See Also: → 1087038
See Also: → 1083571
Depends on: 1091484
Depends on: 1091615
Depends on: 1091645
Hey Gareth, I am gonna be working on this till its unhidden, so stealing for now if thats cool
Assignee: gaye → dale
Depends on: 1091680
With the 2 blocking tests disabled (they both have patches), https://treeherder.mozilla.org/ui/#/jobs?repo=gaia-try&revision=a313fc16de92 is looking pretty green right now
James

So most of the errors currently give error reports in the form of some type of socket error, now I can see unhandled errors being passed through the client a bunch and theres a few ways to clean that up, I will hopefully get it to the point where tests failing like that will fail with an obvious "marionette couldnt send commands to b2g" or something

However that actually fix the tests, is b2g crashing when this happens, is there a way to get some visibility on what is with the b2g process when the socket dies?
Flags: needinfo?(jlal)
Depends on: 1092103
Clearing needinfo since we are looking into it @ https://bugzilla.mozilla.org/show_bug.cgi?id=1093799
Flags: needinfo?(jlal)
No longer blocking since test has been disabled
No longer depends on: 1091484
Depends on: 1030045
Depends on: 1094847
Depends on: 1100305
Depends on: 1100307
No longer depends on: 1100305
Jobs unhidden on Treeherder on all repos apart from b2g32 and b2g30, in bug 1037001.
Summary: Fix gaia-integration tests (currently running but hidden) → Fix gaia-integration tests
This was our meta bug to track enabling Gij, the remaining bugs can be tracked seperately
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.