Closed Bug 974564 Opened 10 years ago Closed 10 years ago

Gaia tree needs to be reopened

Categories

(Firefox OS Graveyard :: Gaia, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gaye, Assigned: gaye)

References

Details

Attachments

(1 file)

On Tuesday, Feb 18 2014 I closed the tree because we started seeing (1) b2g die during the marionette js email integration tests and (2) travis die (no output for 10 minutes) during |npm install|.

Issue (1) was first observed on this build https://travis-ci.org/mozilla-b2g/gaia/builds/18901708. The build output was

> email next previous
>
>◦ should not move down when down tapped if greyed out:
>
>No output has been received in the last 10 minutes, this potentially indicates a >stalled build or something wrong with the build itself.

The intermittent error is not specific to this test though. There were others like

> email notifications, foreground
>
>◦ should have bulk message notification in the different account:
>
>No output has been received in the last 10 minutes, this potentially indicates a >stalled build or something wrong with the build itself.

I tried running pull requests on Travis which reverted each of the following gaia commits

- 9bb31f0da81f92f0048207ea9cf51628bb1183a5
- 19304fe63b26630f95c6f3a7da49349b433d476e and
- b9859b9c2199e1c5b7e4eda220e556a7ff43cd21

However, none of them fixed the issue (the intermittent error still popped up on each pull request). This makes me think the regression may have been introduced by a gecko patch, but further investigation and bisecting is necessary.

Issue (2) popped up yesterday morning while I was trying to debug issue (1). We are still seeing the following with some frequency:

>...
>npm http 200 https://registry.npmjs.org/q/-/q-0.9.7.tgz
>
>npm http 200 https://registry.npmjs.org/proxyquire/-/proxyquire-0.4.1.tgz
>
>No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the >build itself.

Both travis and npm have reported minor outages in the past 24 hours on https://twitter.com/traviscistatus and https://twitter.com/npmjs, but they may not be aware of the severity and ubiquity of the issues.

This bug will track the work to resolve issues 1 and 2 and then (with some luck : ) reopen the tree.
Summary: Tree needs to be reopened → Gaia tree needs to be reopened
I have tweeted at both travis and npmjs to ask them for some guidance https://twitter.com/garethaye/status/436227129089724416.
I joined #travis on IRC and started asking some questions. I'm also running a job with npm log level = verbose, so hopefully that might help us narrow down the problem.
https://travis-ci.org/mozilla-b2g/gaia/builds/19214866 are 30 builds on b2g aurora. Notably, it doesn't have any errors unrelated to travis/npm. This demonstrates to statistical significance that we have a gecko regression. We are looking into bisecting now.
Now I am really confused. This build started going green like crazy: https://travis-ci.org/mozilla-b2g/gaia/builds/19226368

Unless a fix for gecko just landed something weird is happening.
(In reply to Kevin Grandon :kgrandon from comment #4)
> Now I am really confused. This build started going green like crazy:
> https://travis-ci.org/mozilla-b2g/gaia/builds/19226368
> 
> Unless a fix for gecko just landed something weird is happening.

Not sure what you mean?  There are 3 failed builds: 1) https://travis-ci.org/mozilla-b2g/gaia/jobs/19226378 2) https://travis-ci.org/mozilla-b2g/gaia/jobs/19226381 3) https://travis-ci.org/mozilla-b2g/gaia/jobs/19226390
Yup but we don't care about those (I am disabling those tests and opening bugs). The main one we care about is those nasty grey ones which popped up later. So seems there is something funky going on with gecko after all =/
See Also: → 952611
I don't see what points to a gecko issue.

All grey builds I've seen just stop after (or during) the npm install process.
=> https://travis-ci.org/mozilla-b2g/gaia/pull_requests

firefox or b2g is not even launched.
(In reply to Julien Wajsberg [:julienw] from comment #7)
> I don't see what points to a gecko issue.
> 
> All grey builds I've seen just stop after (or during) the npm install
> process.
> => https://travis-ci.org/mozilla-b2g/gaia/pull_requests
> 
> firefox or b2g is not even launched.

We've managed to handle the npm problems with this: https://github.com/mozilla-b2g/gaia/pull/15319

(That is currently under review, and we may want to add additional reviewers). Tonight we have been basing investigations off of that patch. 

Here is an example of the timeout we think is a gecko regression: https://travis-ci.org/mozilla-b2g/gaia/jobs/19230769
failing b2g desktop hg changeset: https://hg.mozilla.org/mozilla-central/rev/33b3248b4aa0
passing b2g desktop hg changeset: https://hg.mozilla.org/mozilla-central/rev/eac89fb04bb9

(again, the 'passing' changeset does not mean it's good)
So I can see more similar failures on TBPL before this changeset.

Also, Bug 953212 is the bug opened by the automation team tracking this issue, we can clearly see an increase of this issue since last week or the week before.
It should be mentioned that the intermittent errors are only showing up in email app test cases.
I've pushed this patch https://github.com/mozilla-b2g/gaia/pull/16486 to add some debugging info for the test that's failing most frequently. The builds are showing up here https://travis-ci.org/mozilla-b2g/gaia/builds/19260139. 30050.7 shows the regression, for instance.
Trying to reproduce locally:

  while xvfb-run -a make test-integration TEST_FILES=apps/email/test/marionette/next_previous_test.js ; do sleep .1 ; done

in 4 different locale clones.
I got one :

  1) email next previous "before each" hook:
     
  ScriptTimeout: (28) timed out
  Remote Stack:
  <none>
      at Error.MarionetteError (/home/julien/travail/git/gaia-clone-1/node_modules/marionette-client/lib/marionette/error.js:67:13)
      at Object.Client._handleCallback (/home/julien/travail/git/gaia-clone-1/node_modules/marionette-client/lib/marionette/client.js:474:19)
      at /home/julien/travail/git/gaia-clone-1/node_modules/marionette-client/lib/marionette/client.js:508:21
      at TcpSync.send (/home/julien/travail/git/gaia-clone-1/node_modules/marionette-client/lib/marionette/drivers/tcp-sync.js:94:10)


which is _not_ what we have in Travis IINW.
I could reproduce several times the same ScriptTimeout at "before each", and its always at:

      at Object.Email.launch (/home/julien/travail/git/gaia/apps/email/test/marionette/lib/email.js:358:17)

But I could reproduce once the issue we have in Travis. Now I don't know how to hook into it...
I could see the crash reporter in the list of my processes so I know that when B2G hangs it's really a crash. But no crash dump yet...
hint: running "make test-integration DEBUG=0" gives you the marionette-js commands that are executed.

Now, I looped running this on my local X server (as opposed to Xvfb) and... my computer restarted. Yay.
(In reply to Gareth Aye [:gaye] from comment #12)
> It should be mentioned that the intermittent errors are only showing up in
> email app test cases.

I have actually seen this happen for other applications as well (dialer I think). Unfortunately I don't have a travis link handy - but it does happen in other places.
We have narrowed the issue down to a b2g crash that gets triggered by sending emails (with a low rate of reproducibility) in the marionette js test suite. Now that we know what's going on (even though we don't know why or which patch introduced the bug), it's time to reopen the tree.
Attachment #8379242 - Flags: review?(bugmail)
Makes me sad that we are disabling these tests. Let's make sure we prioritize tracking this down asap. Should the productivity team own this work?
Comment on attachment 8379242 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/16494

r=asuth on disabling all e-mail tests for now until we can isolate and get the gecko bug fixed and/or workaround it in the e-mail backend.
Attachment #8379242 - Flags: review?(bugmail) → review+
(In reply to Kevin Grandon :kgrandon from comment #21)
> Makes me sad that we are disabling these tests. Let's make sure we
> prioritize tracking this down asap. Should the productivity team own this
> work?

I don't think it would make sense for any other functional team to own it unless there's a team/person who owns keeping testing in general healthy.  (I think all y'all have just been drafted by your natural awesomeness!)  Intermittent crashes involving e-mail frequently turn out to be GC hazards related to workers, so it's something we do want to track down in general anyways.
Assignee: nobody → gaye
Tree is open again!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Gareth filed bug 975588 to turn the e-mail tests back on.
See Also: → 975588
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: