Closed Bug 980541 Opened 8 years ago Closed 8 years ago

Marionette-JS does not automatically fail when an oop app process crashes

Categories

(Testing Graveyard :: JSMarionette, defect, P1)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: qdot, Assigned: hub)

References

Details

(Whiteboard: [systemsfe] [c=automation p=3])

Attachments

(5 files)

In researching bug 962988, it was found that marionette simply stalls on the messaging app crashing instead of failing the test. This needs to be fixed in order to reliably run integration tests on OOP b2g-desktop.
Attached file Example test for issue
Here's a sample test for gaia to prove out this issue. This bug doesn't just happen when the app crashes, but even when window.close() is called before a marionetteScriptFinished happens. Usually this would trigger a "unload was called" error when b2g-desktop is run non-OOP. When run OOP, no error fires and the test is stuck. It appears that timeouts no longer fire at this point either.
Assignee: nobody → kyle
Status: NEW → ASSIGNED
Chatted with qDot about this a bit; he thinks we could have a watchdog in marionette-server pinging the active marionette-listener, and if the ping isn't returned, we could send an error back to Marionette.  This would protect against the cases where an OOP app crashes or disappears before the final response gets sent back to marionette-server.

In the particular example he attached, this sendError isn't getting received by the parent:

http://mxr.mozilla.org/mozilla-central/source/testing/marionette/marionette-listener.js#584
Duplicate of this bug: 980973
This is also important for make perf-test
Blocks: 980993
Wouldn't a timeout on the client side be enough?
Yes, there should be a client-side timeout regardless, to catch other cases where something bad happens on the server side.
ok, filed bug 983461 for the timeout on the client side. I need this for bug 980993
Duplicate of this bug: 983461
Here is the problem:

Using the above test (or with bug 983461), I get node to hang.

1. I can't even debug Node with --debug, which might also explain why no timeout is raised.
2. If I attach gdb to the node process, I get this stack trace:

(gdb) thread apply all bt

Thread 2 (Thread 0x7f54edf62700 (LWP 6964)):
#0  0x0000003c722bc98d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000003c722ece24 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x00000039775564fd in v8::internal::SignalSender::Run() () from /lib64/libv8.so.3
#3  0x0000003977556166 in v8::internal::ThreadEntry(void*) () from /lib64/libv8.so.3
#4  0x0000003c72607f33 in start_thread (arg=0x7f54edf62700) at pthread_create.c:309
#5  0x0000003c722f4ded in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f54edf25880 (LWP 6963)):
#0  0x0000003c7260ec0b in __libc_recv (fd=11, buf=0x2d448ff, n=2, flags=-1) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
#1  0x00007f54edd21d8a in Sockit::Read(v8::Arguments const&) () from /home/hub/source/mozilla/B2G-buri/gaia/node_modules/marionette-client/node_modules/sockit-to-me/build/Release/sockit.node
#2  0x00003e352f3791d9 in ?? ()
#3  0x00007fff2b013200 in ?? ()
#4  0x00007fff2b013208 in ?? ()
#5  0x0000000000000001 in ?? ()
#6  0x0000000000000000 in ?? ()


We get stuck on a blocking recv() from sockit-to-me.


Kyle, are you actively working on this? If not may I take it as this is blocking tarako and other tests.
Flags: needinfo?(kyle)
Priority: -- → P1
(In reply to Hubert Figuiere [:hub] from comment #9)
> We get stuck on a blocking recv() from sockit-to-me.

So we're all on the same page, I believe the blocking call is an intentional means of making the API be synchronous.  This has the side-effect that the node.js ChildRunner process is not a traditional node.js async execution environment.

As described in https://developer.mozilla.org/en-US/Firefox_OS/Platform/Automated_testing/Gaia_integration_tests#Okay.2C_what_does_make_test-integration_actually_do.3F there is a ParentRunner that sits outside the ChildRunner which does operate in an asynchronous fashion and is probably the right place to perform watchdog capabilities.

Having said that, I suspect sockit-to-me could be altered to support some type of fatal timeout mechanism.  Unless the ChildRunner process is keeping the parent appraised of notable test steps as it goes along, that really is the only way to get the call-stack of the JS runner's state.
Ah, yeah, I didn't mean to have this assigned to myself. Feel free to take.
Assignee: kyle → nobody
Flags: needinfo?(kyle)
Taking it then.
QA Contact: hub
Whiteboard: [systemsfe] [p=5] → [systemsfe] [c=automation p=3]
Assignee: nobody → hub
QA Contact: hub
Kyle, do you think this is the right approach? You test case now throw an error instead of hanging.

If that's the way to go, I'll add an API to the module to set the timeout (hardcoded at 60sec) so we can use the same value as marionette's.
Attachment #8391656 - Flags: feedback?(kyle)
Also I need to add tests.
Attachment #8391656 - Flags: feedback?(gaye)
I now have a test and bumped the module version.
Comment on attachment 8391656 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/sockit-to-me/pull/14

Actually, Gareth, if you could review.
Attachment #8391656 - Flags: feedback?(gaye) → review?(gaye)
Comment on attachment 8391656 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/sockit-to-me/pull/14

While this looks good to me, I'm not super familiar with sockit-to-me, so I'm redirecting to aus, who was the original author.
Attachment #8391656 - Flags: feedback?(kyle) → feedback?(aus)
Blocks: 971771
(In reply to Jonathan Griffin (:jgriffin) from comment #2)
> Chatted with qDot about this a bit; he thinks we could have a watchdog in
> marionette-server pinging the active marionette-listener, and if the ping
> isn't returned, we could send an error back to Marionette.  This would
> protect against the cases where an OOP app crashes or disappears before the
> final response gets sent back to marionette-server.
> 
> In the particular example he attached, this sendError isn't getting received
> by the parent:
> 
> http://mxr.mozilla.org/mozilla-central/source/testing/marionette/marionette-
> listener.js#584

I've filed https://bugzilla.mozilla.org/show_bug.cgi?id=984508 to get this fixed on the server-side.
Comment on attachment 8391656 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/sockit-to-me/pull/14

Yes, you can definitely do it this way. You can also use stock bsd style select and non-blocking recv to timeout but it's nice to use the modern bits from node.
Attachment #8391656 - Flags: feedback?(aus) → feedback+
(In reply to Ghislain Aus Lacroix [:aus] from comment #19)
> Comment on attachment 8391656 [details] [review]
> Link to Github pull-request:
> https://github.com/mozilla-b2g/sockit-to-me/pull/14
> 
> Yes, you can definitely do it this way. You can also use stock bsd style
> select and non-blocking recv to timeout but it's nice to use the modern bits
> from node.

Ugh, nm, you're using poll already. Still feedback+, obviously :P
Comment on attachment 8391656 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/sockit-to-me/pull/14

<gaye> I think auswerk should review,[...]
Attachment #8391656 - Flags: review?(gaye) → review?(aus)
Comment on attachment 8391656 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/sockit-to-me/pull/14

Hooray! I had been meaning to do something similar to avoid using the default system timeout for socket reads. Thanks for doing this! 

As mentioned on IRC, I would like the pull request to be a single commit, you can do this pretty easily with 'git rebase -i master' and doing a force push to your pull request. When that's done, I'll go ahead and merge the PR.
Attachment #8391656 - Flags: review?(aus) → review+
This is the patch for marionette-client.
Attachment #8393483 - Flags: review?(gaye)
Comment on attachment 8393483 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/marionette-js-client/pull/105

cancelling r? for now.
Attachment #8393483 - Flags: review?(gaye)
Comment on attachment 8391656 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/sockit-to-me/pull/14

Need a review for the second patch in that pull request: I needed to set non blocking as in some cases it was still blocking on it.

(I intend to squash it before merge as requested).
Attachment #8391656 - Flags: review?(aus)
Comment on attachment 8393483 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/marionette-js-client/pull/105

Mike,

do you mind reviewing this since Gareth is on PTO. Thanks.

(this patch depends on the other that is for sockit-to-me).
Attachment #8393483 - Flags: review?(mike)
Comment on attachment 8393483 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/marionette-js-client/pull/105

Code looks good and I can confirm the tests against your proposed patch to Sockittome. After some minor changes to the comments, this should be good to merge (so long as nothing crazy happens on that other patch).
Attachment #8393483 - Flags: review?(mike) → review+
Comment on attachment 8391656 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/sockit-to-me/pull/14

r+, with a few nits, see github for details. :)
Attachment #8391656 - Flags: review?(aus) → review+
:aus, do you think we can get a release so I can check the other patch in? Or give me the rights on npm (username hub).

Thanks!
Flags: needinfo?(aus)
Updated to 0.2.0 -- https://www.npmjs.org/package/sockit-to-me
Flags: needinfo?(aus)
merge marionette-js-client

https://github.com/mozilla-b2g/marionette-js-client/commit/9a8aa8913befbc64826524423fb8d243d683ca82



:mhenretty, we need a 1.1.2 release on NPM if possible. Thanks !
Flags: needinfo?(mhenretty)
Published

+ marionette-client@1.1.2
Flags: needinfo?(mhenretty)
The very specific question for feedback, Gareth, is : do I have to rebuild sockit-to-me for Ubuntu and check that in with this pull request?

I want us to use 0.2.0.

Thanks
Attachment #8395047 - Flags: feedback?(gaye)
Comment on attachment 8395047 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/gaia-node-modules/pull/11

Since you answered on IRC, here is the review request. Thanks.

(I haven't figured out how I can "try" this change)
Attachment #8395047 - Flags: feedback?(gaye) → review?(gaye)
Comment on attachment 8395047 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/gaia-node-modules/pull/11

Kevin Grandon told me to merge this since it just pull already published modules.
Attachment #8395047 - Flags: review?(gaye)
Merged

https://github.com/mozilla-b2g/gaia-node-modules/commit/af7758928e2e2024c614672a4cca13483df7f34c
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
And FWIW I forgot to bump the marionette-client requirement for gaia. It was taken care for another bug so we are good.
See https://github.com/mozilla-b2g/gaia/commit/d19a7fb3efa74c479be7f0b24f030e9895eb4b7a
Depends on: 988114
reverted due to general tree bustage.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Here is a try-gaia with the new modules.

https://tbpl.mozilla.org/?tree=Try&rev=0c403419049b

Let's see how it wrecks havoc.
Second try request. The previous had issues with overriding the repo URL.

https://tbpl.mozilla.org/?tree=Try&rev=3c914311e3ae
Status: REOPENED → ASSIGNED
I can reproduce this problem locally on Linux. Trying to isolate the change and fix it ASAP.
Will get Travis and Try server green first.
Attachment #8404380 - Flags: review?(kgrandon)
Strange, that Gi error is the same one we were seeing in bug 974197, which has since been backed out. I'm not sure why we're still seeing it.
Ah right, looks like it was sent to try before that was backed out. I think we should be good to go here. Nice work.
Attachment #8404380 - Flags: review?(kgrandon) → review+
Landed in master: https://github.com/mozilla-b2g/gaia/commit/5e8fe1f267da2a5f7cf88a1a02342ffb1e2b7064
Status: ASSIGNED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Yeah here is the second Try run:  https://tbpl.mozilla.org/?tree=Try&rev=24e7b79d93e3

And only orange is know intermittent in Mochitest.

Thanks !
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.