Marionette helper waitForElement() sometimes throws exception with no stack

RESOLVED INVALID

Status

RESOLVED INVALID
3 years ago
3 years ago

People

(Reporter: djf, Assigned: djf)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Assignee)

Description

3 years ago
While working on bug 1223070, I found that a couple of tests were failing with the message "Crash detected but error running stackwalk"  This made it hard to figure out what was going on.

I think we're getting some kind of non-standard exception that does not have a stacktrace associated with it. I don't understand why it is failing, but I think this problem can be solved by adding a try/catch in waitForElement and rethrowing an exception when client.findElement fails.
Created attachment 8686369 [details] [review]
[gaia] davidflanagan:bug1224070 > mozilla-b2g:master
(Assignee)

Comment 2

3 years ago
Created attachment 8686371 [details]
log from failing tests without the patch
(Assignee)

Comment 3

3 years ago
Created attachment 8686372 [details]
Improved log for failing test with this patch
(Assignee)

Comment 4

3 years ago
Comment on attachment 8686369 [details] [review]
[gaia] davidflanagan:bug1224070 > mozilla-b2g:master

Gareth,

Would you take a look at this patch? It would have saved me hours of console.log() based debugging in bug 1223070. The other two attachments show the difference it makes for the one particular failing test I was looking at in that bug.
Attachment #8686369 - Flags: review?(gaye)
(Assignee)

Updated

3 years ago
Assignee: nobody → dflanagan
(Assignee)

Comment 5

3 years ago
I should add that there may be some deeper reason for this bug. The failing tests shown in the attachments seem to pass normally. They fail consistently though with Michael's patch from bug 1222215 where he turns off retries. If that is affecting whether the tests pass or fail, then maybe it is also affecting exception hadling somehow.

I also noticed in bug 1223070 that there were four failing bookmark tests. They all seemed to fail on the same bookmark.openAndSave() call. But the two tests that make that call in setup() were failing with stack traces, and the two tests that make that call in a test itself, were failing with that "crash detected" error. So it seems to me that something wonky is going on here. My patch is a workaround, but not a deep solution.
(Assignee)

Comment 6

3 years ago
And it looks like I may have busted the marionette client_test.js test with this patch. I'll wait for your feedback on the patch before I look at that. If it is not worth landing, then I won't bother trying to fix the tests.
I'm convinced we need this for bug 1222215, since we will need to be able to debug any intermittents very quickly and this provides great insight into test failures if I understand correctly.
Blocks: 1222215
So the error running stackwalk bit comes from an issue that we have in marionette-js-runner dealing with gecko crashes that we're working on. I don't entirely understand the workaround here, but I'm happy to get it landed if it's helping you and Michael debug test failures. Did this help you and can you help me understand what difference it made?
(Assignee)

Comment 9

3 years ago
Gareth: take a look at the other two attachments. They're excerpts from logs from two test runs. Without this patch I just get the "Crash detected" message. But with this patch I get a stack trace. I don't think that there is any gecko crash involved.  Yes, this did help.  The downside is that by throwing a new exception to get a stack trace we lose the information that would have been in the original Marionette exception object like the remote stack and the marionette environment information.  If you've got suggestions about a better way to work around this, I'd like to hear them.
Flags: needinfo?(gaye)
(Assignee)

Comment 10

3 years ago
(In reply to David Flanagan [:djf] from comment #6)
> And it looks like I may have busted the marionette client_test.js test with
> this patch. I'll wait for your feedback on the patch before I look at that.
> If it is not worth landing, then I won't bother trying to fix the tests.

Looks like I was wrong. Gij30 had some retries, but they were for know intermittents and they were in tests for the client, not for the helper, so I don't think they were caused by this patch.
(Assignee)

Comment 11

3 years ago
Gareth

Hmm. Maybe a better way to fix this would be in marionette-js-runner/host/session.js in parseCrashInfoWithStackwalkError()

That method overwrites the stack of the error object, which is probably why that information is lost. Would it be okay to modify that file to leave the stack property alone and put the crash stack in .crashStack or .geckoStack or something like that? If this is code that you're actively working on, I'm nervous about messing with it myself.

Alternatively, I suppose I could modify the patch here to look for error.name === 'ProcessCrash' and only throw a new error in that case, otherwise rethrowing whatever other error I get.

It still leaves open the question of why gecko was crashing for those tests when I used bug 122215 to turn off retries. If the remote gecko process crashes, would that manifest as a timeout in client.findElement()?
> That method overwrites the stack of the error object, which is probably why that information is lost. Would it be okay to modify that file to leave the stack property alone and put the crash stack in .crashStack or .geckoStack or something like that? If this is code that you're actively working on, I'm nervous about messing with it myself.

That sounds great for now if you'd like to make that change.

> It still leaves open the question of why gecko was crashing for those tests when I used bug 122215 to turn off retries. If the remote gecko process crashes, would that manifest as a timeout in client.findElement()?

Quite possibly. If gecko crashes then it takes down the marionette server with it and, depending on the state of things, the client might get stuck waiting for a full request response.
Flags: needinfo?(gaye)
(Assignee)

Comment 13

3 years ago
So if I just change parseCrashInfoWithStackwalkError() to use .crashStack instead of .stack, then I get an exception with a stack, but the stack does not include any lines of the test case, and so is useless since it doesn't help pin down what part of the test is running when the crash happens.  Here's the log I get for bookmark_test.js when it crashes:

 TEST-START | apps/homescreen/test/marionette/bookmark_test.js | Homescreen - Bookmark Bookmarking adds bookmark to homescreen
 TEST-UNEXPECTED-FAIL | apps/homescreen/test/marionette/bookmark_test.js | Homescreen - Bookmark Bookmarking adds bookmark to homescreen
 ProcessCrash: Crash detected but error running stackwalk

     at parseCrashInfoWithStackwalkError (/home/worker/git_checkout/node_modules/marionette-js-runner/host/session.js:110:15)
     at /home/worker/git_checkout/node_modules/marionette-js-runner/host/session.js:35:16
     at /home/worker/git_checkout/node_modules/marionette-js-runner/node_modules/promise/lib/core.js:33:15
     at flush (/home/worker/git_checkout/node_modules/marionette-js-runner/node_modules/promise/node_modules/asap/asap.js:27:13)
     at process._tickCallback (node.js:442:13)
 TEST-END | apps/homescreen/test/marionette/bookmark_test.js | Homescreen - Bookmark Bookmarking adds bookmark to homescreen took 23920 ms

So I think I would be better off leaving the session.js code as it is, then waiting for that exception to propagate back to somewhere that we can generate useful stack traces and then rethrowing it, copying the gecko stack into the message, so that we can report both the gecko stack and the JS stack.

I can do this as a special case in client.waitForElement(). I imagine that there is somewhere lower level to do it, but I just don't know where that is, since I don't actually understand how marionette works.
(Assignee)

Comment 14

3 years ago
I'm giving up on this bug. I tried to log more information about what was happening with the crash and exception. First, this just made it not crash anymore.  Then with a variant on the patch, I was seeing it log the NoSuchElement exception after findElement() timed out, and then it was printing the "crash but failed stackwalk" error. So maybe the gecko crash is happening after the timeout. I don't know why it would prevent the NoSuchElement exception from being logged, but it does not appear that I can catch the crash and rethrow it in any useful way.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → INVALID
Comment on attachment 8686369 [details] [review]
[gaia] davidflanagan:bug1224070 > mozilla-b2g:master

Resetting flag for now, but we're going to get this fixed as part of a larger marionette-js-runner overhaul this month with any luck.
Attachment #8686369 - Flags: review?(gaye)
You need to log in before you can comment on or make changes to this bug.