Closed Bug 1575021 Opened 5 years ago Closed 4 years ago

Sometimes geckodriver kills the browser already 1s after starting it

Categories

(Testing :: geckodriver, defect, P1)

Version 3
defect

Tracking

(firefox81 fixed)

RESOLVED FIXED
81 Branch
Tracking Status
firefox81 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Attachments

(1 file)

I have seen it already in the past but never filed a bug for it. See the following log:

https://taskcluster-artifacts.net/Nm24orzAS_e6ahY8r_IcqQ/0/public/logs/live_backing.log

[task 2019-08-19T14:44:19.391Z] 14:44:19     INFO - PID 5036 | 1566225859381	mozrunner::runner	INFO	Running command: "Z:\\task_1566220377\\build\\application\\firefox\\firefox.exe" "-marionette" "-foreground" "-no-remote" "-profile" "C:\\Users\\task_1566220377\\AppData\\Local\\Temp\\rust_mozprofileDCDfsW"
[task 2019-08-19T14:44:19.406Z] 14:44:19     INFO - PID 5036 | 1566225859390	geckodriver::marionette	DEBUG	Waiting 60s to connect to browser on 127.0.0.1:2828
[task 2019-08-19T14:44:19.821Z] 14:44:19     INFO - PID 5036 | [3516, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/extensions/permissions/nsPermissionManager.cpp, line 2901
[task 2019-08-19T14:44:20.504Z] 14:44:20     INFO - PID 5036 | 1566225860490	mozrunner::runner	DEBUG	Killing process 5528

Why is Firefox killed after 1s while we expect to wait 60s for its startup? Could the failed Permission check result in an early exit of Firefox? Mark given that you added that code, mind giving us some feedback?

Flags: needinfo?(markh)

I don't recall what code you are referring to - do you have a bug number of revision?

Flags: needinfo?(markh) → needinfo?(hskupin)

It's for nsPermissionManager.cpp, line 2901 which is part of the above comment. It was implemented as part of bug 506446.

If this cannot cause a very early exit of the process, there is really something wrong with geckodriver. Thanks!

Flags: needinfo?(hskupin) → needinfo?(markh)

Sorry, I mis-understood and thought you meant I added code which killed Firefox after some time. The error being referenced is NS_ERROR_FILE_NOT_FOUND trying to import default permissions, and I guess the NS_ENSURE_SUCCESS macro will indeed kill debug versions of Firefox. It's probably not necessary for that failure to be fatal, but it's probably worth trying to understand why the default permissions are missing.

Flags: needinfo?(markh)

Mark, which file is that actually? One from the profile, or from the application directory? If I would know the name I could have a look at the builds.

Flags: needinfo?(markh)

I believe it is whatever the pref permissions.manager.defaultsUrl points at - I'm afraid I can't remember what that was 5 years ago, and even if I could, it could well have changed since then.

Flags: needinfo?(markh)

So far this problem has only been manifested on Windows 7 32bit, and specifically for debug builds.

In regards of the pref I had a look at about:config, and it refers to resource://app/defaults/permissions. Maybe for Firefox this comes from https://searchfox.org/mozilla-central/source/browser/app/permissions.

But having a look at the import again which happens here:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/extensions/permissions/nsPermissionManager.cpp#2840-2865

The file which actually gets imported is hostperm.1 (HOSTPERM_FILE_NAME) from the user's profile. But it looks like this will only succeed for kinda old profiles, which are still prepared with this file. At least for mozprofile generated profiles we use permissions.sqlite.

So maybe this is all just red herring, and there might indeed be a geckodriver issue here, and not waiting longer than 1s before killing Firefox.

OS: Unspecified → Windows 7
Hardware: Unspecified → x86

Indeed when looking at the same log and observing start-ups of Firefox from previous tests the same log entry exists:

[task 2019-08-19T14:27:26.905Z] 14:27:26     INFO - PID 2272 | 1566224846894	geckodriver::marionette	DEBUG	Waiting 60s to connect to browser on 127.0.0.1:2828
[task 2019-08-19T14:27:27.693Z] 14:27:27     INFO - PID 2272 | [3820, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/extensions/permissions/nsPermissionManager.cpp, line 2901
[..]
[task 2019-08-19T14:27:29.450Z] 14:27:29     INFO - PID 2272 | 1566224849443	Marionette	TRACE	Received observer notification command-line-startup
[task 2019-08-19T14:27:29.451Z] 14:27:29     INFO - PID 2272 | 1566224849443	Marionette	TRACE	Received observer notification nsPref:changed
[task 
[..]
[task 2019-08-19T14:27:34.043Z] 14:27:34     INFO - PID 2272 | 1566224854028	Marionette	INFO	Listening on port 2828
[task 2019-08-19T14:27:34.043Z] 14:27:34     INFO - PID 2272 | 1566224854029	Marionette	DEBUG	Remote service is active

So this log entry clearly hasn't something to do with that early kill of the application.

The line in mozrunner, which causes this early kill of Firefox is:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/testing/mozbase/rust/mozrunner/src/runner.rs#146

But I fail to parse the marionette.rs code in how we reach that line:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/testing/geckodriver/src/marionette.rs#1182-1225

But maybe this is a problem with calling TcpStream::connect()?

Andreas, any idea?

Flags: needinfo?(ato)

Maybe the following patch from bug 1525126 might help us to get more information due to the additional added debug output.

https://phabricator.services.mozilla.com/D39813

Depends on: 1525126

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #8)

The line in mozrunner, which causes this early kill of Firefox is:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/testing/mozbase/rust/mozrunner/src/runner.rs#146

But I fail to parse the marionette.rs code in how we reach that line:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/testing/geckodriver/src/marionette.rs#1182-1225

I think you may be mixing up try_wait() on std::process::Child and FirefoxProcess.

The only time FirefoxProcess::try_wait(time::Duration) is called is from MarionetteHandler::delete_session(),
which is when we hit the loop you linked to in testing/mozbase/rust/mozrunner/src/runner.rs:138.

In MarionetteConnection::connect() we call std::process::Child::try_wait() directly,
which just tells if the process has exited yet.

The return value is a bit convoluted, but the API documentation is pretty good.

But maybe this is a problem with calling TcpStream::connect()?

TcpStream::connect() shouldn’t cause the Process unexpectedly closed with status
as it ignores any connection error until the timeout has elapsed,
and then stringifies the std::io::Error which would have a different string.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #9)

Maybe the following patch from bug 1525126 might help us to get more information due to the additional added debug output.

https://phabricator.services.mozilla.com/D39813

Yes, I was about to say the same thing. I think that is worth waiting for.

Flags: needinfo?(ato)

The fix on bug 1525126 should have fixed that. I will revisit this bug in a couple of weeks, and after the 0.26.0 release went out.

Priority: -- → P3
Blocks: 1604695

So this is still not fixed. Here another case from bug 1659967:

[task 2020-08-19T11:17:29.527Z] 11:17:29     INFO - PID 891 | 1597835849515	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "--marionette" "-foreground" "-no-remote" "-profile" "/tmp/rust_mozprofiles0QlUe"
[task 2020-08-19T11:17:29.527Z] 11:17:29     INFO - PID 891 | 1597835849515	geckodriver::marionette	DEBUG	Waiting 60s to connect to browser on 127.0.0.1:2828
[task 2020-08-19T11:17:29.623Z] 11:17:29     INFO - PID 891 | 1597835849616	mozrunner::runner	DEBUG	Killing process 900

James, do you have an idea why the connection attempt to Marionette is aborted within 1s? See also Andreas' last comment 10 from a year ago. The related code is here:

https://searchfox.org/mozilla-central/source/testing/geckodriver/src/marionette.rs#1307-1355

Flags: needinfo?(james)
OS: Windows 7 → All
Hardware: x86 → All
Blocks: 1659967
Blocks: 1659968

Actually all the test jobs for that platform failed due to application terminated with exit code -11 (see bug 1659965):
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=157db696462d8a98905d0f8697088aa97cb6e08f&searchStr=ccov%2Clinux

So we don't seem to report the error correctly when Firefox dies during the connection attempt.

Right, from the code it looks like the most likely problem is a crash during startup that gets reported as a UnknownError i.e. a 500 but where we aren't surfacing the error message. I think we should add logging in geckodriver for that case so it appears in the stderr not just in the protocol response.

Flags: needinfo?(james)

Oh, well we actually override the error by force-killing the browser even when the process doesn't exist:

https://searchfox.org/mozilla-central/rev/6cc48251bb97600fdf11a5b4c5f621bfc8606d55/testing/geckodriver/src/marionette.rs#172

As such the original error message is lost and we always report can't kill an exited process, which totally makes sense.

Will be easy to fix.

Assignee: nobody → hskupin
Blocks: 1649094
Status: NEW → ASSIGNED
Priority: P3 → P1
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f354c92a3716
[mozrunner] Don't kill an already exited process. r=webdriver-reviewers,jgraham
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: