Open Bug 1575021 Opened 4 months ago Updated 2 months ago

Sometimes geckodriver kills the browser already 1s after starting it

Categories

(Testing :: geckodriver, defect, P3)

Version 3
x86
Windows 7
defect

Tracking

(Not tracked)

People

(Reporter: whimboo, Unassigned)

References

Details

I have seen it already in the past but never filed a bug for it. See the following log:

https://taskcluster-artifacts.net/Nm24orzAS_e6ahY8r_IcqQ/0/public/logs/live_backing.log

[task 2019-08-19T14:44:19.391Z] 14:44:19     INFO - PID 5036 | 1566225859381	mozrunner::runner	INFO	Running command: "Z:\\task_1566220377\\build\\application\\firefox\\firefox.exe" "-marionette" "-foreground" "-no-remote" "-profile" "C:\\Users\\task_1566220377\\AppData\\Local\\Temp\\rust_mozprofileDCDfsW"
[task 2019-08-19T14:44:19.406Z] 14:44:19     INFO - PID 5036 | 1566225859390	geckodriver::marionette	DEBUG	Waiting 60s to connect to browser on 127.0.0.1:2828
[task 2019-08-19T14:44:19.821Z] 14:44:19     INFO - PID 5036 | [3516, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/extensions/permissions/nsPermissionManager.cpp, line 2901
[task 2019-08-19T14:44:20.504Z] 14:44:20     INFO - PID 5036 | 1566225860490	mozrunner::runner	DEBUG	Killing process 5528

Why is Firefox killed after 1s while we expect to wait 60s for its startup? Could the failed Permission check result in an early exit of Firefox? Mark given that you added that code, mind giving us some feedback?

Flags: needinfo?(markh)

I don't recall what code you are referring to - do you have a bug number of revision?

Flags: needinfo?(markh) → needinfo?(hskupin)

It's for nsPermissionManager.cpp, line 2901 which is part of the above comment. It was implemented as part of bug 506446.

If this cannot cause a very early exit of the process, there is really something wrong with geckodriver. Thanks!

Flags: needinfo?(hskupin) → needinfo?(markh)

Sorry, I mis-understood and thought you meant I added code which killed Firefox after some time. The error being referenced is NS_ERROR_FILE_NOT_FOUND trying to import default permissions, and I guess the NS_ENSURE_SUCCESS macro will indeed kill debug versions of Firefox. It's probably not necessary for that failure to be fatal, but it's probably worth trying to understand why the default permissions are missing.

Flags: needinfo?(markh)

Mark, which file is that actually? One from the profile, or from the application directory? If I would know the name I could have a look at the builds.

Flags: needinfo?(markh)

I believe it is whatever the pref permissions.manager.defaultsUrl points at - I'm afraid I can't remember what that was 5 years ago, and even if I could, it could well have changed since then.

Flags: needinfo?(markh)

So far this problem has only been manifested on Windows 7 32bit, and specifically for debug builds.

In regards of the pref I had a look at about:config, and it refers to resource://app/defaults/permissions. Maybe for Firefox this comes from https://searchfox.org/mozilla-central/source/browser/app/permissions.

But having a look at the import again which happens here:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/extensions/permissions/nsPermissionManager.cpp#2840-2865

The file which actually gets imported is hostperm.1 (HOSTPERM_FILE_NAME) from the user's profile. But it looks like this will only succeed for kinda old profiles, which are still prepared with this file. At least for mozprofile generated profiles we use permissions.sqlite.

So maybe this is all just red herring, and there might indeed be a geckodriver issue here, and not waiting longer than 1s before killing Firefox.

OS: Unspecified → Windows 7
Hardware: Unspecified → x86

Indeed when looking at the same log and observing start-ups of Firefox from previous tests the same log entry exists:

[task 2019-08-19T14:27:26.905Z] 14:27:26     INFO - PID 2272 | 1566224846894	geckodriver::marionette	DEBUG	Waiting 60s to connect to browser on 127.0.0.1:2828
[task 2019-08-19T14:27:27.693Z] 14:27:27     INFO - PID 2272 | [3820, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/extensions/permissions/nsPermissionManager.cpp, line 2901
[..]
[task 2019-08-19T14:27:29.450Z] 14:27:29     INFO - PID 2272 | 1566224849443	Marionette	TRACE	Received observer notification command-line-startup
[task 2019-08-19T14:27:29.451Z] 14:27:29     INFO - PID 2272 | 1566224849443	Marionette	TRACE	Received observer notification nsPref:changed
[task 
[..]
[task 2019-08-19T14:27:34.043Z] 14:27:34     INFO - PID 2272 | 1566224854028	Marionette	INFO	Listening on port 2828
[task 2019-08-19T14:27:34.043Z] 14:27:34     INFO - PID 2272 | 1566224854029	Marionette	DEBUG	Remote service is active

So this log entry clearly hasn't something to do with that early kill of the application.

The line in mozrunner, which causes this early kill of Firefox is:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/testing/mozbase/rust/mozrunner/src/runner.rs#146

But I fail to parse the marionette.rs code in how we reach that line:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/testing/geckodriver/src/marionette.rs#1182-1225

But maybe this is a problem with calling TcpStream::connect()?

Andreas, any idea?

Flags: needinfo?(ato)

Maybe the following patch from bug 1525126 might help us to get more information due to the additional added debug output.

https://phabricator.services.mozilla.com/D39813

Depends on: 1525126

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #8)

The line in mozrunner, which causes this early kill of Firefox is:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/testing/mozbase/rust/mozrunner/src/runner.rs#146

But I fail to parse the marionette.rs code in how we reach that line:

https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/testing/geckodriver/src/marionette.rs#1182-1225

I think you may be mixing up try_wait() on std::process::Child and FirefoxProcess.

The only time FirefoxProcess::try_wait(time::Duration) is called is from MarionetteHandler::delete_session(),
which is when we hit the loop you linked to in testing/mozbase/rust/mozrunner/src/runner.rs:138.

In MarionetteConnection::connect() we call std::process::Child::try_wait() directly,
which just tells if the process has exited yet.

The return value is a bit convoluted, but the API documentation is pretty good.

But maybe this is a problem with calling TcpStream::connect()?

TcpStream::connect() shouldn’t cause the Process unexpectedly closed with status
as it ignores any connection error until the timeout has elapsed,
and then stringifies the std::io::Error which would have a different string.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #9)

Maybe the following patch from bug 1525126 might help us to get more information due to the additional added debug output.

https://phabricator.services.mozilla.com/D39813

Yes, I was about to say the same thing. I think that is worth waiting for.

Flags: needinfo?(ato)

The fix on bug 1525126 should have fixed that. I will revisit this bug in a couple of weeks, and after the 0.26.0 release went out.

Duplicate of this bug: 1559773
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.