Closed Bug 1282570 Opened 8 years ago Closed 8 years ago

In case of in-app restarts Marionette has to update mozprocess for the new process id


(Remote Protocol :: Marionette, defect)

Version 3
Not set


(firefox49 fixed, firefox50 fixed)

Tracking Status
firefox49 --- fixed
firefox50 --- fixed


(Reporter: whimboo, Assigned: whimboo)




(1 file)

With bug 1281750 Marionette client knows about the new process id of the application after a restart. Especially when the process forks itself into a new process group. Given that mozprocess loses the connection to the process (bug 1176758) Marionette can inform it about the new process id and process group.
So I was working on this bug the last two days and tried to figure out a good solution. As more as I looked at it and had other conversations more options came up. The following will give an overview:

1. The process id Marionette client gets could be used to update `` in ProcessHandler. It would mean that whatever method we call, there will be a running and no detached (zombie) process, and operations will succeed. But there is also a risk of breaking other code which for now should not happen given that Marionette is the only framework to actually support restarts of Firefox. Beside that there is also the question how to handle the pid and process group id given the setting of `ignore_children`.

2. Another option would be to move the new process from a foreign process group into the one we know since the process has been started. It would mean our call to killpg() would still succeed at the end when we kill left-open processes. With that method any method performed for the process should afaik also succeed.

3. We do not update the pid at all but store it while marking the old pid as detached. That way we can ensure to kill any left-open processes at the end in `kill()`. Side-effect here is that any method of the process instance will work on the detached (zombie) process, and doesn't reflect the reality.

After talking with Andrew we both like option 2) but I cannot get it working because of os.setpgid() raises a process not found error whereby this process id definitely exist. This is when using Marionette directly, but when I wrote a little test script the problem was gone. So maybe it's a setting how we start/control the process via mozrunner/marionette.

Before I dive deeper into those problems I kinda would like to get more feedback in regards which of the above mentioned options would be better suitable. 

Ted, if you can have a look soon that would be fantastic. Reason is that once we have the next migration our update tests will fail for beta and block QA from signing off from releases. Thanks a lot!
Flags: needinfo?(ted)
I think trying to do anything useful with this in mozprocess is probably going to be hard to get right, since mozprocess expects that it controls the process it spawned. Option 2 sounds tricky, I'm not sure you can reliably do that. I think option 3 is your best bet, you'll just have to track when ` != pid_from_marionette` and act accordingly.
Flags: needinfo?(ted)
Depends on: 1176758
Attachment #8766787 - Flags: review?(dburns) → review+
Comment on attachment 8766787 [details]
Bug 1282570 - In case of in-app restarts Marionette has to update mozprocess for the new process id.
The patches on bug 1176758 landed earlier today and all looks fine. I'm going to push this patch to autoland now.
Whiteboard: [needs bug 1176758]
Pushed by
In case of in-app restarts Marionette has to update mozprocess for the new process id. r=automatedtester
Depends on: 1284982
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Oopsie, while it is true that bug 1285030 was merely intermittent on OS X opt builds, on debug builds it is permaorange.

Backed out in
Resolution: FIXED → ---
Target Milestone: mozilla50 → ---
I'm not able to reproduce this test failure on my own machine. But noticed that I do not run a self-made debug build. So I will rebuild and check again.
Ok, so for an in-app restart we will have the following code with my patch:

> self._send_message("quitApplication", {"flags": restart_flags})
> self.delete_session()

Whereby delete_session is defined as:

>    def delete_session(self):
>        self._send_message("deleteSession")
>        self.session_id = None
>        self.session = None
>        self.window = None
>        self.client.close()

Any order of those calls would not work. Calling deleteSession first we won't be able to submit a quitApplication message afterward. The other way around also doesn't work because we might end-up in a non working socket - well most of the time because that is racy. Sending the message `quitApplication` will initiate a shutdown of Gecko. So any following message is not guarantied to be received by Marionette server, and I'm not talking about receiving the response at all.

The hang we see here with process is still running happens because in Marionette client we are still waiting for a response from deleteSession. Given the default socket timeout of 360s (see bug 1284457) this can take a while and we get killed somehow.

I'm not sure why this is happening in debug builds only, because they usually tend to run slower as opt builds, and quitting the application should take longer.
Drop my last comment, at least partially. I just noticed that the quitApplication() method in Marionette server actually calls deleteSession() itself. That's why any call to delete_session() on the client side will fail:

Looks like it's an easy fix here, whereby I'm still not sure why only debug builds are affected.
Part of the problem is that the client needs to know when the session is closed, and it’s not safe to send and wait for new responses.  Because Marionette/WebDriver uses a one-way protocol—i.e. the server cannot let the client know that some event occurred; the client needs to poll—a follow-up command might not succeed because the server has already disappeared.

There are a couple of commands that will either explicitly or implicitly cause the session to end: quitApplication is obvious, but closeWindow might also cause the session to close if the last window was closed.

Currently, there is no way to _tell_ that closeWindow caused the session to end, because the next call to getWindowHandles will wait for the full length of the socket timeout.  If the client isn’t let know _in a uniform way_ that the session disappeared, it will be hard to code a universal solution for this in

The WebDriver working group has talked about sending some special flag on the response (header or a body field) indicating if the session was closed as a result of executing the command.  If we do this, we can stop the client from making subsequent calls to Marionette that might otherwise cause it to wait for the full length of the socket timeout.

Another piece to the story is that I believe shutting down the socket listener in Marionette is racy.  This means that when this.stopSignal_() is called, it’s not guaranteed that the socket is actually closed when the call returns.  We can get around this by setting a lock on the dispatcher that reads/receives messages to stop receiving or discard them.
Comment on attachment 8766787 [details]
Bug 1282570 - In case of in-app restarts Marionette has to update mozprocess for the new process id.

Review request updated; see interdiff:
With my last update I have removed this extra call to delete_session() and it seems to work perfectly for me on my developer machine with a debug build. Also all other Marionette harness tests are passing. I still submitted a try build. David, mind giving your feedback again if that is fine with you?
Flags: needinfo?(dburns)
Comment on attachment 8766787 [details]
Bug 1282570 - In case of in-app restarts Marionette has to update mozprocess for the new process id.

Review request updated; see interdiff:
I talked with Andreas on IRC and with the last commit - re-adding the call to client.close() - is is happy with that patch. Given that try if all green, I'm going to push this patch now.
Flags: needinfo?(dburns)
Pushed by
In case of in-app restarts Marionette has to update mozprocess for the new process id. r=automatedtester
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Blocks: 1276220
This test-only patch we also need on mozilla-aurora to fix test bustage for update tests. When you land please first land the patch on bug 1176758. Thanks.
Whiteboard: [needs bug 1176758] → [checkin-needed-aurora][needs bug 1176758 first]
Whiteboard: [checkin-needed-aurora][needs bug 1176758 first]
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.