Closed Bug 1297826 Opened 3 years ago Closed 3 years ago

Fix IPDL error handling in the GPU process

Categories

(Core :: Graphics, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: dvander, Assigned: mattwoodrow)

References

Details

(Whiteboard: [gfx-noted])

Attachments

(2 files, 1 obsolete file)

IPDL expects that "Child" actors live in untrusted processes, and "Parent" actors live in trusted processes. When message decoding goes awry, the Child actor is nuked.

In the GPU process model, the child is the UI process, so we don't want it getting killed when something goes wrong.

Example: http://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PGPUChild.cpp#716

Making FatalError virtual is probably the way to go.
Whiteboard: [gfx-noted]
Attached patch virtual-fatalerror (obsolete) — Splinter Review
This fixes the crash that I've encountered, but I assume we probably want the same sort of thing on all protocols that go across to the GPU process.

I wonder if we could instead change mozilla::ipc::FatalError take extra parameters (like the process IDs?) and have it determine the appropriate action for the given processes.

That way LayerTransactionChilds FatalError could be actually fatal when communicating with the UI process, but just a warning when communicating with the GPU process.
Attachment #8795573 - Flags: feedback?(dvander)
Comment on attachment 8795573 [details] [diff] [review]
virtual-fatalerror

Review of attachment 8795573 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, I think that is a much better idea. Otherwise we have to remember to override every actor which is a pain and error-prone.

Is there a reason not to just check XRE_IsParentProcess here? http://searchfox.org/mozilla-central/rev/767e1e9b118269f957ca1817138472146278a29e/ipc/glue/ProtocolUtils.cpp#314
Attachment #8795573 - Flags: feedback?(dvander)
Can we instead forward FatalError to the top-level protocol? That way we'd need fewer implementations.

Ultimately it sucks sucks that the parent crashes in this situation. We only have that there because it's hard to invoke the crash reporter in a protocol-independent way. We would like for each top-level protocol to be able to override FatalError in the parent process to generate a crash report.
(In reply to Bill McCloskey (:billm) from comment #3)
> Can we instead forward FatalError to the top-level protocol? That way we'd
> need fewer implementations.
> 
> Ultimately it sucks sucks that the parent crashes in this situation. We only
> have that there because it's hard to invoke the crash reporter in a
> protocol-independent way. We would like for each top-level protocol to be
> able to override FatalError in the parent process to generate a crash report.

With the GPU process we have multiple cases that currently crash that we need to not-crash.

We can now have UI process (child) -> GPU process (parent), that would normally be UI process -> UI process. We can also have Content process (child) -> GPU process (parent) where it would normally be Content process -> UI process.

We need to support both of these configurations (since the GPU process can be disabled mid-run if it crashes), so having to overload a virtual FatalError on each protocol that might access the GPU process seems like a lot of duplication.

I think modifying mozilla::ipc::FatalError to handle the various cases at runtime is simpler and puts the complexity in a single location.

I think the various cases we need to handle are:

FatalError with aParent=true:
  - This should be unchanged, these are super rare and only for extreme circumstances.
FatalError with aParent=false and CurrentPid == OtherPid:
  - In-process communication, I think we always want to crash for this.
FatalError with aParent=false, and OtherPid == UIProcessPid:
  - IPC with the UI process (always content -> UI afaik), should still crash.
FatalError with aParent=false and OtherPid != UIProcessPid:
  - IPC with a non-UI process (both of the cases detailed above). We should *not* crash for this case.

I think if we modify the signature of FatalError to take the PIDs (and change lower.py to generate code that sends them), then we can implement this logic.

The last thing we need is to know the PID of the UI process globally, I assume we can set this somewhere during XRE init for the process and then access it.

How does that sound Bill?
Flags: needinfo?(wmccloskey)
> How does that sound Bill?

It seems pretty complicated, and I'm not sure it's the right thing in all cases. For example, we have protocols that communicate between two different non-UI processes (plugins, for example), and this would affect those protocols as well.

I really feel like it's better to have the top-level protocols override FatalError however they want. There are only three top-level protocols that I know of that talk to the GPU process: CompositorBridge, ImageBridge, and VRManager. It doesn't seem like that much of a hassle to have three overrides, especially when they're one line.
Flags: needinfo?(wmccloskey)
This adds a second version of FatalError which takes both the protocol name as well as the error message.

This version is virtual (in IProtocolManager) and lower.py generates code to propagate these calls up to the top-level protocol which then calls mozilla::ipc::FatalError.

The old version of FatalError (single arg) just calls the two arg version using the current protocol name.

It would be to only have the two arg version and have the callers pass the current protocol name, but lower.py makes this very difficult.
Assignee: nobody → matt.woodrow
Attachment #8795936 - Flags: review?(wmccloskey)
Attachment #8795573 - Attachment is obsolete: true
Attachment #8795937 - Flags: review?(wmccloskey)
Comment on attachment 8795936 [details] [diff] [review]
Make FatalError virtual and propagate it to the top level protocols

Review of attachment 8795936 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!

::: ipc/glue/ProtocolUtils.h
@@ +184,5 @@
>      // XXX odd ducks, acknowledged
>      virtual ProcessId OtherPid() const = 0;
>      virtual MessageChannel* GetIPCChannel() = 0;
>  
> +    virtual void FatalError(const char* const, const char* const) const = 0;

Please give these names: aProtocolName and aErrorMsg.

::: ipc/ipdl/ipdl/lower.py
@@ +3443,4 @@
>          ])
>          self.cls.addstmts([ fatalerror, Whitespace.NL ])
>  
> +        msgparamname = ExprVar('aName')

I think protocolname would be a better name for the variable, and aProtocolName for the param.
Attachment #8795936 - Flags: review?(wmccloskey) → review+
Comment on attachment 8795937 [details] [diff] [review]
Override FatalError for GPU process protocols

Review of attachment 8795937 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/ipc/VsyncBridgeChild.cpp
@@ +144,5 @@
> +{
> +  // If we're communicating with the same process or the UI process then we
> +  // want to crash normally. Otherwise we want to just warn as the other end
> +  // must be the GPU process and it crashing shouldn't be fatal for us.
> +  if (OtherPid() == base::GetCurrentProcId() ||

Could we move this code to a static method of GPUProcessManager or something so that it can be shared across these protocols? It seems like OtherPid() is the only extra param we'd have to pass in.
Attachment #8795937 - Flags: review?(wmccloskey) → review+
Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b3544baf625e
Make FatalError virtual on IPDL generated classes, and propagate calls up to top-level protocols. r=billm
https://hg.mozilla.org/integration/mozilla-inbound/rev/e465fc24bd21
Override FatalError for IPDL protocols that access the GPU process and make it non-fatal. r=billm
https://hg.mozilla.org/mozilla-central/rev/b3544baf625e
https://hg.mozilla.org/mozilla-central/rev/e465fc24bd21
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
You need to log in before you can comment on or make changes to this bug.