Closed Bug 1614296 Opened 5 years ago Closed 3 years ago

Crash in [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessageChannel::SynchronouslyClose]

Categories

(Core :: Networking, defect, P2)

Unspecified
Windows
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: gsvelto, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [necko-triaged])

Crash Data

This bug is for crash report bp-eb962072-9915-4d77-897b-7a76a0200209.

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::ipc::MessageChannel::SynchronouslyClose ipc/glue/MessageChannel.cpp:2694
5 xul.dll mozilla::ipc::MessageChannel::Close ipc/glue/MessageChannel.cpp:2767
6 xul.dll mozilla::net::SocketProcessBridgeChild::Observe netwerk/ipc/SocketProcessBridgeChild.cpp:168
7 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:65
8 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:292
9 xul.dll mozilla::dom::ContentChild::ShutdownInternal dom/ipc/ContentChild.cpp:3059

This is a content process hung during shutdown and it seems to be happening almost exclusively on nightly.

It seems like the content process was stuck here waiting for something to happen before we were forced to kill it because it was taking too long.

I don't know this code well but that looks like a synchronous IPC message. Those tend to be slow so we might just being too slow here, but we might also be stuck.

I found another signature for this issue.

Crash Signature: [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessageChannel::SynchronouslyClose] → [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ IPCError-browser | ShutDownKill | mozilla::ipc::ProcessLink::SendClose]

Found another signature for this.

Crash Signature: [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ IPCError-browser | ShutDownKill | mozilla::ipc::ProcessLink::SendClose] → [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ IPCError-browser | ShutDownKill | mozilla::ipc::ProcessLink::SendClose] [@ IPCError-browser | ShutDownKill | NtSetIoCompletion]

Kershaw, Byron, can you take look?

Flags: needinfo?(kershaw)
Flags: needinfo?(docfaraday)

Seems like the content process is stuck waiting on the socket process. I see a similar signature that looks related to RDD here.

Maybe there's something about IPC channels between content and other types of child process that simply isn't plumbed correctly right now? It would be really nice to know what is going on in the other process (eg; socket, RDD, what-have-you).

Flags: needinfo?(docfaraday) → needinfo?(nfroyd)

(In reply to Byron Campen [:bwc] from comment #4)

Maybe there's something about IPC channels between content and other types of child process that simply isn't plumbed correctly right now? It would be really nice to know what is going on in the other process (eg; socket, RDD, what-have-you).

It's technically possible to do that but it needs specific plumbing within the ContentParent class. ATM we grab a minidump for the affected content process and the main process. It should be possible to also grab minidumps for the socket and RDD and associate them with the crash report. The modifications would be non-trivial though.

I've spent some time on this, but I still can't figure out the root cause of this.
I think this might need another fresh pair of eyes to take a look.

;jld, do you probably have an idea about this?
Thanks.

Flags: needinfo?(kershaw) → needinfo?(jld)

I've had another pass at the crashes and I'm now convinced that this is just content processes being slow during shutdown. Here's why: many crashes have the IPCShutdownState annotation set to RecvShutdown which is consistent with the stack we see here - shutdown has begun but not finished yet. However the majority of the crashes have that annotation set to SendFinishShutdown (sent) which happens past the point where this stack trace originates from.

Since the minidump and the annotations are not perfectly in-sync it's possible that in most cases we grabbed a minidump, and by the time we grabbed the annotations the content process had made forward progress already. Jed if you agree with this analysis feel free to close this as invalid and move the signatures to bug 1279293 since this is just generic slowness and not something we can act upon directly.

Naturally if we could speed-up this step it should bring the overall volume down.

cc dthayer for speeding up shutdown having other nice knock-on effects.

(In reply to Gabriele Svelto [:gsvelto] from comment #5)

(In reply to Byron Campen [:bwc] from comment #4)

Maybe there's something about IPC channels between content and other types of child process that simply isn't plumbed correctly right now? It would be really nice to know what is going on in the other process (eg; socket, RDD, what-have-you).

It's technically possible to do that but it needs specific plumbing within the ContentParent class. ATM we grab a minidump for the affected content process and the main process. It should be possible to also grab minidumps for the socket and RDD and associate them with the crash report. The modifications would be non-trivial though.

We should consider doing this, I think; when the RDD (or socket?) processes were being brought up, I remember fielding questions from people who were puzzled about why we didn't get crash dumps for them...which makes debugging on try annoying.

Flags: needinfo?(nfroyd)

The explanation in comment #7 sounds plausible… but the IPCShutdownState annotation is specific to ContentParent and I don't know it very well; someone from the DOM: Content Processes component might be more helpful.

Flags: needinfo?(jld)

The priority flag is not set for this bug.
:mayhemer, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(honzab.moz)
Flags: needinfo?(honzab.moz)
Priority: -- → P2
Whiteboard: [necko-triaged]
Crash Signature: [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ IPCError-browser | ShutDownKill | mozilla::ipc::ProcessLink::SendClose] [@ IPCError-browser | ShutDownKill | NtSetIoCompletion] → [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ IPCError-browser | ShutDownKill | mozilla::ipc::ProcessLink::SendClose] [@ IPCError-browser | ShutDownKill | NtSetIoCompletion] [@ IPCError-browser | ShutDownKill…
Crash Signature: ShutDownKill | __psynch_cvwait | <name omitted> | mozilla::ipc::MessageChannel::SynchronouslyClose] → ShutDownKill | __psynch_cvwait | <name omitted> | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ mozilla::ipc::MessageChannel::SynchronouslyClose]
Crash Signature: ShutDownKill | __psynch_cvwait | <name omitted> | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ mozilla::ipc::MessageChannel::SynchronouslyClose] → ShutDownKill | __psynch_cvwait | <name omitted> | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ mozilla::ipc::MessageChannel::SynchronouslyClose]

I wasn't shutting down Firefox when I hit this - d16ae064-c3c2-46b4-9aeb-82d9e0201011

The entire window became momentarily blurry, as if the window had been duplicated and offset a few pixels (like bad font smoothing), then Firefox crashed without me doing anything.

Crash Signature: [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ IPCError-browser | ShutDownKill | mozilla::ipc::ProcessLink::SendClose] [@ IPCError-browser | ShutDownKill | NtSetIoCompletion] [@ IPCError-browser | ShutDownKill… → [@ IPCError-browser | ShutDownKill | mozilla::ipc::MessageChannel::SynchronouslyClose] [@ IPCError-browser | ShutDownKill | mozilla::ipc::ProcessLink::SendClose] [@ IPCError-browser | ShutDownKill | __psynch_cvwait | <name omitted> | mozilla::ipc::Messa…

Recent builds don't seem to be affected.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.