Closed Bug 1614991 Opened 4 years ago Closed 2 years ago

Crash in [@ IPCError-browser | ShutDownKill | mozilla::dom::LSObject::DoRequestSynchronously]

Categories

(Core :: Storage: localStorage & sessionStorage, defect, P5)

x86_64
Windows 10
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox75 --- affected
firefox81 --- affected

People

(Reporter: sg, Unassigned)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files)

This bug is for crash report bp-7d765394-10b3-481b-afd7-d5cd50200212. Maybe some race within RequestHelper::StartAndReturnResponse?

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForKeyedEvent 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernel32.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue>::GetEvent xpcom/threads/ThreadEventQueue.cpp:208
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1089
6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
7 xul.dll mozilla::dom::LSObject::DoRequestSynchronously dom/localstorage/LSObject.cpp:832
8 xul.dll mozilla::dom::LSObject::EnsureDatabase dom/localstorage/LSObject.cpp:884
9 xul.dll mozilla::dom::LSObject::GetItem dom/localstorage/LSObject.cpp:584

So we are killed while we are waiting for a LSObject::DoRequestSynchronously which should trigger a PrepareDataStoreOp on the DOM file thread which in turn should send an event that causes the event loop spinning to terminate by setting mWaiting = false;.

It seems, that this event does not arrive in reasonable time and that the user kills us rather than waiting 50s. This timeout seems very long to me.

And I do not see any other thread doing any work, in particular no running PrepareDataStoreOp. So I can only assume, that this operation terminated without sending the return message (or sent it but it gets not interpreted well and thus mWaitingis never touched)? Might something unforeseen happen, if the PrepareDataStoreOp terminates and sends its return message before we even reach the event loop spinning ?

Priority: -- → P2

I added a bunch of signatures and glanced over the crashes. None of them has the IPCShutdownState annotation set which means that the process was stuck even before receiving the IPC shutdown message and could never react to it.

Crash Signature: [@ IPCError-browser | ShutDownKill | mozilla::dom::LSObject::DoRequestSynchronously] → [@ IPCError-browser | ShutDownKill | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | mozglue.dll | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | ntdll.dll | kernelbase.dll | moz…

The code in https://searchfox.org/mozilla-central/rev/a4be2fbe9bd4f405c91cc16e4e3a80400f5a9301/dom/localstorage/LSObject.cpp#1198 seems to acknowledge that we reach the timeout rather than completing the request. However, it is unclear under which circumstances this is expected.

Furthermore, the comment around https://searchfox.org/mozilla-central/rev/a4be2fbe9bd4f405c91cc16e4e3a80400f5a9301/dom/localstorage/LSObject.cpp#30 explains the rationale for choosing 50s, and it says that this should be shorter than the 60s limit of the shutdown kill. Is this still 60s? If yes, then something else seems to be wrong here with the timer. If no, the timeout must probably be adapted in order to prevent this ShutDownKill.

The whole function looks very complex. I think it should be refactored, and (C++) unit tests should be added for the core aspects, validating that the synchronisation and cancellation work as expected.

(In reply to Simon Giesecke [:sg] [he/him] from comment #3)

The code in https://searchfox.org/mozilla-central/rev/a4be2fbe9bd4f405c91cc16e4e3a80400f5a9301/dom/localstorage/LSObject.cpp#1198 seems to acknowledge that we reach the timeout rather than completing the request. However, it is unclear under which circumstances this is expected.

Furthermore, the comment around https://searchfox.org/mozilla-central/rev/a4be2fbe9bd4f405c91cc16e4e3a80400f5a9301/dom/localstorage/LSObject.cpp#30 explains the rationale for choosing 50s, and it says that this should be shorter than the 60s limit of the shutdown kill. Is this still 60s? If yes, then something else seems to be wrong here with the timer. If no, the timeout must probably be adapted in order to prevent this ShutDownKill.

The 60 second time limit is for the main process shutdown hang detector. The crashes here are all in content processes where the timeout is only 5 seconds between when we send the shutdown IPC message and when the content process gets killed. From the looks of it the affected content processes couldn't even respond to the IPC shutdown message because they were stuck in this situation.

Another datapoint that might be interesting: all those processes were short-lived. Over half of the crashes have an uptime of less than 60s, two-thirds are under three minutes.

Recap: During shutdown, within content processes, the DOM File Thread ends, before it can send back the unblocking event to the event loop that blocks the main thread. The timeout used for this event loop is significantly longer (50s) then the IPC shutdown timeout (5s) for content processes, which causes the crash.

It is unclear, what would happen, if we reduced that timeout to, let's say, 4s, as the error handling code in case of timeout has no test coverage. Ând probably the user does not even notice any problem with the current state of things (as we are just closing down the process). So probably there is nothing urgent to do here (but to mark some of that code as "refactoring candidate").

Priority: P2 → P5

Error handling after the timeout fired is not that important. We added the timer to be able debug (remotely) hard to discover deadlocks. In that case we just print logging information and immediately force crash the whole application. See bug 1535221. That said, ideally this timer should never fire and we will eventually remove it when all possible deadlocks are addressed.

Crash Signature: omitted> | <name omitted> | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | trunc | mozilla::dom::LSObject::DoRequestSynchronously] → omitted> | <name omitted> | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | trunc | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | __psynch_cvwait | <name omitted> | mozilla::dom…

It appears that this is also happening on Linux. I can (regularly-ish) cause this to happen, usually on Google properties of some sort, with youtube being the largest offender and search a distant second.

The signature is slightly different ([@ IPCError-browser | ShutDownKill | __pthread_cond_wait | mozilla::dom::LSObject::DoRequestSynchronously ]) but I suspect that's windows/linux differences.

I've tried, but so far been unable, to get this to happen under gdb or rr. Not sure if that's coincidence (it does occur only somewhat regularly), or if it's some race/deadlock somewhere that the additional overhead of tracing causes to not happen.

Crash Signature: omitted> | <name omitted> | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | trunc | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | __psynch_cvwait | <name omitted> | mozilla::dom… → omitted> | <name omitted> | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | trunc | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | __psynch_cvwait | <name omitted> | mozilla::do…

Re-reading some current stacktrace, I am not sure anymore about the timout values that apply here. The stack seems to indicate, that we crash inside NtWaitForAlertByThreadId inside ntdll.dll without calling any firefox code. The system call originates from this NSProcessNextEvent and reads like this:

void mozilla::detail::ConditionVariableImpl::wait(MutexImpl& lock) {
  SRWLOCK* srwlock = &lock.platformData()->lock;
  bool r =
      SleepConditionVariableSRW(&platformData()->cv_, srwlock, INFINITE, 0);
  MOZ_RELEASE_ASSERT(r);
}

indicating an INFINITE wait according to MSDN. So I would be inclined to think, that the user kills us.

In any case, the comments in RequestHelper::StartAndReturnResponse indicate, that we are dancing on thin ice here. There has been some improvements/additions in bug 1513859, that might be related?

Flags: needinfo?(jvarga)

(In reply to Jens Stutte [:jstutte] (REO for FF 81) from comment #10)

indicating an INFINITE wait according to MSDN. So I would be inclined to think, that the user kills us.

Yes, this is not an actual crash in the content process, it's the main process taking a minidump of it before killing it. What you're seeing is a snapshot of the state of the content process before it was killed. The content process was stuck for over 20s during shutdown and this is the state it was in at the end of that timeout.

Attached file pstree output
Attached file gdb log

I managed to catch it in the state where it froze the entire browser chrome up. Unsurprisingly, perhaps, the main threads were all zombies by the time I got gdb attached, so I couldn't grab their backtraces. However, I tried to grab as many as possible from its child threads (and their grandchildren) before the whole tree got reaped.

I've attached the gdb debug log and the pstree output so you can map the process ids back to their names.

Let me know if there's additional stuff you want me to try. I think I can provoke at least a crash of the tab in about 10 minutes of messing with it now, though it only freezes the entire browser up maybe one in five times.

Flags: needinfo?(jvarga)

Added a signature.

Crash Signature: mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | __pthread_cond_wait | mozilla::dom::LSObject::DoRequestSynchronously] → mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | __pthread_cond_wait | mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | __psynch_cvwait | mozilla::dom::LSObject::DoRequestSynchrono…
Crash Signature: mozilla::dom::LSObject::DoRequestSynchronously] → mozilla::dom::LSObject::DoRequestSynchronously] [@ IPCError-browser | ShutDownKill | __psynch_cvwait | _pthread_cond_wait | mozilla::dom::LSObject::DoRequestSynchronously ]

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: