Crash in [@ IPCError-browser | ShutDownKill | mozilla::dom::LSObject::DoRequestSynchronously]
Categories
(Core :: Storage: localStorage & sessionStorage, defect, P5)
Tracking
()
People
(Reporter: sg, Unassigned)
References
Details
(Keywords: crash)
Crash Data
Attachments
(2 files)
This bug is for crash report bp-7d765394-10b3-481b-afd7-d5cd50200212. Maybe some race within RequestHelper::StartAndReturnResponse?
Top 10 frames of crashing thread:
0 ntdll.dll NtWaitForKeyedEvent
1 ntdll.dll RtlSleepConditionVariableSRW
2 kernel32.dll SleepConditionVariableSRW
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue>::GetEvent xpcom/threads/ThreadEventQueue.cpp:208
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1089
6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
7 xul.dll mozilla::dom::LSObject::DoRequestSynchronously dom/localstorage/LSObject.cpp:832
8 xul.dll mozilla::dom::LSObject::EnsureDatabase dom/localstorage/LSObject.cpp:884
9 xul.dll mozilla::dom::LSObject::GetItem dom/localstorage/LSObject.cpp:584
Updated•4 years ago
|
Comment 1•4 years ago
|
||
So we are killed while we are waiting for a LSObject::DoRequestSynchronously
which should trigger a PrepareDataStoreOp
on the DOM file thread which in turn should send an event that causes the event loop spinning to terminate by setting mWaiting = false;
.
It seems, that this event does not arrive in reasonable time and that the user kills us rather than waiting 50s. This timeout seems very long to me.
And I do not see any other thread doing any work, in particular no running PrepareDataStoreOp
. So I can only assume, that this operation terminated without sending the return message (or sent it but it gets not interpreted well and thus mWaiting
is never touched)? Might something unforeseen happen, if the PrepareDataStoreOp
terminates and sends its return message before we even reach the event loop spinning ?
Updated•4 years ago
|
Comment 2•4 years ago
|
||
I added a bunch of signatures and glanced over the crashes. None of them has the IPCShutdownState
annotation set which means that the process was stuck even before receiving the IPC shutdown message and could never react to it.
Reporter | ||
Comment 3•4 years ago
|
||
The code in https://searchfox.org/mozilla-central/rev/a4be2fbe9bd4f405c91cc16e4e3a80400f5a9301/dom/localstorage/LSObject.cpp#1198 seems to acknowledge that we reach the timeout rather than completing the request. However, it is unclear under which circumstances this is expected.
Furthermore, the comment around https://searchfox.org/mozilla-central/rev/a4be2fbe9bd4f405c91cc16e4e3a80400f5a9301/dom/localstorage/LSObject.cpp#30 explains the rationale for choosing 50s, and it says that this should be shorter than the 60s limit of the shutdown kill. Is this still 60s? If yes, then something else seems to be wrong here with the timer. If no, the timeout must probably be adapted in order to prevent this ShutDownKill.
The whole function looks very complex. I think it should be refactored, and (C++) unit tests should be added for the core aspects, validating that the synchronisation and cancellation work as expected.
Comment 4•4 years ago
|
||
(In reply to Simon Giesecke [:sg] [he/him] from comment #3)
The code in https://searchfox.org/mozilla-central/rev/a4be2fbe9bd4f405c91cc16e4e3a80400f5a9301/dom/localstorage/LSObject.cpp#1198 seems to acknowledge that we reach the timeout rather than completing the request. However, it is unclear under which circumstances this is expected.
Furthermore, the comment around https://searchfox.org/mozilla-central/rev/a4be2fbe9bd4f405c91cc16e4e3a80400f5a9301/dom/localstorage/LSObject.cpp#30 explains the rationale for choosing 50s, and it says that this should be shorter than the 60s limit of the shutdown kill. Is this still 60s? If yes, then something else seems to be wrong here with the timer. If no, the timeout must probably be adapted in order to prevent this ShutDownKill.
The 60 second time limit is for the main process shutdown hang detector. The crashes here are all in content processes where the timeout is only 5 seconds between when we send the shutdown IPC message and when the content process gets killed. From the looks of it the affected content processes couldn't even respond to the IPC shutdown message because they were stuck in this situation.
Comment 5•4 years ago
|
||
Another datapoint that might be interesting: all those processes were short-lived. Over half of the crashes have an uptime of less than 60s, two-thirds are under three minutes.
Comment 6•4 years ago
|
||
Recap: During shutdown, within content processes, the DOM File Thread
ends, before it can send back the unblocking event to the event loop that blocks the main thread. The timeout used for this event loop is significantly longer (50s) then the IPC shutdown timeout (5s) for content processes, which causes the crash.
It is unclear, what would happen, if we reduced that timeout to, let's say, 4s, as the error handling code in case of timeout has no test coverage. Ând probably the user does not even notice any problem with the current state of things (as we are just closing down the process). So probably there is nothing urgent to do here (but to mark some of that code as "refactoring candidate").
Updated•4 years ago
|
Comment 7•4 years ago
•
|
||
Error handling after the timeout fired is not that important. We added the timer to be able debug (remotely) hard to discover deadlocks. In that case we just print logging information and immediately force crash the whole application. See bug 1535221. That said, ideally this timer should never fire and we will eventually remove it when all possible deadlocks are addressed.
Updated•4 years ago
|
Comment 9•4 years ago
|
||
It appears that this is also happening on Linux. I can (regularly-ish) cause this to happen, usually on Google properties of some sort, with youtube being the largest offender and search a distant second.
The signature is slightly different ([@ IPCError-browser | ShutDownKill | __pthread_cond_wait | mozilla::dom::LSObject::DoRequestSynchronously ]) but I suspect that's windows/linux differences.
I've tried, but so far been unable, to get this to happen under gdb or rr. Not sure if that's coincidence (it does occur only somewhat regularly), or if it's some race/deadlock somewhere that the additional overhead of tracing causes to not happen.
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Re-reading some current stacktrace, I am not sure anymore about the timout values that apply here. The stack seems to indicate, that we crash inside NtWaitForAlertByThreadId
inside ntdll.dll
without calling any firefox code. The system call originates from this NSProcessNextEvent
and reads like this:
void mozilla::detail::ConditionVariableImpl::wait(MutexImpl& lock) {
SRWLOCK* srwlock = &lock.platformData()->lock;
bool r =
SleepConditionVariableSRW(&platformData()->cv_, srwlock, INFINITE, 0);
MOZ_RELEASE_ASSERT(r);
}
indicating an INFINITE wait according to MSDN. So I would be inclined to think, that the user kills us.
In any case, the comments in RequestHelper::StartAndReturnResponse
indicate, that we are dancing on thin ice here. There has been some improvements/additions in bug 1513859, that might be related?
Comment 11•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] (REO for FF 81) from comment #10)
indicating an INFINITE wait according to MSDN. So I would be inclined to think, that the user kills us.
Yes, this is not an actual crash in the content process, it's the main process taking a minidump of it before killing it. What you're seeing is a snapshot of the state of the content process before it was killed. The content process was stuck for over 20s during shutdown and this is the state it was in at the end of that timeout.
Comment 12•4 years ago
|
||
Comment 13•4 years ago
|
||
Comment 14•4 years ago
|
||
I managed to catch it in the state where it froze the entire browser chrome up. Unsurprisingly, perhaps, the main threads were all zombies by the time I got gdb attached, so I couldn't grab their backtraces. However, I tried to grab as many as possible from its child threads (and their grandchildren) before the whole tree got reaped.
I've attached the gdb debug log and the pstree output so you can map the process ids back to their names.
Let me know if there's additional stuff you want me to try. I think I can provoke at least a crash of the tab in about 10 minutes of messing with it now, though it only freezes the entire browser up maybe one in five times.
Updated•4 years ago
|
Comment 15•4 years ago
|
||
Added a signature.
Updated•3 years ago
|
Comment 16•2 years ago
|
||
Closing because no crashes reported for 12 weeks.
Description
•