Closed Bug 1402586 Opened 2 years ago Closed 2 years ago

Crash in nsTArray_base<T>::SwapArrayElements<T> | nsTArray_Impl<T>::SwapElements<T> | mozilla::ipc::DeserializeIPCStream


(Core :: DOM: Core & HTML, defect, critical)

57 Branch
Not set



Tracking Status
firefox-esr52 --- unaffected
firefox55 --- unaffected
firefox56 --- unaffected
firefox57 --- fixed
firefox58 --- fixed


(Reporter: philipp, Assigned: bkelly)


(Blocks 1 open bug)


(Keywords: crash, regression)

Crash Data


(1 file)

This bug was filed from the Socorro interface and is 
report bp-b5433694-a1ff-4539-9f77-0444a0170922.
Crashing Thread (33), Name: DOM Worker
Frame 	Module 	Signature 	Source
0 	xul.dll 	nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::SwapArrayElements<nsTArrayInfallibleAllocator, nsTArrayInfallibleAllocator>(nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>&, unsigned __int64, unsigned __int64) 	xpcom/ds/nsTArray-inl.h:349
1 	xul.dll 	nsTArray_Impl<RefPtr<mozilla::dom::quota::DirectoryLockImpl>, nsTArrayInfallibleAllocator>::SwapElements<nsTArrayInfallibleAllocator>(nsTArray_Impl<RefPtr<mozilla::dom::quota::DirectoryLockImpl>, nsTArrayInfallibleAllocator>&) 	xpcom/ds/nsTArray.h:1802
2 	xul.dll 	mozilla::ipc::DeserializeIPCStream(mozilla::ipc::IPCStream const&) 	ipc/glue/IPCStreamUtils.cpp:414
3 	xul.dll 	mozilla::ipc::DeserializeIPCStream(mozilla::ipc::OptionalIPCStream const&) 	ipc/glue/IPCStreamUtils.cpp:430
4 	xul.dll 	<lambda_d131e09cdf1f65f318b74bf31670ad4c>::operator() 	dom/cache/CacheStreamControlChild.cpp:120
5 	xul.dll 	mozilla::MozPromise<mozilla::ipc::OptionalIPCStream, mozilla::ipc::PromiseRejectReason, 0>::ThenValue<<lambda_d131e09cdf1f65f318b74bf31670ad4c>, <lambda_456c3dd1edb9b73c99703b668fed0127> >::DoResolveOrRejectInternal(mozilla::MozPromise<mozilla::ipc::OptionalIPCStream, mozilla::ipc::PromiseRejectReason, 0>::ResolveOrRejectValue&) 	xpcom/threads/MozPromise.h:765
6 	xul.dll 	mozilla::MozPromise<mozilla::ipc::FileDescriptor, mozilla::ipc::PromiseRejectReason, 0>::ThenValueBase::DoResolveOrReject(mozilla::MozPromise<mozilla::ipc::FileDescriptor, mozilla::ipc::PromiseRejectReason, 0>::ResolveOrRejectValue&) 	xpcom/threads/MozPromise.h:497
7 	xul.dll 	mozilla::MozPromise<mozilla::ipc::OptionalIPCStream, mozilla::ipc::PromiseRejectReason, 0>::ThenValueBase::ResolveOrRejectRunnable::Run() 	xpcom/threads/MozPromise.h:402
8 	xul.dll 	`anonymous namespace'::ExternalRunnableWrapper::Cancel 	dom/workers/WorkerPrivate.cpp:277
9 	xul.dll 	mozilla::dom::workers::WorkerRunnable::Run() 	dom/workers/WorkerRunnable.cpp:253
10 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1039
11 	xul.dll 	NS_ProcessPendingEvents(nsIThread*, unsigned int) 	xpcom/threads/nsThreadUtils.cpp:463
12 	xul.dll 	mozilla::dom::workers::WorkerPrivate::ClearMainEventQueue(mozilla::dom::workers::WorkerPrivate::WorkerRanOrNot) 	dom/workers/WorkerPrivate.cpp:5603
13 	xul.dll 	`anonymous namespace'::WorkerThreadPrimaryRunnable::Run 	dom/workers/RuntimeService.cpp:2863
14 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1039
15 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:521
16 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:338
17 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/
18 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/
19 	xul.dll 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp:427
20 	nss3.dll 	PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
21 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:95
22 	ucrtbase.dll 	o__realloc_base 	
23 	kernel32.dll 	BaseThreadInitThunk 	
24 	ntdll.dll 	LdrpAllocateTls 	
25 	kernelbase.dll 	kernelbase.dll@0x6d4ef

this signature for content process crashes first showed up on 57.0a1 build 20170916220246, which would look related to bug 1397128 timing wise.
Flags: needinfo?(bkelly)
Assignee: nobody → bkelly
Blocks: 1397128
Flags: needinfo?(bkelly)
I think this is what is happening:

1. A worker thread starts reading the Cache API stream which triggers a lazy open.
2. While the lazy open is in process the worker thread starts shutting down.
3. The worker shutdown closes the stream.
4. The lazy open resolves some time around here placing a runnable in the worker event loop.
5. The CacheStreamControlChild sees the close and allows itself to be destroyed.
6. The WorkerHolder for the CacheStreamControlChild is dropped allowing the worker to shutdown.
7. The worker shuts down the PBackground actor killing the PFileDescriptorChild in the serialized stream.
8. The lazy open resolve runnable finally fires when the worker empties its event loop at the end.
9. The lazy open resolve runnable tries to deserialize the PFileDescriptorChild and crashes.
Andrew, do you mind reviewing this?  Tom reviewed the original patches for this feature, but he is out on PTO at the moment.

This patch just ensures that we hold the worker alive while the lazy open operation is in progress.  This will prevent the worker thread from kill the PBackground IPC tree out from underneath us.

Note, its ok if the CacheStreamControlChild actor itself goes away, but we must ensure the PBackground actor and its child PFileDescriptorSetChild actor live until we do the DeserializeIPCStream() method.

I don't have a way to reproduce this at the moment.  Its quite racy.  So I'm not planning on writing a test.
Attachment #8911835 - Flags: review?(bugmail)
Comment on attachment 8911835 [details] [diff] [review]
Hold the worker alive while Cache API completes a lazy body open IPC operation. r=asuth

Review of attachment 8911835 [details] [diff] [review]:

Excellent analysis in comment 1 and comment in the patch, thank you!

(In reply to Ben Kelly [:bkelly] from comment #2)
> I don't have a way to reproduce this at the moment.  Its quite racy.  So I'm
> not planning on writing a test.

Yes, I think we'd need test infrastructure changes to support something like this.  Something that effectively looked like gdb (or rr) playing god with scheduling based on a set of symbolic ordering descriptions in whatever the test definition is.  This could also be done internally with a swapped-in replacement event dispatch mechanism that could perform symbolic unwinding itself to know when the runnable under test was getting dispatched and should result in stalling the queue's execution until whatever the next semantic event is; in this case, thread shutdown.
Attachment #8911835 - Flags: review?(bugmail) → review+
Pushed by
Hold the worker alive while Cache API completes a lazy body open IPC operation. r=asuth
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Comment on attachment 8911835 [details] [diff] [review]
Hold the worker alive while Cache API completes a lazy body open IPC operation. r=asuth

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1397128
[User impact if declined]: Intermittent crashes on sites using service workers.
[Is this code covered by automated tests?]: We have extensive tests for this code, but it doesn't trigger the race condition responsible for the crash.
[Has the fix been verified in Nightly?]: The fix is in nightly, but we don't have a reliable way to reproduce the crash.  So its difficult to verify.  Its a low frequency crash.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Minimal risk
[Why is the change risky/not risky?]: The solution to this bug is to keep a worker thread slightly longer while some IPC messaging completes.  This is a standard operation we do in other places.
[String changes made/needed]: None
Attachment #8911835 - Flags: approval-mozilla-beta?
Comment on attachment 8911835 [details] [diff] [review]
Hold the worker alive while Cache API completes a lazy body open IPC operation. r=asuth

Fix a crash, taking it.
Should be in 57b4 (gtb tomorrow Thursday)
Attachment #8911835 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.