Closed Bug 1279246 Opened 4 years ago Closed 3 years ago

Hang due to CacheFileInputStream deadlock

Categories

(Core :: Networking: Cache, defect, critical)

All
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox49 --- fixed
firefox50 --- fixed

People

(Reporter: mstange, Assigned: michal)

References

Details

(Keywords: hang, Whiteboard: [necko-active])

Attachments

(1 file, 1 obsolete file)

I just encountered a parent process hang in yesterday's nightly (2016-06-08). The main thread was blocked with this callstack:

> * thread #1: tid = 0xb02a56, 0x00007fff9509ede6 libsystem_kernel.dylib`__psynch_mutexwait + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>     frame #0: 0x00007fff9509ede6 libsystem_kernel.dylib`__psynch_mutexwait + 10
>   * frame #1: 0x00007fff8c132e4a libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
>     frame #2: 0x000000010872f68e libnss3.dylib`PR_Lock + 14
>     frame #3: 0x0000000108dbec62 XUL`mozilla::net::CacheFileInputStream::CloseWithStatus(nsresult) + 34
>     frame #4: 0x0000000108b09d47 XUL`nsInputStreamPump::Cancel(nsresult) + 135
>     frame #5: 0x0000000108e782ce XUL`mozilla::net::nsHttpChannel::Cancel(nsresult) + 398
>     frame #6: 0x0000000108e844bd XUL`non-virtual thunk to mozilla::net::nsHttpChannel::Cancel(nsresult) + 13
>     frame #7: 0x0000000108b0bcb5 XUL`mozilla::net::nsLoadGroup::Cancel(nsresult) + 565
>     frame #8: 0x0000000109a90cb0 XUL`imgRequest::ContinueCancel(nsresult) + 160
>     frame #9: 0x0000000109a90a5f XUL`imgRequest::Cancel(nsresult) + 111
>     frame #10: 0x0000000109a92c2b XUL`imgRequest::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned long long, unsigned int) + 1835
>     frame #11: 0x0000000108b416a8 XUL`mozilla::net::nsStreamListenerTee::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned long long, unsigned int) + 280
>     frame #12: 0x0000000108d67e7e XUL`mozilla::net::nsHTTPCompressConv::BrotliHandler(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*) + 702
>     frame #13: 0x0000000108dbe9f0 XUL`mozilla::net::CacheFileInputStream::ReadSegments(nsresult (*)(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*) + 416
>     frame #14: 0x0000000108d68350 XUL`mozilla::net::nsHTTPCompressConv::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned long long, unsigned int) + 560
>     frame #15: 0x0000000108e89056 XUL`mozilla::net::nsHttpChannel::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned long long, unsigned int) + 742
>     frame #16: 0x0000000108b0ac17 XUL`nsInputStreamPump::OnStateTransfer() + 599
>     frame #17: 0x0000000108b0a6a8 XUL`nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*) + 408
>     frame #18: 0x0000000108a54755 XUL`nsInputStreamReadyEvent::Run() + 37
>     frame #19: 0x0000000108a6ddd8 XUL`nsThread::ProcessNextEvent(bool, bool*) + 632
>     frame #20: 0x0000000108a9b96f XUL`NS_ProcessPendingEvents(nsIThread*, unsigned int) + 95
>     frame #21: 0x000000010ae9d3f1 XUL`nsBaseAppShell::NativeEventCallback() + 113
>     frame #22: 0x000000010aef4858 XUL`nsAppShell::ProcessGeckoEvents(void*) + 296
>     frame #23: 0x00007fff89361881 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
>     frame #24: 0x00007fff89340fbc CoreFoundation`__CFRunLoopDoSources0 + 556
>     frame #25: 0x00007fff893404df CoreFoundation`__CFRunLoopRun + 927
>     frame #26: 0x00007fff8933fed8 CoreFoundation`CFRunLoopRunSpecific + 296
>     frame #27: 0x00007fff8f5c7935 HIToolbox`RunCurrentEventLoopInMode + 235
>     frame #28: 0x00007fff8f5c776f HIToolbox`ReceiveNextEventCommon + 432
>     frame #29: 0x00007fff8f5c75af HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 71
>     frame #30: 0x00007fff8c609df6 AppKit`_DPSNextEvent + 1067
>     frame #31: 0x00007fff8c609226 AppKit`-[NSApplication _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 454
>     frame #32: 0x000000010aef3ec6 XUL`-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 86
>     frame #33: 0x00007fff8c5fdd80 AppKit`-[NSApplication run] + 682
>     frame #34: 0x000000010aef4e3a XUL`nsAppShell::Run() + 234
>     frame #35: 0x000000010b91d229 XUL`nsAppStartup::Run() + 41
>     frame #36: 0x000000010b975d55 XUL`XREMain::XRE_mainRun() + 4021
>     frame #37: 0x000000010b976232 XUL`XREMain::XRE_main(int, char**, nsXREAppData const*) + 674
>     frame #38: 0x000000010b9765d7 XUL`XRE_main + 247
>     frame #39: 0x0000000107fa4a17 firefox`main + 2007
>     frame #40: 0x0000000107fa3f04 firefox`start + 52

The "Cache2 I/O" thread was also blocked on a cache lock in mozilla::net::CacheFile::WriteMetadataIfNeeded, and the "Cache I/O" thread was idle.
I think the main thread was deadlocked with itself - both CacheFileInputStream::ReadSegments and CacheFileInputStream::CloseWithStatus lock. I haven't verified whether they locked the same lock.
Assignee: nobody → michal.novotny
Severity: normal → major
Whiteboard: [necko-active]
I could not find any recent change in this code so it is mystery to me why we don't hit this deadlock more often. I see several solutions here:

1) release the lock in CacheFileInputStream::ReadSegments before calling aWriter()
2) change nsHttpChannel so that it won't cancel the pump when the data is coming from the cache via CacheFileInputStream::ReadSegments
3) allow this special case, i.e. allow only CacheFileInputStream::CloseWithStatus to be called when we are in CacheFileInputStream::ReadSegments lower on the stack

While 1) seems to be ideal solution, it is quite hard to do since we would need to ensure that the buffer won't be reallocated due to write on another thread. The lock protects several classes and it would be a big change. I'm inclined to implement 3). Honza, what's your opinion about 2)?
Flags: needinfo?(honzab.moz)
Discussed in person.  The code will be made reentrant.
Flags: needinfo?(honzab.moz)
Attached patch patch v1 (obsolete) — Splinter Review
This is quite a large change and we might want to have it on m-c for a while before uplifting. This deadlock occurs when ReadSegments is used to read the data from the input stream and the passed handler calls some method on the stream again. This happens when an image is transferred using brotli compression and the image is already in the imgcache. Imgcache cancels the request from OnDataAvailable which closes the input stream and re-enters the lock. I'm not sure whether another possible scenario exists but I guess that probably not because this is the first and the only report so far.

The idea of the patch is simple. Buffer is now encapsulated in a class and read/write handles are used to access the data. The handles are meant to be a short term handles for a single read/write operation. That's why GetReadHandle/GetWriteHandle returns an object allocated on the stack. The main buffer is CacheFileChunk::mBuf and most of the time no other buffer will be needed when reading or writing the data. Just in case the read handle exists when an output stream wants to write some data, the buffer is cloned and the original is kept as read-only until the read handle is released. In theory something like this is possible:

- thread A, input stream gets the read handle and releases the lock, the handle sees data in mBuf (buf1)
- thread B, output stream writes some data, this moves buf1 info mOldBufs and mBuf now contains new data (buf2)
- thread C, another input stream gets the read handle and releases the lock, the handle sees data in mBuf (buf2)
- thread B, output stream writes some data, this moves buf2 info mOldBufs and mBuf now contains new data (buf3)
Attachment #8764754 - Flags: review?(honzab.moz)
Comment on attachment 8764754 [details] [diff] [review]
patch v1

Review of attachment 8764754 [details] [diff] [review]:
-----------------------------------------------------------------

Nice piece of work.  

Instead of so massive use of MOZ_RELEASE_ASSERT you should rather use MOZ_DIAGNOSTIC_ASSERT.  It real-crashes only on m-c/m-a.  That should be sufficient to find any issues IMO.  

r=honzab but I didn't test this locally nor put too much effort to checks on lifetime/various thread-safety.

::: netwerk/cache2/CacheFileChunk.cpp
@@ +132,5 @@
> +void
> +CacheFileChunkBuffer::RemoveReadHandle()
> +{
> +  AssertOwnsLock();
> +  MOZ_RELEASE_ASSERT(mReadHandlesCount != 0);

just MOZ_RELEASE_ASSERT(mReadHandlesCount) ?

@@ +166,5 @@
> +
> +uint32_t
> +CacheFileChunkHandle::DataSize()
> +{
> +  mBuf->AssertOwnsLock();

do a null check before? (in other methods too)

@@ +198,5 @@
> +CacheFileChunkWriteHandle::CacheFileChunkWriteHandle(CacheFileChunkBuffer *aBuf)
> +{
> +  mBuf = aBuf;
> +  if (mBuf) {
> +    mBuf->mWriteHandleExists = true;

maybe assert false first?

could two streams try to write to the same chunk?  what happens then?

@@ +219,5 @@
> +void
> +CacheFileChunkWriteHandle::UpdateDataSize(uint32_t aOffset, uint32_t aLen, bool aEOF)
> +{
> +  MOZ_ASSERT(mBuf, "Write performed on dummy handle?");
> +  MOZ_ASSERT(!aEOF, "Not implemented yet");

is it then worth keeping here?

@@ +779,5 @@
>    mStatus = aStatus;
>  }
>  
> +CacheFileChunkReadHandle
> +CacheFileChunk::GetReadHandle()

this signature causes unnecessary addref/release on the buffer.  

I understand you may not want something like:

CacheFileChunkBuffer* CacheFileChunk::GetBufferForReading()

and then:

{
  CacheFileChunkReadHandle hnd(GetBufferForReading()); 
  ...
}

just to make sure the buffer is never used w/o a handle.

hence, you should define a move constructors/=operators for the two classes to save the addref/release.

@@ +891,5 @@
> +  uint32_t oldBuffersSize = mBuffersSize;
> +  mBuffersSize += aAllocated;
> +  mBuffersSize -= aFreed;
> +
> +  DoMemoryReport(sizeof(CacheFileChunk) + mBuffersSize);

I think this should be done only after the !mLimitAllocation check?

@@ +909,3 @@
>  {
>    static mozilla::Atomic<uint32_t> chunksMemoryUsage(0);
>    static mozilla::Atomic<uint32_t> prioChunksMemoryUsage(0);

when you are here, these could be changed to only ReleaseAcquire

::: netwerk/cache2/CacheFileChunk.h
@@ +27,5 @@
> +public:
> +  MOZ_DECLARE_REFCOUNTED_TYPENAME(CacheFileChunkBuffer);
> +
> +  explicit CacheFileChunkBuffer(CacheFileChunk *aChunk);
> +  ~CacheFileChunkBuffer();

this must be private when the object is refcounted.

@@ +51,5 @@
> +
> +  void RemoveReadHandle();
> +  void RemoveWriteHandle();
> +
> +  CacheFileChunk *mChunk;

please add a comment why this is a weak ptr (how it is ensured that mChunk overlives this class)

@@ +56,5 @@
> +  char           *mBuf;
> +  uint32_t        mBufSize;
> +  uint32_t        mDataSize;
> +  uint32_t        mReadHandlesCount;
> +  bool            mWriteHandleExists;

are mReadHandlesCount and mWriteHandleExists always processed only under the file's lock?  otherwise must be atomic (rel/acq sufficient)

@@ +221,3 @@
>  
> +  // We need to keep pointers of the old buffers for memory reporting.
> +  nsTArray<RefPtr<CacheFileChunkBuffer> > mOldBufs;

nit: > > can now be >>

::: netwerk/cache2/CacheFileInputStream.cpp
@@ +250,5 @@
>    LOG(("CacheFileInputStream::CloseWithStatusLocked() [this=%p, "
>         "aStatus=0x%08x]", this, aStatus));
>  
>    if (mClosed) {
> +    MOZ_ASSERT(!mCallback || mInReadSegments);

isn't this assertion too strict? (I mean even the original version).  why would it be wrong to have a callback after close?  just a leak precaution?  if so, then why it's now OK to have a callback here why being called from insire readsegments?  changes to assertions need comments..

@@ +296,5 @@
>  
> +  if (mInReadSegments) {
> +    LOG(("CacheFileInputStream::AsyncWait() - Cannot be called while the stream"
> +         " is in ReadSegments!"));
> +    return NS_ERROR_UNEXPECTED;

this seems odd a bit.  I think it should be legal to do it, but depends.  I presume it's not simple to allow this, then we should have an assertion (or diagnostic crash) here just to catch places trying to do this?

::: netwerk/cache2/CacheFileInputStream.h
@@ +58,5 @@
> +  int64_t                mPos;
> +  nsresult               mStatus;
> +  bool                   mClosed : 1;
> +  bool                   mInReadSegments : 1;
> +  bool                   mWaitingForUpdate : 1;

please double check carefully that all of these fieldsare accessed always under the lock, I was too lazy..
Attachment #8764754 - Flags: review?(honzab.moz) → review+
(In reply to Honza Bambas (:mayhemer) from comment #5)
> Instead of so massive use of MOZ_RELEASE_ASSERT you should rather use
> MOZ_DIAGNOSTIC_ASSERT.  It real-crashes only on m-c/m-a.  That should be
> sufficient to find any issues IMO.  

I had a look at them and all are important. I.e. if the assertion is not true we would probably crash later which is hard to debug. I'll keep them release assert so if there is anything wrong we don't end up hunting mystery crashes.

> @@ +166,5 @@
> > +
> > +uint32_t
> > +CacheFileChunkHandle::DataSize()
> > +{
> > +  mBuf->AssertOwnsLock();
> 
> do a null check before? (in other methods too)

DataSize must not be called when mBuf is null. I've added an assertion to make that clear.

> @@ +198,5 @@
> > +CacheFileChunkWriteHandle::CacheFileChunkWriteHandle(CacheFileChunkBuffer *aBuf)
> > +{
> > +  mBuf = aBuf;
> > +  if (mBuf) {
> > +    mBuf->mWriteHandleExists = true;
> 
> maybe assert false first?
> 
> could two streams try to write to the same chunk?  what happens then?

This is checked in GetWriteHandle, but I added an assertion here too.

> @@ +779,5 @@
> >    mStatus = aStatus;
> >  }
> >  
> > +CacheFileChunkReadHandle
> > +CacheFileChunk::GetReadHandle()
> 
> this signature causes unnecessary addref/release on the buffer.  
> 
> I understand you may not want something like:
> 
> CacheFileChunkBuffer* CacheFileChunk::GetBufferForReading()
> 
> and then:
> 
> {
>   CacheFileChunkReadHandle hnd(GetBufferForReading()); 
>   ...
> }
> 
> just to make sure the buffer is never used w/o a handle.
> 
> hence, you should define a move constructors/=operators for the two classes
> to save the addref/release.

It seems this is not needed. I checked with debugger that only one addref and one release is called when the read/write handle is obtained, used and destroyed.

> @@ +891,5 @@
> > +  uint32_t oldBuffersSize = mBuffersSize;
> > +  mBuffersSize += aAllocated;
> > +  mBuffersSize -= aFreed;
> > +
> > +  DoMemoryReport(sizeof(CacheFileChunk) + mBuffersSize);
> 
> I think this should be done only after the !mLimitAllocation check?

No, we do report memory size always. If !mLimitAllocation we don't count this chunk into chunksMemoryUsage.

> @@ +56,5 @@
> > +  char           *mBuf;
> > +  uint32_t        mBufSize;
> > +  uint32_t        mDataSize;
> > +  uint32_t        mReadHandlesCount;
> > +  bool            mWriteHandleExists;
> 
> are mReadHandlesCount and mWriteHandleExists always processed only under the
> file's lock?

yes

> ::: netwerk/cache2/CacheFileInputStream.cpp
> @@ +250,5 @@
> >    LOG(("CacheFileInputStream::CloseWithStatusLocked() [this=%p, "
> >         "aStatus=0x%08x]", this, aStatus));
> >  
> >    if (mClosed) {
> > +    MOZ_ASSERT(!mCallback || mInReadSegments);
> 
> isn't this assertion too strict? (I mean even the original version).  why
> would it be wrong to have a callback after close?  just a leak precaution? 
> if so, then why it's now OK to have a callback here why being called from
> insire readsegments?  changes to assertions need comments..

I added a comment. It's not too strict, we notify the listener as soon as the stream is closed. Now we postpone the notification until we step out of ReagSegments.

> @@ +296,5 @@
> >  
> > +  if (mInReadSegments) {
> > +    LOG(("CacheFileInputStream::AsyncWait() - Cannot be called while the stream"
> > +         " is in ReadSegments!"));
> > +    return NS_ERROR_UNEXPECTED;
> 
> this seems odd a bit.  I think it should be legal to do it, but depends.  I
> presume it's not simple to allow this, then we should have an assertion (or
> diagnostic crash) here just to catch places trying to do this?

I don't think it makes sense to call AsyncWait while the data has not been fully read and I don't want to implement something that will never be used and cannot be easily tested. I added an assertion to make sure this code is unused.

> ::: netwerk/cache2/CacheFileInputStream.h
> @@ +58,5 @@
> > +  int64_t                mPos;
> > +  nsresult               mStatus;
> > +  bool                   mClosed : 1;
> > +  bool                   mInReadSegments : 1;
> > +  bool                   mWaitingForUpdate : 1;
> 
> please double check carefully that all of these fieldsare accessed always
> under the lock, I was too lazy..

Yes, they are.
Attachment #8764754 - Attachment is obsolete: true
Attachment #8768350 - Flags: review+
Severity: major → critical
Keywords: hang
Pushed by mnovotny@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8446a3f5f3c2
Hang due to CacheFileInputStream deadlock, r=honzab
https://hg.mozilla.org/mozilla-central/rev/8446a3f5f3c2
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Duplicate of this bug: 1295572
Comment on attachment 8768350 [details] [diff] [review]
patch v2 - addressed reviewer's comments

Approval Request Comment
[Feature/regressing bug #]: enabled brotli compression 
[User impact if declined]: possible hang of cache IO thread (see bug 1295572 comment 7)
[Describe test coverage new/current, TreeHerder]: added xpcshell test + all existing tests that use cache
[Risks and why]: low, although it is quite a large change it's in the tree for a while and we're are not aware of any regression
[String/UUID change made/needed]: none
Attachment #8768350 - Flags: approval-mozilla-beta?
Comment on attachment 8768350 [details] [diff] [review]
patch v2 - addressed reviewer's comments

Fix for a hang that may be fairly common, may be risky/big change but includes new tests. Let's uplift this for the beta 6 build next Monday. 

That doesn't leave a lot of time to evaluate the fix; Michal and Honza, can you help look over crash-stats next week, or suggest anything else to help verify that this fixes the issue and hasn't caused worse issues?
Attachment #8768350 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Duplicate of this bug: 1299586
Depends on: 1401224
You need to log in before you can comment on or make changes to this bug.