Closed Bug 1035411 Opened 6 years ago Closed 6 years ago

Suspect lock handling in cache2

Categories

(Core :: Networking: Cache, defect, critical)

x86_64
Windows 8.1
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox32 --- fixed
firefox33 --- fixed

People

(Reporter: jgilbert, Assigned: michal)

References

Details

Attachments

(1 file)

In checking out my recent crash reports[1], I found this code:
http://hg.mozilla.org/mozilla-central/annotate/1204667a2935/netwerk/cache2/CacheFileInputStream.cpp#l229

  // We need to release the lock to avoid lock re-entering
  ...
  lock.Unlock();
  rv = aWriter(this, aClosure, buf, 0, toRead, _retval);
  lock.Lock();

This is suspect handling of this lock, since it is generally susceptible to race conditions. This should either pass along the still-held lock to the callee, or use a reentrant lock.

Alternatively, if proper behavior is assured regardless, there should be a comment about why this does not introduce a race condition.
Michal, can you please elaborate?
Flags: needinfo?(michal.novotny)
BTW, we only have reentrant monitors (also has condvar, which here would have no use), AFAIK.
BTW2, if this were split to two protected blocks with writer() call between them, you wouldn't probably file this bug.  This is the same.
I don't think that releasing the lock is really necessary, but the common rule is to not call an unknown callback while holding the lock. The CacheFile's lock is shared among CacheFileInputStream, CacheFileOutputStream and CacheFileChunk and it's OK when e.g. some output stream method is called while the lock is released in CacheFileInputStream::ReadSegments(). The only problem at this place would be if some method of the _same_ input stream would be called. Such input stream usage doesn't make sense and the fact that it doesn't happen is verified in debug build with the mInReadSegments checks.
Flags: needinfo?(michal.novotny)
Thanks.  WONTFIX based on comment 5.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
(In reply to Michal Novotny (:michal) from comment #5)
> I don't think that releasing the lock is really necessary, but the common
> rule is to not call an unknown callback while holding the lock. The
> CacheFile's lock is shared among CacheFileInputStream, CacheFileOutputStream
> and CacheFileChunk and it's OK when e.g. some output stream method is called
> while the lock is released in CacheFileInputStream::ReadSegments(). The only
> problem at this place would be if some method of the _same_ input stream
> would be called. Such input stream usage doesn't make sense and the fact
> that it doesn't happen is verified in debug build with the mInReadSegments
> checks.

I'm not super-satisfied with this, since I'm getting this crash about once every two weeks. I think it's reasonable to say that it could be missed by DEBUG builds if it is a race condition in this case.

Is it impossible for contention by construction?

I would argue that "don't call an unknown callback while holding a lock" does not necessarily apply here, since if you're worried the callback might deadlock, the possibility of the deadlock means you are preventing the race condition. Removing the lock just removes the symptom, and will let it work *in many cases*.

What situations are there where a callback might re-enter the lock, but would not constitute a race condition?
Flags: needinfo?(michal.novotny)
I've spent some time trying find out whether releasing of the lock is really necessary. I've removed releasing of the lock and I've run FF locally as well as I've pushed it to try (https://tbpl.mozilla.org/?tree=Try&rev=0d60b0a69f21). So far I didn't experience reentering the lock, so this problem is probably just theoretical and I think we could demand that callback passed to CacheFileInputStream::ReadSegments() won't call the cache code.

There is really a problem when the same entry is written on a different thread while the lock is released. When the data won't fit into the existing buffer, it is reallocated in CacheFileChunk::EnsureBufSize() which invalidates the buffer returned by CacheFileInputStream::CanRead().

So my proposal is to hold the lock all the time. If this is not acceptable, then I don't see any other solution than removing ReadSegments() method and using just Read().
Severity: normal → critical
Status: RESOLVED → REOPENED
Flags: needinfo?(michal.novotny)
Resolution: WONTFIX → ---
Attached patch patch v1Splinter Review
Assignee: nobody → michal.novotny
Attachment #8455814 - Flags: review?(honzab.moz)
Michal, this is not about reentering the lock, it's about potential deadlock warning we may get.  The code is designed to enter locks in cascades like: cache service lock -> cache entry lock -> cache file lock.  But when we don't release the lock in ReadSegments before calling the callback, we may get cascade of cache file lock -> cache entry lock.  This will be detected (correctly!) as a potential deadlock since there are two codepaths that may enter two locks (the file's and the entry's) in opposite ways.
Comment on attachment 8455814 [details] [diff] [review]
patch v1

Review of attachment 8455814 [details] [diff] [review]:
-----------------------------------------------------------------

We need to release the lock, that's obvious.  What's wrong here is that the buffer we pass as an argument may be released.  Can we hold the chunk locally (in a local var, on the stack) and maybe keep it in some 'locked' or 'just being operated' state so that it cannot go away sooner than the callback returns?
Attachment #8455814 - Flags: review?(honzab.moz) → review-
Few notes:

- the crash reports all indicate the buffer is corrupted (supports my suggestion)
- nsBufferedInputStream::ReadSegments is not very smart too (doesn't count with call to Seek() while inside ReadSegments()->writer()) and doesn't keep any locks despite the class implements thread-safe ref counter, [1]

[1] http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsBufferedStreams.cpp#334
(In reply to Honza Bambas (:mayhemer) from comment #10)
> Michal, this is not about reentering the lock, it's about potential deadlock
> warning we may get.  The code is designed to enter locks in cascades like:
> cache service lock -> cache entry lock -> cache file lock.  But when we
> don't release the lock in ReadSegments before calling the callback, we may
> get cascade of cache file lock -> cache entry lock.  This will be detected
> (correctly!) as a potential deadlock since there are two codepaths that may
> enter two locks (the file's and the entry's) in opposite ways.

I clearly wrote that we could demand that the callback won't call any cache code which would mean that this cannot happen.


> We need to release the lock, that's obvious.  What's wrong here is that the
> buffer we pass as an argument may be released.  Can we hold the chunk
> locally (in a local var, on the stack) and maybe keep it in some 'locked' or
> 'just being operated' state so that it cannot go away sooner than the
> callback returns?

No, you didn't read my comment correctly. It's not about releasing the chunk and its buffer. The problem is that on one thread we use some buffer outside the lock and on another thread this buffer is reallocated and so likely moved, so the previous pointer becomes invalid.
(In reply to Honza Bambas (:mayhemer) from comment #12)
> Few notes:
> 
> - the crash reports all indicate the buffer is corrupted (supports my
> suggestion)
> - nsBufferedInputStream::ReadSegments is not very smart too (doesn't count
> with call to Seek() while inside ReadSegments()->writer()) and doesn't keep
> any locks despite the class implements thread-safe ref counter, [1]
> 
> [1]
> http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/
> nsBufferedStreams.cpp#334

I don't know what you're trying to say with this, please read comment #13 to understand the problem.
(In reply to Michal Novotny (:michal) from comment #13)
> (In reply to Honza Bambas (:mayhemer) from comment #10)
> > Michal, this is not about reentering the lock, it's about potential deadlock
> > warning we may get.  The code is designed to enter locks in cascades like:
> > cache service lock -> cache entry lock -> cache file lock.  But when we
> > don't release the lock in ReadSegments before calling the callback, we may
> > get cascade of cache file lock -> cache entry lock.  This will be detected
> > (correctly!) as a potential deadlock since there are two codepaths that may
> > enter two locks (the file's and the entry's) in opposite ways.
> 
> I clearly wrote that we could demand that the callback won't call any cache
> code which would mean that this cannot happen.

I'm afraid we cannot enforce this.  The problem comes from an old code we just undermine with this change, but read on...

> 
> 
> > We need to release the lock, that's obvious.  What's wrong here is that the
> > buffer we pass as an argument may be released.  Can we hold the chunk
> > locally (in a local var, on the stack) and maybe keep it in some 'locked' or
> > 'just being operated' state so that it cannot go away sooner than the
> > callback returns?
> 
> No, you didn't read my comment correctly. It's not about releasing the chunk
> and its buffer. The problem is that on one thread we use some buffer outside
> the lock and on another thread this buffer is reallocated and so likely
> moved, so the previous pointer becomes invalid.


OK, reallocated means released as well.  

So, the problem is that the buffer may get invalid while we are in writer().

Can we somehow prevent it?  I assume reallocation may only happen when we do partial request?  (one thread reads, other writes and prolongs = reallocates the buffer.)


This is rather complicated... hmm... maybe we should find the original stack that calls from the callback back on the cache and fix that code and try to enforce (I suspect this will be very hard tho and will cause us just troubles in the future).  

Unfortunatelly this change was introduced in times we were working on Gum w/o recording bugs.  I'll try to find any logs on my hdd somewhere to find the stack...

The changeset we have added the unlock change: https://hg.mozilla.org/projects/gum/rev/abd66f7c85b8
(In reply to Michal Novotny (:michal) from comment #14)
> (In reply to Honza Bambas (:mayhemer) from comment #12)
> > Few notes:
> > 
> > - the crash reports all indicate the buffer is corrupted (supports my
> > suggestion)
> > - nsBufferedInputStream::ReadSegments is not very smart too (doesn't count
> > with call to Seek() while inside ReadSegments()->writer()) and doesn't keep
> > any locks despite the class implements thread-safe ref counter, [1]
> > 
> > [1]
> > http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/
> > nsBufferedStreams.cpp#334
> 
> I don't know what you're trying to say with this, please read comment #13 to
> understand the problem.

After further understanding this bug all this seems moot now, right.
Comment on attachment 8455814 [details] [diff] [review]
patch v1

Review of attachment 8455814 [details] [diff] [review]:
-----------------------------------------------------------------

Ah, I didn't finish reading comment 8 first!  Sorry.

OK, let's get this in and see what happens.  For now it's better to have a potential deadlock code path we can then find and fix than a bad crash.

r+
Attachment #8455814 - Flags: review- → review+
Status: REOPENED → ASSIGNED
Comment on attachment 8455814 [details] [diff] [review]
patch v1

Approval Request Comment
[Feature/regressing bug #]: cache2
[User impact if declined]:  crash
[Describe test coverage new/current, TBPL]: needs to bake on m-c for a few days
[Risks and why]:  This is a change in locking code, so always some risk.  That's why I think this should bake for a few days on m-c.
[String/UUID change made/needed]: none
Attachment #8455814 - Flags: approval-mozilla-aurora?
This looks like it will hit m-c today. Let's give it until at least Friday before considering for Aurora approval. (Note that we're almost out of time for Aurora uplift.)
https://hg.mozilla.org/mozilla-central/rev/6f0a5bfa00a6
Status: ASSIGNED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Attachment #8455814 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Depends on: 1052266
Looking through Socorro [1] for [@ nsUTF8ToUnicode::Convert(char const*, int*, wchar_t*, int*)], I can see that there are 3 crashes on Firefox 32 Beta, in builds with IDs: 20140722195627, 20140804164216, 20140807212602.

Michal, should I reopen this issue? Is there more work needed here?

[1] - https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=nsUTF8ToUnicode%3A%3AConvert%28char+const%2A%2C+int%2A%2C+wchar_t%2A%2C+int%2A%29#tab-reports
Flags: needinfo?(michal.novotny)
Please file a new bug. It seems that this crash is not caused by releasing the lock in CacheFileInputStream::ReadSegments(), but as I described in comment #8 it could cause a crash, so let's keep this bug about that issue.
Flags: needinfo?(michal.novotny)
(In reply to Michal Novotny (:michal) from comment #24)
> Please file a new bug. It seems that this crash is not caused by releasing
> the lock in CacheFileInputStream::ReadSegments(), but as I described in
> comment #8 it could cause a crash, so let's keep this bug about that issue.

Logged separate bug for the crash (bug 1055529).
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.