Closed Bug 673543 Opened 13 years ago Closed 12 years ago

crash nsDiskCacheMap::DeleteRecord

Categories

(Core :: Networking: Cache, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla14
Tracking Status
firefox10 + ---
firefox11 + ---
firefox12 + wontfix
firefox13 + wontfix
firefox14 + fixed
firefox-esr10 14+ fixed

People

(Reporter: WeirdAl, Assigned: michal)

Details

(Keywords: crash, topcrash, Whiteboard: [tbird crash][qa?])

Crash Data

Attachments

(5 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-67651a52-dd9a-4499-b25a-0940b2110722 .
============================================================= 

Filing here because Socorro didn't offer me the Core product option.  I don't know quite how this happened, but the stack trace appears different than bug 631008.
Component: General → Networking: Cache
Product: Firefox → Core
QA Contact: general → networking.cache
Version: Trunk → 6 Branch
Assigning to Michal for investigation because it is probably related to bug 631008.
Assignee: nobody → michal.novotny
The stack trace looks like:
Frame 	Module 	Signature [Expand] 	Source
0 	xul.dll 	nsDiskCacheMap::DeleteRecord 	netwerk/cache/nsDiskCacheMap.cpp:488
1 	xul.dll 	nsDiskCacheDevice::DeactivateEntry_Private 	
2 	xul.dll 	nsDiskCacheDeviceDeactivateEntryEvent::Run 	netwerk/cache/nsDiskCacheDevice.cpp:473
3 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:618
4 	xul.dll 	nsRunnable::Release 	obj-firefox/xpcom/build/nsThreadUtils.cpp:55
5 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:426
6 	nspr4.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:122
7 	mozcrt19.dll 	_callthreadstartex 	obj-firefox/memory/jemalloc/crtsrc/threadex.c:348
8 	mozcrt19.dll 	_threadstartex 	obj-firefox/memory/jemalloc/crtsrc/threadex.c:326
9 	kernel32.dll 	BaseThreadStart
Crash Signature: [@ nsDiskCacheMap::DeleteRecord(nsDiskCacheRecord*)] → [@ nsDiskCacheMap::DeleteRecord(nsDiskCacheRecord*)] [@ nsDiskCacheMap::DeleteRecord]
OS: Windows NT → All
Hardware: x86 → All
Keywords: topcrash
Sitting at the #18 spot for crashes on 8.0 in the past week and #38 on 8.0.1 in the past week.
Still occurs according to crash-stats. Grabbing this on request from Josh and by agreement with Michal.
Assignee: michal.novotny → bjarne
This is caused by the same problem causing bug #691108, which I can reproduce with an xpcshell-test in a debugger.

Bug #691108 only occurs in debug-builds since the crashing statement (NS_NOTREACHED) is active only for debug builds. In an opt build, nsDiskCacheMap::WriteDiskCacheEntry() returns failure (instead of crashing in the NOTREACHED) making DeactivateEntry_Private() call DeleteStorage() and DeleteRecord(). The latter crashes (this bug) because the record has already been deleted from nsDiskCacheMap.

It seems like an entry can be deleted from nsDiskCacheMap while there is a pending deactiavtion. This is basically because nsDiskCacheStreamIO keeps a binding to the record even after CloseOutputStream/Internal is called. I am not 100% sure under which circumstances this happens but I can provoke it by dispatching an "xpcom-shutdown" notification from the test and use the debugger to make nsDiskCacheMap::DeleteStorage() fail when in nsDiskCacheStreamIO::Flush(), which triggers the call to nsDiskCacheMap::DeleteRecord(). When the deactivate-event fires, the record has already been deleted and we get the behaviour described above.

Will have to think about how to fix this... The obvious way is to clear the binding to the record in nsDiskCacheStreamIO but there may be other ways. I also do not really understand the logic in nsDiskCacheStreamIO::Flush() if nsDiskCacheMap::DeleteStorage() fails - why would we just remove the record..? Wouldn't dooming the entry be more appropriate?
Since bug #691108 is about one particular test failing, I've been trying to find the cache access-pattern used in that test in order to come up with some test-code which fails consistently. No clear conclusion yet but the observation is that these crashes occur immediately when the test enables disk-cache, loads one entry, clears the cache and tries to load a second entry. The crash seems to be the deactivate-call for the entry loaded immediately before clearing the cache.
(In reply to Bjarne (:bjarne) from comment #5)
> In an opt build,
> nsDiskCacheMap::WriteDiskCacheEntry() returns failure (instead of crashing
> in the NOTREACHED) making DeactivateEntry_Private() call DeleteStorage() and
> DeleteRecord(). The latter crashes (this bug) because the record has already
> been deleted from nsDiskCacheMap.

This does not seem to be right... I have a test which can consistently trigger bug #691108 but I do not seem to be able to trigger this bug with an opt build. Will investigate further...
This signature is now #5 on 10.* in yesterday's data, #8 in total 10.0b4 data.
Any idea if this is because it happens more frequently or because other crashes are fixed?
It was #18 top browser crasher in 8.0.
It's #25 top browser crasher in 9.0.1.
It's stable across 10.0 Betas, between #8 and #11.
So there's probably a regression between 9 and 10.
..  or maybe some other crashes were fixed between 9 and 10, causing this one to climb on the list?  ;)

Anyway, here's a theory: Using the test mentioned in comment #7 it can be shown that a failure in |nsDiskCacheMap::DeleteStorage| can cause an entry to be deleted from the cache-map twice (essence in comment #5). The crash occurs when |nsDiskCacheMap::DeleteRecord| is called for the second time for the same entry

https://hg.mozilla.org/releases/mozilla-release/file/4e309e63c279/netwerk/cache/nsDiskCacheMap.cpp#l488

Observe that on the second call |mHeader.mBucketUsage[bucketIndex]| can be 0 (for example if it happens immediately after clearing the cache). This means that |last| will be assigned the value of -1 but since |last| is an |PRUint32|, some conversion will happen. In the for-loop below |last| is converted into an |int| (note: no number of bits and signed) and my guess is that on certain platforms and with certain compilers and -flags |i| ends up as a large positive number which makes |records[i].HashNumber()| crash. This is possibly also a 32/64 bit issue - I'm not too deep into the gory details in this area.

Michal: On my Linux-box this is handled somehow and I don't have the setup for Windows-builds. Could you try the patch I'll attach in a while on your setup and verify that the contained test crashes in an opt-build? Then, change |last| to be a signed int and see if it fixes the crash?
Attached patch Test to provoke crash (obsolete) — Splinter Review
This is the code mentioned in comment #7.

In a debug-build it reproduces the issue from bug #691108. In an opt-build, the theory is that on some platforms (non-Linux) it will reproduce the issue seen in this bug.

Note that this is not suitable for checking in since it cripples |nsICacheService| to provide a mechanism for the test to make |nsDiskCacheMap::DeleteStorage| fail.
(In reply to Bjarne (:bjarne) from comment #9)
> Any idea if this is because it happens more frequently or because other
> crashes are fixed?

Frequency of this crash has definitely risen in Firefox 10+ compared to 9. Was 110-130 crashes per million ADU for the ~2 weeks before, 356 in data from 2012-01-17 (but that's a single-day spike). Also going between 100 and 300 for Firefox 11. And it's at 5-8 crashes per million ADU in data for Firefox 9.

Also, as you mentioned 32bit vs. 64bit, https://crash-stats.mozilla.com/report/list?signature=nsDiskCacheMap%3A%3ADeleteRecord%28nsDiskCacheRecord*%29 seems to have mostly 32bit, but the odd 64bit crash is visible there as well. All the crashes are on Windows, though.
(In reply to Bjarne (:bjarne) from comment #11)
> Michal: On my Linux-box this is handled somehow and I don't have the setup
> for Windows-builds. Could you try the patch I'll attach in a while on your
> setup and verify that the contained test crashes in an opt-build? Then,
> change |last| to be a signed int and see if it fixes the crash?

I went ahead and tried this on tryserver, but the opt-build does not crash as expected...  :(
Still believe the conversion int -> unsigned 32bit int -> int is unsafe. Additionally, even if conversion happens correctly, an optimizing compiler might unroll the outer if-loop (line 490 in link from comment #11) because |last|, being an unsigned int, should always be >= 0. So, in this case the inner loop would be executed once even if |i| is -1, and a crash is likely.

Of-course, the proper fix for this is to avoid deleting the record twice. However, that fix is not as obvious as avoid using unsigned ints here in order to make this code more stable.
IMO it's worth to try this safe, simple change and see if it fixes this crash. If it doesn't then no harm should have been done.
Attachment #590359 - Flags: review?(michal.novotny)
This can be applied alone or together with attachment #590359 [details] [diff] [review]

The basic idea is to make sure error-values are propagated properly from internal methods and up to a level where they can be handled. I've chosen nsCacheEntry as the point to deal with errors, but I'm open to discussions. (Note that the test in attachment #589849 [details] [diff] [review] need a try-catch to deal with this - I'll attach a new version shortly.)
Attachment #590702 - Flags: review?(michal.novotny)
Patch sent to tryserver

https://tbpl.mozilla.org/?tree=Try&rev=1f6764f8dfca

As a side-note: Bug #691108 seems to have disappeared since I started looking at this. I'm not aware of any fixes which may have fixed that. Anyone who knows..?
Whiteboard: [tbird crash]
Comment on attachment 590702 [details] [diff] [review]
More complete fix which avoids deleting record twice

(In reply to Bjarne (:bjarne) from comment #17)
> The basic idea is to make sure error-values are propagated properly from
> internal methods and up to a level where they can be handled. I've chosen
> nsCacheEntry as the point to deal with errors, but I'm open to discussions.

If we simply call nsCacheService::DoomEntry() in nsDiskCacheStreamIO::Flush() instead of propagating the error, then the only difference would be that we wouldn't stop early in nsCacheEntry::RemoveDescriptor(), right? Would it be a problem?
Attachment #590702 - Flags: review?(michal.novotny)
(In reply to Bjarne (:bjarne) from comment #11)
> Michal: On my Linux-box this is handled somehow and I don't have the setup
> for Windows-builds. Could you try the patch I'll attach in a while on your
> setup and verify that the contained test crashes in an opt-build? Then,
> change |last| to be a signed int and see if it fixes the crash?

I've tried the patch on windows build but it didn't crash. Variable "last" is 0xffffffff and the for loop is not entered.
Comment on attachment 590359 [details] [diff] [review]
Change to signed int

I'm not sure that we want this change. This would just suppress the crash without fixing the issue. Anyway, if we want to handle the bucket usage better, I'd prefer to make variable 'i' unsigned since mBucketUsage is also unsigned, i.e. something like this:

    PRUint32            last = mHeader.mBucketUsage[bucketIndex];
    for (PRUint32 i = last; i > 0; i--) {
        if (records[i-1].HashNumber() == hashNumber) {
Attachment #590359 - Flags: review?(michal.novotny) → review-
(In reply to Michal Novotny (:michal) from comment #20)
> Comment on attachment 590702 [details] [diff] [review]
> If we simply call nsCacheService::DoomEntry() in
> nsDiskCacheStreamIO::Flush() instead of propagating the error, then the only
> difference would be that we wouldn't stop early in
> nsCacheEntry::RemoveDescriptor(), right? Would it be a problem?

That would resolve this particular problem, yes, but I'd consider it more of a hack. I think it is a mistake in any case not to propagate errors. IMO, it is not the responsibility of the disk-stream to decide whether to doom a cache-entry. It should propagate errors and leave such decisions to a higher layer.
Attachment #590702 - Flags: review?(michal.novotny)
(In reply to Michal Novotny (:michal) from comment #22)
> I'm not sure that we want this change. This would just suppress the crash
> without fixing the issue. 

Agreed, but it would make the code robust. The way it is now is just plain wrong.

> Anyway, if we want to handle the bucket usage
> better, I'd prefer to make variable 'i' unsigned since mBucketUsage is also
> unsigned, i.e. something like this:
> 
>     PRUint32            last = mHeader.mBucketUsage[bucketIndex];
>     for (PRUint32 i = last; i > 0; i--) {
>         if (records[i-1].HashNumber() == hashNumber) {

The point of this change is to avoid unsafe conversions. If you prefer to do it this way, that's ok with me although I don't see the rationale nor the benefit compared to the suggested way.

Also observe that this loop is found in many methods in this class, but without using |last| and rather subtracting 1 in the initial assignment to |i|. The reason for using |last| in |DeleteRecord| is that |last| is needed three other places in the loop. You would need to subtract 1 also in these three places, which is the reason it was done like this I guess.

I've provided two theories for why this particular line crashes (comment #11 and comment #15) - do you have any other theory for what can cause a crash here?
(In reply to Bjarne (:bjarne) from comment #19)
> Patch sent to tryserver
> 
> https://tbpl.mozilla.org/?tree=Try&rev=1f6764f8dfca

Seems to be as green as a tryserver-run can be these days...

> As a side-note: Bug #691108 seems to have disappeared since I started
> looking at this.

.. and of-course bug #691108, comment #77 came in a few hours after this comment. But the frequency of these has at least dropped.
Comment on attachment 590702 [details] [diff] [review]
More complete fix which avoids deleting record twice

> +    // Doom entry if something bad happens while closing. See bug #673543
> +    if (NS_FAILED(rv) && rv != NS_BASE_STREAM_WOULD_BLOCK) {
> +        nsCacheService::DoomEntry(this);
> +        return false; // it' doomed - deactivate it
> +    }

Dooming the entry doesn't remove any active descriptor (though there shouldn't be any other descriptor since we had opened output) or request. So you IMO can't return false here.


> +        // Doom entry if something bad happens while closing. See bug #673543
> +        nsresult rv = descriptor->CloseOutput();
> +        if (NS_FAILED(rv) && rv != NS_BASE_STREAM_WOULD_BLOCK)
> +            nsCacheService::DoomEntry(this);

Cache streams never return NS_BASE_STREAM_WOULD_BLOCK, see nsCacheEntryDescriptor::nsOutputStreamWrapper::IsNonBlocking() and nsCacheEntryDescriptor::nsInputStreamWrapper::IsNonBlocking(). But if we ever change it, we would need to doom the entry too, because the data wasn't flushed properly. Or am I wrong?


>      if (!mClosed) {
>          mClosed = true;
>          // tell parent streamIO we are closing
> -        mStreamIO->CloseOutputStream(this);
> +        return mStreamIO->CloseOutputStream(this);
>      }
>  
>      mozilla::Telemetry::ID id;
>      if (NS_IsMainThread())

This disables the telemetry. The same problem is in nsDiskCacheOutputStream::CloseInternal().


>      if (mBinding && mOutStream)
> -        Flush();
> +        // This method is only called from nsDiskCacheBinding-dtor
> +        return Flush();

Although this is true, I'd prefer to null out mBinding (at least in case of successful flush).
Attachment #590702 - Flags: review?(michal.novotny) → review-
(In reply to Michal Novotny (:michal) from comment #26)
> Dooming the entry doesn't remove any active descriptor (though there
> shouldn't be any other descriptor since we had opened output) or request. So
> you IMO can't return false here.

Agreed.

> Cache streams never return NS_BASE_STREAM_WOULD_BLOCK, see
> nsCacheEntryDescriptor::nsOutputStreamWrapper::IsNonBlocking() and
> nsCacheEntryDescriptor::nsInputStreamWrapper::IsNonBlocking(). But if we
> ever change it, we would need to doom the entry too, because the data wasn't
> flushed properly. Or am I wrong?

You are probably right...  the right action seems to actually be just return true in this case, something like this:

    nsresult rv = descriptor->CloseOutput();
    if (rv == NS_BASE_STREAM_WOULD_BLOCK)
        return true;
    descriptor->ClearCacheEntry();
    PR_REMOVE_AND_INIT_LINK(descriptor);
    // Doom entry if something bad happens while closing. See bug #673543
    if (NS_FAILED(rv))
        nsCacheService::DoomEntry(this);

> This disables the telemetry. The same problem is in
> nsDiskCacheOutputStream::CloseInternal().

Ahh - good catch. Will fix.

> Although this is true, I'd prefer to null out mBinding (at least in case of
> successful flush).

I could agree with this but I think it is a different issue and should not be done here. See my recent comments to bug #648232. (If I'm missing something and nulling is necessary for this patch, please point out why.)
(In reply to Bjarne (:bjarne) from comment #27)
> I could agree with this but I think it is a different issue and should not
> be done here. See my recent comments to bug #648232. (If I'm missing
> something and nulling is necessary for this patch, please point out why.)

Hmm...  I missed your point here. Please forget this comment.
Addressed comments. Pushed to tryserver as https://tbpl.mozilla.org/?tree=Try&rev=6211c80cfb86
Attachment #590702 - Attachment is obsolete: true
Attachment #591087 - Flags: review?(michal.novotny)
First, I'd like to note that I have no theory. Your theory about "signed int -> unsigned 32b int -> signed int" conversion would make sense if int were 64 bit long. But it seems that all windows versions are I32LP64, i.e. int is always 32 bit.

> The point of this change is to avoid unsafe conversions. If you prefer to do
> it this way, that's ok with me although I don't see the rationale nor the
> benefit compared to the suggested way.

The benefit is that there is no unsigned -> signed conversion in my change. See section 6.3.1.3 paragraph 3 of the C99 standard:

6.3.1.3 Signed and unsigned integers

1) When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
2) Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.
3) Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
(In reply to Bjarne (:bjarne) from comment #27)
> You are probably right...  the right action seems to actually be just return
> true in this case, something like this:
> 
>     nsresult rv = descriptor->CloseOutput();
>     if (rv == NS_BASE_STREAM_WOULD_BLOCK)
>         return true;

Returning true will lead to an assertion in nsCacheEntryDescriptor::Close(). But I'm probably OK with it, because the cache streams are blocking and if we change it, we need to change the closing of entries too...
Attachment #591087 - Flags: review?(michal.novotny) → review+
Keywords: checkin-needed
(In reply to Michal Novotny (:michal) from comment #21)
> I've tried the patch on windows build but it didn't crash. Variable "last"
> is 0xffffffff and the for loop is not entered.

Yeah...  I guess it adds 2^N to -1 which should set all the bits when rhs is -1.

(In reply to Michal Novotny (:michal) from comment #30)
> First, I'd like to note that I have no theory. Your theory about "signed int
> -> unsigned 32b int -> signed int" conversion would make sense if int were
> 64 bit long. But it seems that all windows versions are I32LP64, i.e. int is
> always 32 bit.

I am really no expert modern compilers work so I'll accept this if you say so. (I also see that msdn states that MS-compilers preserve the bit-patterns in s -> u and u -> s conversions which confirms your comment #21 and should make this safe on win32.)

What about the unroll-idea? Could an optimizing modern compiler assume that a value assigned from an unsigned int will always be >= 0 and hence always execute the first iteration of the loop |for (int i = last; i >= 0; i--)|? (We did such things manually in the good old days - maybe compilers have learnt how to do it by now..?)

> The benefit is that there is no unsigned -> signed conversion in my change.

Well, it is a point to avoid the wrapping which seems to be allowed when subtracting unsigned ints. Another way to handle this can be to follow the secure-code guidelines from CERT

https://www.securecoding.cert.org/confluence/display/cplusplus/INT30-CPP.+Ensure+that+unsigned+integer+operations+do+not+wrap#INT30-CPP.Ensurethatunsignedintegeroperationsdonotwrap-Subtraction

The code we have is essentially the "noncompliant code example" and we could make it compliant like they suggest here.
https://hg.mozilla.org/mozilla-central/rev/f10313959cb0
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla12
-> default owner
Assignee: bjarne → nobody
not only are there still crashes, the patch appears to have had zero impact
Top crasher #29 in FF11, and still apparent in FF12. Josh - can we get somebody to look at this?
Assignee: nobody → joshmoz
Michal, can you look into this?
Assignee: joshmoz → michal.novotny
It's tracking for FF12 and we don't have any comments in 10 days. It's a top crash we shipped in 11 and it would be great to get it fixed in 12. It's still very high volume so the fix 1/26 didn't work.
I'm actively working on this. So far I have a test that reproduces an assertion from bug #691108. I believe that the cause of both bugs is the same. But the problem is that I need to run the test overnight to get the assertion and the logs are really huge, so it's not a matter of a short while to analyze them.
Unfortunately this bug isn't duplicate of #691108. Did anybody succeed to reproduce this crash? A NSPR log (nsHttp:5,cache:5) from a crashing session would be very helpful.
Attached patch fixSplinter Review
In the end I think that the problem is similar to bug #691108. It could happen that a new cache entry is activated while we are waiting for pending events on the cache IO thread in nsCacheService::Shutdown() at http://hg.mozilla.org/mozilla-central/annotate/6fe5b0271cd1/netwerk/cache/nsCacheService.cpp#l1245 and in nsCacheService::OnProfileShutdown() at http://hg.mozilla.org/mozilla-central/annotate/6fe5b0271cd1/netwerk/cache/nsCacheService.cpp#l2183. In both cases we clear mActiveEntries before calling SyncWithCacheIOThread() and we expect that there is no active entry when SyncWithCacheIOThread() finishes. I'm not able to reproduce the crash and also I don't see exactly what scenario causes the crash, but using debugger I'm able to force a cache entry for a wyciwyg channel to be created while the main thread is blocked by SyncWithCacheIOThread(). Unfortunately the cache entry is also closed and removed from mActiveEntries before SyncWithCacheIOThread() finishes. But if I was able to reference the wyciwyg channel for a bit longer, the cache entry wouldn't be closed immediately and IMO it would lead to the crash that we see in this bug. Attached patch should fix this problem.
Attachment #613553 - Flags: review?(jduell.mcbugs)
Comment on attachment 613553 [details] [diff] [review]
fix

Review of attachment 613553 [details] [diff] [review]:
-----------------------------------------------------------------

I'm no expert in this code, but poking around I see that ActivateEntry can already fail commonly enough (ex: if hash collision) that I'm assuming ProcessRequest will behave well once we've added in the additonal failure mode (if we're clearing entries).
Attachment #613553 - Flags: review?(jduell.mcbugs) → review+
tryserver is green except few intermittent failures: https://tbpl.mozilla.org/?tree=Try&rev=9f4bbf96ea71
https://hg.mozilla.org/mozilla-central/rev/38fcd9047d39
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Target Milestone: mozilla12 → mozilla14
Version: 6 Branch → Trunk
It's too late in FF12's cycle to fix this top crasher (first regressing in FF10) - please nominate for Aurora approval if you feel the fix is low risk.
This is now in Aurora, adjusting flags accordingly.
It's #4 top crasher in 10.0.4 ESR.
[Approval Request Comment]
User impact if declined: possible crash during FF shutdown
Fix Landed on Version: mozilla14
Risk to taking this patch (and alternatives if risky): probably none
String or UUID changes made by this patch: none
Attachment #636168 - Flags: approval-mozilla-esr10?
Comment on attachment 636168 [details] [diff] [review]
unbitrotted for esr10

looks good, thanks for the updated patch.
Attachment #636168 - Flags: approval-mozilla-esr10? → approval-mozilla-esr10+
Is there something I can do to test this is fixed on Firefox 10.0.6esrpre? I don't see that this was ever reliably reproducible, nor can I find a testcase for it.
Whiteboard: [tbird crash] → [tbird crash][qa?]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: