Last Comment Bug 673543 - crash nsDiskCacheMap::DeleteRecord
: crash nsDiskCacheMap::DeleteRecord
Status: RESOLVED FIXED
[tbird crash][qa?]
: crash, topcrash
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: Trunk
: All All
: -- critical (vote)
: mozilla14
Assigned To: Michal Novotny (:michal)
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-07-22 14:00 PDT by Alex Vincent [:WeirdAl]
Modified: 2012-07-11 14:39 PDT (History)
13 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
+
+
wontfix
+
wontfix
+
fixed
14+
fixed


Attachments
Test to provoke crash (7.62 KB, patch)
2012-01-19 06:24 PST, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Change to signed int (1.10 KB, patch)
2012-01-20 14:49 PST, Bjarne (:bjarne)
michal.novotny: review-
Details | Diff | Splinter Review
More complete fix which avoids deleting record twice (8.17 KB, patch)
2012-01-23 06:50 PST, Bjarne (:bjarne)
michal.novotny: review-
Details | Diff | Splinter Review
Test to trigger crash - modified to handle failure-returns (7.68 KB, patch)
2012-01-23 06:51 PST, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
More complete fix which avoids deleting record twice, v1.1 (8.73 KB, patch)
2012-01-24 07:30 PST, Bjarne (:bjarne)
michal.novotny: review+
Details | Diff | Splinter Review
fix (3.36 KB, patch)
2012-04-10 04:57 PDT, Michal Novotny (:michal)
jduell.mcbugs: review+
Details | Diff | Splinter Review
unbitrotted for esr10 (3.35 KB, patch)
2012-06-24 08:52 PDT, Michal Novotny (:michal)
lukasblakk+bugs: approval‑mozilla‑esr10+
Details | Diff | Splinter Review

Description Alex Vincent [:WeirdAl] 2011-07-22 14:00:14 PDT
This bug was filed from the Socorro interface and is 
report bp-67651a52-dd9a-4499-b25a-0940b2110722 .
============================================================= 

Filing here because Socorro didn't offer me the Core product option.  I don't know quite how this happened, but the stack trace appears different than bug 631008.
Comment 1 Josh Aas 2011-11-01 10:12:51 PDT
Assigning to Michal for investigation because it is probably related to bug 631008.
Comment 2 Scoobidiver (away) 2011-12-08 05:25:00 PST
The stack trace looks like:
Frame 	Module 	Signature [Expand] 	Source
0 	xul.dll 	nsDiskCacheMap::DeleteRecord 	netwerk/cache/nsDiskCacheMap.cpp:488
1 	xul.dll 	nsDiskCacheDevice::DeactivateEntry_Private 	
2 	xul.dll 	nsDiskCacheDeviceDeactivateEntryEvent::Run 	netwerk/cache/nsDiskCacheDevice.cpp:473
3 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:618
4 	xul.dll 	nsRunnable::Release 	obj-firefox/xpcom/build/nsThreadUtils.cpp:55
5 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:426
6 	nspr4.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:122
7 	mozcrt19.dll 	_callthreadstartex 	obj-firefox/memory/jemalloc/crtsrc/threadex.c:348
8 	mozcrt19.dll 	_threadstartex 	obj-firefox/memory/jemalloc/crtsrc/threadex.c:326
9 	kernel32.dll 	BaseThreadStart
Comment 3 Sheila Mooney 2011-12-12 22:15:01 PST
Sitting at the #18 spot for crashes on 8.0 in the past week and #38 on 8.0.1 in the past week.
Comment 4 Bjarne (:bjarne) 2012-01-11 05:44:34 PST
Still occurs according to crash-stats. Grabbing this on request from Josh and by agreement with Michal.
Comment 5 Bjarne (:bjarne) 2012-01-13 14:06:33 PST
This is caused by the same problem causing bug #691108, which I can reproduce with an xpcshell-test in a debugger.

Bug #691108 only occurs in debug-builds since the crashing statement (NS_NOTREACHED) is active only for debug builds. In an opt build, nsDiskCacheMap::WriteDiskCacheEntry() returns failure (instead of crashing in the NOTREACHED) making DeactivateEntry_Private() call DeleteStorage() and DeleteRecord(). The latter crashes (this bug) because the record has already been deleted from nsDiskCacheMap.

It seems like an entry can be deleted from nsDiskCacheMap while there is a pending deactiavtion. This is basically because nsDiskCacheStreamIO keeps a binding to the record even after CloseOutputStream/Internal is called. I am not 100% sure under which circumstances this happens but I can provoke it by dispatching an "xpcom-shutdown" notification from the test and use the debugger to make nsDiskCacheMap::DeleteStorage() fail when in nsDiskCacheStreamIO::Flush(), which triggers the call to nsDiskCacheMap::DeleteRecord(). When the deactivate-event fires, the record has already been deleted and we get the behaviour described above.

Will have to think about how to fix this... The obvious way is to clear the binding to the record in nsDiskCacheStreamIO but there may be other ways. I also do not really understand the logic in nsDiskCacheStreamIO::Flush() if nsDiskCacheMap::DeleteStorage() fails - why would we just remove the record..? Wouldn't dooming the entry be more appropriate?
Comment 6 Bjarne (:bjarne) 2012-01-17 08:04:04 PST
Since bug #691108 is about one particular test failing, I've been trying to find the cache access-pattern used in that test in order to come up with some test-code which fails consistently. No clear conclusion yet but the observation is that these crashes occur immediately when the test enables disk-cache, loads one entry, clears the cache and tries to load a second entry. The crash seems to be the deactivate-call for the entry loaded immediately before clearing the cache.
Comment 7 Bjarne (:bjarne) 2012-01-18 07:28:57 PST
(In reply to Bjarne (:bjarne) from comment #5)
> In an opt build,
> nsDiskCacheMap::WriteDiskCacheEntry() returns failure (instead of crashing
> in the NOTREACHED) making DeactivateEntry_Private() call DeleteStorage() and
> DeleteRecord(). The latter crashes (this bug) because the record has already
> been deleted from nsDiskCacheMap.

This does not seem to be right... I have a test which can consistently trigger bug #691108 but I do not seem to be able to trigger this bug with an opt build. Will investigate further...
Comment 8 Robert Kaiser 2012-01-18 10:19:16 PST
This signature is now #5 on 10.* in yesterday's data, #8 in total 10.0b4 data.
Comment 9 Bjarne (:bjarne) 2012-01-18 14:38:40 PST
Any idea if this is because it happens more frequently or because other crashes are fixed?
Comment 10 Scoobidiver (away) 2012-01-18 15:04:24 PST
It was #18 top browser crasher in 8.0.
It's #25 top browser crasher in 9.0.1.
It's stable across 10.0 Betas, between #8 and #11.
So there's probably a regression between 9 and 10.
Comment 11 Bjarne (:bjarne) 2012-01-19 05:12:37 PST
..  or maybe some other crashes were fixed between 9 and 10, causing this one to climb on the list?  ;)

Anyway, here's a theory: Using the test mentioned in comment #7 it can be shown that a failure in |nsDiskCacheMap::DeleteStorage| can cause an entry to be deleted from the cache-map twice (essence in comment #5). The crash occurs when |nsDiskCacheMap::DeleteRecord| is called for the second time for the same entry

https://hg.mozilla.org/releases/mozilla-release/file/4e309e63c279/netwerk/cache/nsDiskCacheMap.cpp#l488

Observe that on the second call |mHeader.mBucketUsage[bucketIndex]| can be 0 (for example if it happens immediately after clearing the cache). This means that |last| will be assigned the value of -1 but since |last| is an |PRUint32|, some conversion will happen. In the for-loop below |last| is converted into an |int| (note: no number of bits and signed) and my guess is that on certain platforms and with certain compilers and -flags |i| ends up as a large positive number which makes |records[i].HashNumber()| crash. This is possibly also a 32/64 bit issue - I'm not too deep into the gory details in this area.

Michal: On my Linux-box this is handled somehow and I don't have the setup for Windows-builds. Could you try the patch I'll attach in a while on your setup and verify that the contained test crashes in an opt-build? Then, change |last| to be a signed int and see if it fixes the crash?
Comment 12 Bjarne (:bjarne) 2012-01-19 06:24:51 PST
Created attachment 589849 [details] [diff] [review]
Test to provoke crash

This is the code mentioned in comment #7.

In a debug-build it reproduces the issue from bug #691108. In an opt-build, the theory is that on some platforms (non-Linux) it will reproduce the issue seen in this bug.

Note that this is not suitable for checking in since it cripples |nsICacheService| to provide a mechanism for the test to make |nsDiskCacheMap::DeleteStorage| fail.
Comment 13 Robert Kaiser 2012-01-19 09:03:39 PST
(In reply to Bjarne (:bjarne) from comment #9)
> Any idea if this is because it happens more frequently or because other
> crashes are fixed?

Frequency of this crash has definitely risen in Firefox 10+ compared to 9. Was 110-130 crashes per million ADU for the ~2 weeks before, 356 in data from 2012-01-17 (but that's a single-day spike). Also going between 100 and 300 for Firefox 11. And it's at 5-8 crashes per million ADU in data for Firefox 9.

Also, as you mentioned 32bit vs. 64bit, https://crash-stats.mozilla.com/report/list?signature=nsDiskCacheMap%3A%3ADeleteRecord%28nsDiskCacheRecord*%29 seems to have mostly 32bit, but the odd 64bit crash is visible there as well. All the crashes are on Windows, though.
Comment 14 Bjarne (:bjarne) 2012-01-20 07:42:44 PST
(In reply to Bjarne (:bjarne) from comment #11)
> Michal: On my Linux-box this is handled somehow and I don't have the setup
> for Windows-builds. Could you try the patch I'll attach in a while on your
> setup and verify that the contained test crashes in an opt-build? Then,
> change |last| to be a signed int and see if it fixes the crash?

I went ahead and tried this on tryserver, but the opt-build does not crash as expected...  :(
Comment 15 Bjarne (:bjarne) 2012-01-20 14:23:27 PST
Still believe the conversion int -> unsigned 32bit int -> int is unsafe. Additionally, even if conversion happens correctly, an optimizing compiler might unroll the outer if-loop (line 490 in link from comment #11) because |last|, being an unsigned int, should always be >= 0. So, in this case the inner loop would be executed once even if |i| is -1, and a crash is likely.

Of-course, the proper fix for this is to avoid deleting the record twice. However, that fix is not as obvious as avoid using unsigned ints here in order to make this code more stable.
Comment 16 Bjarne (:bjarne) 2012-01-20 14:49:19 PST
Created attachment 590359 [details] [diff] [review]
Change to signed int

IMO it's worth to try this safe, simple change and see if it fixes this crash. If it doesn't then no harm should have been done.
Comment 17 Bjarne (:bjarne) 2012-01-23 06:50:31 PST
Created attachment 590702 [details] [diff] [review]
More complete fix which avoids deleting record twice

This can be applied alone or together with attachment #590359 [details] [diff] [review]

The basic idea is to make sure error-values are propagated properly from internal methods and up to a level where they can be handled. I've chosen nsCacheEntry as the point to deal with errors, but I'm open to discussions. (Note that the test in attachment #589849 [details] [diff] [review] need a try-catch to deal with this - I'll attach a new version shortly.)
Comment 18 Bjarne (:bjarne) 2012-01-23 06:51:39 PST
Created attachment 590704 [details] [diff] [review]
Test to trigger crash - modified to handle failure-returns
Comment 19 Bjarne (:bjarne) 2012-01-23 07:09:13 PST
Patch sent to tryserver

https://tbpl.mozilla.org/?tree=Try&rev=1f6764f8dfca

As a side-note: Bug #691108 seems to have disappeared since I started looking at this. I'm not aware of any fixes which may have fixed that. Anyone who knows..?
Comment 20 Michal Novotny (:michal) 2012-01-23 16:19:10 PST
Comment on attachment 590702 [details] [diff] [review]
More complete fix which avoids deleting record twice

(In reply to Bjarne (:bjarne) from comment #17)
> The basic idea is to make sure error-values are propagated properly from
> internal methods and up to a level where they can be handled. I've chosen
> nsCacheEntry as the point to deal with errors, but I'm open to discussions.

If we simply call nsCacheService::DoomEntry() in nsDiskCacheStreamIO::Flush() instead of propagating the error, then the only difference would be that we wouldn't stop early in nsCacheEntry::RemoveDescriptor(), right? Would it be a problem?
Comment 21 Michal Novotny (:michal) 2012-01-23 16:21:21 PST
(In reply to Bjarne (:bjarne) from comment #11)
> Michal: On my Linux-box this is handled somehow and I don't have the setup
> for Windows-builds. Could you try the patch I'll attach in a while on your
> setup and verify that the contained test crashes in an opt-build? Then,
> change |last| to be a signed int and see if it fixes the crash?

I've tried the patch on windows build but it didn't crash. Variable "last" is 0xffffffff and the for loop is not entered.
Comment 22 Michal Novotny (:michal) 2012-01-23 16:57:40 PST
Comment on attachment 590359 [details] [diff] [review]
Change to signed int

I'm not sure that we want this change. This would just suppress the crash without fixing the issue. Anyway, if we want to handle the bucket usage better, I'd prefer to make variable 'i' unsigned since mBucketUsage is also unsigned, i.e. something like this:

    PRUint32            last = mHeader.mBucketUsage[bucketIndex];
    for (PRUint32 i = last; i > 0; i--) {
        if (records[i-1].HashNumber() == hashNumber) {
Comment 23 Bjarne (:bjarne) 2012-01-24 02:14:28 PST
(In reply to Michal Novotny (:michal) from comment #20)
> Comment on attachment 590702 [details] [diff] [review]
> If we simply call nsCacheService::DoomEntry() in
> nsDiskCacheStreamIO::Flush() instead of propagating the error, then the only
> difference would be that we wouldn't stop early in
> nsCacheEntry::RemoveDescriptor(), right? Would it be a problem?

That would resolve this particular problem, yes, but I'd consider it more of a hack. I think it is a mistake in any case not to propagate errors. IMO, it is not the responsibility of the disk-stream to decide whether to doom a cache-entry. It should propagate errors and leave such decisions to a higher layer.
Comment 24 Bjarne (:bjarne) 2012-01-24 02:43:44 PST
(In reply to Michal Novotny (:michal) from comment #22)
> I'm not sure that we want this change. This would just suppress the crash
> without fixing the issue. 

Agreed, but it would make the code robust. The way it is now is just plain wrong.

> Anyway, if we want to handle the bucket usage
> better, I'd prefer to make variable 'i' unsigned since mBucketUsage is also
> unsigned, i.e. something like this:
> 
>     PRUint32            last = mHeader.mBucketUsage[bucketIndex];
>     for (PRUint32 i = last; i > 0; i--) {
>         if (records[i-1].HashNumber() == hashNumber) {

The point of this change is to avoid unsafe conversions. If you prefer to do it this way, that's ok with me although I don't see the rationale nor the benefit compared to the suggested way.

Also observe that this loop is found in many methods in this class, but without using |last| and rather subtracting 1 in the initial assignment to |i|. The reason for using |last| in |DeleteRecord| is that |last| is needed three other places in the loop. You would need to subtract 1 also in these three places, which is the reason it was done like this I guess.

I've provided two theories for why this particular line crashes (comment #11 and comment #15) - do you have any other theory for what can cause a crash here?
Comment 25 Bjarne (:bjarne) 2012-01-24 05:12:56 PST
(In reply to Bjarne (:bjarne) from comment #19)
> Patch sent to tryserver
> 
> https://tbpl.mozilla.org/?tree=Try&rev=1f6764f8dfca

Seems to be as green as a tryserver-run can be these days...

> As a side-note: Bug #691108 seems to have disappeared since I started
> looking at this.

.. and of-course bug #691108, comment #77 came in a few hours after this comment. But the frequency of these has at least dropped.
Comment 26 Michal Novotny (:michal) 2012-01-24 05:52:54 PST
Comment on attachment 590702 [details] [diff] [review]
More complete fix which avoids deleting record twice

> +    // Doom entry if something bad happens while closing. See bug #673543
> +    if (NS_FAILED(rv) && rv != NS_BASE_STREAM_WOULD_BLOCK) {
> +        nsCacheService::DoomEntry(this);
> +        return false; // it' doomed - deactivate it
> +    }

Dooming the entry doesn't remove any active descriptor (though there shouldn't be any other descriptor since we had opened output) or request. So you IMO can't return false here.


> +        // Doom entry if something bad happens while closing. See bug #673543
> +        nsresult rv = descriptor->CloseOutput();
> +        if (NS_FAILED(rv) && rv != NS_BASE_STREAM_WOULD_BLOCK)
> +            nsCacheService::DoomEntry(this);

Cache streams never return NS_BASE_STREAM_WOULD_BLOCK, see nsCacheEntryDescriptor::nsOutputStreamWrapper::IsNonBlocking() and nsCacheEntryDescriptor::nsInputStreamWrapper::IsNonBlocking(). But if we ever change it, we would need to doom the entry too, because the data wasn't flushed properly. Or am I wrong?


>      if (!mClosed) {
>          mClosed = true;
>          // tell parent streamIO we are closing
> -        mStreamIO->CloseOutputStream(this);
> +        return mStreamIO->CloseOutputStream(this);
>      }
>  
>      mozilla::Telemetry::ID id;
>      if (NS_IsMainThread())

This disables the telemetry. The same problem is in nsDiskCacheOutputStream::CloseInternal().


>      if (mBinding && mOutStream)
> -        Flush();
> +        // This method is only called from nsDiskCacheBinding-dtor
> +        return Flush();

Although this is true, I'd prefer to null out mBinding (at least in case of successful flush).
Comment 27 Bjarne (:bjarne) 2012-01-24 06:47:39 PST
(In reply to Michal Novotny (:michal) from comment #26)
> Dooming the entry doesn't remove any active descriptor (though there
> shouldn't be any other descriptor since we had opened output) or request. So
> you IMO can't return false here.

Agreed.

> Cache streams never return NS_BASE_STREAM_WOULD_BLOCK, see
> nsCacheEntryDescriptor::nsOutputStreamWrapper::IsNonBlocking() and
> nsCacheEntryDescriptor::nsInputStreamWrapper::IsNonBlocking(). But if we
> ever change it, we would need to doom the entry too, because the data wasn't
> flushed properly. Or am I wrong?

You are probably right...  the right action seems to actually be just return true in this case, something like this:

    nsresult rv = descriptor->CloseOutput();
    if (rv == NS_BASE_STREAM_WOULD_BLOCK)
        return true;
    descriptor->ClearCacheEntry();
    PR_REMOVE_AND_INIT_LINK(descriptor);
    // Doom entry if something bad happens while closing. See bug #673543
    if (NS_FAILED(rv))
        nsCacheService::DoomEntry(this);

> This disables the telemetry. The same problem is in
> nsDiskCacheOutputStream::CloseInternal().

Ahh - good catch. Will fix.

> Although this is true, I'd prefer to null out mBinding (at least in case of
> successful flush).

I could agree with this but I think it is a different issue and should not be done here. See my recent comments to bug #648232. (If I'm missing something and nulling is necessary for this patch, please point out why.)
Comment 28 Bjarne (:bjarne) 2012-01-24 06:58:48 PST
(In reply to Bjarne (:bjarne) from comment #27)
> I could agree with this but I think it is a different issue and should not
> be done here. See my recent comments to bug #648232. (If I'm missing
> something and nulling is necessary for this patch, please point out why.)

Hmm...  I missed your point here. Please forget this comment.
Comment 29 Bjarne (:bjarne) 2012-01-24 07:30:06 PST
Created attachment 591087 [details] [diff] [review]
More complete fix which avoids deleting record twice, v1.1

Addressed comments. Pushed to tryserver as https://tbpl.mozilla.org/?tree=Try&rev=6211c80cfb86
Comment 30 Michal Novotny (:michal) 2012-01-24 07:40:22 PST
First, I'd like to note that I have no theory. Your theory about "signed int -> unsigned 32b int -> signed int" conversion would make sense if int were 64 bit long. But it seems that all windows versions are I32LP64, i.e. int is always 32 bit.

> The point of this change is to avoid unsafe conversions. If you prefer to do
> it this way, that's ok with me although I don't see the rationale nor the
> benefit compared to the suggested way.

The benefit is that there is no unsigned -> signed conversion in my change. See section 6.3.1.3 paragraph 3 of the C99 standard:

6.3.1.3 Signed and unsigned integers

1) When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
2) Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.
3) Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
Comment 31 Michal Novotny (:michal) 2012-01-24 07:59:19 PST
(In reply to Bjarne (:bjarne) from comment #27)
> You are probably right...  the right action seems to actually be just return
> true in this case, something like this:
> 
>     nsresult rv = descriptor->CloseOutput();
>     if (rv == NS_BASE_STREAM_WOULD_BLOCK)
>         return true;

Returning true will lead to an assertion in nsCacheEntryDescriptor::Close(). But I'm probably OK with it, because the cache streams are blocking and if we change it, we need to change the closing of entries too...
Comment 32 Bjarne (:bjarne) 2012-01-24 15:01:54 PST
(In reply to Michal Novotny (:michal) from comment #21)
> I've tried the patch on windows build but it didn't crash. Variable "last"
> is 0xffffffff and the for loop is not entered.

Yeah...  I guess it adds 2^N to -1 which should set all the bits when rhs is -1.

(In reply to Michal Novotny (:michal) from comment #30)
> First, I'd like to note that I have no theory. Your theory about "signed int
> -> unsigned 32b int -> signed int" conversion would make sense if int were
> 64 bit long. But it seems that all windows versions are I32LP64, i.e. int is
> always 32 bit.

I am really no expert modern compilers work so I'll accept this if you say so. (I also see that msdn states that MS-compilers preserve the bit-patterns in s -> u and u -> s conversions which confirms your comment #21 and should make this safe on win32.)

What about the unroll-idea? Could an optimizing modern compiler assume that a value assigned from an unsigned int will always be >= 0 and hence always execute the first iteration of the loop |for (int i = last; i >= 0; i--)|? (We did such things manually in the good old days - maybe compilers have learnt how to do it by now..?)

> The benefit is that there is no unsigned -> signed conversion in my change.

Well, it is a point to avoid the wrapping which seems to be allowed when subtracting unsigned ints. Another way to handle this can be to follow the secure-code guidelines from CERT

https://www.securecoding.cert.org/confluence/display/cplusplus/INT30-CPP.+Ensure+that+unsigned+integer+operations+do+not+wrap#INT30-CPP.Ensurethatunsignedintegeroperationsdonotwrap-Subtraction

The code we have is essentially the "noncompliant code example" and we could make it compliant like they suggest here.
Comment 34 Marco Bonardo [::mak] 2012-01-26 16:00:25 PST
https://hg.mozilla.org/mozilla-central/rev/f10313959cb0
Comment 36 Bjarne (:bjarne) 2012-02-06 03:20:47 PST
-> default owner
Comment 37 Wayne Mery (:wsmwk, NI for questions) 2012-03-10 12:19:50 PST
not only are there still crashes, the patch appears to have had zero impact
Comment 38 Alex Keybl [:akeybl] 2012-03-19 16:59:19 PDT
Top crasher #29 in FF11, and still apparent in FF12. Josh - can we get somebody to look at this?
Comment 39 Josh Aas 2012-03-20 08:35:36 PDT
Michal, can you look into this?
Comment 40 Sheila Mooney 2012-03-30 09:21:34 PDT
It's tracking for FF12 and we don't have any comments in 10 days. It's a top crash we shipped in 11 and it would be great to get it fixed in 12. It's still very high volume so the fix 1/26 didn't work.
Comment 41 Michal Novotny (:michal) 2012-04-02 05:48:45 PDT
I'm actively working on this. So far I have a test that reproduces an assertion from bug #691108. I believe that the cause of both bugs is the same. But the problem is that I need to run the test overnight to get the assertion and the logs are really huge, so it's not a matter of a short while to analyze them.
Comment 42 Michal Novotny (:michal) 2012-04-04 16:45:48 PDT
Unfortunately this bug isn't duplicate of #691108. Did anybody succeed to reproduce this crash? A NSPR log (nsHttp:5,cache:5) from a crashing session would be very helpful.
Comment 43 Michal Novotny (:michal) 2012-04-10 04:57:33 PDT
Created attachment 613553 [details] [diff] [review]
fix

In the end I think that the problem is similar to bug #691108. It could happen that a new cache entry is activated while we are waiting for pending events on the cache IO thread in nsCacheService::Shutdown() at http://hg.mozilla.org/mozilla-central/annotate/6fe5b0271cd1/netwerk/cache/nsCacheService.cpp#l1245 and in nsCacheService::OnProfileShutdown() at http://hg.mozilla.org/mozilla-central/annotate/6fe5b0271cd1/netwerk/cache/nsCacheService.cpp#l2183. In both cases we clear mActiveEntries before calling SyncWithCacheIOThread() and we expect that there is no active entry when SyncWithCacheIOThread() finishes. I'm not able to reproduce the crash and also I don't see exactly what scenario causes the crash, but using debugger I'm able to force a cache entry for a wyciwyg channel to be created while the main thread is blocked by SyncWithCacheIOThread(). Unfortunately the cache entry is also closed and removed from mActiveEntries before SyncWithCacheIOThread() finishes. But if I was able to reference the wyciwyg channel for a bit longer, the cache entry wouldn't be closed immediately and IMO it would lead to the crash that we see in this bug. Attached patch should fix this problem.
Comment 44 Jason Duell [:jduell] (needinfo me) 2012-04-11 12:33:30 PDT
Comment on attachment 613553 [details] [diff] [review]
fix

Review of attachment 613553 [details] [diff] [review]:
-----------------------------------------------------------------

I'm no expert in this code, but poking around I see that ActivateEntry can already fail commonly enough (ex: if hash collision) that I'm assuming ProcessRequest will behave well once we've added in the additonal failure mode (if we're clearing entries).
Comment 45 Michal Novotny (:michal) 2012-04-12 07:06:24 PDT
tryserver is green except few intermittent failures: https://tbpl.mozilla.org/?tree=Try&rev=9f4bbf96ea71
Comment 47 Marco Bonardo [::mak] 2012-04-13 04:01:41 PDT
https://hg.mozilla.org/mozilla-central/rev/38fcd9047d39
Comment 48 Alex Keybl [:akeybl] 2012-04-16 09:52:50 PDT
It's too late in FF12's cycle to fix this top crasher (first regressing in FF10) - please nominate for Aurora approval if you feel the fix is low risk.
Comment 49 Lukas Blakk [:lsblakk] use ?needinfo 2012-05-02 11:52:06 PDT
This is now in Aurora, adjusting flags accordingly.
Comment 50 Scoobidiver (away) 2012-05-23 08:29:21 PDT
It's #4 top crasher in 10.0.4 ESR.
Comment 51 Michal Novotny (:michal) 2012-06-24 08:52:10 PDT
Created attachment 636168 [details] [diff] [review]
unbitrotted for esr10

[Approval Request Comment]
User impact if declined: possible crash during FF shutdown
Fix Landed on Version: mozilla14
Risk to taking this patch (and alternatives if risky): probably none
String or UUID changes made by this patch: none
Comment 52 Lukas Blakk [:lsblakk] use ?needinfo 2012-06-28 16:18:05 PDT
Comment on attachment 636168 [details] [diff] [review]
unbitrotted for esr10

looks good, thanks for the updated patch.
Comment 53 Michal Novotny (:michal) 2012-07-07 00:42:00 PDT
http://hg.mozilla.org/releases/mozilla-esr10/rev/974fb8245eb5
Comment 54 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-07-10 08:28:11 PDT
Is there something I can do to test this is fixed on Firefox 10.0.6esrpre? I don't see that this was ever reliably reproducible, nor can I find a testcase for it.

Note You need to log in before you can comment on or make changes to this bug.