Closed Bug 614513 Opened 14 years ago Closed 13 years ago

Crash in nsCacheService::DeactivateEntry during Linux Talos

Categories

(Core :: Networking: Cache, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- -

People

(Reporter: mstange, Assigned: michal)

References

Details

(Keywords: crash, intermittent-failure)

Attachments

(3 files, 2 obsolete files)

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1290590523.1290592205.13810.gz
Rev3 Fedora 12 mozilla-central talos cold on 2010/11/24 01:22:03

Thread 11 (crashed)
 0  firefox-bin!free [jemalloc.c:836ab778ffc5 : 4271 + 0x0]
    eip = 0x0804edbf   esp = 0xb01fef60   ebp = 0xb01fef88   ebx = 0x08054858
    esi = 0x00000076   edi = 0xacd9f7c0   eax = 0x00000076   ecx = 0xacc23000
    edx = 0x00000000   efl = 0x00010246
    Found by: given as instruction pointer in context
 1  libmozalloc.so!moz_free [mozalloc.cpp:836ab778ffc5 : 92 + 0x8]
    eip = 0x00114d73   esp = 0xb01fef90   ebp = 0xacd6c580   ebx = 0x00115f50
    esi = 0xacd4cd08   edi = 0xacd9f7c0
    Found by: call frame info
 2  libxul.so!NS_Free_P [nsMemoryImpl.cpp:836ab778ffc5 : 235 + 0x8]
    eip = 0x019c9cb6   esp = 0xb01fefb0   ebp = 0xacd6c580   ebx = 0x021220f0
    esi = 0xacd4cd08   edi = 0xacd9f7c0
    Found by: call frame info
 3  libxul.so!ReleaseData [nsMemory.h : 74 + 0x8]
    eip = 0x019d8e11   esp = 0xb01fefd0   ebp = 0xacd6c580   ebx = 0x021220f0
    esi = 0xacd4cd08   edi = 0xacd9f7c0
    Found by: call frame info
 4  0xb772c0c3
    eip = 0xb772c0c4   esp = 0xb01fefe8   ebp = 0xacd6c580   ebx = 0x00000fff
    esi = 0xacd4cd08   edi = 0xacd9f7c0
    Found by: call frame info
 5  libxul.so!nsACString_internal::Finalize [nsTSubstring.cpp:836ab778ffc5 : 186 + 0x9]
    eip = 0x019d7e13   esp = 0xb01feff0   ebp = 0xacd6c580
    Found by: stack scanning
 6  libxul.so!nsCacheEntry::~nsCacheEntry [nsTSubstring.h : 113 + 0x6]
    eip = 0x00f0d1ee   esp = 0xb01ff010   ebp = 0xacd6c580
    Found by: call frame info
 7  libxul.so!nsDiskCacheDevice::DeactivateEntry [nsDiskCacheDevice.cpp:836ab778ffc5 : 530 + 0x9]
    eip = 0x00f14884   esp = 0xb01ff020   ebp = 0xacd6c580   ebx = 0x021220f0
    esi = 0xacd9f7c0   edi = 0xad0639e8
    Found by: call frame info
 8  libxul.so!nsCacheService::DeactivateEntry [nsCacheService.cpp:836ab778ffc5 : 2269 + 0x8]
    eip = 0x00f0fc41   esp = 0xb01ff050   ebp = 0x00000000   ebx = 0x021220f0
    esi = 0xacd9f7c0   edi = 0xacd3fc00
    Found by: call frame info
 9  libxul.so!nsCacheService::DoomEntry_Internal [nsCacheService.cpp:836ab778ffc5 : 1902 + 0x8]
    eip = 0x00f0fd9d   esp = 0xb01ff070   ebp = 0x00000000   ebx = 0x021220f0
    esi = 0xacd9f7c0   edi = 0xacd3fc00
    Found by: call frame info
10  libxul.so!nsCacheService::DoomEntry [nsCacheService.cpp:836ab778ffc5 : 1866 + 0x10]
    eip = 0x00f0fff2   esp = 0xb01ff090   ebp = 0xace85800   ebx = 0x021220f0
    esi = 0x80040111   edi = 0xacd4cc00
    Found by: call frame info
11  libxul.so!nsDiskCacheStreamIO::Flush [nsDiskCacheStreams.cpp:836ab778ffc5 : 531 + 0xd]
    eip = 0x00f17f06   esp = 0xb01ff0b0   ebp = 0xace85800   ebx = 0x021220f0
    esi = 0x80040111   edi = 0xacd4cc00
    Found by: call frame info
...
This started after bug 592422 landed.
Looking at the rest of the stack

12  libxul.so!nsDiskCacheStreamIO::CloseOutputStream [nsDiskCacheStreams.cpp:836ab778ffc5 : 460 + 0x8]
    eip = 0x00f17f66   esp = 0xb01ff0e0   ebp = 0xace85800   ebx = 0x021220f0
    esi = 0x8000ffff   edi = 0xacd4cc00
    Found by: call frame info
13  libxul.so!nsDiskCacheOutputStream::Close [nsDiskCacheStreams.cpp:836ab778ffc5 : 238 + 0xa]
    eip = 0x00f17fbd   esp = 0xb01ff100   ebp = 0xb01ff2c8   ebx = 0x021220f0
    esi = 0xacd2d0c0   edi = 0x00000000
    Found by: call frame info
14  libxul.so!nsCacheEntryDescriptor::nsOutputStreamWrapper::Close [nsCacheEntryDescriptor.cpp:836ab778ffc5 : 642 + 0x8]
    eip = 0x00f0d4ce   esp = 0xb01ff120   ebp = 0xb01ff2c8   ebx = 0x021220f0
    esi = 0xacd2d0c0   edi = 0x00000000
    Found by: call frame info
15  libxul.so!nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper [nsCacheEntryDescriptor.h:836ab778ffc5 : 140 + 0x5]
    eip = 0x00f0e26e   esp = 0xb01ff140   ebp = 0xb01ff2c8   ebx = 0x021220f0
    esi = 0xacd2d0c0   edi = 0x00000000
    Found by: call frame info
16  libxul.so!nsCacheEntryDescriptor::nsOutputStreamWrapper::Release [nsCacheEntryDescriptor.cpp:836ab778ffc5 : 594 + 0x3]
    eip = 0x00f0d8bb   esp = 0xb01ff160   ebp = 0xb01ff2c8   ebx = 0x021220f0
    esi = 0xacd2d0c0   edi = 0x00000000
    Found by: call frame info
17  libxul.so!nsCOMPtr_base::assign_assuming_AddRef [nsCOMPtr.h:836ab778ffc5 : 479 + 0x8]
    eip = 0x01986128   esp = 0xb01ff180   ebp = 0xb01ff2c8   ebx = 0x021220f0
    esi = 0x00000000   edi = 0xad0c0a38
    Found by: call frame info
18  libxul.so!nsCOMPtr_base::assign_with_AddRef [nsCOMPtr.cpp : 89 + 0x8]
    eip = 0x01985f7e   esp = 0xb01ff1a0   ebp = 0xb01ff2c8   ebx = 0x021220f0
    esi = 0x00000000   edi = 0xad0c0a38
    Found by: call frame info
19  libxul.so!nsWyciwygChannel::CloseCacheEntryInternal [nsCOMPtr.h : 663 + 0x9]
    eip = 0x00f59cde   esp = 0xb01ff1c0   ebp = 0xb01ff2c8   ebx = 0x021220f0
    esi = 0xad0c09e0   edi = 0x00000000
    Found by: call frame info
20  libxul.so!nsWyciwygCloseEvent::Run [nsWyciwygChannel.cpp:836ab778ffc5 : 109 + 0xa]
    eip = 0x00f5abd6   esp = 0xb01ff1d0   ebp = 0xb01ff2c8   ebx = 0x021220f0
    esi = 0xace31880   edi = 0x00000000
    Found by: call frame info

I.e. the deactivation runs off main-thread (as a result of flushing the output-stream of a wysiwyg-channel - see frame 20). This is similar to the crash seen in bug #549767, comment #112 in the sense that deactivation happens off the main-thread, but similarities ends here IMO.

Looking at this further, note frame 11: It is a call to Doom() after WriteDataCacheBlocks() failed. Hence it *may* be related to bug #592422 although it's not clear to me how. I would rather expect this crash to be triggered by closing/flushing wysiwug-channels off main thread (I believe I saw something related to that not too long ago).

Deactivating an entry off the main-thread shouldn't be an issue though, so this is another indication that the book-keeping in the cache is not thread-safe.
Blocks: 438871
Whiteboard: [orange]
blocking2.0: --- → ?
There might be a variant of this crash in bug 615020, since they're both cache-related.
(In reply to comment #5)
> Looking at this further, note frame 11: It is a call to Doom() after
> WriteDataCacheBlocks() failed. Hence it *may* be related to bug #592422
> although it's not clear to me how. I would rather expect this crash to be
> triggered by closing/flushing wysiwug-channels off main thread (I believe I saw
> something related to that not too long ago).

Right, this seems to be related to bug #599127. But as you say, deactivating entry on a background thread should be safe. I'll have a look at it...
Assignee: nobody → michal.novotny
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1291253758.1291256642.8334.gz

May or may not be interesting in that it has two crashes, one the normal DeactivateEntry and another going through nsCacheService::Shutdown.
This looks like the same crash, but it's on ts rather than ts_cold:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1291287379.1291287842.2048.gz
Michal, I have a theory why we sometimes deactivate the same entry twice:

Refer to remark about frame 11 in comment #5 - in nsDiskCacheStreamIO::Flush() we call DoomEntry() if WriteDataCacheBlocks() fails. In some cases (when entry is considered to be unused) nsCacheService::DoomEntry_Internal() calls DeactivateEntry() which eventually deletes the entry.

However, if the entry was deactivated for some reason before nsDiskCacheStreamIO::Flush() is called we end up deactivating twice. For this to happen WriteDataCacheBlocks() must fail (which *may* be related to bug #592422, ref comment #3). And the entry must have been deactivated while it was written, which could happen via nsCacheService::Shutdown() -> ClearActiveEntries() or nsCacheService::OnProfileShutdown() which calls DoomActiveEntries() and then ClearDoomList(). (There may also be other paths.)

(Note also that ClearActiveEntries() removes descriptors and requests from the entry, making it unused which will cause DoomEntry_Internal() to call DeactivateEntry() - ref second paragraph).

I see at least two different ways to deal with this:

1) check if entry is already deactivated in DeactivateEntry() and exit early
2) sync up with cache-io thread in Shutdown() and OnProfileShutdown() (like I'm suggesting to do in bug #549767) to let Flush() finish first.

Or maybe we should do both.
I think we need to block on getting our thread safety straight in the cache code, especially given that we now do much more off the main thread in the cache.
blocking2.0: ? → betaN+
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1291409731.1291411392.9539.gz
Rev3 Fedora 12 mozilla-central talos cold [testfailed]

Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.i686.PAE #1 SMP Sat Nov 7 21:25:57 EST 2009 i686
CPU: x86
     GenuineIntel family 6 model 23 stepping 10
     2 CPUs

Crash reason:  SIGSEGV
Crash address: 0x0

Thread 10 (crashed)
 0  firefox-bin!free [jemalloc.c : 4271 + 0x0]
    eip = 0x0804edbf   esp = 0xb2e9ef60   ebp = 0xb2e9ef88   ebx = 0x08054858
    esi = 0x00000076   edi = 0xad723be0   eax = 0x00000076   ecx = 0xad701000
    edx = 0x00000000   efl = 0x00010246
    Found by: given as instruction pointer in context
 1  libmozalloc.so!moz_free [mozalloc.cpp : 92 + 0x8]
    eip = 0x00650d73   esp = 0xb2e9ef90   ebp = 0xad71c400   ebx = 0x00651f50
    esi = 0xad77d088   edi = 0xad723be0
    Found by: call frame info
 2  libxul.so!NS_Free_P [nsMemoryImpl.cpp : 235 + 0x8]
    eip = 0x019d2f76   esp = 0xb2e9efb0   ebp = 0xad71c400   ebx = 0x0212b630
    esi = 0xad77d088   edi = 0xad723be0
    Found by: call frame info
 3  libxul.so!ReleaseData [nsMemory.h : 74 + 0x8]
    eip = 0x019e2313   esp = 0xb2e9efd0   ebp = 0xad71c400   ebx = 0x0212b630
    esi = 0xad77d088   edi = 0xad723be0
    Found by: call frame info
 4  0xb77120c3
    eip = 0xb77120c4   esp = 0xb2e9efe8   ebp = 0xad71c400   ebx = 0x00000fff
    esi = 0xad77d088   edi = 0xad723be0
    Found by: call frame info
 5  libxul.so!nsACString_internal::Finalize [nsTSubstring.cpp : 187 + 0x9]
    eip = 0x019e12a1   esp = 0xb2e9eff0   ebp = 0xad71c400
    Found by: stack scanning
 6  libxul.so!nsCacheEntry::~nsCacheEntry [nsTSubstring.h : 113 + 0x6]
    eip = 0x00f0e22e   esp = 0xb2e9f010   ebp = 0xad71c400
    Found by: call frame info
 7  libxul.so!nsDiskCacheDevice::DeactivateEntry [nsDiskCacheDevice.cpp : 530 + 0x9]
    eip = 0x00f158c4   esp = 0xb2e9f020   ebp = 0xad71c400   ebx = 0x0212b630
    esi = 0xad723be0   edi = 0xad98dd28
    Found by: call frame info
 8  libxul.so!nsCacheService::DeactivateEntry [nsCacheService.cpp : 2269 + 0x8]
    eip = 0x00f10c81   esp = 0xb2e9f050   ebp = 0x00000000   ebx = 0x0212b630
    esi = 0xad723be0   edi = 0xad77f580
    Found by: call frame info
 9  libxul.so!nsCacheService::DoomEntry_Internal [nsCacheService.cpp : 1902 + 0x8]
    eip = 0x00f10ddd   esp = 0xb2e9f070   ebp = 0x00000000   ebx = 0x0212b630
    esi = 0xad723be0   edi = 0xad77f580
    Found by: call frame info
10  libxul.so!nsCacheService::DoomEntry [nsCacheService.cpp : 1866 + 0x10]
    eip = 0x00f11032   esp = 0xb2e9f090   ebp = 0xad71aad0   ebx = 0x0212b630
    esi = 0x80040111   edi = 0xad703f40
    Found by: call frame info
11  libxul.so!nsDiskCacheStreamIO::Flush [nsDiskCacheStreams.cpp : 532 + 0xd]
    eip = 0x00f18f94   esp = 0xb2e9f0b0   ebp = 0xad71aad0   ebx = 0x0212b630
    esi = 0x80040111   edi = 0xad703f40
    Found by: call frame info
12  libxul.so!nsDiskCacheStreamIO::CloseOutputStream [nsDiskCacheStreams.cpp : 461 + 0x8]
    eip = 0x00f18ff4   esp = 0xb2e9f0e0   ebp = 0xad71aad0   ebx = 0x0212b630
    esi = 0x8000ffff   edi = 0xad703f40
    Found by: call frame info
13  libxul.so!nsDiskCacheOutputStream::Close [nsDiskCacheStreams.cpp : 239 + 0xa]
    eip = 0x00f1904b   esp = 0xb2e9f100   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0xad71bc80   edi = 0x00000000
    Found by: call frame info
14  libxul.so!nsCacheEntryDescriptor::nsOutputStreamWrapper::Close [nsCacheEntryDescriptor.cpp : 642 + 0x8]
    eip = 0x00f0e50e   esp = 0xb2e9f120   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0xad71bc80   edi = 0x00000000
    Found by: call frame info
15  libxul.so!nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper [nsCacheEntryDescriptor.h : 140 + 0x5]
    eip = 0x00f0f2ae   esp = 0xb2e9f140   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0xad71bc80   edi = 0x00000000
    Found by: call frame info
16  libxul.so!nsCacheEntryDescriptor::nsOutputStreamWrapper::Release [nsCacheEntryDescriptor.cpp : 594 + 0x3]
    eip = 0x00f0e8fb   esp = 0xb2e9f160   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0xad71bc80   edi = 0x00000000
    Found by: call frame info
17  libxul.so!nsCOMPtr_base::assign_assuming_AddRef [nsCOMPtr.h : 479 + 0x8]
    eip = 0x0198f2e8   esp = 0xb2e9f180   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0x00000000   edi = 0xad9c4b88
    Found by: call frame info
18  libxul.so!nsCOMPtr_base::assign_with_AddRef [nsCOMPtr.cpp : 89 + 0x8]
    eip = 0x0198f13e   esp = 0xb2e9f1a0   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0x00000000   edi = 0xad9c4b88
    Found by: call frame info
19  libxul.so!nsWyciwygChannel::CloseCacheEntryInternal [nsCOMPtr.h : 663 + 0x9]
    eip = 0x00f5b396   esp = 0xb2e9f1c0   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0xad9c4b30   edi = 0x00000000
    Found by: call frame info
20  libxul.so!nsWyciwygCloseEvent::Run [nsWyciwygChannel.cpp : 109 + 0xa]
    eip = 0x00f5c28e   esp = 0xb2e9f1d0   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0xad723100   edi = 0x00000000
    Found by: call frame info
21  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp : 626 + 0xa]
    eip = 0x019ca8bd   esp = 0xb2e9f1f0   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0xad723100   edi = 0x00000000
    Found by: call frame info
22  libxul.so!NS_ProcessNextEvent_P [nsThreadUtils.cpp : 250 + 0xf]
    eip = 0x019936e9   esp = 0xb2e9f240   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0xb2e9f28c   edi = 0xad723100
    Found by: call frame info
23  libxul.so!nsThread::ThreadFunc [nsThread.cpp : 277 + 0x9]
    eip = 0x019cb15e   esp = 0xb2e9f270   ebp = 0xb2e9f2c8   ebx = 0x0212b630
    esi = 0xb2e9f28c   edi = 0xad723100
    Found by: call frame info
24  libnspr4.so!_pt_root [ptthread.c : 187 + 0x8]
    eip = 0x00aac251   esp = 0xb2e9f2a0   ebp = 0xb2e9f2c8   ebx = 0x00ab554c
    esi = 0xad835c10   edi = 0x00000000
    Found by: call frame info
25  libpthread-2.11.so + 0x5ab4
    eip = 0x00c6bab5   esp = 0xb2e9f2d0   ebp = 0xb2e9f3c8   ebx = 0x00c7cff4
    esi = 0xb2e9fb70   edi = 0x003d0f00
    Found by: call frame info
26  libc-2.11.so + 0xda83d
    eip = 0x00bc283e   esp = 0xb2e9f3d0   ebp = 0x00000000
    Found by: previous frame's frame pointer

Thread 0
 0  linux-gate.so + 0x424
    eip = 0x00967424   esp = 0xbfd5ae24   ebp = 0xbfd5ae58   ebx = 0xaf41f4c8
    esi = 0x00000000   edi = 0x00000000   eax = 0xfffffe00   ecx = 0x00000080
    edx = 0x00000002   efl = 0x00000202
    Found by: given as instruction pointer in context
 1  libmozsqlite3.so!pthreadMutexEnter [sqlite3.c : 16637 + 0xa]
    eip = 0x0042d27d   esp = 0xbfd5ae60   ebp = 0xbfd5ae68
    Found by: previous frame's frame pointer
 2  libmozsqlite3.so!sqlite3_mutex_enter [sqlite3.c : 15877 + 0x8]
    eip = 0x003e45a2   esp = 0xbfd5ae70   ebp = 0xbfd5ae78   ebx = 0x00465c84
    Found by: call frame info
 3  libmozsqlite3.so!sqlite3_finalize [sqlite3.c : 57620 + 0x7]
    eip = 0x0041b93c   esp = 0xbfd5ae80   ebp = 0xbfd5aea8   ebx = 0x00465c84
    Found by: call frame info
 4  libxul.so!mozilla::storage::Statement::internalFinalize [mozStorageStatement.cpp : 400 + 0x8]
    eip = 0x0179a864   esp = 0xbfd5aeb0   ebp = 0xb30d3980   ebx = 0x0212b630
    esi = 0x01e23f2b   edi = 0xbfd5af34
    Found by: call frame info
...

Failed ts_cold_generated_med: 
                Stopped Fri, 03 Dec 2010 13:22:48
FAIL: Busted: ts_cold_generated_med
FAIL: crash during run (stack found)
And yet another way to crash while closing a nsWyciwygChannel, because you can never have too many:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1291704028.1291704699.1633.gz

Thread 17 (crashed)
 0  xul.dll!nsCacheService::ProcessPendingRequests(nsCacheEntry *) [nsCacheService.cpp:37b29506a7d4 : 2336 + 0x0]
    eip = 0x6baa9ccd   esp = 0x09d5f96c   ebp = 0x09d5f98c   ebx = 0x03b44930
    esi = 0x00000000   edi = 0x00000000   eax = 0x00000001   ecx = 0x080e2880
    edx = 0x00000037   efl = 0x00010246
    Found by: given as instruction pointer in context
 1  xul.dll!nsCacheService::DoomEntry_Internal(nsCacheEntry *,int) [nsCacheService.cpp:37b29506a7d4 : 1898 + 0x7]
    eip = 0x6c0be4c2   esp = 0x09d5f994   ebp = 0x80040111
    Found by: previous frame's frame pointer
 2  xul.dll!nsCacheService::DoomEntry(nsCacheEntry *) [nsCacheService.cpp:37b29506a7d4 : 1866 + 0x10]
    eip = 0x6bf4bbd4   esp = 0x09d5f9a8   ebp = 0x80040111
    Found by: call frame info with scanning
 3  xul.dll!nsDiskCacheStreamIO::Flush() + 0x3b
    eip = 0x6bccb532   esp = 0x09d5f9b4   ebp = 0x80040111
    Found by: call frame info
 4  xul.dll!nsDiskCacheStreamIO::CloseOutputStream(nsDiskCacheOutputStream *) [nsDiskCacheStreams.cpp:37b29506a7d4 : 461 + 0x6]
    eip = 0x6b8e8ac9   esp = 0x09d5f9d0   ebp = 0x80040111
    Found by: stack scanning
 5  xul.dll!nsDiskCacheOutputStream::Close() [nsDiskCacheStreams.cpp:37b29506a7d4 : 239 + 0x8]
    eip = 0x6b8e8a93   esp = 0x09d5f9dc   ebp = 0x80040111
    Found by: call frame info with scanning
 6  xul.dll!nsCacheEntryDescriptor::nsOutputStreamWrapper::`vector deleting destructor'(unsigned int) + 0x21
    eip = 0x6ba8b067   esp = 0x09d5f9e4   ebp = 0x80040111
    Found by: call frame info
 7  xul.dll!nsCacheEntryDescriptor::nsOutputStreamWrapper::Release() [nsCacheEntryDescriptor.cpp:37b29506a7d4 : 595 + 0x22]
    eip = 0x6b8fbdf3   esp = 0x09d5f9f0   ebp = 0x80040111
    Found by: call frame info with scanning
 8  xul.dll!nsCOMPtr_base::assign_with_AddRef(nsISupports *) [nsCOMPtr.cpp:37b29506a7d4 : 89 + 0x9]
    eip = 0x6b9c9936   esp = 0x09d5f9f8   ebp = 0x80040111
    Found by: call frame info
 9  xul.dll!nsWyciwygChannel::CloseCacheEntryInternal(unsigned int) [nsWyciwygChannel.cpp:37b29506a7d4 : 481 + 0xa]
    eip = 0x6bfc35c7   esp = 0x09d5fa00   ebp = 0x80040111
    Found by: call frame info
For the record, Michal says he's been able to reproduce this and will likely have a fix in a couple of days.
Blocks: 620266
No longer blocks: 620266
I've filed bug 620642 on the Windows crash with the different stack from comment 51 which has just happened again.
Summary: Crash in nsCacheService::DeactivateEntry during Linux Talos ts_cold → Crash in nsCacheService::DeactivateEntry during Linux Talos ts_cold and ts_cold_generated_min
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1292997361.1292997717.4799.gz
Summary: Crash in nsCacheService::DeactivateEntry during Linux Talos ts_cold and ts_cold_generated_min → Crash in nsCacheService::DeactivateEntry during Linux Talos
Michael, this seems common enough that it can be caught in recording. Have you tried to do this yet?
Fwiw, see bug #620660, comment #0 second paragraph.
Attached patch patch v1 (obsolete) — Splinter Review
The problem is in the way we forcibly deactivate entries during shutdown. We call DetachDescriptors() (which removes all descriptors but input/output can stay open) and then DeactivateEntry() (which deletes the entry). Deletion of the binding is posted to the appropriate thread and it happens when the entry is already deleted. The binding releases the reference to streamIO which causes writing to and dooming the released entry:

#0  nsCacheService::DoomEntry (entry=0xeb025220) at /opt/moz/hg/netwerk/cache/nsCacheService.cpp:1866
#1  0x0138e07c in nsDiskCacheStreamIO::Flush (this=0xee920680) at /opt/moz/hg/netwerk/cache/nsDiskCacheStreams.cpp:532
#2  0x0138dc43 in nsDiskCacheStreamIO::ClearBinding (this=0xee920680) at /opt/moz/hg/netwerk/cache/nsDiskCacheStreams.cpp:438
#3  0x013838c8 in nsDiskCacheBinding::~nsDiskCacheBinding (this=0xee920640, __in_chrg=<value optimized out>)
    at /opt/moz/hg/netwerk/cache/nsDiskCacheBinding.cpp:136
#4  0x01383927 in nsDiskCacheBinding::~nsDiskCacheBinding (this=0xee920640, __in_chrg=<value optimized out>)
    at /opt/moz/hg/netwerk/cache/nsDiskCacheBinding.cpp:139
#5  0x01383686 in nsDiskCacheBinding::Release (this=0xee920640) at /opt/moz/hg/netwerk/cache/nsDiskCacheBinding.cpp:115
#6  0x0282bbd6 in nsProxyReleaseEvent::Run (this=0xeb067100) at nsProxyRelease.cpp:52
#7  0x02896c8d in nsThread::ProcessNextEvent (this=0xee8c6820, mayWait=1, result=0xee6ff21c) at /opt/moz/hg/xpcom/threads/nsThread.cpp:626
#8  0x02827275 in NS_ProcessNextEvent_P (thread=0xee8c6820, mayWait=1) at nsThreadUtils.cpp:250
#9  0x02895d5e in nsThread::ThreadFunc (arg=0xee8c6820) at /opt/moz/hg/xpcom/threads/nsThread.cpp:278
#10 0x00359757 in _pt_root (arg=0xee88a240) at /opt/moz/hg/nsprpub/pr/src/pthreads/ptthread.c:187
#11 0x0026b925 in start_thread (arg=0xee6ffb70) at pthread_create.c:297
#12 0x07b0407e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

The stack with wyciwyg channel is an alternate crash but the problem is the same.

This patch forces closing of the output when we call DetachDescriptors(). The tricky and ugly part of the patch is the nsIDiskCacheStreamInternal interface. nsDiskCacheOutputStream::Close() grabs the lock and we need to call it under the lock, so to prevent reentering the lock I had to create a new method.

I pushed the patch to the try server and there are few failures that seem to be unrelated.
Attachment #500025 - Flags: review?(bzbarsky)
Please note (see comment #141) that with the patch for bug #620660 I have not been able to reproduce this issue at all. I'd say we get 620660 reviewed and landed first, then see if it improves the situation here. Michal is in fact the reviewer there so coordination should be simple.
Comment on attachment 500025 [details] [diff] [review]
patch v1

Is CloseOutput really a callback for the cache service?  If not, it should probably not be grouped with the things that are.

r=me with that nit.
Attachment #500025 - Flags: review?(bzbarsky) → review+
Attached patch patch v2Splinter Review
Attachment #500025 - Attachment is obsolete: true
Attachment #500132 - Flags: superreview?(cbiesinger)
(In reply to comment #156)
> Please note (see comment #141) that with the patch for bug #620660 I have not
> been able to reproduce this issue at all. I'd say we get 620660 reviewed and
> landed first, then see if it improves the situation here. Michal is in fact the
> reviewer there so coordination should be simple.

I hope I'll review it tomorrow. Anyway, with that patch we still call Flush() and Doom() for already released entry. The fact that it doesn't crash is just a luck by chance...
Ahh - of-course: Flush() (and the following Doom()) is not called on the cache-io thread, hence draining it doesn't help in this case. Is it the STS-thread? I guess waiting for the STS-thread to drain is not a proper way to deal with this...

I do have a funny feeling about this approach though. IMO, at the very least it should make clear that DetachDescriptors() has the side-effect of closing the output-stream (e.g. by renaming the method, or at least documenting properly).

To be constructive: Why not rather take action in nsDiskCacheDevice::DeactivateEntry()? Afaics this problem only applies to the disk-device, and DeactivateEntry() is called from the same places as DetachDescriptors(). From nsDiskCacheDevice::DeactivateEntry() you have access to the nsDiskCacheStreamIO object via the binding. Moreover, calling nsDiskCacheStreamIO::ClearBinding() may be sufficient to prevent this problem because if calls Flush() and then clears |mBinding|, preventing further writes to the outputstream.
Attached patch Alternative approach (obsolete) — Splinter Review
Following up on the hint in comment #160, this patch gave me the greenest tryserver-run in a long time (says a lot about my other patches I guess, but nevertheless...) - 08e2016af8f3 on tryserver. Michals procedure for reproducing do not crash with this patch applied.

The code-block added to nsDiskCacheDevice::DeactivateEntry() is also found in dtor for nsDiskCacheBinding and could probably be stuffed in a method named nsDiskCacheBinding::DetachIoStreams() or similar. I also removed the redundant call to UpdateRecord() in Flush() (as indicated in the comment) but this change has no relevance to this problem and could be removed.
Attachment #500389 - Flags: review?(michal.novotny)
Whiteboard: [orange] → [orange], hardblocker
Comment on attachment 500389 [details] [diff] [review]
Alternative approach

(In reply to comment #160)
> Ahh - of-course: Flush() (and the following Doom()) is not called on the
> cache-io thread, hence draining it doesn't help in this case. Is it the
> STS-thread? I guess waiting for the STS-thread to drain is not a proper way to
> deal with this...

It depends on which thread the entry was bound.


>      NS_ASSERTION(mBinding, "oops");
> +    if (!mBinding)  return NS_ERROR_NOT_AVAILABLE;

The assertion must be removed.


> I do have a funny feeling about this approach though. IMO, at the very least it
> should make clear that DetachDescriptors() has the side-effect of closing the
> output-stream (e.g. by renaming the method, or at least documenting properly).

DetachDescriptors() is called only in situations when it is expected that the output should be closed.


> To be constructive: Why not rather take action in
> nsDiskCacheDevice::DeactivateEntry()? Afaics this problem only applies to the
> disk-device, and DeactivateEntry() is called from the same places as
> DetachDescriptors().

Yes, FF seems to crash only in case of disk cache, but IMO it makes sense to always close the output.
Attachment #500389 - Flags: review?(michal.novotny)
Maybe something like this? Can be extended so that each cache-device close the output-streams in the most convenient way.
Attachment #500389 - Attachment is obsolete: true
Attachment #501491 - Flags: review?(michal.novotny)
Whiteboard: [orange], hardblocker → [orange][hardblocker]
Comment on attachment 500132 [details] [diff] [review]
patch v2

bz already reviewed this, no need for sr here.
Attachment #500132 - Flags: superreview?(cbiesinger)
Comment on attachment 500132 [details] [diff] [review]
patch v2

... and carrying bz's review forward.
Attachment #500132 - Flags: review+
Per discussion with Michal, and in the interest of getting rid of this random orange and landing this blocker bug fix sooner rather than later, I landed this today. Michal believes his patch gives us a more consistent approach here, but if that for some reason doesn't work out he'd be happy to take Bjarne's alternate fix here. But until then, this is fixed.

http://hg.mozilla.org/mozilla-central/rev/82ac194b7537
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1294795415.1294796285.16360.gz
Rev3 Fedora 12 mozilla-central talos dirty on 2011/01/11 17:23:35 

Well after that push

Thread 11 (crashed)
 0  firefox-bin!free [jemalloc.c : 4271 + 0x0]
    eip = 0x0804edbf   esp = 0xad3fef40   ebp = 0xad3fef68   ebx = 0x08054858
    esi = 0x00000076   edi = 0xad52c0a0   eax = 0x00000076   ecx = 0xaece5000
    edx = 0x00000000   efl = 0x00010246
    Found by: given as instruction pointer in context
 1  libmozalloc.so!moz_free [mozalloc.cpp : 92 + 0x8]
    eip = 0x00151d73   esp = 0xad3fef70   ebp = 0xad511430   ebx = 0x00152f50
    esi = 0xad527408   edi = 0xad52c0a0
    Found by: call frame info
 2  libxul.so!NS_Free_P [nsMemoryImpl.cpp : 235 + 0x8]
    eip = 0x019dff62   esp = 0xad3fef90   ebp = 0xad511430   ebx = 0x02149804
    esi = 0xad527408   edi = 0xad52c0a0
    Found by: call frame info
 3  libxul.so!ReleaseData [nsMemory.h : 74 + 0x8]
    eip = 0x019ef4c3   esp = 0xad3fefb0   ebp = 0xad511430   ebx = 0x02149804
    esi = 0xad527408   edi = 0xad52c0a0
    Found by: call frame info
 4  0xccd4e9a1
    eip = 0xccd4e9a2   esp = 0xad3fefc8   ebp = 0xad511430   ebx = 0xad52d168
    esi = 0xad527408   edi = 0xad52c0a0
    Found by: call frame info
 5  libxul.so!nsACString_internal::Finalize [nsTSubstring.cpp : 187 + 0x9]
    eip = 0x019ee451   esp = 0xad3fefd0   ebp = 0xad511430
    Found by: stack scanning
 6  libxul.so!nsCacheEntry::~nsCacheEntry [nsTSubstring.h : 113 + 0x6]
    eip = 0x00f0ffd6   esp = 0xad3feff0   ebp = 0xad511430
    Found by: call frame info
 7  libxul.so!nsDiskCacheDevice::DeactivateEntry [nsDiskCacheDevice.cpp : 530 + 0x9]
    eip = 0x00f176a2   esp = 0xad3ff000   ebp = 0xad511430   ebx = 0x02149804
    esi = 0xad52c0a0   edi = 0xaff2ef08
    Found by: call frame info
 8  libxul.so!nsCacheService::DeactivateEntry [nsCacheService.cpp : 2269 + 0x8]
    eip = 0x00f12a6d   esp = 0xad3ff030   ebp = 0x00000000   ebx = 0x02149804
    esi = 0xad52c0a0   edi = 0xaef72700
    Found by: call frame info
 9  libxul.so!nsCacheService::DoomEntry_Internal [nsCacheService.cpp : 1902 + 0x8]
    eip = 0x00f12bc9   esp = 0xad3ff050   ebp = 0x00000000   ebx = 0x02149804
    esi = 0xad52c0a0   edi = 0xaef72700
    Found by: call frame info
10  libxul.so!nsCacheService::DoomEntry [nsCacheService.cpp : 1866 + 0x10]
    eip = 0x00f12e1e   esp = 0xad3ff070   ebp = 0xad3ff2c8   ebx = 0x02149804
    esi = 0x80040111   edi = 0xad527280
    Found by: call frame info
11  libxul.so!nsDiskCacheStreamIO::Flush [nsDiskCacheStreams.cpp : 554 + 0xd]
    eip = 0x00f1b03c   esp = 0xad3ff090   ebp = 0xad3ff2c8   ebx = 0x02149804
    esi = 0x80040111   edi = 0xad527280
    Found by: call frame info
12  libxul.so!nsDiskCacheStreamIO::CloseOutputStreamInternal [nsDiskCacheStreams.cpp : 483 + 0x8]
    eip = 0x00f1b092   esp = 0xad3ff0c0   ebp = 0xad3ff2c8   ebx = 0x02149804
    esi = 0xad527280   edi = 0xad529100
    Found by: call frame info
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 501491 [details] [diff] [review]
Alternative approach, v2

Fwiw, I think bug #620660 should be landed before we can state too much about these random oranges related to the cache. Current code randomly executes cache-related operations *after* internal datastructures of the disk cache has been freed, clearly carrying the potential for weird problems. The patch has been waiting for review for quite some time so I guess it will happen any day now.

Since this bug has been reopened I also take the liberty to change reviewer on my alternative (and IMO more light-handed) patch. Bz already reviewed the landed patch and may be able to do a quick one for the alternative as well since he knows the problem-area.
Attachment #501491 - Flags: review?(michal.novotny) → review?(bzbarsky)
(In reply to comment #166)
> Created attachment 501491 [details] [diff] [review]
> Alternative approach, v2
> 
> Maybe something like this? Can be extended so that each cache-device close the
> output-streams in the most convenient way.

Are you sure that other devices can be extended easily? We don't have access to the output stream from the device. So IMO this is still a disk-cache solution only.
With this approach it is up to the device to ensure it has enough information to "close" its IO. In the disk-cache case, nsDiskCacheBinding holds a ref to the nsDiskCacheStreamIO-object, which is sufficient to "shut down" (or rather: "prevent further") IO for the disk-cache. The offline-device would have to add appropriate info to nsOfflineCacheBinding, then implement its own version of DetachStreamIO() to use this info to close IO. The memory-device might not need this, hence it can avoid implementing DetachStreamIO().

It is pretty much what you also do, except that the necessary info is per device and not kept in nsCacheEntryDescriptor. We avoid the new interface and the explicit QI. Moreover, with this approach a device can do more than just close the outputstream for an entry, if this is necessary or desirable for whatever reason.
Comment on attachment 501491 [details] [diff] [review]
Alternative approach, v2

This seems fine, though I think you overestimate my familiarity with this code... ;)
Attachment #501491 - Flags: review?(bzbarsky) → review+
Previous patch closes output only when detaching descriptors during shutdown. In case the cache entry is closed explicitly with Close() instead of just releasing reference, the entry is deactivated without closing the output and could crash in certain situations. So we need to close the output whenever we remove the descriptor.

On the other hand I don't think that this was the cause of the latest crashes since I haven't find any place where we would call Close() directly.
Attachment #507477 - Flags: review?(bzbarsky)
Comment on attachment 507477 [details] [diff] [review]
patch v3 - close output in RemoveDescriptor() too

r=me
Attachment #507477 - Flags: review?(bzbarsky) → review+
Note that since bug #620660 landed on 2011-01-20 13:44:27 PST we have not seen this crash.
Patch v3 landed:

http://hg.mozilla.org/mozilla-central/rev/4215d5d9bc90

I'm going to go ahead and unblock on this, but I'll leave it open for a while to check if Bjarne's belief in the previous comment holds true!
blocking2.0: betaN+ → -
Whiteboard: [orange][hardblocker] → [orange]
(In reply to comment #185)
> to check if Bjarne's belief in the previous comment holds true!

It's not a belief - it's an observation... ;)
A year later I think it's safe to say this was fixed.
Status: REOPENED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: