Closed Bug 784414 Opened 12 years ago Closed 12 years ago

Intermittent crash in xpcshell-unpack/test_hotfix.js [@ PR_Seek | nsDiskCacheMap::WriteCacheClean]

Categories

(Core :: Networking: Cache, defect)

x86_64
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla17

People

(Reporter: emorley, Assigned: bbondy)

References

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(1 file)

Rev3 Fedora 12x64 mozilla-inbound pgo test xpcshell on 2012-08-21 07:30:51 PDT for push eb4392d9bc48

slave: talos-r3-fed64-023

https://tbpl.mozilla.org/php/getParsedLog.php?id=14563427&tree=Mozilla-Inbound

{
TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell-unpack/head_addons.js | [null : 1269] false == false

TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell-unpack/head_addons.js | [null : 1272] false == false
*** LOG addons.manager: shutdown
*** LOG addons.xpi: shutdown
*** LOG addons.xpi-utils: shutdown
*** LOG addons.xpi-utils: Database closed

TEST-PASS | (xpcshell/head.js) | 140 (+ 0) check(s) passed

TEST-INFO | (xpcshell/head.js) | 0 check(s) todo
<<<<<<<
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64-pgo/1345554102/firefox-17.0a1.en-US.linux-x86_64.crashreporter-symbols.zip
PROCESS-CRASH | /home/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell-unpack/test_hotfix.js | application crashed (minidump found)
Crash dump filename: /home/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell-unpack/0960b613-c79c-ecdf-650739c8-754853b2.dmp
Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64
CPU: amd64
     family 6 model 23 stepping 10
     2 CPUs

Crash reason:  SIGSEGV
Crash address: 0x0

Thread 8 (crashed)
 0  libnspr4.so!PR_Seek [priometh.c:eb4392d9bc48 : 119 + 0x0]
    rbx = 0x47539450   r12 = 0x80004005   r13 = 0x00000002   r14 = 0x475ac8a8
    r15 = 0x00000002   rip = 0x5781ed80   rsp = 0x466feb68   rbp = 0x466feb90
    Found by: given as instruction pointer in context
 1  libxul.so!nsDiskCacheMap::WriteCacheClean [nsDiskCacheMap.cpp:eb4392d9bc48 : 1281 + 0x8]
    rbx = 0x47539450   r12 = 0x80004005   r13 = 0x00000002   r14 = 0x475ac8a8
    r15 = 0x00000002   rip = 0x582980ad   rsp = 0x466feb70   rbp = 0x466feb90
    Found by: call frame info
 2  libxul.so!nsDiskCacheMap::RevalidateCache [nsDiskCacheMap.cpp:eb4392d9bc48 : 1388 + 0x4]
    rbx = 0x47539450   r12 = 0x58297faa   r13 = 0x00000002   r14 = 0x475ac8a8
    r15 = 0x00000002   rip = 0x58297a7c   rsp = 0x466feba0   rbp = 0x466febc0
    Found by: call frame info
 3  libxul.so!nsDiskCacheMap::RevalidateTimerCallback [nsDiskCacheMap.cpp:eb4392d9bc48 : 1355 + 0x7]
    rbx = 0x47539450   r12 = 0x58297faa   r13 = 0x00000002   r14 = 0x475ac8a8
    r15 = 0x00000002   rip = 0x58297ff8   rsp = 0x466febd0   rbp = 0x466febe0
    Found by: call frame info
 4  libxul.so!nsTimerImpl::Fire [nsTimerImpl.cpp:eb4392d9bc48 : 473 + 0x9]
    rbx = 0x475ac880   r12 = 0x58297faa   r13 = 0x00000002   r14 = 0x475ac8a8
    r15 = 0x00000002   rip = 0x58ec7811   rsp = 0x466febf0   rbp = 0x466fec40
    Found by: call frame info
 5  libxul.so!nsTimerEvent::Run [nsTimerImpl.cpp:eb4392d9bc48 : 556 + 0x7]
    rbx = 0x475ac880   r12 = 0x00000000   r13 = 0x5a42d970   r14 = 0x00000000
    r15 = 0x466fecb0   rip = 0x58ec78f5   rsp = 0x466fec50   rbp = 0x466fec60
    Found by: call frame info
 6  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:eb4392d9bc48 : 624 + 0x6]
    rbx = 0x48fa31a0   r12 = 0x00000000   r13 = 0x5a42d970   r14 = 0x00000000
    r15 = 0x466fecb0   rip = 0x58ec3eb6   rsp = 0x466fec70   rbp = 0x466fecf0
    Found by: call frame info
 7  libxul.so!NS_ProcessNextEvent_P [nsThreadUtils.cpp:eb4392d9bc48 : 220 + 0xd]
    rbx = 0x48fa31a0   r12 = 0x466fed50   r13 = 0x48fa31f0   r14 = 0x00000000
    r15 = 0x00000003   rip = 0x58e8ec0c   rsp = 0x466fed00   rbp = 0x466fed30
    Found by: call frame info
 8  libxul.so!nsThread::ThreadFunc [nsThread.cpp:eb4392d9bc48 : 257 + 0xc]
    rbx = 0x48fa31a0   r12 = 0x466fed50   r13 = 0x48fa31f0   r14 = 0x00000000
    r15 = 0x00000003   rip = 0x58ec4af8   rsp = 0x466fed40   rbp = 0x466fed80
    Found by: call frame info
 9  libnspr4.so!_pt_root [ptthread.c:eb4392d9bc48 : 156 + 0x6]
    rbx = 0x56512890   r12 = 0xb0fb0c10   r13 = 0x00000000   r14 = 0x00000000
    r15 = 0x00000003   rip = 0x57836064   rsp = 0x466fed90   rbp = 0x00000000
}
Assignee: nobody → netzen
Blocks: 777328
Attached patch Patch v1.Splinter Review
I think what's happening is that after a Close(), even know it is protected by the lock, and even know it calls Cancel() first before closing the file handle, I think the callback can sometimes still get through.
Attachment #653890 - Flags: review?(michal.novotny)
Probably it is waiting on the lock inside the callback while Close() is running.  Then Close() completes and it keeps running.  And now the cache clean file is set to nullptr causing a crash when you seek.
Comment on attachment 653890 [details] [diff] [review]
Patch v1.

I would much rather see a proper solution to not process RevalidateTimerCallback() when the timer was canceled.

Btw. I've found another potential problem in nsDiskCacheMap::RevalidateTimerCallback(). The disk cache map passed as arg to the timer can be already deleted in case the disk device was shutdown in nsCacheService::Shutdown() instead of nsCacheService::OnProfileShutdown(). We should first check if the disk device nsCacheService::gService->mDiskDevice still exists, then we could access the cache map as nsCacheService::gService->mDiskDevice.mCacheMap. And since we shouldn't use aClosure argument to pass the cache map to the timer, we can use it for the proper canceling.

We could have a counter in the cache map, that we increment every time we cancel the timer. We would pass the actual value of the counter as aClosure argument and in nsDiskCacheMap::RevalidateTimerCallback() we would compare it with the counter in the cache map. If it differs the timer was canceled.
Attachment #653890 - Flags: review?(michal.novotny) → review-
> I would much rather see a proper solution to not process 
> RevalidateTimerCallback() when the timer was canceled.

The problem is that the callback timer is already called, not that it will be called (I was wrong in comment 1, but correct in comment 2). 

We are inside the timer callback already, context switch happens, we do Close(), and then we go back to the timer callback thread and it waits on the lock.  Then Close() finishes, then back in our callback we proceed and obtain the lock.

In that case I think this is the proper solution, check if the file is already closed.


----

Re the other comments, I'll work on those.
Actually for the other comments since they aren't related to this exact crash, I'll post another bug. I don't want to hold up a good fix for another potential problem.  I'll still work on it right away though.
Comment on attachment 653890 [details] [diff] [review]
Patch v1.

Review of attachment 653890 [details] [diff] [review]:
-----------------------------------------------------------------

Resetting review based on Comment 4. I think this is the correct solution.
For the other comments I will implement those in another bug soon.
Attachment #653890 - Flags: review- → review?(michal.novotny)
Comment on attachment 653890 [details] [diff] [review]
Patch v1.

(In reply to Brian R. Bondy [:bbondy] from comment #4)
> The problem is that the callback timer is already called, not that it will
> be called (I was wrong in comment 1, but correct in comment 2). 

Sure, I know. But what is the difference? We cancel the timer to avoid performing actions in the timer callback. So the proper solution would be to check in the callback if the timer was canceled. Btw I think that the last diskCacheMap->ResetCacheTimer() in nsDiskCacheMap::RevalidateTimerCallback() should be also inside the lock.

Anyway, I'm fine with the current fix if we're going to change it in bug #784657.
Attachment #653890 - Flags: review?(michal.novotny) → review+
The crash isn't because it is canceled or not though, it is just because the handle gets set to nullptr in the Close() while the timer is waiting on the lock in the callback.  This patch will ensure that this specific crash won't happen though so as you mentioned I'll land it.  Thanks for the review!
 
> Btw I think that the last diskCacheMap->ResetCacheTimer() 
> in nsDiskCacheMap::RevalidateTimerCallback() should be also inside the lock.

Ya I moved it into the lock in bug 784657.
BTW if we get through bug 784657 today then I'll land these together.
https://hg.mozilla.org/mozilla-central/rev/281b36542acd
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: