Closed Bug 163841 Opened 22 years ago Closed 22 years ago

deadlock when disk cache fills up [was: Mozilla hangs in PR_Lock at ptsynch.c:190]

Categories

(Core :: Networking: Cache, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla1.2alpha

People

(Reporter: Mitch, Assigned: darin.moz)

References

Details

(Keywords: hang)

Attachments

(1 file)

I had my first hang for several months today with yesterday's build
of Mozilla 20020820. A stacktrace from the hung process yields

(gdb) bt
#0  0xfea65e88 in __lwp_park ()
#1  0xfea61c10 in mutex_lock_queue ()
#2  0xfea62610 in slow_lock ()
#3  0xfefcfc78 in PR_Lock (lock=0xb26b98) at ptsynch.c:190
#4  0xfcb545f0 in nsAutoLock::nsAutoLock ()
#5  0xfcb448c8 in nsDiskCacheDevice::DoomEntry ()
#6  0xfcb3f024 in nsCacheService::DoomEntry_Internal ()
#7  0xfcb3ef80 in nsCacheService::DoomEntry ()
#8  0xfcb432fc in nsDiskCacheEvictor::VisitRecord ()
#9  0xfcb4734c in nsDiskCacheBucket::VisitEachRecord ()
#10 0xfcb4888c in nsDiskCacheMap::EvictRecords ()
#11 0xfcb45a24 in nsDiskCacheDevice::EvictDiskCacheEntries ()
#12 0xfcb44f98 in nsDiskCacheDevice::OnDataSizeChange ()
#13 0xfcb3fd44 in nsCacheService::OnDataSizeChange ()
#14 0xfcb38320 in nsCacheEntryDescriptor::RequestDataSizeChange ()
#15 0xfcb3a250 in nsCacheEntryDescriptor::nsOutputStreamWrapper::OnWrite ()
#16 0xfcb3a020 in nsCacheEntryDescriptor::nsOutputStreamWrapper::Write ()
#17 0xff0cc24c in nsInputStreamTee::TeeSegment ()
#18 0xff0cc360 in nsInputStreamTee::WriteSegmentFun ()
#19 0xff0d16d4 in nsPipe::nsPipeInputStream::ReadSegments ()
#20 0xff0cc8c0 in nsInputStreamTee::ReadSegments ()
#21 0xfcb14c50 in nsGIFDecoder2::WriteFrom ()
#22 0xfd681d00 in imgRequest::OnDataAvailable ()
#23 0xfd67e964 in ProxyListener::OnDataAvailable ()
#24 0xfd67f928 in httpValidateChecker::OnDataAvailable ()
#25 0xfe562ad4 in nsStreamListenerTee::OnDataAvailable ()
#26 0xfe5bdba0 in nsHttpChannel::OnDataAvailable ()
#27 0xfe561748 in nsOnDataAvailableEvent::HandleEvent ()
#28 0xfe5476b4 in nsARequestObserverEvent::HandlePLEvent ()
#29 0xff0f6e1c in PL_HandleEvent (self=0x18f7170) at plevent.c:596
#30 0xff0f6d38 in PL_ProcessPendingEvents (self=0x83808) at plevent.c:526
#31 0xff0f822c in nsEventQueueImpl::ProcessPendingEvents ()
#32 0xfe33154c in event_processor_callback ()
#33 0xfe3310a4 in Letext ()
#34 0xfeda6c10 in g_main_dispatch ()
#35 0xfeda7424 in g_main_iterate ()
#36 0xfeda7660 in g_main_run ()
#37 0xfee8fde8 in gtk_main ()
#38 0xfe331c38 in nsAppShell::Run ()
#39 0xfe3bbac4 in nsAppShellService::Run ()
#40 0x1f2c0 in main1 ()
#41 0x204f0 in main ()
(gdb) quit


Possibly a OS (Solaris) but, but added to bugzilla for future reference.
Will add coments if i can reproduce this with todays build (21st August)
stack points to cache
Assignee: asa → gordon
Component: Browser-General → Networking: Cache
QA Contact: asa → tever
Severity: normal → critical
Keywords: hang
we're attempting to re-enter the disk cache device lock, and it looks like this
hang will occur whenever we reach the size limits on the disk cache!!
Status: UNCONFIRMED → NEW
Ever confirmed: true
Happened again with todays build 20020822

If Darin's analysis is correct, then i'm purposely not cleaning my
disk cache out so we have a reproduceable testcase...

Unfortunately, according to my cache limit set in mozilla preferences, i have
2000KB for the Disk Cache and the Cache folder size is currently

montgomery $ du -skh psl75l81.slt
 239K   psl75l81.slt

So it's way under the limit...

Todays stack from a "pstack" command on the hug process
shows (no c++ demangling unfortunately)


9841:	/share/eu/build/insecure/tools/mozilla/mozilla_localsrc.2002082209/moz
-----------------  lwp# 1 / thread# 1  --------------------
 fea65e88 lwp_park (0, 0, 0)
 fea61c08 mutex_lock_queue (fea78b44, 0, e8a1f0, fea78000, ff0d1dd0, 0) + 104
 fea62608 slow_lock (e8a1f0, fec40000, ffbfd700, afab7c, 1f3400, 0) + 58
 fefcfc70 PR_Lock  (e8a1f0, e8a1f0, 0, 1753ea8, 1d44cb, b68) + 4
 fcb045e8 __10nsAutoLockP6PRLock (ffbfd700, e8a1f0, 7f00, fcaf4e58, 1d44cb, b68)
+ 30
 fcaf48c0 DoomEntry__17nsDiskCacheDeviceP12nsCacheEntry (afab70, 11593c0,
fcaf4888, 0, 1d44cb, b68) + 38
 fcaef01c DoomEntry_Internal__14nsCacheServiceP12nsCacheEntry (326220, 11593c0,
0, ffffffff, 4, 0) + 88
 fcaeef78 DoomEntry__14nsCacheServiceP12nsCacheEntry (11593c0, 17993c0,
80470000, b35, 33, ffbfdd04) + 28
 fcaf32f4 VisitRecord__18nsDiskCacheEvictorP17nsDiskCacheRecord (ffbfda88,
d99da4, fcaf3198, 0, fcae9ffc, ff0d1488) + 15c
 fcaf7344 VisitEachRecord__17nsDiskCacheBucketP24nsDiskCacheRecordVisitorUiPUi
(d99da4, ffbfda88, c29b488a, ffbfd988, ff0d1dd0, 0) + bc
 fcaf8884 EvictRecords__14nsDiskCacheMapP24nsDiskCacheRecordVisitor (d7bd60,
ffbfda88, d7bd60, afab7c, 1f3400, 0) + 138
 fcaf5a1c EvictDiskCacheEntries__17nsDiskCacheDevicei (afab70, 1f3400, ffbfdc30,
1753ea8, 1d44cb, b68) + 5c
 fcaf4f90 OnDataSizeChange__17nsDiskCacheDeviceP12nsCacheEntryi (afab70,
11593c0, b68, fcaf4e58, 1d44cb, b68) + 138
 fcaefd3c OnDataSizeChange__14nsCacheServiceP12nsCacheEntryi (11593c0, b68, 0,
13, 1d44cb, b68) + 84
 fcae8318 RequestDataSizeChange__22nsCacheEntryDescriptori (15fc210, b68, 0,
ffffffff, 4, 0) + 64
 fcaea248 OnWrite__Q222nsCacheEntryDescriptor21nsOutputStreamWrapperUi (150f058,
b68, 80470000, b35, 33, ffbfdd04) + 34
 fcaea018 Write__Q222nsCacheEntryDescriptor21nsOutputStreamWrapperPCcUiPUi
(150f058, 191c000, b68, ffbfdd98, fcae9ffc, ff0d1488) + 1c
 ff0cc244 TeeSegment__16nsInputStreamTeePCcUi (1753ea8, 191c000, b68, ffbfdf58,
ff0d1dd0, 0) + 9c
 ff0cc7d8 Read__16nsInputStreamTeePcUiPUi (1753ea8, 191c000, b68, ffbfdf58,
ff0cc6e4, 0) + f4
 fd92660c
OnDataAvailable__20nsExternalAppHandlerP10nsIRequestP11nsISupportsP14nsIInputStreamUiUi
(12dc980, 12bfb60, 0, 1753ea8, 1d44cb, b68) + 15c
 fd918f70
OnDataAvailable__18nsDocumentOpenInfoP10nsIRequestP11nsISupportsP14nsIInputStreamUiUi
(13a6030, 12bfb60, 0, 1753ea8, 1d44cb, b68) + 88
 fe562ac4
OnDataAvailable__19nsStreamListenerTeeP10nsIRequestP11nsISupportsP14nsIInputStreamUiUi
(13a6030, 12bfb60, 0, 11595d4, 1d44cb, b68) + 31c
 fe5bdb90
OnDataAvailable__13nsHttpChannelP10nsIRequestP11nsISupportsP14nsIInputStreamUiUi
(12bfb60, 11596f4, 0, 11595d4, 1d44cb, b68) + 20c
 fe561738 HandleEvent__22nsOnDataAvailableEvent (1150f60, fe5615a4, 4, 12bfb60,
11596f4, 0) + 194
 fe5476d4 HandlePLEvent__23nsARequestObserverEventP7PLEvent (1150f60, fe547688,
156ddb0, 0, 0, 0) + 4c
 ff0f6e14 PL_HandleEvent (1150f60, 765c0, 0, 0, 0, 0) + 14
 ff0f6d30 PL_ProcessPendingEvents (852e0, 0, 0, 0, 0, 0) + 60
 ff0f8224 ProcessPendingEvents__16nsEventQueueImpl (cd3d0, ff0f81e8, feb40604,
1, 0, 0) + 3c
 fe331598 event_processor_callback__FPvi17GdkInputCondition (cd3d0, 5, 1,
fe33154c, fedced8c, 0) + 4c
 fe3310f0 our_gdk_io_invoke__FP11_GIOChannel12GIOConditionPv (1bcdc0, 1, fbd68,
0, 78d00, ffbfe4b0) + 4c
 feda6c08 g_main_dispatch (0, fedcedb8, fedcedb4, fedcedbc, 0, 0) + 104
 feda741c g_main_iterate (1, 1, 0, fedced28, 1, feda4ea4) + 758
 feda7658 g_main_run (fbdc8, 0, 1, fe331d9c, 0, fe397c90) + 94
 fee8fde0 gtk_main (cd3d0, a2e80, fe3318b0, 0, fe394030, 0) + f8
 fe331c84 Run__10nsAppShell (a2e78, a2e78, fe331c0c, 0, 21ce8c, 21ce8c) + 78
 fe3bbabc Run__17nsAppShellService (21c200, 21c200, fe3bba7c, fe3f6120, a81c0,
a81c0) + 40
 0001f53c main1__FiPPcP11nsISupports (ffbfe7b0, ffbfe904, 69f68, fe8bf220,
81c28, 81c28) + f44
 0002076c main     (2, ffbfe904, ffbfe910, 695e0, 0, 0) + 244
 00018f10 _start   (0, 0, 0, 0, 0, 0) + 5c
-----------------  lwp# 2 / thread# 2  --------------------
 feb1ae68 poll     (fdb4fba8, 3, 88b8)
 fea5dd0c poll     (fdb4fba8, 3, 88b8, fefd1fa8, 3, fff6e608) + 5c
 fefd3cc8 _pr_poll_with_poll (919e8, 3, 3567e0, aaaaaaaa, 73e8962f, 0) + 348
 fefd3e3c PR_Poll  (919e8, 3, 3567e0, 0, 0, 0) + c
 fe55d254 Run__24nsSocketTransportService (91978, 91978, fe55d1c4, 0, 0, 0) + 90
 ff0fa624 Main__8nsThreadPv (26d788, ff0fa5b8, 0, 0, 0, 0) + 6c
 fefd50d4 _pt_root (91d58, 0, 0, 0, 0, 0) + 90
 fea65d48 _lwp_start (0, 0, 0, 0, 0, 0)
-----------------  lwp# 3 / thread# 3  --------------------
 fea65e88 lwp_park (0, 0, 0)
 fea63138 cond_wait_queue (21b8c8, fea78b48, 0, 0, fec40400, fea78000) + d4
 fea638f4 cond_wait (21b8c8, 91cf8, 1, f, 819c550b, ff263248) + 14
 fea63930 pthread_cond_wait (21b8c8, 91cf8, fe6761dc, 0, fe558130, de8ce4) + 8
 fefcffa8 PR_WaitCondVar (21b8c0, ffffffff, fff80000, 0, 0, 0) + 64
 fe56d764 DequeuePendingQ__12nsDNSService (a8900, fe56ab20, fff80000, 0, 0, 0) + 44
 fe56ce58 Run__12nsDNSService (a8900, a8900, fe56ce20, 0, 0, 0) + 38
 ff0fa624 Main__8nsThreadPv (20d720, ff0fa5b8, 0, 0, 0, 0) + 6c
 fefd50d4 _pt_root (a99e0, 0, 0, 0, 0, 0) + 90
 fea65d48 _lwp_start (0, 0, 0, 0, 0, 0)
-----------------  lwp# 4 / thread# 4  --------------------
 fea65e88 lwp_park (0, fd2ffc68, 0)
 fea63138 cond_wait_queue (17f200, fea78b48, 0, 0, fec40600, fea78000) + d4
 fea636b0 cond_wait_common (0, 23de58, fd2ffc68, 0, 0, 3d64b844) + 1d8
 fea63b40 _cond_timedwait (17f200, 23de58, fd2ffd98, 0, 8, fdb7e8) + 1f0
 fea63b74 cond_timedwait (17f200, 23de58, fd2ffd98, 186a0, 0, 0) + 18
 fea63bb4 pthread_cond_timedwait (17f200, 23de58, fd2ffd98, 2710, 0, fd2ffda0) + c
 fefcfdb8 pt_TimedWait (17f200, 23de58, c2dc, ff0f9f5c, ff0fdd74, 0) + b8
 fefcffc0 PR_WaitCondVar (17f1f8, c2dc, cd3d0, 85548, 0, 0) + 7c
 ff0ffc48 Run__11TimerThread (23dda0, 23dda0, ff0ffa7c, 0, 0, 0) + 1cc
 ff0fa624 Main__8nsThreadPv (93d38, ff0fa5b8, 0, 0, 0, 0) + 6c
 fefd50d4 _pt_root (23df18, 0, 0, 0, 0, 0) + 90
 fea65d48 _lwp_start (0, 0, 0, 0, 0, 0)
-----------------  lwp# 15 / thread# 15  --------------------
 fea65e88 lwp_park (0, 0, 0)
 fea63138 cond_wait_queue (21b940, fea78b48, 0, 0, fec40800, fea78000) + d4
 fea638f4 cond_wait (21b940, a9f38, 1, f, fe547b64, 0) + 14
 fea63930 pthread_cond_wait (21b940, a9f38, 1, ff096364, fe561c1c, 18bcb40) + 8
 fefcffa8 PR_WaitCondVar (21b938, ffffffff, ff16b9c0, ff1c27f0, 18bcab0, 0) + 64
 ff0fc0f8 GetRequest__12nsThreadPoolP9nsIThread (a9980, 1318698, a9ef8, a9ef8,
a9980, ff1c2880) + 458
 ff0fcedc Run__20nsThreadPoolRunnable (1056690, 1056690, ff0fce5c, 0, 0, 0) + 80
 ff0fa624 Main__8nsThreadPv (1318698, ff0fa5b8, 0, 0, 0, 0) + 6c
 fefd50d4 _pt_root (163a698, 0, 0, 0, 0, 0) + 90
 fea65d48 _lwp_start (0, 0, 0, 0, 0, 0)
-----------------  lwp# 11 / thread# 11  --------------------
 fea65e88 lwp_park (0, fcb9fb40, 0)
 fea63138 cond_wait_queue (df59b0, fea78b48, 0, 0, fec40a00, fea78000) + d4
 fea636b0 cond_wait_common (0, 9c33e0, fcb9fb40, 0, 0, 3d64b844) + 1d8
 fea63b40 _cond_timedwait (df59b0, 9c33e0, fcb9fc70, 0, 0, 0) + 1f0
 fea63b74 cond_timedwait (df59b0, 9c33e0, fcb9fc70, 186a0, 0, 0) + 18
 fea63bb4 pthread_cond_timedwait (df59b0, 9c33e0, fcb9fc70, 0, 0, fcb9fc78) + c
 fefcfdb8 pt_TimedWait (df59b0, 9c33e0, 0, 0, 0, 0) + b8
 fefcffc0 PR_WaitCondVar (df59a8, 186a0, 2, 7, 0, 0) + 7c
 fefd026c PR_Wait  (9c33d8, 186a0, ff0f81e8, fefee844, 76690, 0) + 18
 fbf6d8a0 ImapThreadMainLoop__14nsImapProtocol (9c2e90, 1, 9c2f08, ff0f9f5c, 0,
0) + 13c
 fbf6ca04 Run__14nsImapProtocol (85548, 9c2e90, fbf6c7e8, 0, 9c2e90, 0) + 21c
 ff0fa624 Main__8nsThreadPv (d77438, ff0fa5b8, 0, 0, 0, 0) + 6c
 fefd50d4 _pt_root (df2500, 0, 0, 0, 0, 0) + 90
 fea65d48 _lwp_start (0, 0, 0, 0, 0, 0)




*** Bug 163995 has been marked as a duplicate of this bug. ***
this hang occurs whenever the disk cache device tries to doom an entry.  it
calls nsCacheService::DoomEntry while holding the disk cache device's lock.  the
cache service does what it needs to do and then calls nsCacheDevice::DoomEntry,
which for the disk cache device tries to re-enter the disk cache lock!!

-> me (since gordon is on vacation)
Assignee: gordon → darin
Priority: -- → P1
Target Milestone: --- → mozilla1.2alpha
dave: whenever you crash or hang, then disk cache will be cleaned out on the
next startup, so don't worry about preserving your disk cache.  i know exactly
what's causing this problem.
Attached patch v1 patchSplinter Review
this patch is a very conservative solution, but i believe it is the right thing
to do.	i'm removing the disk cache device lock altogether and using the cache
service lock to protect the disk cache device where necessary.	this simplifies
the locking _a lot_ and shouldn't hurt performance (since reads still don't
have to enter any locks).
Platform => All form bug 163995.  Is this bug unix-only?
Hardware: Sun → All
No, saw it on Windows NT and Mac OS 9 too. And bug 163995 was reported on Linux.
OS: SunOS → All
Blocks: 164367
Isn't this a bit course-grained? Won't this stop other cache providers from
being accessed separately?

The memcache doesn't have any sort of lock, though (why not?), so I guess that
its academic, though
yes, this probably isn't the best architecture, since writing to the disk cache
will now prevent all other accesses to the cache.  however, in practice i don't
see this as a performance bottleneck.  in most cases already, writing to the
cache happens only from the main/UI thread, so for the time being we really
don't have to worry about making the locking more fine-grained.
Comment on attachment 96363 [details] [diff] [review]
v1 patch

r=dougt
Attachment #96363 - Flags: review+
Request you get it functional first and then leave the bug open, or make a new
one, to address the granularity of locking and the associated performance.
hanasaki: yup, that is definitely the plan.
fixed-on-trunk
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Summary: Mozilla hangs in PR_Lock at ptsynch.c:190 → deadlock when disk cache fills up [was: Mozilla hangs in PR_Lock at ptsynch.c:190]
*** Bug 146429 has been marked as a duplicate of this bug. ***
verified trunk - 10/23/02 winNT4, linux rh6, mac osX  - not seeing any more
lockups when cache is full
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: