Closed Bug 1249811 Opened 9 years ago Closed 9 years ago

crashes [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait ] in mozilla::net::CacheFileIOManager::Shutdown()

Categories

(Core :: Networking: Cache, defect)

defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 1247432
Tracking Status
firefox47 --- affected

People

(Reporter: dbaron, Unassigned)

Details

(Keywords: crash, topcrash, Whiteboard: [necko-active])

Crash Data

One of the top crashes on release in Firefox 44 (currently #2 on the topcrash list) is shutdown hangs at: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait ] https://crash-stats.mozilla.com/signature/?product=Firefox&version=44.0.2&date=%3C2016-02-19T22%3A34%3A21&date=%3E%3D2016-02-12T22%3A34%3A21&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+mozilla%3A%3ACondVar%3A%3AWait&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#reports A sample of the 4 most recent such crashes shows that for all 4, the next frame in the stack is: mozilla::net::CacheFileIOManager::Shutdown() (I filed bug 1249805 on Socorro.) This may be related to bug 1029213. Taking the most recent crash: https://crash-stats.mozilla.com/report/index/f4773486-0113-43fe-aa4b-baa3d2160213#allthreads The main thread is doing: 0 ntdll.dll NtWaitForSingleObject 1 kernelbase.dll WaitForSingleObjectEx 2 kernelbase.dll WaitForSingleObject 3 nss3.dll PR_WaitCondVar nsprpub/pr/src/threads/combined/prucv.c 4 xul.dll mozilla::CondVar::Wait(unsigned int) xpcom/glue/CondVar.h 5 xul.dll mozilla::net::CacheFileIOManager::Shutdown() netwerk/cache2/CacheFileIOManager.cpp 6 xul.dll mozilla::net::CacheObserver::Observe(nsISupports*, char const*, wchar_t const*) and the cache IO thread is doing: 0 ntdll.dll NtCreateFile 1 KERNELBASE.dll CreateFileInternal 2 KERNELBASE.dll CreateFileW 3 xul.dll OpenFile(nsString const&, int, int, bool, PRFileDesc**) xpcom/io/nsLocalFileWin.cpp 4 xul.dll nsLocalFile::OpenNSPRFileDescMaybeShareDelete(int, int, bool, PRFileDesc**) xpcom/io/nsLocalFileWin.cpp 5 xul.dll nsLocalFile::OpenNSPRFileDesc(int, int, PRFileDesc**) xpcom/io/nsLocalFileWin.cpp 6 xul.dll mozilla::net::CacheFileIOManager::OpenNSPRHandle(mozilla::net::CacheFileHandle*, bool) netwerk/cache2/CacheFileIOManager.cpp 7 xul.dll mozilla::net::CacheFileIOManager::WriteInternal(mozilla::net::CacheFileHandle*, __int64, char const*, int, bool, bool) netwerk/cache2/CacheFileIOManager.cpp 8 xul.dll mozilla::net::WriteEvent::Run() netwerk/cache2/CacheFileIOManager.cpp 9 xul.dll mozilla::net::CacheIOThread::LoopOneLevel(unsigned int) netwerk/cache2/CacheIOThread.cpp 10 xul.dll mozilla::net::CacheIOThread::ThreadFunc() netwerk/cache2/CacheIOThread.cpp 11 xul.dll mozilla::net::CacheIOThread::ThreadFunc(void*) netwerk/cache2/CacheIOThread.cpp 12 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c
8 of the last 10 crashes had the Skype Click-to-Call extension per bug 1215970 comment 43, so most of this may be related to that extension. Has it been confirmed that bug 1248049 fixed these crashes on beta? (This may well be a duplicate.)
Flags: needinfo?(honzab.moz)
This may well be just a duplicate of bug 1247432.
Flags: needinfo?(honzab.moz) → needinfo?(michal.novotny)
I guess bug #913822 and bug #1247432 should solve this.
Flags: needinfo?(michal.novotny)
Whiteboard: [necko-active]
Crash Signature: mozilla::net::CacheFileIOManager::Shutdown ] → mozilla::net::CacheFileIOManager::Shutdown ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::net::CacheIOThread::Shutdown ]
(In reply to Michal Novotny (:michal) from comment #3) > I guess bug #913822 and bug #1247432 should solve this. last comment in bug 1247432 indicates fixing bug 1251130 first is the best path forward on that bug. does that still help with this bug. crash comments show this is costing us users each day the problem is still out there on release channels.
Keywords: regression
(In reply to chris hofmann from comment #4) > (In reply to Michal Novotny (:michal) from comment #3) > > I guess bug #913822 and bug #1247432 should solve this. > > last comment in bug 1247432 indicates fixing bug 1251130 first is the best > path forward on that bug. does that still help with this bug. If this is a question, then I'm not sure I fully follow it. Bug 1247432 is reopened but still left in the repo. We want to decide on its status (backout from everywhere or mark again as fixed) based on how simply/quickly we can deal with bug 1251130 that appeared after bug 1247432 has landed. > > crash comments show this is costing us users each day the problem is still > out there on release channels. So far, bug 1247432 is uplifted up to beta. In next 5 weeks from now on it will be in release. I don't think we can do anything more here. If you want this be marked as a regression, please provide a bug number first.
Status: NEW → RESOLVED
Closed: 9 years ago
Keywords: regression
Resolution: --- → DUPLICATE
(In reply to Honza Bambas (:mayhemer) from comment #5) > If you want this be marked as a regression, please provide a bug number > first. > > *** This bug has been marked as a duplicate of bug 1247432 *** Here is the rational for marking it as a regression. This crash signature was mostly invisible or non-existent before Firefox 44. This bug was filed from socorro when the signature raced to the top affer the release of 44. Now its bouncing around in the top 5-10 crashes for 44 and 45 and will stay that way until we back out what ever changes caused that, or make another run a fixing. Yeah I could use some help in understanding which approach or approaches we intend to take, and which bugs might move us from this regressed state to a better place, but its also worthwhile to understand how we got this new top crash. I wasn't able to figure that out of the complex relationships of all the bugs that might have contributed to the new stabilbity problem, and what we intend to do now.
dbaron 2/19 This may be related to Bug 1029213 - Shutdown hang in CacheFileIOManager::Shutdown -- 1029213 now depends on Bug 913822 - HTTP cache v2: fix shutdown time regression on a slow storage 913822 has now been fixed and uplifted to beta. comment on 2/8 Unfortunately, this patch seems to not have helped bug 1215970 or bug 1158189 in 45.0b3. also on 2/9 asked Has it been confirmed that Bug 1248049 - Don't HTTP cache XHR POSTs fixed these crashes on beta? that fix went on to 45beta around 2/18 not clear if anyone has confirmed on 45 beta or release yet looking at crash data. if these signatures had multiple sources for the crash the skype click to call part may have been reduced or eliminated. honza comfirmed with testing and now crash data from the 45 release seems to back that up with only about 5% of this signature has {82AF8DCA-6DE9-405D-BD5E-43525BDAD38A} skype click to call 8.0.0.9103 version installed. possible conclusion here with this signature still seen in high volume on 45 is that the skype issue wasn't a big part of this signature or its been replaced by some other thing that tickles the crash now. seems like all those parts of tracking these signatures can be put to bed now. then is on tracking other possible sources for these signatures. re: comment 5 > So far, bug 1247432 is uplifted up to beta. In next 5 weeks from now on it will be in release. > I don't think we can do anything more here. comments in that bug indicate back out is being considered, or maybe wait for a fix to bug 1251130 first With a few days of crash data on beta it looks like this signature still ranks at #12 https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=46.0b&days=7 https://crash-stats.mozilla.com/signature/?product=Firefox&version=46.0b&date=%3C2016-03-15T00%3A54%3A24&date=%3E%3D2016-03-08T00%3A54%3A24&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+PR_JoinThread+|+mozilla%3A%3Anet%3A%3ACacheIOThread%3A%3AShutdown&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#reports meaning some possible reduction, or migration to other signatures, or just the need to watch for more data. that still leaves us with no direct answers or bugs that caused the original spike in 44, but at least tries to explain the path forward. does all that sound correct?
(In reply to chris hofmann from comment #7) > dbaron 2/19 This may be related to Bug 1029213 - Shutdown hang in > CacheFileIOManager::Shutdown If it is, the cause is quite different. This particular bug belongs to the "blocked on IO during shutdown" group. Bug 1029213 might be just an endless or long time taking loop in a very different piece of code. If you believe Bug 1029213 happens in large, I can check crash-stats. Note that Bug 1029213 is reported individually, not from socorro. > > -- 1029213 now depends on Bug 913822 - HTTP cache v2: fix shutdown time > regression on a slow storage No, it doesn't and I think never has been. It depends on a different fix in our cache index, that could be another group of a shutdown hang, this time "inefficiency of the cache index". It has been fixed or largely mitigated in bug 1025913. > 913822 has now been fixed and uplifted to beta. > comment on 2/8 Unfortunately, this patch seems to not have > helped bug 1215970 It's possible, because we have to push bug 913822 be even more agressive, that is bug 1247432. bug 1247432 is now only on m-c and unfortunately uncovers some issues in lower levels of the cache2 code (bug 1251130 - in phase of being diagnosed) > or bug 1158189 in 45.0b3. That bug is unrelated to this one. > > also on 2/9 asked Has it been confirmed that Bug 1248049 - Don't HTTP cache > XHR POSTs fixed these crashes on beta? that fix went on to 45beta around > 2/18 > not clear if anyone has confirmed on 45 beta or release yet looking at > crash data. Honestly I did only local testing. Cache shutdown was simply hard-killed soon enough to not cause shutdown timeout. But that is of course not objective way of confirming a fix here. > if these signatures had multiple sources for the crash > the skype click to call part may have been reduced or eliminated. > honza comfirmed with testing and now crash data from the 45 release > seems to back that up with only about 5% of this signature has > {82AF8DCA-6DE9-405D-BD5E-43525BDAD38A} skype click to call 8.0.0.9103 > version installed. possible conclusion here with this signature still seen > in high volume on 45 is that the skype issue wasn't a big part of this > signature or its been replaced by some other thing that tickles the crash > now. Good to know. > > seems like all those parts of tracking these signatures can be put to bed > now. > > then is on tracking other possible sources for these signatures. > > re: comment 5 > > So far, bug 1247432 is uplifted up to beta. In next 5 weeks from now on it will be in release. > > I don't think we can do anything more here. > > comments in that bug indicate back out is being considered, or maybe wait > for a fix to bug 1251130 first Yes! Sorry, I misplaced bug 1251130 with bug 913822. Right, as I also write a bit above here, it's probably the right (missing) fix for cutting off cache2 at shutdown correctly. > > With a few days of crash data on beta it looks like this signature still > ranks at #12 > https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=46. > 0b&days=7 > > https://crash-stats.mozilla.com/signature/?product=Firefox&version=46. > 0b&date=%3C2016-03-15T00%3A54%3A24&date=%3E%3D2016-03- > 08T00%3A54%3A24&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingl > eObject+|+PR_WaitCondVar+|+PR_JoinThread+|+mozilla%3A%3Anet%3A%3ACacheIOThrea > d%3A%3AShutdown&_columns=date&_columns=product&_columns=version&_columns=buil > d_id&_columns=platform&_columns=reason&_columns=address&page=1#reports > > meaning some possible reduction, or migration to other signatures, or just > the need to watch for more data. > > that still leaves us with no direct answers or bugs that caused the original > spike in 44, but at least tries to explain the path forward. > > does all that sound correct? I think it does. Could Bug 1176988 be this regression cause? But it has landed on 43, not 44. A change (bug) in Skype click to call addon could trigger it in 44 and now it's just an echo?
You need to log in before you can comment on or make changes to this bug.