Closed Bug 1251130 Opened 4 years ago Closed 4 years ago

Intermittent test_caches.html | application crashed [@ mozilla::net::CacheFileHandles::GetAllHandles(nsTArray<RefPtr<mozilla::net::CacheFileHandle> >*)]

Categories

(Core :: Networking: Cache, defect, critical)

defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 1257611
Tracking Status
firefox46 --- affected
firefox47 --- affected
firefox48 --- affected

People

(Reporter: philor, Assigned: michal)

References

Details

(Keywords: intermittent-failure, Whiteboard: [necko-active])

Attachments

(1 obsolete file)

This is necko cache, not DOM Cache API.
Component: DOM → Networking: Cache
Necko put its test in dom/cache/test/mochitest/test_caches.html?

Naughty!
The test may be in dom/cache, but the stack is all necko:

 14:26:46     INFO -   0  XUL!mozilla::net::CacheFileHandles::GetAllHandles(nsTArray<RefPtr<mozilla::net::CacheFileHandle> >*) [RefPtr.h:3d625cdac138 : 36 + 0x0]
 14:26:46     INFO -      rax = 0x0000000113094500   rdx = 0x0000000000000000
 14:26:46     INFO -      rcx = 0x0000000000000003   rbx = 0x0000000000000003
 14:26:46     INFO -      rsi = 0x0000000000000000   rdi = 0x0000000100500044
 14:26:46     INFO -      rbp = 0x00000001170d3da0   rsp = 0x00000001170d3d40
 14:26:46     INFO -       r8 = 0x0000000000000000    r9 = 0x0000000000000000
 14:26:46     INFO -      r10 = 0x0000000000080000   r11 = 0xfffffffffbb433e0
 14:26:46     INFO -      r12 = 0x00000001080bc0a8   r13 = 0x000000010060f228
 14:26:46     INFO -      r14 = 0x0000000000000001   r15 = 0x00000001170d3dd0
 14:26:46     INFO -      rip = 0x000000010260224c
 14:26:46     INFO -      Found by: given as instruction pointer in context
 14:26:46     INFO -   1  XUL!mozilla::net::CacheFileIOManager::ShutdownInternal() [CacheFileIOManager.cpp:3d625cdac138 : 1215 + 0x14]
 14:26:46     INFO -      rbx = 0x00000001170d3dd0   rbp = 0x00000001170d3e00
 14:26:46     INFO -      rsp = 0x00000001170d3db0   r12 = 0x0000000000000004
 14:26:46     INFO -      r13 = 0x00000001080bc0a8   r14 = 0x00007fff5fbfebd0
 14:26:46     INFO -      r15 = 0x00000001161dc680   rip = 0x0000000102603c3a
 14:26:46     INFO -      Found by: call frame info
 14:26:46     INFO -   2  XUL!mozilla::net::ShutdownEvent::Run() [CacheFileIOManager.cpp:3d625cdac138 : 570 + 0xc]
 14:26:46     INFO -      rbx = 0x0000000117cee220   rbp = 0x00000001170d3e20
 14:26:46     INFO -      rsp = 0x00000001170d3e10   r12 = 0x0000000000000004
 14:26:46     INFO -      r13 = 0x00000001173e0110   r14 = 0x00007fff5fbfebd0
 14:26:46     INFO -      r15 = 0x00000001080bc0a8   rip = 0x0000000102621748
 14:26:46     INFO -      Found by: call frame info
 14:26:46     INFO -   3  XUL!mozilla::net::CacheIOThread::LoopOneLevel(unsigned int) [CacheIOThread.cpp:3d625cdac138 : 294 + 0x6]
 14:26:46     INFO -      rbx = 0x0000000000000020   rbp = 0x00000001170d3e70
 14:26:46     INFO -      rsp = 0x00000001170d3e30   r12 = 0x0000000000000004
 14:26:46     INFO -      r13 = 0x00000001173e0110   r14 = 0x0000000000000003
 14:26:46     INFO -      r15 = 0x00000001080bc0a8   rip = 0x0000000102614d8a
 14:26:46     INFO -      Found by: call frame info
 14:26:46     INFO -   4  XUL!mozilla::net::CacheIOThread::ThreadFunc() [CacheIOThread.cpp:3d625cdac138 : 224 + 0x8]
 14:26:46     INFO -      rbx = 0x00000001173e0110   rbp = 0x00000001170d3ea0
 14:26:46     INFO -      rsp = 0x00000001170d3e80   r12 = 0x000000000000692b
 14:26:46     INFO -      r13 = 0x00000000000008ff   r14 = 0x00000001170d3e87
 14:26:46     INFO -      r15 = 0x0000000000000000   rip = 0x0000000102614b88
 14:26:46     INFO -      Found by: call frame info
 14:26:46     INFO -   5  XUL!mozilla::net::CacheIOThread::ThreadFunc(void*) [CacheIOThread.cpp:3d625cdac138 : 173 + 0x8]
 14:26:46     INFO -      rbx = 0x00000001173e0110   rbp = 0x00000001170d3ec0
 14:26:46     INFO -      rsp = 0x00000001170d3eb0   r12 = 0x000000000000692b
 14:26:46     INFO -      r13 = 0x00000000000008ff   r14 = 0x00000001170d4000
 14:26:46     INFO -      r15 = 0x0000000000000000   rip = 0x00000001026149e4
14:26:46 INFO - Found by: call frame info

I suppose dom Cache could be doing something bad, but I think we need to investigate whats happening in this stack first.
Some shutdown related things changed in neck cache recently.  For example, bug 1247432.  It seems the change to use profile-change-net-teardown as the shutdown hook causes necko cache to shutdown before DOM cache API now.  It didn't use to do that.  Not sure if thats related, but seems possible.

In dom/cache the only thing that changed recently is bholley's NS_ProxyRelease() stuff in bug 1164581.  That also touched the necko cache.
Honza, can you take a look at this crash?  I won't be able to look again until next week after I get back from PTO.
Flags: needinfo?(honzab.moz)
Michal?
Flags: needinfo?(honzab.moz) → needinfo?(michal.novotny)
Assignee: nobody → michal.novotny
Flags: needinfo?(michal.novotny)
Whiteboard: [necko-active]
It's a regression caused by bug 1247432. CacheFileIOManager::Shutdown() is now called multiple times and what happens here is:

- MainThread: CacheFileIOManager::Shutdown() dispatches ShutDownEvent and waits until condVar is notified
- IO Thread: ShutdownEvent::Run() notifies condVar
- MainThread: CacheFileIOManager::Shutdown() joins IO thread (via gInstance->mIOThread->Shutdown()) which means we process pending events on mainthread
- MainThread: CacheFileIOManager::Shutdown() is called again while we're still waiting for IO thread to be shut down and because gInstance is not null the whole shutdown process is executed again.
Blocks: 1247432
Severity: normal → critical
Attached patch fix (obsolete) — Splinter Review
CacheFileIOManager::gInstance is now nulled out on cache IO thread in CacheFileIOManager::ShutdownInternal() while main thread is blocked on condVar in CacheFileIOManager::Shutdown(). I also did a cleanup of CacheFileIOManager usage in events. Now it is safe to check gInstance and then use it whenever on Cache IO thread without addrefing.

I'm not able to get clean try run. I tried several pushes but win7 seems to be broken (not by this patch).

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0fbc500589f9
Attachment #8726816 - Flags: review?(honzab.moz)
(In reply to Michal Novotny (:michal) from comment #7)
> It's a regression caused by bug 1247432. CacheFileIOManager::Shutdown() is
> now called multiple times and what happens here is:
> 
> - MainThread: CacheFileIOManager::Shutdown() dispatches ShutDownEvent and
> waits until condVar is notified
> - IO Thread: ShutdownEvent::Run() notifies condVar
> - MainThread: CacheFileIOManager::Shutdown() joins IO thread (via
> gInstance->mIOThread->Shutdown()) which means we process pending events on
> mainthread

Hmm.. no, we don't.  It's doing a PR_JoingThread() which doesn't loop.

> - MainThread: CacheFileIOManager::Shutdown() is called again while we're
> still waiting for IO thread to be shut down and because gInstance is not
> null the whole shutdown process is executed again.

Based on the above, this is impossible.
comment 9
Flags: needinfo?(michal.novotny)
Then I'm out of ideas here. How can we be past condVar.wait() on the main thread when condVar was not yet notified on Cache IO thread? Any ideas?
Flags: needinfo?(michal.novotny) → needinfo?(honzab.moz)
You are right we can call Shutdown in the io manager twice now.  But the problem is elsewhere than in reentrance of it.  It may be a race condition on access to gInstance during this shutdown phase.  Actually, something I was pointing out at a long time.  Still, the patch may be fixing the issue and seems correct to me.
Flags: needinfo?(honzab.moz)
Comment on attachment 8726816 [details] [diff] [review]
fix

Review of attachment 8726816 [details] [diff] [review]:
-----------------------------------------------------------------

This should be uplifted, right?  I'm a bit scared of this patch, also because we don't the true cause of this bug yet, but if you believe it's safe enough, please nominate to go along the shutdown bug.  Thanks.
Attachment #8726816 - Flags: review?(honzab.moz) → review+
(In reply to Honza Bambas (:mayhemer) from comment #13)
because we don't *know* the true cause of this bug yet
Duplicate of this bug: 1254720
Comment on attachment 8726816 [details] [diff] [review]
fix

Approval Request Comment
[Feature/regressing bug #]: 1247432
[User impact if declined]: possible shutdown crash
[Describe test coverage new/current, TreeHerder]: this is tested by all existing tests that use disk cache
[Risks and why]: low - This patch cleans up shutdown procedure in the cache a bit. We are not sure what exactly causes these crashes (also bug 1254720 and bug 1248389 comment 30) but this change makes accessing CacheFileIOManager::gInstance safe at least on main thread and cache IO thread. I'll probably come with yet another clean up later, but for now we should land this on the same branches as bug 1247432 was landed.
[String/UUID change made/needed]: none
Attachment #8726816 - Flags: approval-mozilla-beta?
Attachment #8726816 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/1ff643f2c358
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Duplicate of this bug: 1252746
Comment on attachment 8726816 [details] [diff] [review]
fix

Crash fix, taking it.
Attachment #8726816 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8726816 [details] [diff] [review]
fix

Dropping the request.  This is causes (at least) bug 1255690.

Candidate for a complete backout!
Attachment #8726816 - Flags: approval-mozilla-beta?
Depends on: 1255690
Michal, please back this out ASAP.  We need a better patch here...
Status: RESOLVED → REOPENED
Flags: needinfo?(michal.novotny)
Resolution: FIXED → ---
Flags: needinfo?(michal.novotny)
Attachment #8726816 - Flags: approval-mozilla-aurora+
And one more:

https://crash-stats.mozilla.com/report/index/ab06c015-bcb0-41be-8cbb-a9ec32160309


Tho, these two crashes are just ad only two I've found.  both on OSX, where this crash bug also appeared.
try run with sync,cache2:5 logging:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=85ff56eb6fef

let's try get some info from logs.
another try without the sync module:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=78362bdde9a3

Apparently this is a tight timing issue that sync logging is interfering with.
OK, I have a solution to this bug: let's turn on cache logging, and it will never crash again! :)
(In reply to Honza Bambas (:mayhemer) from comment #30)
> (In reply to Honza Bambas (:mayhemer) from comment #26)
> > try run with sync,cache2:5 logging:
> > https://treeherder.mozilla.org/#/jobs?repo=try&revision=85ff56eb6fef
> > 
> > let's try get some info from logs.
> 
> Managed to reproduce:
> 
> http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/
> 68c79c7068c2e5f0d4ba2b47b70325a6fcb427c9c277e8c0a1b82aa66e133f047e5f5307fe0f1
> ebba38662d6eb14d30fc11a02cdcb02eea896f7b1d3c45bf486
> 

nspr-pid=2336-uid=bbd4d41b-d0ac-4301-95aa-ca8612c9025c.log

> http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/
> 3437c29f38a3681b9d3d1f542bda028ba5db5cd86d6c1b03409def6a2c242abfbfd94a73c5ed7
> 272b02c5d493a9b780ee70ac278f02301cc550126bd1267d90e

nspr-pid=2013-uid=ab305ead-f338-43b4-9bfc-875ec11796d8.log

Michal, can you figure out something from it?
Flags: needinfo?(michal.novotny)
At least the theory that CacheFileIOManager::ShutdownInternal is called more than once is gone.
No longer blocks: 1238707
Status: REOPENED → ASSIGNED
Flags: needinfo?(michal.novotny)
Depends on: 1257611
Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1257611
Attachment #8726816 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.