Closed Bug 1315285 Opened 8 years ago Closed 8 years ago

Frequent Windows IPC-related test hangs with jemalloc4 enabled

Categories

(Core :: Memory Allocator, defect, P3)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox49 --- unaffected
firefox50 --- unaffected
firefox51 --- unaffected
firefox52 --- fixed
firefox53 --- fixed

People

(Reporter: RyanVM, Assigned: ting)

References

Details

(Keywords: crash, hang, regression)

Not sure this is really jemalloc4's fault or not, but starting with the Memory Allocator component for now until there's a better sense of what's going on here.

While testing the forthcoming jemalloc 4.3.0 release on Try, glandium and I noticed that Win8 in particular was hitting a lot of hangs along the line of the logs below (the dom-level* tests seem to be best for reliably triggering it):
https://treeherder.mozilla.org/logviewer.html#?job_id=30453758&repo=try
https://treeherder.mozilla.org/logviewer.html#?job_id=30453765&repo=try

Both win32 and win64 Talos also have been seeing hangs like:
https://treeherder.mozilla.org/logviewer.html#?job_id=30404691&repo=try
https://treeherder.mozilla.org/logviewer.html#?job_id=30399341&repo=try

I tried bisecting this to a broken jemalloc commit only to realize that it reproduced with the in-tree 4.1.1 version as well. I was eventually able to bisect it down to the debugging patch from bug 1293501 as the culprit.

To reproduce on Try, you should only need to make the following changes to the win32/win64 opt build configs, respectively:
https://hg.mozilla.org/try/diff/031fb63ab74b/browser/config/mozconfigs/win32/common-opt
https://hg.mozilla.org/try/diff/031fb63ab74b/browser/config/mozconfigs/win64/common-opt
Flags: needinfo?(janus926)
For now I have no ideas how could VirtualProtect in necko cause hangs. Any ideas about what objects the ZwWaitForMultipleObjects is waiting for in memset?
Flags: needinfo?(janus926)
4 of them are with EXCEPTION_ACCESS_VIOLATION_WRITE:

https://treeherder.mozilla.org/logviewer.html#?job_id=30453758&repo=try
  07:30:34     INFO -  Crash reason:  EXCEPTION_ACCESS_VIOLATION_WRITE
  07:30:34     INFO -  Crash address: 0x88486fb808
  07:30:34     INFO -  Assertion: Unknown assertion type 0x00000000
  07:30:34     INFO -  Process uptime: 42 seconds
  07:30:34     INFO -  Thread 1 (crashed)
  07:30:34     INFO -   0  KERNELBASE.dll!PostQueuedCompletionStatus + 0x108

https://treeherder.mozilla.org/logviewer.html#?job_id=30453765&repo=try
  07:28:52     INFO -  Crash reason:  EXCEPTION_ACCESS_VIOLATION_WRITE
  07:28:52     INFO -  Crash address: 0x84e72e78e8
  07:28:52     INFO -  Assertion: Unknown assertion type 0x00000000
  07:28:52     INFO -  Process uptime: 36 seconds
  07:28:52     INFO -  Thread 1 (crashed)
  07:28:52     INFO -   0  xul.dll!mozilla::BufferList<js::SystemAllocPolicy>::BufferList<js::SystemAllocPolicy>(mozilla::BufferList<js::SystemAllocPolicy> &&) [BufferList.h:533f238d2e70 : 97 + 0x0]

https://treeherder.mozilla.org/logviewer.html#?job_id=30399341&repo=try
  10:51:31     INFO -  Crash reason:  EXCEPTION_ACCESS_VIOLATION_WRITE
  10:51:31     INFO -  Crash address: 0x11bb0cc
  10:51:31     INFO -  Assertion: Unknown assertion type 0x00000000
  10:51:31     INFO -  Process uptime: 1 seconds
  10:51:31     INFO -  Thread 2 (crashed)
  10:51:31     INFO -   0  xul.dll!IPC::Channel::ChannelImpl::OutputQueuePop() [ipc_channel_win.cc:2ce763de722e : 103 + 0x0]

https://treeherder.mozilla.org/logviewer.html#?job_id=30456590&repo=try
  08:30:35     INFO - Crash reason:  EXCEPTION_ACCESS_VIOLATION_WRITE
  08:30:35     INFO - Crash address: 0x63c1f6a0e8
  08:30:35     INFO - Assertion: Unknown assertion type 0x00000000
  08:30:35     INFO - Process uptime: 42 seconds
  08:30:35     INFO - 
  08:30:35     INFO - Thread 1 (crashed)
  08:30:35     INFO -  0  xul.dll!mozilla::BufferList<InfallibleAllocPolicy>::BufferList<InfallibleAllocPolicy>(mozilla::BufferList<InfallibleAllocPolicy> &&) [BufferList.h:8a6d7d7e632d : 97 + 0x0]
Those stacks don't seem the culprit I'm hunting for in bug 1293501 as those come with jemalloc4. Note the debug patch may stay a bit longer in the tree (and merge to Aurora) because I haven't seen any access violation write crashes that seems from it.
... except the patch in bug 1293501 looks wrong to me.

First, it wastes a page by element in the array, even though only the beginning of that array is meant to be protected.

Second, the padding is not large enough to put what follows it out of the protected page.

Third, I don't see anything to handle the array being reallocated when its size changes. Which means the protected page stays there, and can end up being given away to something else by jemalloc, while the array itself is not protected anymore.

Essentially, it seems to me you're *very* lucky that those crashes are not happening with mozjemalloc already. In fact, for all I know, there may well be unidentified crashes due to that patch.
(In reply to Mike Hommey [:glandium] from comment #5)
> First, it wastes a page by element in the array, even though only the
> beginning of that array is meant to be protected.

Yes. But if I add padding to only first element, I'll need to overwrite the array functions for accessing the elements, which I think is too much work for debug.

> Second, the padding is not large enough to put what follows it out of the
> protected page.

I am not sure I understand, why is this needed? From the crash reports earlier, I found it's likely the first nsEntry in the array has nsCString mData 0x0, but the other fields remains valid. So the protected page includes:

  |nsTArrayHeader|nsEntry.header|padding|nsEntry.value.mData|

which covers mData's orginal address and the shifted address.

> Third, I don't see anything to handle the array being reallocated when its
> size changes. Which means the protected page stays there, and can end up
> being given away to something else by jemalloc, while the array itself is
> not protected anymore.

The only way to alternate the array is through nsHttpRequestHead APIs like:

https://dxr.mozilla.org/mozilla-central/rev/86f702229e32c6119d092e86431afee576f033a1/netwerk/protocol/http/nsHttpRequestHead.cpp#162

which each of them has ReentrantMonitorAutoEnter in the beginning of the function. The patch will unprotect the page in the constructor of ReentrantMonitorAutoEnter, and reprotect the page (of course by the updated array mHdr) in its destructor. Also when nsHttpRequestHead is destructed, the page will set back to readwrite.

> Essentially, it seems to me you're *very* lucky that those crashes are not
> happening with mozjemalloc already. In fact, for all I know, there may well
> be unidentified crashes due to that patch.

Could be, but from crash-stats I don't see significant access violation write crashes seems to be caused by this:

https://crash-stats.mozilla.com/search/?reason=~EXCEPTION_ACCESS_VIOLATION_WRITE&build_id=%3E%3D20161012030211&product=Firefox&version=52.0a1&platform=Windows&date=%3E%3D2016-11-01T09%3A24%3A00.000Z&date=%3C2016-11-08T09%3A24%3A00.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Priority: -- → P3
No longer blocks: 1293501
Fixed by the bug 1293501 backout.
Assignee: nobody → janus926
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.