Closed Bug 1091705 Opened 10 years ago Closed 9 years ago

[e10s] content process hang - js::baseops::SetPropertyHelper<(js::ExecutionMode)0>

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
e10s m5+ ---

People

(Reporter: blassey, Assigned: billm)

References

(Depends on 1 open bug)

Details

(Keywords: hang)

* thread #1: tid = 0x122610, 0x0000000103184d64 XUL`bool js::baseops::SetPropertyHelper<(js::ExecutionMode)0>(js::ExecutionModeTraits<(js::ExecutionMode)0>::ContextType, JS::Handle<js::NativeObject*>, JS::Handle<JSObject*>, JS::Handle<jsid>, js::baseops::QualifiedBool, JS::MutableHandle<JS::Value>, bool) + 1940, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000103184d64 XUL`bool js::baseops::SetPropertyHelper<(js::ExecutionMode)0>(js::ExecutionModeTraits<(js::ExecutionMode)0>::ContextType, JS::Handle<js::NativeObject*>, JS::Handle<JSObject*>, JS::Handle<jsid>, js::baseops::QualifiedBool, JS::MutableHandle<JS::Value>, bool) + 1940
    frame #1: 0x000000010315f872 XUL`js::SetObjectElement(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, bool) + 386
    frame #2: 0x0000000108dc23fe
tracking-e10s: --- → ?
Keywords: hang
Assignee: nobody → wmccloskey
Alon caught this today and Blake and I took a look at it. We found a few things. First, Alon had dom.max_script_run_time set to 0, which means the slow script dialog is disabled. Apparently this happens whenever you check "Don't show again" in the slow script dialog (i.e., it's disabled forever for all sites!).

We're also wondering why the hang appears to be e10s-specific. The script that seems to be hanging is http://b.scorecardresearch.com/rs/vce_st.js. When I google for that filename, I see forum posts reporting hangs that sound similar. So it's possible that this isn't an e10s-specific problem.

When we stopped the script, it was at column 13012, which is a loop. It's hard to know if that's the loop that's not terminating. The code in question is doing some funky calculations relating to the coordinates of an element (probably a plugin). It looks like this:

i.translated = false;
while (e.left + e.width*s <= i.left) {
  i.left -= n;
  i.translated = true;
}

It's possible that |n| is 0 or negative, in which case the loop will never terminate. It's hard to tell where |n| comes from since the script is minified.
I think bug 1034213 would really help to debug problems like this.
I forgot to ask, Brad: what is dom.max_script_run_time set to in your profile?
Flags: needinfo?(blassey.bugs)
It is set to 0
Flags: needinfo?(blassey.bugs)
OK, great. That means your slow script dialog is also disabled. So the only remaining question here is why this script seems to hang in e10s and not without e10s. I'll get to work on bug 1034213 soon, and that will allow the next person who hits this to debug the script using devtools and maybe get a sense of why it's looping.
I was going to file a bug on the weird behavior of "Don't ask again" in the slow script window, but we already have three: bug 487898, bug 614438, and bug 953090!
very reliably reproducible for me. I have stack traces from both processes. Do you want me to attach them here?
No thanks. But can you check the value of dom.max_script_run_time in about:config?
This just happened on my computer tonight. The slow script warning is enabled, so that's not the problem. Like in bug 1100074, cx->runtime()->interruptCallback is null.

I'm pretty sure that nsGlobalWindow::ShowSlowScriptDialog is actually on the stack when this happens. (I searched through memory and found the return address on the stack, but the frames don't show up because the JIT confuses the debugger.) That makes sense since that's the only function that sets the interruptCallback to null.

I also looked at the JS stack:

#0          (nil)   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f043a552b20 @ 117)
#1          (nil)   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f043add44a0 @ 1196)
#2 0x7f047a85f4f0   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f0438c737e0 @ 255)
#3 0x7f047a85f438   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f0438c73710 @ 45)
#4 0x7f047a85f378   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f0438c733d0 @ 122)
#5 0x7f047a85f2c0   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f043ac27cc0 @ 14)
#6 0x7f047a85f200   jar:file:///home/billm/Downloads/firefox/omni.ja!/components/nsPrompter.js:494 (0x7f0458185160 @ 500)
#7 0x7f047a85f0f8   jar:file:///home/billm/Downloads/firefox/omni.ja!/components/nsPrompter.js:548 (0x7f04581854a0 @ 394)
#8 0x7f047a85f020   jar:file:///home/billm/Downloads/firefox/omni.ja!/components/nsPrompter.js:697 (0x7f0458185b20 @ 444)
#9          (nil)   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f043a552b20 @ 117)
#10          (nil)   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f043add44a0 @ 1196)
#11          (nil)   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f043add43d0 @ 396)
#12          (nil)   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f043add4300 @ 292)
#13          (nil)   http://b.scorecardresearch.com/rs/vce_st.js:1 (0x7f0439054a50 @ 43)

In between the scorecardresearch frames, you can see nsPrompt.js. The top-most nsPrompt.js frame is for the place where it does its nested event loop. It should have sent a message to chrome right before that asking it to put up the dialog. But, for whatever reason, I don't see a dialog box.

I tried to look at the browser console in chrome to see if we hit an exception or something, but it won't open (not sure why). My only theory is that maybe the chrome process is trying to contact content before it puts up the dialog, and content has already hung again inside the scorecardresearch code.

However, I don't really understand why that wouldn't happen without e10s as well. Does putting up the dialog somehow "freeze" content scripts? If so, the problem may be that we're not freezing in the content process since it doesn't know that a dialog is up. Does that make any sense Blake?
Flags: needinfo?(mrbkap)
OK, based on that theory I constructed this testcase. It reliably hangs for me with no slow script dialog in e10s.

<script>
setInterval(function() {
  var t0 = Date.now();
  while (Date.now() - t0 < 15000);
}, 0);
</script>

Yay!
With a little more debugging, my testcase seems to trigger a different problem.
The problem in comment 11 was caused by the fact that I had Ion disabled in my dev profile for some reason, and there's a bug in the slow script code that's related to that (bug 1111412).
I guess we call enterModalState to freeze scripts.
Flags: needinfo?(mrbkap)
Hey guys, in bug 1109768 comment 10 Steven provided a library that produces a text log of _sigtramp(). I used it to create a log for the hang from that bug (where chrome is not frozen, just content process is).

I created a text log and a screencase video documenting the behavior there. This bug is reproducible for me as well and manifests with both chrome and content processes hanging.

Would it be useful for me to reproduce the behavior and document it with text log and video screencast as well?
(In reply to Zibi Braniecki [:gandalf] from comment #15)
> I created a text log and a screencase video documenting the behavior there.
> This bug is reproducible for me as well and manifests with both chrome and
> content processes hanging.

How do you reproduce it? Does it happen consistently?
(In reply to Bill McCloskey (:billm) from comment #16)
> (In reply to Zibi Braniecki [:gandalf] from comment #15)
> > I created a text log and a screencase video documenting the behavior there.
> > This bug is reproducible for me as well and manifests with both chrome and
> > content processes hanging.
> 
> How do you reproduce it? Does it happen consistently?

I can't find a good STR, but I definitely get hit around once per hour.

Actually, it just happened as I was looking at the bmo email with your comment. I recorded it:

 - https://www.youtube.com/watch?v=09BVipwcwhE - part 1
 - https://www.youtube.com/watch?v=3raURLXVg_s - part 2

One difference is that in the video you see normal pointers. Instead i saw the spinning wheel whenever Firefox was the active app.
You can notice in the second video that after I kill the content process the chrome process "executes" all my previous clicks. It looks weird. Also, processes don't spin CPU to 100%.
Also, when I launch Firefox from a console and reproduce it, that's what I see in my terminal log: http://pastebin.mozilla.org/8038717
(In reply to Bill McCloskey (:billm) from comment #10)
> I'm pretty sure that nsGlobalWindow::ShowSlowScriptDialog is actually on the
> stack when this happens.

This is correct. Here's a profile of gandalf's content process that includes C++ symbols and JavaScript stack: http://people.mozilla.org/~bgirard/cleopatra/#report=788e2f7c26254ae8aef44d66887eb681935d6e28

nsGlobalWindow::ShowSlowScriptDialog() creates a dialog that launches a nested event loop, and from that event loop we re-enter scorecardresearch JavaScript through a NotifyOffThreadScriptLoadCompletedRunnable runnable.
With help from Zibi I was able to reproduce this somewhat consistently. The STR is as follows:
1. Visit phoronix.com.
2. Ctrl-click on 20 or so front-page links.
3. Close them all in quick succession.
This seems to reproduce for me pretty often. I can also reproduce a bunch of other random issues this way. I'll have to follow up on those once this one is fixed.

I haven't had much time to confirm this, but it looks like there are two independent problems:

1. The first one is the one Markus discovered in the profile. Even after we call enterModalDialog on the window, NotifyOffThreadScriptLoadCompletedRunnable is running script for that window.

2. The Prompt:Open message is sent but never received. It looks like this is happening because we're in the process of tearing down the tab that the script is running in, so the message manager doesn't work for it.
To elaborate on problem #2, we get to TabParent::ReceiveMessage and frameLoader->GetFrameMessageManager() is null.
(In reply to comment #19)

I see exactly what you report in the profile, Markus.  It seems to be a problem in cross-platform code -- not in any native event loop.  So in principle this bug (and bug 1109768) could also happen on Windows and Linux.  Do you agree?
Yes, I was able to reproduce this on Linux using the steps in comment 20.
So there's one more problem here. (I'll have to break this into multiple bugs.) The reason the scorecardresearch script is running forever is that window.screen.width is returning 0 and the script assumes it can never be 0.

When the script queries screen.width, we end up calling nsScreenManagerProxy::ScreenForNativeWidget. That sends a ScreenForBrowser message to the parent. The parent then calls GetWidget() on the TabParent. However, GetWidget() returns null because mFrameElement == null because the window is in the process of being closed. Yuck.
I filed bug 1113012 for the off thread parsing issue. It doesn't really block this bug but I added it to the list to make it easy to find.
Depends on: 1113012
I think all the important issues here have been fixed bu bug 1113006 and bug 1118618.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.