Closed Bug 1167883 Opened 5 years ago Closed 5 years ago
.com reply field fails to open due to stack overflow
82.53 KB, image/png
2.06 KB, patch
|Details | Diff | Splinter Review|
163 bytes, text/html
For the past week (or more), the reply box on twitter.com has been failing to open around 90% of the time for me. Looking in the JS console shows that it's overflowing the stack. It doesn't happen 100% of the time, but it's quite consistent. I've tried repeatedly and been unable to reproduce this in x86 builds of FF Aurora (39.0a2 (2015-04-19) and 40.0a2 (2015-05-23) both seem fine); it only seems to happen in x64 builds. My guess is this is more of an evangelism issue than a JS engine one - they probably shipped some completely broken JS - but that x64 builds are worse off here because pointers are larger on the stack? My simple reproduction steps are to log in, visit the test URL, and click in the reply box. In an x64 build it almost always fails, and fails repeatedly. I've yet to get it to fail in an x86 build.
This page: https://twitter.com/antumbral/status/602307485416820736 Seems to consistently trigger the stack overflow right as it loads in x86-64 but not x86. You can tell it happened because the inline GIFs are broken.
I can't repro this on OS X. Stack limits are very platform dependent though, will try it on Windows later.
Twitter pushed some code changes a little while back that fixed some other Firefox bugs (like image upload UI being totally busted), and this issue went away for a day or two. But now it's back again. :/
My repro URL doesn't appear to work anymore because they totally changed how that page is laid out (the other GIFs don't autoload). My other main repro step does still work but is less consistent (click the reply arrow when viewing a tweet page)
I'm getting similar stack overflow errors when trying to save changes to calendar events in Google Calendar now. Should I file a separate issue or should I expand this into a general issue about too much recursion issues in win64 Firefox?
Did some manual testing of the calendar issue, looks similar. Decided to write it up, and tested out a workaround: Open an existing google calendar event, change the start and/or end time, hit save Results: broken (silent failure, 'too much recursion' in console when clicking save) 40.0a2 (2015-06-08) x64 40.0a2 (2015-06-09) x64 works 40.0a2 (2015-06-09) Workaround: editbin.exe .\firefox.exe /STACK:4194304
Katelyn, thanks for the info. I can get you a Try build that logs the JS stack when we hit an overrecursion. That way we can see how many frames there are actually on the stack. I assume this also affects Nightly?
Haven't tried x64 nightly yet, I'll make a note to do that.
Here's an Aurora Win64 build that should log the JS stack to stderr when we overrecurse: https://treeherder.mozilla.org/#/jobs?repo=try&revision=51506e427bf4 http://firstname.lastname@example.org/try-win64/ If you don't have time to try it I can also do it soon :)
That try build doesn't run out of stack >_< It also doesn't seem to log anything to stdout or stderr when I run it from cmd.exe, though.
Does this help at all? I'll see if I can dump a JS stacktrace from the debugger, but I remember it being prohibitively difficult...
If the try build is an opt build, you should be able to get commandline output by launching it with --attach-console from cmd.exe or --console otherwise.
(In reply to K. Gadd (:kael) from comment #10) > That try build doesn't run out of stack >_< Ugh, we probably need a PGO build, I'll trigger one. (In reply to K. Gadd (:kael) from comment #11) > Created attachment 8620744 [details] > Native stack trace during google calendar stack overflow > > Does this help at all? Hm that looks pretty worst-case: interpreter -> fun.apply -> bound function -> parser to delazify. Each of those eat stack space. Unfortunately we don't see the whole stack due to JIT code. > I'll see if I can dump a JS stacktrace from the debugger, but I remember it > being prohibitively difficult... You could try to call DumpJSStack() or js::DumpBacktrace(cx), but not sure offhand if they work in opt builds.
Win64 Aurora PGO build, this will easily take 4-5 hours: https://treeherder.mozilla.org/#/jobs?repo=try&revision=38a8b8a65b32
(In reply to Jan de Mooij [:jandem] from comment #14) > Win64 Aurora PGO build, this will easily take 4-5 hours: Bah, it hit a weird hang or timeout and the build took 5.5 hours :( I just retriggered it: the only difference with the previous push is that I enabled PGO so I doubt it's a problem with the patch.
How do I actually download that build? It shows as 'complete' but the ftp folder just has two huge .txt files in it.
(In reply to K. Gadd (:kael) from comment #16) > How do I actually download that build? It shows as 'complete' but the ftp > folder just has two huge .txt files in it. Well this one failed too, in a different way. I just did two more retriggers; two so that we get at least one different slave, in case it's something with that machine.
Did you ever get a try build for this? It's reproducing on Twitter again.
Is there a way I can do an instrumented PGO build myself? At present a handful of sites (ones important to me, unfortunately) are broken in x64 firefox, and a handful are broken in x86 firefox (address space exhaustion/compositor issues, which I also filed bugs about ages ago), so this kinda leaves me with no reasonable option other than switching over to another browser or finding some labyrinthine way of running both builds of FF side-by-side...
Sorry for the slow response due to the work week and PGO. Is this also broken with the new Aurora release, 41? I just updated my Aurora tree and triggered a new instrumented Win64 PGO build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ebd81d3a3722 If this fails again I'm going to open a RelEng bug; it's silly if we can't do PGO builds on Try.
It comes and goes intermittently but I still hit this pretty often on Twitter and Google Calendar. Once it starts happening I see it for at least a day or so (presumably until they push new code or the phases of the moon change)
This now also affects our tests, bug 1180948.
So I was finally able to reproduce this locally with my own build. When we hit the overrecursion error there aren't a lot of JS frames on the stack, but there are some bound function calls and those eat a lot of stack space. This is bug 1093158. What makes this even worse here is that the Invoke + RunScript stack frames we have for each bound function invocation eat ~6.2 KB stack space because PGO inlines a ton of stuff in them. I'm now doing another build with a patch to disable PGO in Invoke* and RunScript, let's see if that helps.
Depends on: 1093158
Two small fixes: (1) Keep MSVC from optimizing RunScript too much. PGO was inlining a lot of code in it (code to enter the JITs etc) and RunScript's stack frame became > 4 KB. (2) Use cx->currentScript instead of FrameIter in Invoke. On x64, FrameIter is > 1.5 KB (also bug 1181558), so Invoke's stack frame was > 2 KB.
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Attachment #8631020 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8631020 [details] [diff] [review] Patch Review of attachment 8631020 [details] [diff] [review]: ----------------------------------------------------------------- This sounds good. Out of curiosity, can you report the result returned by js/xpconnect/tests/chrome/test_bug732665_meta.js, before and after this patch. I feel that this might have a big impact on our recursion depth.
Attachment #8631020 - Flags: review?(nicolas.b.pierron) → review+
> Out of curiosity, can you report the result returned by > js/xpconnect/tests/chrome/test_bug732665_meta.js, before and after this > patch. I feel that this might have a big impact on our recursion depth. Not really on this test because (1) the recursive calls are optimized by Baseline (and also the interpreter btw) so we don't go through Invoke/RunScript and (2) I ran it in the shell, which is very different from the browser build with PGO. Attached is a bound function call benchmark, for this I get the following max recursion depths before/after the patch: - normal : 88 -> 340 - no baseline: 47 -> 86 This is still not great, but bind is the worst case scenario (see bug 1093158) and hopefully this is good enough for now until we fix bind itself.
sorry had to back this out for test failures like https://treeherder.mozilla.org/logviewer.html#?job_id=11513958&repo=mozilla-inbound
(In reply to Carsten Book [:Tomcat] from comment #28) > sorry had to back this out for test failures like Hm this exposed a completely unrelated bug; will investigate.
Comment on attachment 8631020 [details] [diff] [review] Patch Requesting approval so we can reopen Aurora (see bug 1180948). Approval Request Comment [Feature/regressing bug #]: Hard to say; compiler issue. [User impact if declined]: Overrecursion errors. [Describe test coverage new/current, TreeHerder]: On inbound. [Risks and why]: Low risk; very small patch with some simple tweaks to avoid large stack frames. [String/UUID change made/needed]: None.
Attachment #8631020 - Flags: approval-mozilla-aurora?
Comment on attachment 8631020 [details] [diff] [review] Patch Actually let's land this on beta too - there has been some talk about our Win64 builds and this will fix some websites for people willing to try them.
I went ahead and pushed this to Aurora ahead of the formal approval so we can get it reopened ASAP. https://hg.mozilla.org/releases/mozilla-aurora/rev/113007b56abf
Comment on attachment 8631020 [details] [diff] [review] Patch Low risk, on inbound, already fixed in Aurora.
:kael, could you confirm the fix works with the latest Firefox 40 Beta 4 build: ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/40.0b4-candidates/build1/?
You need to log in before you can comment on or make changes to this bug.