Closed Bug 837011 Opened 11 years ago Closed 11 years ago

BananaBread benchmark halts in the middle silently

Categories

(Core :: JavaScript Engine, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla22
Tracking Status
firefox21 - affected
firefox22 --- fixed
firefox32 --- verified

People

(Reporter: azakai, Assigned: luke)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [js:p1][games:p1])

Attachments

(2 files)

http://kripken.github.com/misc-js-benchmarks/banana/

is a deterministic benchmark of the BananaBread game engine. When running it in nightly, often during the benchmark it will silently halt: JS execution stops, often in mid-frame (you can tell because part of the scene was rendered but not all of it, e.g. the health bar on the bottom is missing). This shows the issue isn't that we call requestAnimationFrame and it forgets to call us - we somehow just stop executing JS in the middle of a frame. (I remember seeing such a thing a very long time ago, the issue was a silent OOM would just halt with no error. But here memory use is nowhere close to being exhausted. Unless it is virtual memory?)

On Chrome, the benchmark always reaches the end and prints out the results.

Nothing shows up in the JS console when this happens. It just silently halts.

There is nothing different between this benchmark and playing BananaBread except that input is ignored and Math.random etc. are deterministic. So this problem likely happens in BananaBread gameplay as well.

The problem happens more often in "Go" than "Go headless" (headless runs the exact same thing, except it replaces the canvas with empty shims and no rendering is done), but I've seen it happen in headless as well, so it isn't a graphics issue.

Note that headless does allocate less memory than when rendering. So if an OOM is happening here, it could be related. The headless version linked here does still do audio though, which uses memory and many many threads in fact end up created. So perhaps we are using a lot of virtual memory for stack space.

I've gotten reports of a very similar problem on FF19, so this might be a long-standing problem.
Can you run this in a debug browser build?
Whiteboard: [games:p1]
I don't have a debug build handy, and

ftp://ftp.mozilla.org/pub/firefox/nightly/2013-02-01-mozilla-central-debug/

seems to not be a debug build despite the name. I'll try to build later today. If someone else already has a debug build around though and can test this that would be great.
I tried on a debug build. The good news is that the problem seems much more reproducible there. It still fails at random places, but it fails very frequently and typically very close to the beginning of the benchmark.

The bad news is there is nothing odd in the debug output. It still just stops silently.

However, it did crash half the times I ran it, some reports,

https://crash-stats.mozilla.com/report/index/f51dd736-2075-4cc4-8037-f129d2130201
https://crash-stats.mozilla.com/report/index/bp-40e94727-2094-4eef-a2bc-004612130201
https://crash-stats.mozilla.com/report/index/a3aa8d85-21cd-4414-806a-6c8062130201
Keywords: crash
Attached file python testcase
Smaller testcase showing the same problem.

STR:

1. Load the page
2. Open the web console
3. Reload the page and look at the web console output. In most cases, it will just print 'start', which is done at the beginning of python.js, and not 'finish' which is done at the end of that script. In other words, execution halts silently in the middle of the script. I see this happen in about 75% of reloads, so doing say 5 quick reloads should be enough to confirm/disconfirm this bug on a build.

I am seeing this on a 32-bit linux machine. The BananaBread testcase shows up on 64-bit linux and os x.
This is definitely a regression, since the last testcase - in the previous comment- is python that was compiled 2-3 years ago (I just added the console.log stuff at the beginning and end). Requesting regression range. Note that we have no idea how big the regression range is - we started seeing it over the last month or two, but it might have regressed much earlier and not been noticed.

To be clear, I think we should get a regression range for the testcase in the last comment, it is much smaller and more reproducible. (The original testcase in this bug is much larger and harder to work with.)
I tested with str of comment #4

Regression window(m-c)
Good:
http://hg.mozilla.org/mozilla-central/rev/30b4f99a137c
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120228 Firefox/13.0a1 ID:20120228112349
Bad:
http://hg.mozilla.org/mozilla-central/rev/3812d0ce274e
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120229 Firefox/13.0a1 ID:20120229104909
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=30b4f99a137c&tochange=3812d0ce274e


Regression window(m-i)
Good:
http://hg.mozilla.org/integration/mozilla-inbound/rev/0983ca6b6ce6
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120228 Firefox/13.0a1 ID:20120228114554
Bad:
http://hg.mozilla.org/integration/mozilla-inbound/rev/491ceed82be3
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120228 Firefox/13.0a1 ID:20120228123549
Pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=0983ca6b6ce6&tochange=491ceed82be3

In local build
Last Good: a77570b9a808\
First Bad: 491ceed82be3\
Triggered by:
  491ceed82be3	Peter Van der Beken — Fix for bug 637099 (Remove JS_ClearScope calls on windows). r=mrbkap.
Whiteboard: [games:p1] → [js:p1][games:p1]
Also, we have a demo that was meant to ship out on the 19th.  I would like to know, once the cause is identified, if there is a work around or if we should delay release until the fix is out in the wide.
(In reply to Alon Zakai (:azakai) from comment #4)
> Created attachment 712218 [details]
> python testcase
> 
> Smaller testcase showing the same problem.
> 
> STR:
> 
> 1. Load the page
> 2. Open the web console
> 3. Reload the page and look at the web console output. In most cases, it
> will just print 'start', which is done at the beginning of python.js, and
> not 'finish' which is done at the end of that script. In other words,
> execution halts silently in the middle of the script. I see this happen in
> about 75% of reloads, so doing say 5 quick reloads should be enough to
> confirm/disconfirm this bug on a build.
> 
> I am seeing this on a 32-bit linux machine. The BananaBread testcase shows
> up on 64-bit linux and os x.

I haven't been able to get this one to repro on OSX. I'm trying the BananaBread testcase now, but it hasn't repro'd for me yet. What fraction of the time does it fail?
Luke and I saw it now and then. I would say 10% of the time.
(In reply to David Mandelin [:dmandelin] from comment #8)
> I haven't been able to get this one to repro on OSX. I'm trying the
> BananaBread testcase now, but it hasn't repro'd for me yet. What fraction of
> the time does it fail?

The python failure only reproduces on x86, about 50% of the time.
python testcase

STR
1. Open Web Console
2. Open python testcase
3. Hold key down F5 for a while  and release key
4. Key press F5
    --- observe start/finish log
5. Repeat Step 4

6. Wait for a while
7. Key press F5
    --- observe start/finish log

Actual result
no "finish" appears at step 4
"finish" appears again at step 7
Assignee: general → sstangl
Adding some prints to both the forced_return and the exit labels in jsinterp; note that my (modified) python.js has 7914 lines:

-- when everything's ok:
start
finish
** interp exit pc 0513CFF1 (current line 20)
** interp exit pc 0A465020 (current line 74)
** interp exit pc 0E78F870 (current line 7914)

-- when it fails:
start
** forced_return
** interp exit pc 0DEA17D8 (current line 7914)
start
GOTO ERROR @ 2342
*** error label with no pending Exception!
** forced_return pc 0DF10552 (current line 10)

InvokeConstructorKernel at around like 2340 of jsinterp.cpp is what's returning false and causing a goto error without setting an exception.
Ok, so dmandelin and I think we figured it out.  The python testcase seems to be allocating typed arrays of doubles with 52,428,800 elements -- so 52*8 = ~400MB.  It seems to allocate two of them.  If you keep hitting F5, we keep going up in memory usage until GC hits and we clean up the DOM windows, and thus their global objects, and thus these previously-allocated buffers.  With ClearScope before, we were likely clearing out the global object even before the window was GCd, so it was much harder to hit this.

So, I suspect that this is an unreported OOM, probably in the typed array view create code -- the actual thing that's failing is in ArrayBufferObject::create, the call to allocateSlots() and ultimately to AllocateArrayBufferContents.  maybecx is not NULL, so js_ReportOutOfMemory *is* being called...
The code does:

try {
    ua = !! Int32Array && !! Float64Array && !! (new Int32Array).subarray
} catch (faa) {}
if (ua) na = u = new Int32Array(52428800), v = new Float64Array(52428800);
else {
    na = Array(1638400);
    for (var va = 0; va < 1638400; va++) na[va] = 0;
    u = v = na
}
for (var ya = wa("(null)"), va = 0; va < ya.length; va++) u[va] = ya[va];
Module.HEAP = na;


I don't know why that Float64Array allocation is there; is this just from an earlier version of emscripten?  It doesn't seem like it gets used, here at least. 

So this is a legitemate (but unreported, somehow) OOM.. we need to figure out if this is the same issue with BananaBench though.
We have bug 839631 for GC triggers surrounding typed arrays not working right, maybe that's related...
I attempted to determine whether the bananabread demo also calls js_ReportOutOfMemory() immediately before halting execution, but I am no longer able to reproduce the bad behavior.
Sorry for unsetting the tracking flags. Bugzilla has a bug where leaving a tab open for a long time, even through refreshes, causes these flags to be unset.
> 
> I don't know why that Float64Array allocation is there; is this just from an
> earlier version of emscripten?  It doesn't seem like it gets used, here at
> least. 

Yeah, it uses an old memory model of two separate typed arrays. It's possible the float one is not actually used in practice.

> 
> So this is a legitemate (but unreported, somehow) OOM.. we need to figure
> out if this is the same issue with BananaBench though.

(In reply to Andrew McCreight [:mccr8] from comment #16)
> We have bug 839631 for GC triggers surrounding typed arrays not working
> right, maybe that's related...

We have been seeing symptoms like BananaBread on some other codebases too (nonshareable). All use WebGL. One feature of how WebGL is used is that many small typed arrays are created each frame - so that bug makes perfect sense to me, marking as dependent. Best guess seems to be that the failures we see are a combination of the silent OOM halt from the regression tracked down for the Python testcase + an OOM from bug 839631.

Perhaps if when we fix the Python testcase and have an exception or a log error shown on this oom, we can see if it happens on the other codebases.
Depends on: 839631
(In reply to Alon Zakai (:azakai) from comment #19)
> We have been seeing symptoms like BananaBread on some other codebases too
> (nonshareable). All use WebGL. One feature of how WebGL is used is that many
> small typed arrays are created each frame - so that bug makes perfect sense
> to me, marking as dependent. Best guess seems to be that the failures we see
> are a combination of the silent OOM halt from the regression tracked down
> for the Python testcase + an OOM from bug 839631.

Small typed arrays should not get a singleton type, which is what's required to make bug 839631 happen. You need to be making many big typed arrays during an animation. Hopefully that's pretty rare.

However, it sounds like we definitely need better reporting of when an OOM happened.
Vlad, can you confirm that the bananabread benchmark calls js_ReportOutOfMemory() immediately before erroneously halting? I am no longer able to reproduce the failure locally, and doing so would lend greater credence to the idea that the two failures have the same cause.
Flags: needinfo?(vladimir)
The bananabread bug only reproduces with PGO enabled. I have not yet determined whether the PGO build calls js_ReportOutOfMemory() before ceasing execution.

To figure out why I can't reproduce the bug with local builds, despite trying to mimic memory conditions in the Nightly I run, I pulled two builds from TBPL (https://tbpl.mozilla.org/?rev=aceeea086ccb):

- The Fedora64 opt build (http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64/1360838908/)

- The Fedora64 PGO build (http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-pgo/1360850901/)

The PGO build halts every time I run the benchmark. The opt build successfully ran the benchmark to completion 5 times, which from recent experience hitting f5 repeatedly means it probably will never halt.

So this is most likely yet another PGO bug.
Usually "yet another PGO bug" bugs are on Win32 though; I don't know that we've had linux pgo issues before.  But, your tests certainly seem to indicate that...
Flags: needinfo?(vladimir)
Bug 827946 is a recent linux PGO bug.
(In reply to Sean Stangl from comment #21)
> Vlad, can you confirm that the bananabread benchmark calls
> js_ReportOutOfMemory() immediately before erroneously halting? I am no
> longer able to reproduce the failure locally, and doing so would lend
> greater credence to the idea that the two failures have the same cause.

I have verified that the bananabread benchmark does not call js_ReportOutOfMemory() before halting, so it is most likely unrelated to Bug 839631. I can also reproduce the hang in headless mode, so it is unrelated to WebGL.

I am still only able to reproduce the bug in Nightly -- not even with Tryserver builds with forced PGO. I patched Nightly's libxul.so to trigger an interrupt on JS_ReportOutOfMemory() [2 locations] and js_ReportOutOfMemory() [1 location]. Reproducing the halt did not trigger an INT3, so that's probably not the issue. I'm assuming here that it's unlikely for PGO to decide to inline js_ReportOutOfMemory().
Putting the tracking back that Dave set, and was undone in comment 17.
I am now unable to reproduce the failure. For the last hour or so, even the known-bad PGO build is succeeding, which makes data collection nigh-impossible. Alon, could you lend me the computer that can reproduce the error?

Luke had mentioned that Alon's machine, which could reproduce the failure in a debug build per Comment 3, was a slow 2-core machine, but even downscaling cpufreq and turning off cores did not make the error appear. I also have never been able to reproduce the error on a slow x86_64 machine at home that is likely similar to Alon's.

The easiest way to debug this error would be to instrument "return false" with tracing information, compile a build, and watch the log for events around the time execution ceases. Unfortunately, I have never been able to reproduce the behavior in a custom build, so I cannot debug this. Vlad, can you?

Another idea would be to use ltrace to watch libxul calls around the engine termination -- but this takes an extraordinary amount of time, and ltrace process attachment is a no-go due to dlopen() not using the PLT. But if we had that information, that would help also. If we could separate the JS engine from libxul, that would be even better.

Yet another idea is to use a record-and-replay computer--it was mentioned that perhaps khuey has one--but it would need to be able to reproduce the error.

It would also be extremely helpful for someone to play with jit options and report combinations with which the error reproduces. Since the error is likely to be timing-related, parallel compilation is suspect.
Flags: needinfo?(azakai)
I can't reproduce this anymore...

We have gotten reports from other (nonshareable) codebases where this shows up, perhaps we can get a better testcase from there, but not sure.
Flags: needinfo?(azakai)
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #23)
> Usually "yet another PGO bug" bugs are on Win32 though; I don't know that
> we've had linux pgo issues before.  But, your tests certainly seem to
> indicate that...

FYI, we just recently had bug 827946, which was a pretty explosive Linux PGO-only crash. So they certainly exist - unfortunately.
Sean, I have another codebase (also nonshareable) where I can now reproduce the problem 50% of the time. I can give you the codebase or loan you my machine tomorrow if that's convenient.
(In reply to Alon Zakai (:azakai) from comment #30)
> Sean, I have another codebase (also nonshareable) where I can now reproduce
> the problem 50% of the time. I can give you the codebase or loan you my
> machine tomorrow if that's convenient.

Great! Sending me the codebase would be a fine first step before I steal your machine.
...and I can't reproduce it anymore this morning :(

One thing, I noticed yesterday and today that often firefox uses 2 cores in this testcase for sustained periods of time. Background compilation? Perhaps that is relevant somehow, a rare race condition or such?

Anyhow, we need another approach to reproduce this, it's too hard to do normally. Can we perhaps make a build of firefox that has logging each time it starts to execute js and each time it stops executing it normally, something like that? So that when I do see the rare bug, I'll look in the log and be able to send you something useful?
Compiling a new codebase we hit this issue again except this time it is reliable (happens every time across two computers).  Toggling on and off the various jit modes shows that the problem only happens in JM (w/ or w/o inference enabled).  Unfortunately, the codebase is private but I'll send out a link privately in a second.

I did a bit of triage and the error seems to be caused by an operation callback that doesn't abort execution (i.e., we call stubs::Interrupt and it doesn't throw).  The JS code actually has a bug that causes an infinite loop which, is I suspect, what makes the error reliable.  The two obvious sources of errors are (1) register state not being correctly saved/restored, (2) something happening from inside the operation callback that invalidates the mjit code.  The whole methodjit stubcall path is pretty confusing so I wasn't able to really dig in further.
(In reply to Luke Wagner [:luke] from comment #33)
> Compiling a new codebase we hit this issue again except this time it is
> reliable (happens every time across two computers).  Toggling on and off the
> various jit modes shows that the problem only happens in JM (w/ or w/o
> inference enabled).  Unfortunately, the codebase is private but I'll send
> out a link privately in a second.
> 
> I did a bit of triage and the error seems to be caused by an operation
> callback that doesn't abort execution (i.e., we call stubs::Interrupt and it
> doesn't throw).  The JS code actually has a bug that causes an infinite loop
> which, is I suspect, what makes the error reliable.  The two obvious sources
> of errors are (1) register state not being correctly saved/restored, (2)
> something happening from inside the operation callback that invalidates the
> mjit code.  The whole methodjit stubcall path is pretty confusing so I
> wasn't able to really dig in further.

This issue seems to be due to an OOM, if I add printfs I see calls to js_ReportOutOfMemory before execution halts and the browser becomes responsive again.
(In reply to Brian Hackett (:bhackett) from comment #34)
> This issue seems to be due to an OOM, if I add printfs I see calls to
> js_ReportOutOfMemory before execution halts and the browser becomes
> responsive again.

That suggests it is a separate issue from the one in bananabread, via Comment 25. It may be Bug 839631.
I looked into the particular OOM we were seeing for the testcase in comment 33.  It ended up being pretty goofy: due to a codegen bug, the codebase was doing a ton of access via negative indices which created a ton of atoms which eventually caused the OOM.  Sorry for the distraction on this bug; I wasn't aware that we didn't report OOM to the dev console; it really should.

I'm also unable to reproduce any of the original failures from comment 0.
Untracking after discussion in email with Dvander who says that this can not be reproduced and that also the urgency for getting this to work has passed.  It can be renominated if those circumstances change.
I would say this is no less urgent than before, but yes, it is proving very difficult to reproduce.

One of the issues is that JS OOMs can be silent, we've seen that on a few testcases with this symptom. I was talking about that with luke, and it seemed like we could add some logging output in that case, if I understood correctly?
I agree that this is urgent but we should renominate it once we can reproduce it.  No point in triage watching it while we don't have a solution.
Attached patch fixSplinter Review
w00t!  Alon got this error reproducing somewhat reliably on his laptop on the BananaBread headless benchmark so could finally dig in.  The bug only affects 64-bit and only occurs when the memory allocator puts a newly-generated stub >4GB away from it's caller.  That's why this was so terribly non-deterministic.  Fortunately the fix is trivial.  We should probably backport this to branches.

On a separate note, these bugs are very hard to track down.  I was only able to solve this using gdb reverse debugging.  We should add a ton of assertions to all paths to the effect:
  JS_ASSERT(ok || cx->isExceptionPending() || cx->reportedOOM | cx->stoppedByOperationCallback)
I'll file a separate bug.
Assignee: sstangl → luke
Status: NEW → ASSIGNED
Attachment #724720 - Flags: review?(dvander)
(In reply to Luke Wagner [:luke] from comment #40)
> On a separate note, these bugs are very hard to track down.  I was only able
> to solve this using gdb reverse debugging.  We should add a ton of
> assertions to all paths to the effect:
>   JS_ASSERT(ok || cx->isExceptionPending() || cx->reportedOOM |
> cx->stoppedByOperationCallback)
> I'll file a separate bug.

This sort of check would be very well suited to static analysis, and now that we are using sixgill for analyzing roots the barrier to entry for such a strategy is pretty low.  Can take that to the separate bug though.
(In reply to Brian Hackett (:bhackett) from comment #41)
> This sort of check would be very well suited to static analysis, and now
> that we are using sixgill for analyzing roots the barrier to entry for such
> a strategy is pretty low.

Ah, yes, that sounds excellent.
Attachment #724720 - Flags: review?(dvander) → review+
https://hg.mozilla.org/mozilla-central/rev/f67071c94584
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
(In reply to Luke Wagner [:luke] from comment #40)
> We should
> probably backport this to branches.
> 

I have just gotten another report from a user that sees this bug, not on nightly, so would definitely be great to backport this as much as we can.
Blocks: gecko-games
(In reply to Alice0775 White from comment #11)
> python testcase
> 
> STR
> 1. Open Web Console
> 2. Open python testcase
> 3. Hold key down F5 for a while  and release key
> 4. Key press F5
>     --- observe start/finish log
> 5. Repeat Step 4
> 
> 6. Wait for a while
> 7. Key press F5
>     --- observe start/finish log
> 
> Actual result
> no "finish" appears at step 4
> "finish" appears again at step 7
Verified fixed 32.0a1 (2014-05-18), Win 7 x64
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: