Bugzilla

Reporter

Comment 2

•

11 years ago

I don't have a debug build handy, and

ftp://ftp.mozilla.org/pub/firefox/nightly/2013-02-01-mozilla-central-debug/

seems to not be a debug build despite the name. I'll try to build later today. If someone else already has a debug build around though and can test this that would be great.

Reporter

Comment 3

•

11 years ago

I tried on a debug build. The good news is that the problem seems much more reproducible there. It still fails at random places, but it fails very frequently and typically very close to the beginning of the benchmark.

The bad news is there is nothing odd in the debug output. It still just stops silently.

However, it did crash half the times I ran it, some reports,

https://crash-stats.mozilla.com/report/index/f51dd736-2075-4cc4-8037-f129d2130201
https://crash-stats.mozilla.com/report/index/bp-40e94727-2094-4eef-a2bc-004612130201
https://crash-stats.mozilla.com/report/index/a3aa8d85-21cd-4414-806a-6c8062130201

Reporter

Updated

•

11 years ago

Keywords: crash

Reporter

Comment 4

•

11 years ago

Attached file python testcase — Details

Smaller testcase showing the same problem.

STR:

1. Load the page
2. Open the web console
3. Reload the page and look at the web console output. In most cases, it will just print 'start', which is done at the beginning of python.js, and not 'finish' which is done at the end of that script. In other words, execution halts silently in the middle of the script. I see this happen in about 75% of reloads, so doing say 5 quick reloads should be enough to confirm/disconfirm this bug on a build.

I am seeing this on a 32-bit linux machine. The BananaBread testcase shows up on 64-bit linux and os x.

David Mandelin [:dmandelin]

Reporter

Comment 5

•

11 years ago

This is definitely a regression, since the last testcase - in the previous comment- is python that was compiled 2-3 years ago (I just added the console.log stuff at the beginning and end). Requesting regression range. Note that we have no idea how big the regression range is - we started seeing it over the last month or two, but it might have regressed much earlier and not been noticed.

To be clear, I think we should get a regression range for the testcase in the last comment, it is much smaller and more reproducible. (The original testcase in this bug is much larger and harder to work with.)

Keywords: regression, regressionwindow-wanted

Alice0775 White

Comment 6

•

11 years ago

I tested with str of comment #4

Regression window(m-c)
Good:
http://hg.mozilla.org/mozilla-central/rev/30b4f99a137c
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120228 Firefox/13.0a1 ID:20120228112349
Bad:
http://hg.mozilla.org/mozilla-central/rev/3812d0ce274e
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120229 Firefox/13.0a1 ID:20120229104909
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=30b4f99a137c&tochange=3812d0ce274e


Regression window(m-i)
Good:
http://hg.mozilla.org/integration/mozilla-inbound/rev/0983ca6b6ce6
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120228 Firefox/13.0a1 ID:20120228114554
Bad:
http://hg.mozilla.org/integration/mozilla-inbound/rev/491ceed82be3
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120228 Firefox/13.0a1 ID:20120228123549
Pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=0983ca6b6ce6&tochange=491ceed82be3

In local build
Last Good: a77570b9a808\
First Bad: 491ceed82be3\
Triggered by:
  491ceed82be3	Peter Van der Beken — Fix for bug 637099 (Remove JS_ClearScope calls on windows). r=mrbkap.

Alice0775 White

Updated

•

11 years ago

Blocks: 637099

Keywords: regressionwindow-wanted

Updated

•

11 years ago

Whiteboard: [games:p1] → [js:p1][games:p1]

Martin Best (:mbest)

Comment 7

•

11 years ago

Also, we have a demo that was meant to ship out on the 19th.  I would like to know, once the cause is identified, if there is a work around or if we should delay release until the fix is out in the wide.

David Mandelin [:dmandelin]

Comment 8

•

11 years ago

(In reply to Alon Zakai (:azakai) from comment #4)
> Created attachment 712218 [details]
> python testcase
> 
> Smaller testcase showing the same problem.
> 
> STR:
> 
> 1. Load the page
> 2. Open the web console
> 3. Reload the page and look at the web console output. In most cases, it
> will just print 'start', which is done at the beginning of python.js, and
> not 'finish' which is done at the end of that script. In other words,
> execution halts silently in the middle of the script. I see this happen in
> about 75% of reloads, so doing say 5 quick reloads should be enough to
> confirm/disconfirm this bug on a build.
> 
> I am seeing this on a 32-bit linux machine. The BananaBread testcase shows
> up on 64-bit linux and os x.

I haven't been able to get this one to repro on OSX. I'm trying the BananaBread testcase now, but it hasn't repro'd for me yet. What fraction of the time does it fail?

Reporter

Comment 9

•

11 years ago

Luke and I saw it now and then. I would say 10% of the time.

David Mandelin [:dmandelin]

Comment 10

•

11 years ago

(In reply to David Mandelin [:dmandelin] from comment #8)
> I haven't been able to get this one to repro on OSX. I'm trying the
> BananaBread testcase now, but it hasn't repro'd for me yet. What fraction of
> the time does it fail?

The python failure only reproduces on x86, about 50% of the time.

Alice0775 White

Comment 11

•

11 years ago

python testcase

STR
1. Open Web Console
2. Open python testcase
3. Hold key down F5 for a while  and release key
4. Key press F5
    --- observe start/finish log
5. Repeat Step 4

6. Wait for a while
7. Key press F5
    --- observe start/finish log

Actual result
no "finish" appears at step 4
"finish" appears again at step 7

Updated

•

11 years ago

status-firefox21: --- → affected

tracking-firefox21: --- → +

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

11 years ago

Assignee: general → sstangl

Comment 12

•

11 years ago

Adding some prints to both the forced_return and the exit labels in jsinterp; note that my (modified) python.js has 7914 lines:

-- when everything's ok:
start
finish
** interp exit pc 0513CFF1 (current line 20)
** interp exit pc 0A465020 (current line 74)
** interp exit pc 0E78F870 (current line 7914)

-- when it fails:
start
** forced_return
** interp exit pc 0DEA17D8 (current line 7914)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 13

•

11 years ago

start
GOTO ERROR @ 2342
*** error label with no pending Exception!
** forced_return pc 0DF10552 (current line 10)

InvokeConstructorKernel at around like 2340 of jsinterp.cpp is what's returning false and causing a goto error without setting an exception.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 14

•

11 years ago

Ok, so dmandelin and I think we figured it out.  The python testcase seems to be allocating typed arrays of doubles with 52,428,800 elements -- so 52*8 = ~400MB.  It seems to allocate two of them.  If you keep hitting F5, we keep going up in memory usage until GC hits and we clean up the DOM windows, and thus their global objects, and thus these previously-allocated buffers.  With ClearScope before, we were likely clearing out the global object even before the window was GCd, so it was much harder to hit this.

So, I suspect that this is an unreported OOM, probably in the typed array view create code -- the actual thing that's failing is in ArrayBufferObject::create, the call to allocateSlots() and ultimately to AllocateArrayBufferContents.  maybecx is not NULL, so js_ReportOutOfMemory *is* being called...

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 15

•

11 years ago

The code does:

try {
    ua = !! Int32Array && !! Float64Array && !! (new Int32Array).subarray
} catch (faa) {}
if (ua) na = u = new Int32Array(52428800), v = new Float64Array(52428800);
else {
    na = Array(1638400);
    for (var va = 0; va < 1638400; va++) na[va] = 0;
    u = v = na
}
for (var ya = wa("(null)"), va = 0; va < ya.length; va++) u[va] = ya[va];
Module.HEAP = na;


I don't know why that Float64Array allocation is there; is this just from an earlier version of emscripten?  It doesn't seem like it gets used, here at least. 

So this is a legitemate (but unreported, somehow) OOM.. we need to figure out if this is the same issue with BananaBench though.

Andrew McCreight [:mccr8]

Comment 16

•

11 years ago

We have bug 839631 for GC triggers surrounding typed arrays not working right, maybe that's related...

Comment 17

•

11 years ago

I attempted to determine whether the bananabread demo also calls js_ReportOutOfMemory() immediately before halting execution, but I am no longer able to reproduce the bad behavior.

status-firefox21: affected → ---

tracking-firefox21: + → ---

Comment 18

•

11 years ago

Sorry for unsetting the tracking flags. Bugzilla has a bug where leaving a tab open for a long time, even through refreshes, causes these flags to be unset.

status-firefox21: --- → affected

tracking-firefox21: --- → ?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Comment 19

•

11 years ago

> 
> I don't know why that Float64Array allocation is there; is this just from an
> earlier version of emscripten?  It doesn't seem like it gets used, here at
> least. 

Yeah, it uses an old memory model of two separate typed arrays. It's possible the float one is not actually used in practice.

> 
> So this is a legitemate (but unreported, somehow) OOM.. we need to figure
> out if this is the same issue with BananaBench though.

(In reply to Andrew McCreight [:mccr8] from comment #16)
> We have bug 839631 for GC triggers surrounding typed arrays not working
> right, maybe that's related...

We have been seeing symptoms like BananaBread on some other codebases too (nonshareable). All use WebGL. One feature of how WebGL is used is that many small typed arrays are created each frame - so that bug makes perfect sense to me, marking as dependent. Best guess seems to be that the failures we see are a combination of the silent OOM halt from the regression tracked down for the Python testcase + an OOM from bug 839631.

Perhaps if when we fix the Python testcase and have an exception or a log error shown on this oom, we can see if it happens on the other codebases.

Depends on: 839631

Comment 20

•

11 years ago

(In reply to Alon Zakai (:azakai) from comment #19)
> We have been seeing symptoms like BananaBread on some other codebases too
> (nonshareable). All use WebGL. One feature of how WebGL is used is that many
> small typed arrays are created each frame - so that bug makes perfect sense
> to me, marking as dependent. Best guess seems to be that the failures we see
> are a combination of the silent OOM halt from the regression tracked down
> for the Python testcase + an OOM from bug 839631.

Small typed arrays should not get a singleton type, which is what's required to make bug 839631 happen. You need to be making many big typed arrays during an animation. Hopefully that's pretty rare.

However, it sounds like we definitely need better reporting of when an OOM happened.

Comment 21

•

11 years ago

Vlad, can you confirm that the bananabread benchmark calls js_ReportOutOfMemory() immediately before erroneously halting? I am no longer able to reproduce the failure locally, and doing so would lend greater credence to the idea that the two failures have the same cause.

Flags: needinfo?(vladimir)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 22

•

11 years ago

The bananabread bug only reproduces with PGO enabled. I have not yet determined whether the PGO build calls js_ReportOutOfMemory() before ceasing execution.

To figure out why I can't reproduce the bug with local builds, despite trying to mimic memory conditions in the Nightly I run, I pulled two builds from TBPL (https://tbpl.mozilla.org/?rev=aceeea086ccb):

- The Fedora64 opt build (http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64/1360838908/)

- The Fedora64 PGO build (http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-pgo/1360850901/)

The PGO build halts every time I run the benchmark. The opt build successfully ran the benchmark to completion 5 times, which from recent experience hitting f5 repeatedly means it probably will never halt.

So this is most likely yet another PGO bug.

Comment 23

•

11 years ago

Usually "yet another PGO bug" bugs are on Win32 though; I don't know that we've had linux pgo issues before.  But, your tests certainly seem to indicate that...

Flags: needinfo?(vladimir)

Josh Matthews [:jdm]

Comment 24

•

11 years ago

Bug 827946 is a recent linux PGO bug.

Lukas Blakk [:lsblakk] use ?needinfo

Comment 25

•

11 years ago

(In reply to Sean Stangl from comment #21)
> Vlad, can you confirm that the bananabread benchmark calls
> js_ReportOutOfMemory() immediately before erroneously halting? I am no
> longer able to reproduce the failure locally, and doing so would lend
> greater credence to the idea that the two failures have the same cause.

I have verified that the bananabread benchmark does not call js_ReportOutOfMemory() before halting, so it is most likely unrelated to Bug 839631. I can also reproduce the hang in headless mode, so it is unrelated to WebGL.

I am still only able to reproduce the bug in Nightly -- not even with Tryserver builds with forced PGO. I patched Nightly's libxul.so to trigger an interrupt on JS_ReportOutOfMemory() [2 locations] and js_ReportOutOfMemory() [1 location]. Reproducing the halt did not trigger an INT3, so that's probably not the issue. I'm assuming here that it's unlikely for PGO to decide to inline js_ReportOutOfMemory().

Comment 26

•

11 years ago

Putting the tracking back that Dave set, and was undone in comment 17.

tracking-firefox21: ? → +

Comment 27

•

11 years ago

I am now unable to reproduce the failure. For the last hour or so, even the known-bad PGO build is succeeding, which makes data collection nigh-impossible. Alon, could you lend me the computer that can reproduce the error?

Luke had mentioned that Alon's machine, which could reproduce the failure in a debug build per Comment 3, was a slow 2-core machine, but even downscaling cpufreq and turning off cores did not make the error appear. I also have never been able to reproduce the error on a slow x86_64 machine at home that is likely similar to Alon's.

The easiest way to debug this error would be to instrument "return false" with tracing information, compile a build, and watch the log for events around the time execution ceases. Unfortunately, I have never been able to reproduce the behavior in a custom build, so I cannot debug this. Vlad, can you?

Another idea would be to use ltrace to watch libxul calls around the engine termination -- but this takes an extraordinary amount of time, and ltrace process attachment is a no-go due to dlopen() not using the PLT. But if we had that information, that would help also. If we could separate the JS engine from libxul, that would be even better.

Yet another idea is to use a record-and-replay computer--it was mentioned that perhaps khuey has one--but it would need to be able to reproduce the error.

It would also be extremely helpful for someone to play with jit options and report combinations with which the error reproduces. Since the error is likely to be timing-related, parallel compilation is suspect.

Updated

•

11 years ago

Flags: needinfo?(azakai)

Reporter

Comment 28

•

11 years ago

I can't reproduce this anymore...

We have gotten reports from other (nonshareable) codebases where this shows up, perhaps we can get a better testcase from there, but not sure.

Flags: needinfo?(azakai)

Robert Kaiser

Comment 29

•

11 years ago

(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #23)
> Usually "yet another PGO bug" bugs are on Win32 though; I don't know that
> we've had linux pgo issues before.  But, your tests certainly seem to
> indicate that...

FYI, we just recently had bug 827946, which was a pretty explosive Linux PGO-only crash. So they certainly exist - unfortunately.

Reporter

Comment 30

•

11 years ago

Sean, I have another codebase (also nonshareable) where I can now reproduce the problem 50% of the time. I can give you the codebase or loan you my machine tomorrow if that's convenient.

Comment 31

•

11 years ago

(In reply to Alon Zakai (:azakai) from comment #30)
> Sean, I have another codebase (also nonshareable) where I can now reproduce
> the problem 50% of the time. I can give you the codebase or loan you my
> machine tomorrow if that's convenient.

Great! Sending me the codebase would be a fine first step before I steal your machine.

Reporter

Comment 32

•

11 years ago

...and I can't reproduce it anymore this morning :(

One thing, I noticed yesterday and today that often firefox uses 2 cores in this testcase for sustained periods of time. Background compilation? Perhaps that is relevant somehow, a rare race condition or such?

Anyhow, we need another approach to reproduce this, it's too hard to do normally. Can we perhaps make a build of firefox that has logging each time it starts to execute js and each time it stops executing it normally, something like that? So that when I do see the rare bug, I'll look in the log and be able to send you something useful?

Brian Hackett [Laid off!]

Assignee

Comment 33

•

11 years ago

Compiling a new codebase we hit this issue again except this time it is reliable (happens every time across two computers).  Toggling on and off the various jit modes shows that the problem only happens in JM (w/ or w/o inference enabled).  Unfortunately, the codebase is private but I'll send out a link privately in a second.

I did a bit of triage and the error seems to be caused by an operation callback that doesn't abort execution (i.e., we call stubs::Interrupt and it doesn't throw).  The JS code actually has a bug that causes an infinite loop which, is I suspect, what makes the error reliable.  The two obvious sources of errors are (1) register state not being correctly saved/restored, (2) something happening from inside the operation callback that invalidates the mjit code.  The whole methodjit stubcall path is pretty confusing so I wasn't able to really dig in further.

Comment 34

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #33)
> Compiling a new codebase we hit this issue again except this time it is
> reliable (happens every time across two computers).  Toggling on and off the
> various jit modes shows that the problem only happens in JM (w/ or w/o
> inference enabled).  Unfortunately, the codebase is private but I'll send
> out a link privately in a second.
> 
> I did a bit of triage and the error seems to be caused by an operation
> callback that doesn't abort execution (i.e., we call stubs::Interrupt and it
> doesn't throw).  The JS code actually has a bug that causes an infinite loop
> which, is I suspect, what makes the error reliable.  The two obvious sources
> of errors are (1) register state not being correctly saved/restored, (2)
> something happening from inside the operation callback that invalidates the
> mjit code.  The whole methodjit stubcall path is pretty confusing so I
> wasn't able to really dig in further.

This issue seems to be due to an OOM, if I add printfs I see calls to js_ReportOutOfMemory before execution halts and the browser becomes responsive again.

Comment 35

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #34)
> This issue seems to be due to an OOM, if I add printfs I see calls to
> js_ReportOutOfMemory before execution halts and the browser becomes
> responsive again.

That suggests it is a separate issue from the one in bananabread, via Comment 25. It may be Bug 839631.

Lukas Blakk [:lsblakk] use ?needinfo

Assignee

Comment 36

•

11 years ago

I looked into the particular OOM we were seeing for the testcase in comment 33.  It ended up being pretty goofy: due to a codegen bug, the codebase was doing a ton of access via negative indices which created a ton of atoms which eventually caused the OOM.  Sorry for the distraction on this bug; I wasn't aware that we didn't report OOM to the dev console; it really should.

I'm also unable to reproduce any of the original failures from comment 0.

Comment 37

•

11 years ago

Untracking after discussion in email with Dvander who says that this can not be reproduced and that also the urgency for getting this to work has passed.  It can be renominated if those circumstances change.

tracking-firefox21: + → -

Reporter

Comment 38

•

11 years ago

I would say this is no less urgent than before, but yes, it is proving very difficult to reproduce.

One of the issues is that JS OOMs can be silent, we've seen that on a few testcases with this symptom. I was talking about that with luke, and it seemed like we could add some logging output in that case, if I understood correctly?

Martin Best (:mbest)

Comment 39

•

11 years ago

I agree that this is urgent but we should renominate it once we can reproduce it.  No point in triage watching it while we don't have a solution.

Brian Hackett [Laid off!]

Assignee

Comment 40

•

11 years ago

Attached patch fix — Details — Splinter Review

w00t!  Alon got this error reproducing somewhat reliably on his laptop on the BananaBread headless benchmark so could finally dig in.  The bug only affects 64-bit and only occurs when the memory allocator puts a newly-generated stub >4GB away from it's caller.  That's why this was so terribly non-deterministic.  Fortunately the fix is trivial.  We should probably backport this to branches.

On a separate note, these bugs are very hard to track down.  I was only able to solve this using gdb reverse debugging.  We should add a ton of assertions to all paths to the effect:
  JS_ASSERT(ok || cx->isExceptionPending() || cx->reportedOOM | cx->stoppedByOperationCallback)
I'll file a separate bug.

Assignee: sstangl → luke

Status: NEW → ASSIGNED

Attachment #724720 - Flags: review?(dvander)

Comment 41

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #40)
> On a separate note, these bugs are very hard to track down.  I was only able
> to solve this using gdb reverse debugging.  We should add a ton of
> assertions to all paths to the effect:
>   JS_ASSERT(ok || cx->isExceptionPending() || cx->reportedOOM |
> cx->stoppedByOperationCallback)
> I'll file a separate bug.

This sort of check would be very well suited to static analysis, and now that we are using sixgill for analyzing roots the barrier to entry for such a strategy is pretty low.  Can take that to the separate bug though.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 42

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #41)
> This sort of check would be very well suited to static analysis, and now
> that we are using sixgill for analyzing roots the barrier to entry for such
> a strategy is pretty low.

Ah, yes, that sounds excellent.

Updated

•

11 years ago

Attachment #724720 - Flags: review?(dvander) → review+

Assignee

Comment 43

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f67071c94584

Phil Ringnalda (:philor)

Comment 44

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/f67071c94584

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla22