Closed Bug 468840 Opened 11 years ago Closed 11 years ago

Page hangs browser with JIT enabled

Categories

(Core :: JavaScript Engine, defect, P1, critical)

x86
Windows XP
defect

Tracking

()

VERIFIED FIXED
mozilla1.9.2a1

People

(Reporter: srt, Assigned: dmandelin)

References

()

Details

(Keywords: hang, verified1.9.1)

Attachments

(3 files, 5 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1b3pre) Gecko/20081209 Shiretoko/3.1b3pre (.NET CLR 3.5.30729)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1b3pre) Gecko/20081209 Shiretoko/3.1b3pre (.NET CLR 3.5.30729)

The referenced page does image evolution in client-side Javascript.  With JIT content enabled, the page eventually hangs my browser (after ~20 minutes).  CPU goes to 100% and the browser becomes non-responsive.

As far as I can tell, this doesn't occur with JIT content disabled.

I have JIT chrome disabled.

This is reproducible on at least two machines, both Windows XP running the latest Shiretoko.

Reproducible: Always

Steps to Reproduce:
1. Load page.
2. Hit "Start"
3. Wait.
Confirmed on Windows XP. The unresponsiveness seems to happen because it forgets to release memory.
The first screenshot was made with JIT content to false: http://img244.imageshack.us/img244/401/mona1zs5.png
and the second with JIT content to true: http://img253.imageshack.us/img253/3892/mona2iu6.png
Assignee: nobody → general
Status: UNCONFIRMED → NEW
Component: General → JavaScript Engine
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → general
Version: unspecified → Trunk
Flags: blocking1.9.2?
Flags: blocking1.9.1?
Igor, think this will be fixed by watch dog?
Flags: blocking1.9.1? → blocking1.9.1+
I've been looking into this on TM trunk (minus a day or two). GC is run exactly once during the run of this script, several seconds after hitting "Start". After that, mem usage just keeps increasing, and the process eventually bogs down in page faults. If I hit "Stop" before that happens, a GC runs a few seconds after that. This would seem to be a bug in the watchdog mechanism or its usage. I will continue looking into this.
I was able to prove that the watchdog mechanism was working fine. The problem was that GCs were still not being run often enough (in fact, hardly at all). During a test run just now, it appears that the GC got triggered 2 ways: (a) through the cycle collector, and (b) through MaybeGC, which I think is intended to be the main browser entry point to GC. I saw that MaybeGC uses a very different criterion from JS_MaybeGC:

    // JS_MaybeGC
    if ((bytes > 8192 && bytes > lastBytes + lastBytes / 3) ||
        rt->gcMallocBytes >= rt->gcMaxMallocBytes) {

    // MaybeGC in nsJSEnvironment.cpp
    if ((bytes > 8192 && bytes / 16 > lastBytes)) {

The new MaybeGC in nsJSEnvironment.cpp comes from bug 417052 which was made to speed up Sunspider by reducing collection frequency. In English, this runs the GC if current GC live bytes is 16 times GC live bytes after last collection. I'm not sure exactly what GC live bytes means these days, but I think it means the total size of arenas in use.

This criterion is questionable (it appears to ignore JS_malloc; also, an application will thrash if it has a working set of > (avail physical mem) / 16), but the problem is actually deeper and not necessarily with the formula constant. After running a while, this particular script reaches a point where JS is using about 25 MB (with a current factor of 10: 18 MB in arenas vs. 1.8 MB arena working set + 9 MB malloc bytes). But FF was using > 1 GB at this time vs. a trivial 60 MB startup usage. 

The script draws lots of shapes (filled triangles) on a canvas. I think it had created something on the order of a million shapes at the point I measured above. I don't know enough about how canvas works to know what memory is being used, so I guess I'll have to do some runs with the leak tools.

Assuming the 1 GB is some kind of DOM stuff, it will be hard to get MaybeGC to make this application collect more often, because the JS memory system never "sees" DOM memory usage. Options include:

- Arbitrarily cut the factor of 16 to 4 or so, which would probably fix this application, at least on 2 GB machines.

- Start measuring memory usage of DOM stuff that JS might hold pointers to and include that in GC criteria.

- Add an additional criterion to make sure GC gets run for sure after some maximum time interval T.

- Somehow detect thrash/near-thrash conditions and start GC then.

- Various bogus tweaks that would make the cycle collector more apt to break in to long-running JS, etc.

In the absence of external constraints, I think that for a standard stop-the-world GC such as ours, the right thing to do is to GC when you run out of physical memory. Unfortunately, I'm not sure if detecting that condition is very easy. If anyone knows some good approximations, that would be great.
It's non-trivial to measure the memory usage of DOM stuff, I think a better and more useful number here would be total memory consumption by the Firefox process. Does JE malloc provide that by any chance?
Duplicate of this bug: 470192
Attached patch Slightly better patch (obsolete) — Splinter Review
This patch is terrible, but it may alleviate the problem in the short term. The patch varies the MaybeGC parameters based on the time since the last GC triggered by MaybeGC. Namely, it does it the old way (bytes > 16x last time) if the last GC was within 60 seconds, but becomes more likely to collect after that.

It just so happens that on my machine SunSpider runs in about 60 seconds, while the evolve app starts to use up too much total memory at just about that time.
Attachment #354266 - Attachment is obsolete: true
I've decided that the real solution to this problem requires detecting memory pressure. Memory pressure is the correct criterion and I don't think any other data is a reliable proxy for it. The GC policy should be something like:

- On severe memory pressure, run full GC including CC. "Severe memory pressure" means we think the system is running low on physical memory, page space, or virtual address space.

- When GC cost is rising, run regular GC. "GC cost is rising" means we think the next GC will take kind of a long time if we keep waiting. We should collect at this point to keep pause times down. I don't know the details of GC costs, but it appears that a GC is more expensive when more memory is in use. I found that a regular GC takes about .1 seconds, but after a lot of stuff has been allocated, it can be .5 or more seconds. It may be possible to determine moderate memory pressure adaptively.

- When the system goes idle (in both user input and JS activity), run regular GC.

There is code in nsJSEnvironment that is supposed to cause a CC under severe memory pressure, detected by calling nsMemoryImpl::isLowMemory. But that function doesn't work. On MacOSX, it always returns false. On Windows, it checks if the available page space is less than 10% of the total page space. But I'm pretty sure the app will start thrashing way before that condition holds.
Attachment #354907 - Flags: review?(igor)
Attachment #354907 - Flags: review?(Olli.Pettay)
Assignee: general → dmandelin
Priority: -- → P1
Given systems these days and virtual memory, it is pretty difficult to detect when you are actually low on memory.  jemalloc with stats enabled can tell you how much memory it has allocated, but we don't use it on mac currently and it doesn't know how much memory the system has/will give out/etc.

The isLowMemory check was added fairly recently, and was primarily done just for mobile where we're likely to run low on memory while JS is being processed.
Comment on attachment 354907 [details] [diff] [review]
Slightly better patch

FYI, LL_ macros aren't needed anymore.

Could you explain why JIT triggers this problem?
(In reply to comment #10)
> Could you explain why JIT triggers this problem?
Though, on my machine this doesn't seem to be JIT-only problem.
How is this bug related to P1 blocking1.9.1+ bug 453432?
The patch there removes the whole MaybeGC function.
(In reply to comment #11)
> (In reply to comment #10)
> > Could you explain why JIT triggers this problem?
> Though, on my machine this doesn't seem to be JIT-only problem.

I don't think it's a JIT-only problem either. But it happens a lot faster with the JIT enabled because the JIT runs the program faster. There may also be an effect where the cycle collector can come in via its timer more easily when not JITting, but I haven't tried to figure out if that's true because it doesn't seem relevant to the fix.
(In reply to comment #12)
> How is this bug related to P1 blocking1.9.1+ bug 453432?
> The patch there removes the whole MaybeGC function.

Hmm, that's a minor problem. The new policy code would have to be moved to the IsGCThresholdReached function introduced by that patch. So the policy code would have to be changed to use SpiderMonkey time functions. But otherwise it should be OK.

The more important question is, is this new policy good enough even to be a stopgap? 

I would prefer to use a policy of running GC whenever we expect to collect at least X memory, where X would probably be tuned to 100M-200M. The simplest way to guess that is if total application memory usage has grown by X since the last GC. But that would take longer to implement and test.
Flags: blocking1.9.2?
Policy sounds so fancy. Given what we had, the patch seems strictly better. We can have a followup bug to design and implement something even better.

/be
Comment on attachment 354907 [details] [diff] [review]
Slightly better patch

>+  int factor = 16;
Use PRxxx types here and elsewhere.

>+  if (!LL_IS_ZERO(sPreviousGCTime)) {
No need to use LL_ macros.

>+    sPreviousGCTime = PR_Now();
I think it should be ok to have sPreviousGCTime = now;
Attachment #354907 - Flags: review?(Olli.Pettay) → review-
Attached patch Patch revised per review (obsolete) — Splinter Review
Attachment #354907 - Attachment is obsolete: true
Attachment #356622 - Flags: review?
Attachment #354907 - Flags: review?(igor)
Attachment #356622 - Flags: review? → review?(Olli.Pettay)
Comment on attachment 356622 [details] [diff] [review]
Patch revised per review

>+  PRTime now = PR_Now();
>+
>+  PRInt32 factor = 16;
>+  if (sPreviousGCTime) {
>+    PRInt64 usec = now - sPreviousGCTime;
>+    if (usec >= 60 * PR_USEC_PER_SEC)
This adds a new signed vs. unsigned warning.

>+      factor = usec < 600 * PR_USEC_PER_SEC ? 4 : 1;
and so does this.

You probably want PRInt64(XXX * PR_USEC_PER_SEC)

Could you use a #define for those 60 and 600 and
document it.

With those, r=me
But Igor should look at this too.
Attachment #356622 - Flags: review?(Olli.Pettay) → review+
Attached patch Patch w/ more minor fixes (obsolete) — Splinter Review
Attachment #356622 - Attachment is obsolete: true
Attachment #356786 - Flags: review?(igor)
(In reply to comment #19)
> Created an attachment (id=356786) [details]
> Patch w/ more minor fixes

This conflicts badly with the bug 453432. Can one check if that patch solves the problem as well?
(In reply to comment #20)
> (In reply to comment #19)
> > Created an attachment (id=356786) [details] [details]
> > Patch w/ more minor fixes
> 
> This conflicts badly with the bug 453432. Can one check if that patch solves
> the problem as well?

That patch won't fix it. I would have tested it if it applied cleanly, but I can tell by inspection anyway. The problem is that for this application, the 16x collection factor (JS_GC_TRIGGER=1600 in the API of the bug 453432 patch) causes JS to wait way too long before collecting. 

At a pure coding level, I don't think it's too hard to move the trigger-level-changing logic that I added to the place the other patch needs it (JS_IsGCThresholdReached or thereabouts). More problematic is the fact that other patch seems to be offering some kind of API for tuning GC parameters that doesn't include this feature. Personally, I hope we can replace all of the triggering code soon enough so I don't think it matters too much what we do for now, but YMMV.
Depends on: 453432
Here is a version that would go on top of the new stuff. It's the same idea but inside SM. I have it ifdef'd to be active only in the browser because it really doesn't make sense to run it in shell.
Attachment #359191 - Flags: review?(igor)
Attachment #359191 - Flags: review?(igor)
Previous try had a bunch of bugs (I stuck it up speculatively to minimize latency). This one appears to work and prevent the given page from behaving too badly.
Attachment #359199 - Flags: review?(igor)
(In reply to comment #23)
> Created an attachment (id=359199) [details]
> Fixed patch to go on top of bug 453432 patch(es)
> 
> Previous try had a bunch of bugs (I stuck it up speculatively to minimize
> latency). This one appears to work and prevent the given page from behaving too
> badly.

I suggest to go with the original patch - the patch for bug 453432 has a subtle flaw (I will add comments there) that require some work.
That is, my suggestion is to land this first and then modify the patch in bug 453432 to allow for new heuristics.
(In reply to comment #25)
> That is, my suggestion is to land this first and then modify the patch in bug
> 453432 to allow for new heuristics.

My second patch actually seems to solve this bug on top of current trunk (since you got the JS shell part landed) with the small addition of setting a default trigger factor of 1600. I've been running that way on the URL in this bug for 16 hours straight without trouble. 

But I guess it wouldn't even be that bad to land my first patch first, and then the second one after you have bug 453432 all finished. Please tell me how to proceed and review any patches as you want them committed (I have you listed for both).
(In reply to comment #26)
> But I guess it wouldn't even be that bad to land my first patch first

This is exactly what I suggest - the bug 453432 requires some work and should not block this bug. I will review the patch that changes nsJSEnvironment.cpp tomorrow.
Igor, just pinging for review here, thanks.
I let this web page run until the process was killed with an out-of-memory message in the syslog. 1GB physical memory, 2GB swap partition, Ubuntu 8.10. (The system became unresponsive while the Firefox process consumed memory.)

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b3pre) Gecko/20090206 Shiretoko/3.1b3pre

It sounds like the GC patches should take care of this. Looking forward to testing that out!
Igor, another poke on review here.  The more we can do in parallel the better.
There are some issues with using the bug 453432 as the base for the patch. I suggest to rebase the patch on the current MaybeGC code from dom/src/base/nsJSEnvironment.cpp. I will *promptly* review it.
Attachment #356786 - Attachment is obsolete: true
Attachment #361624 - Flags: review?(igor)
Attachment #356786 - Flags: review?(igor)
Attachment #359191 - Attachment is obsolete: true
Attachment #359199 - Flags: review?(igor)
(In reply to comment #31)
> There are some issues with using the bug 453432 as the base for the patch. I
> suggest to rebase the patch on the current MaybeGC code from
> dom/src/base/nsJSEnvironment.cpp. I will *promptly* review it.

Ok, there's exactly one patch for you to review now. :-)
Comment on attachment 361624 [details] [diff] [review]
Pre-454532 patch, rebased

>@@ -859,13 +868,22 @@ MaybeGC(JSContext *cx)
> {
...
>+  if ((bytes > 8192 && bytes > lastBytes * factor)
> #ifdef DEBUG
>       || cx->runtime->gcZeal > 0
> #endif
>       ) {
>     JS_GC(cx);
>+    sPreviousGCTime = now;

Nit: move sPreviousGCTime = now before JS_GC call to emphasize that this the time when JS_GC was last called, not when it was finished.
Attachment #361624 - Flags: review?(igor) → review+
Pushed with nit fixed to TM as 2ce90221e80c.
Btw, once this patch makes its way into nightlies, it would be very helpful to get some test coverage on machines with <2GB memory--I can't predict whether the tuning parameters I chose will fix the problem on machines with arbitrary amounts of memory.
I'll help test this on the next few nightly TM builds; 1GB memory, here.
I have a bit more insight into the cause of the problem now. 

If I take a memory snapshot of this program (trace-malloc), almost all the memory in use at any given time seems to be allocated by nsCanvasRenderingContext2D::GetImageData(). On my machine, the program tests about 30 'mutations' per second, each using a call to GetImageData:

    30 ImageData   200*200px   4 bytes   60 s
    ------------ * --------- * ------- * ---- = 274 MB / min
      second       ImageData     px      min

It seems to consume closer 1 GB per minute on my machine, so I might be missing some JS array overhead, or perhaps the contents are not stored as ints. Either way, it's a lot of data and on the order of what I see in activity manager. 

Most of this doesn't seem to be showing up in gcMallocBytes. I recall gcMallocBytes reaching about 32MB or so by the time I start thrashing. My stack goes through JS_realloc, which apparently doesn't get counted by the malloc counter.

With this info, I think we can think of the problem as having two causes:

- Memory allocated with JS_realloc is not accounted for in gcMallocBytes.

- gcMallocBytes is not used by MaybeGC.

The first problem is fairly easy to fix, although simply adding realloc bytes into the total will somewhat overcount the amount of memory in use.

The second problem seems harder because we don't have a good guide for triggering it. Currently I think the browser does a GC at 32 MB. This may or may not be a good time to do it for this application. I think on most machines a much larger trigger number would give better throughput, although the larger number might create unwanted heap fragmentation.

I think I'm going to play around with the malloc counter a bit to see if I can get something reasonable out of that. Hopefully it can be more reliable and higher performance than my first try.
Actually, the image data array is 4 elements per pixel, and each element is 4 bytes:

  30 ImageData   200*200px   4 elements   4 bytes   60 s
  ------------ * --------- * ---------- * ------- * ---- = 1099 MB / min
    second       ImageData       px        index    min

So there's your 1GB per minute.
Running: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2a1pre) Gecko/20090212 Minefield/3.2a1pre (official nightly tracemonkey build) Hoping this is new enough to contain your patch. It seems to work better with the image evolution program, anyway.

About 100,000 mutations in total, 2hr 10minute runtime at 13 mutations per second. I'm seeing in the system monitor the physical memory typically climbs to 95% after every minute or so. (200*232px image: 552 MB / min; see formula above) and then drops down to 37%.

The memory freeing (looks like GC at work) sometimes occurs after the system starts using swap space, for up to 30 seconds, slowing everything else down, of course. And then GC action, physical memory goes down, swap memory goes down, system goes back to normal responsiveness.

The longest 'cycle' I've seen (between two large spikes in memory dropping) was about a minute and a half. And used swap has never gone above 400 MB.

low peak: 300 MB physical, 300 MB virtual
high peak: 990 MB physical, 375 MB virtual

I will let it run over night and run in the morning, but basically it looks better than my earlier tests, which usually crashed Shiretoko in less than an hour.
(In reply to comment #40)

Thanks for the prompt test report. This is what I was afraid of. Unfortunately, I don't think there's a simple retune (along the lines of my first patch, but with different fudge factors) that will satisfy all of our requirements here. I'm going to see if I can get arrays at least to count their memory usage more accurately--that should allow us to fix this application at least.
Note that the changes to nsJSEnvironment.cpp are simply to back out the previous patch, as it appears to be unnecessary in the approach.

Part of the original problem was that new arrays (as created by GetImageData) are allocated with JS_realloc, so they don't count toward mallocBytes. But if the argument to JS_realloc is 0, we can count it the same as a malloc, strictly gaining accuracy in mallocBytes. 

With the improved counting in place, we end up doing a GC after every 32 MB of allocation, which keeps the total mem usage on this page under 100 MB on my browser and appears to have minimal effect on throughput. Thus, we should be OK for any typical amount of RAM.

This also appears to have no effect on SunSpider score.
Attachment #362145 - Flags: review?(mrbkap)
Comment on attachment 362145 [details] [diff] [review]
Patch for a better approach

Looks reasonable to me.
Attachment #362145 - Flags: review?(mrbkap) → review+
Pushed to TM as 871d61f0fbfa.
TM with old patch ran all night (>15 hours) without crashing. And swap usage actually went down.


Today's TM build:
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2a1pre) Gecko/20090213 Minefield/3.2a1pre

In 20 minutes, the program has not once caused total physical system memory usage to go above 32.5%! This is a significant improvement as far as the memory usage is concerned (and performance, since it doesn't thrash at swap). And at 13.3 mutations per second (so far) ... whatever performance hit it might have looks negligible.

Low peak: 275 MB physical, 165 MB virtual
High peak: 315 MB physical, 165 MB virtual

I will leave it running for the rest of the day.
David, TM has been running all day with your latest patch, and I'm seeing the same results as in my last report. So it looks good.

The one bit of 'trouble' I've had is that the whole UI is slightly less responsive while the evolution program runs. The same thing happened on older, unpatched builds, as well. Stopping the program makes Minefield act nice and snappy like again. Not a regression.
(In reply to comment #46)
> David, TM has been running all day with your latest patch, and I'm seeing the
> same results as in my last report. So it looks good.

Great news. Thanks!

> The one bit of 'trouble' I've had is that the whole UI is slightly less
> responsive while the evolution program runs. The same thing happened on older,
> unpatched builds, as well. Stopping the program makes Minefield act nice and
> snappy like again. Not a regression.

Please consider filing a bug on this: it sounds worth fixing but I'm not an expert on that aspect of the problem so it would probably need wider attention.
http://hg.mozilla.org/mozilla-central/rev/871d61f0fbfa
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
When will it be checked into 1.9.1 or has it already been landed there?
Target Milestone: --- → mozilla1.9.2a1
Severity: normal → critical
Keywords: hang
(In reply to comment #49)
> When will it be checked into 1.9.1 or has it already been landed there?

it hasn't yet. is there a reason you are changing the target milestone?
verified FIXED on builds: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090422 Minefield/3.6a1pre ID:20090422044118

and

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1b4pre) Gecko/20090422 Shiretoko/3.5b4pre ID:20090422042031
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.