Closed Bug 656120 Opened 13 years ago Closed 13 years ago

Increase GC frequency: GC occasionally based on a timer (except when completely idle)

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla7
Tracking Status
firefox6 - ---

People

(Reporter: gwagner, Assigned: gwagner)

References

Details

(Keywords: memory-footprint, Whiteboard: [MemShrink:P1], fixed-in-tracemonkey)

Attachments

(1 file, 5 obsolete files)

There are many open bug reports about not reclaimed memory (not leaks).
We don't have the notion of an "idle" GC.
The idea is to perform the GC in MaybeGC if there was no GC during the last N seconds.
Attached patch patch (obsolete) — Splinter Review
Assignee: general → anygregor
Igor what do you think about this patch and can we remove the browser mem_frequency?
Attached patch patch (obsolete) — Splinter Review
update.
Also perform idle GC when chunks are waiting to expire.
Attachment #531446 - Attachment is obsolete: true
(In reply to comment #2)
> Igor what do you think about this patch and can we remove the browser
> mem_frequency?

The patch still do not try to recalculate the last GC bytes at the and of the finalization.
(In reply to comment #4)
> (In reply to comment #2)
> > Igor what do you think about this patch and can we remove the browser
> > mem_frequency?
> 
> The patch still do not try to recalculate the last GC bytes at the and of
> the finalization.

We do it on the fly with reduceGCTriggerBytes(...).
Whenever we release an arena during background finalization we adjust gcTriggerBytes.
(In reply to comment #5)
> We do it on the fly with reduceGCTriggerBytes(...).
> Whenever we release an arena during background finalization we adjust
> gcTriggerBytes.

But then why we need setGCLastBytes at all? With background finalization arenas are freed and allocated at the same time, so the last GC bytes becomes rather moot. On the other hand we only release chunks during the GC, so perhaps we should just focus on chunks for GC heuristics? I.e. the idea is to make gcBytes == the amount of allocated memory in chunks. 

Also, has the background finalization patch decreases by one the amount of GC cycles the free chunk lives before it returns to the system?
Gregor: could you move the removal of the trigger factor API to a separated bug? I will review that quickly - lets simplify the code first before any other changes.
Depends on: 603916
Attached patch patch (obsolete) — Splinter Review
update
Attachment #531452 - Attachment is obsolete: true
The patch removes chunks from the pool when the browser is idle during the GC. But this would not release the chunks that the following background finalization can free. Perhaps we should disable that finalization when the GC is idle?

An alternative is not to pool chunks based on the number of Gc cycles the survive but rather pool fixed number of chunks and release them when idling or when hitting OOM.
(In reply to comment #9)
> The patch removes chunks from the pool when the browser is idle during the
> GC. But this would not release the chunks that the following background
> finalization can free. Perhaps we should disable that finalization when the
> GC is idle?

Or can we do the chunk expiration at the end of the background finalization, not at the end of the GC?
Yes we can move the chunk expiration. The worst case now is that we perform 2 GCs so I don't see that as big problem.

I am still experimenting with this patch and the frequency of GC calls.
This patch always calls a full GC but I want to change this to compartment GCs.
So if we allocate additional arenas I call compartment GCs every 10 secs and a full GCs every minute.

The definition of "idle" is that there was no GC within the last N seconds and MaybeGC was called. So that doesn't mean that the browser is completely idle.

I also tried in bug 619822 to change the event queue to call MaybeGC during idle time but it turned out to be very complicated on all the different platforms.
Attached patch patch (obsolete) — Splinter Review
update

Only call idle GCs if we have allocated an arena or if chunks are waiting to be expired.
The frequency right now is 10 sec for compartment GC and one minute for a full GC.
Attachment #531550 - Attachment is obsolete: true
Blocks: 505308
No longer blocks: mlk-fx5+
Blocks: 654028
Blocks: 657115
Whiteboard: [MemShrink:P1]
OS: Mac OS X → All
Hardware: x86 → All
If it would help to know when the user is idle, we have nsIIdleService.
I talked to Gregor about this, he says there are two things left to do here:

- Check if it makes sense to perform a compartment GC with the compartment of the JSContext calling MaybeGC.

- And test if there are any regression with this patch.

He also said he's busy with other stuff for now.
No longer blocks: 664613
We really need this. Without this, it's very difficult for people to distinguish real leaks from late GC (especially if they don't know to click obsessively on the GC buttons in about:memory). Not to mention that it's probably a global performance win anyway to keep the heap down.
(In reply to comment #14)
> If it would help to know when the user is idle, we have nsIIdleService.

We have also user-interaction-active/inactive notifications to indicate
if user is using the browser. That might be more useful that
system level idle time.

Also, especially because of mobile, don't fire some timer constantly.
If it is known that there isn't anything to collect, we should do nothing.
(In reply to comment #17)
> Also, especially because of mobile, don't fire some timer constantly.
> If it is known that there isn't anything to collect, we should do nothing.
Bug 508518
	

Though, that timer was firing every 5s. Perhaps something which fires
less often is ok.
This patch doesn't perform a GC every N seconds.
It only triggers a GC if we may shrink the heap (return unused chunks to the OS) or if we allocated new space after the last GC.
The patch also relies on some kind of JS activity in order to trigger a GC in MaybeGC.

Basically this patch triggers a GC in MabyeGC if there was no GC within the last N seconds and we are positive to collect some garbage.

The remaining question is if the compartment that we get out of the context that calls MaybeGC is a good target for a compartment-GC. Maybe it's better to just force a global GC there all the time or use some other logic? It works fine for my settings but I don't know if this is the right approach for 50+ tabs.

I am very busy with my university stuff right now. Maybe somebody should steal this bug if it's high priority.
Attached patch patch (obsolete) — Splinter Review
Ok lets keep it small and simple.

This patch calls the GC every 20 seconds from MaybeGC if:
a) a new chunk was allocated after the previous GC or
b) chunks are waiting to be returned to the OS.

A normal GC always resets the 20 seconds timer.
More complexity can be added later if needed.

This should solve the problem of growing heaps caused by long periods without GCs.
Attachment #531763 - Attachment is obsolete: true
Attachment #540273 - Flags: review?(igor)
Comment on attachment 540273 [details] [diff] [review]
patch

Review of attachment 540273 [details] [diff] [review]:
-----------------------------------------------------------------

r+ if it is sufficient to add comments to address the issues below.

::: js/src/jsgc.cpp
@@ +498,5 @@
>      for (GCChunkSet::Range r(rt->gcChunkSet.all()); !r.empty(); r.popFront()) {
>          chunk = r.front();
>          if (chunk->hasAvailableArenas()) {
>              cx->compartment->chunk = chunk;
> +            rt->gcChunkAllocationSinceLastGC = true;

The gcChunkAllocationSinceLastGC name contradicts the usage as the flag is set when the the arena allocation code picks a new chunk from the set of currently allocated ones. Is it intentional? 

If yes, we need a better name and comments here why the flag is set even for chunks that may have almost no free arenas.

@@ +537,5 @@
>      for (GCChunkSet::Enum e(rt->gcChunkSet); !e.empty(); e.popFront()) {
>          Chunk *chunk = e.front();
>          JS_ASSERT(chunk->info.runtime == rt);
>          if (chunk->unused()) {
> +            if (chunk->info.age++ > MaxAge || gckind == GC_SHRINK) {

Change the condition to:

gckind == GC_SHRINK || chunk->info.age++ > MaxAge

This way we avoid useless chunk->info.age++ for soon to be released chunks.

@@ +1094,5 @@
>  
>  void
>  JSRuntime::reduceGCTriggerBytes(uint32 amount) {
>      JS_ASSERT(amount > 0);
> +    JS_ASSERT((gcTriggerBytes - amount) >= 0);

Nit: remove () around gcTriggerBytes - amount

@@ +1106,4 @@
>  {
>      gcLastBytes = lastBytes;
> +
> +    size_t base = gckind == GC_SHRINK ? lastBytes : Max(lastBytes, GC_ARENA_ALLOCATION_TRIGGER);

Nit: Add () around gckind == GC_SHRINK. In general the style dictates () in front of ? for anything but simple names.

@@ +1113,5 @@
>  
>  void
>  JSCompartment::reduceGCTriggerBytes(uint32 amount) {
>      JS_ASSERT(amount > 0);
> +    JS_ASSERT((gcTriggerBytes - amount) >= 0);

Same here

@@ +1957,5 @@
> +        return;
> +    }
> +
> +    int64 now = PRMJ_Now();
> +    if (rt->gcNextFullGCTime && rt->gcNextFullGCTime <= now) {

gcNextFullGCTime is 64 bit and its assignments are not atomic on 32 bits. Add comments explaining why it is OK to use it here outside the locks.

::: js/src/jsgc.h
@@ +944,5 @@
>       * it is imperative that rt->gcPoke gets cleared early in js_GC.
>       */
> +    GC_LAST_CONTEXT     = 1,
> +
> +    /* Minimize GC triggers and return chunks right away. */

Nit: s/return chunks/release empty GC chunks
(In reply to comment #21)
> Comment on attachment 540273 [details] [diff] [review] [review]
> patch
> 
> Review of attachment 540273 [details] [diff] [review] [review]:
> -----------------------------------------------------------------
> 
> r+ if it is sufficient to add comments to address the issues below.
> 
> ::: js/src/jsgc.cpp
> @@ +498,5 @@
> >      for (GCChunkSet::Range r(rt->gcChunkSet.all()); !r.empty(); r.popFront()) {
> >          chunk = r.front();
> >          if (chunk->hasAvailableArenas()) {
> >              cx->compartment->chunk = chunk;
> > +            rt->gcChunkAllocationSinceLastGC = true;
> 
> The gcChunkAllocationSinceLastGC name contradicts the usage as the flag is
> set when the the arena allocation code picks a new chunk from the set of
> currently allocated ones. Is it intentional? 
> 
> If yes, we need a better name and comments here why the flag is set even for
> chunks that may have almost no free arenas.
> 

Right! This was my first approach where I didn't trigger a GC when empty chunks were around. They are covered now with gcChunksWaitingToExpire. I will remove the first assignment.
Attached patch patchSplinter Review
Attachment #540273 - Attachment is obsolete: true
Attachment #540273 - Flags: review?(igor)
Attachment #540492 - Flags: review?(igor)
Comment on attachment 540492 [details] [diff] [review]
patch

Review of attachment 540492 [details] [diff] [review]:
-----------------------------------------------------------------

Now gcChunkAllocationSinceLastGC becomes self-descriptive :)

::: js/src/jsgc.cpp
@@ +1971,5 @@
> +    int64 now = PRMJ_Now();
> +    /* 
> +     * We accept the fact that setting gcNextFullGCTime is not atomic on 32bit
> +     * and a race condition could trigger a GC.
> +     */

Nit: style dictates to put a blank line before comments. It looks even better to move the comments before the "int64 now" line. Also in other places the word "tolerate" is used to describe acceptance of the races. So write this as:

/*
 * On 32 bit setting gcNextFullGCTime below is not atomic and a race condition could trigger an GC. We tolerate this. */
int64 now = PRMJ_Now();
Attachment #540492 - Flags: review?(igor) → review+
http://hg.mozilla.org/tracemonkey/rev/41d5782eabf2
Whiteboard: [MemShrink:P1] → [MemShrink:P1], fixed-in-tracemonkey
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(In reply to comment #20)
> 
> This patch calls the GC every 20 seconds from MaybeGC if:
> a) a new chunk was allocated after the previous GC or
> b) chunks are waiting to be returned to the OS.
> 
> A normal GC always resets the 20 seconds timer.
> More complexity can be added later if needed.

Just to clarify... is the following right?

- If the browser is truly idle (ie. nothing is being allocated from the GC heap) nothing will happen.

- If some GC heap allocations are occurring... actually, I'm not sure what the behaviour is.

Can you clarify the new heuristic, for people who aren't authors of jsgc.cpp?  Thanks :)
BTW, here's some info from jst, hopefully this problem will be fixed by this change:

"I left firefox running here with a clean profile and one tab open with a single article from cbsnews.com loaded, and over the course of ~14h the RES mem usage grew from ~200 megs or so after loading the page to 640 megs. I noticed this because my wife's firefox twice now got up to 5+ gigs, and I finally traced it to be caused by cbsnews.com that she had forgotten open in a tab."  

about:memory was dominated by the JS heap.  Here's the abridged version:

550,738,724 B (100.0%) -- explicit
├──489,820,652 B (88.94%) -- js
│  ├──475,004,928 B (86.25%) -- gc-heap
(In reply to comment #27)
> (In reply to comment #20)
> > 
> > This patch calls the GC every 20 seconds from MaybeGC if:
> > a) a new chunk was allocated after the previous GC or
> > b) chunks are waiting to be returned to the OS.
> > 
> > A normal GC always resets the 20 seconds timer.
> > More complexity can be added later if needed.
> 
> Just to clarify... is the following right?
> 
> - If the browser is truly idle (ie. nothing is being allocated from the GC
> heap) nothing will happen.

Yes. MaybeGC is usually called once a JS script finished executing. 

> 
> - If some GC heap allocations are occurring... actually, I'm not sure what
> the behaviour is.
>

Some basics again: 
We allocate 1MB chunk from the OS and split them up into 4k arenas. If a single arena contains reachable objects we can't release the whole chunk.

The problem was that our GC was very expensive and we tried to minimize the impact by weak GC triggers. So we allowed the heap to grow very fast without performing GCs. Most of the GC cost is gone because we don't do sweeping during the critical GC pause time any more and parallel marking should also speed up the remaining cost. The incremental GC is also a very important factor for the future. With this patch and also future patches we are trying to tighten our GC triggers and increase the frequency of GCs.

There are 2 things that this patch addresses:

1) We have this delayed scheme for releasing 1MB chunks to the OS because we don't want to free chunks after the GC and allocate them again right away. Releasing these chunks only happens after a GC.
Once a benchmark or an allocation heavy animation finishes we are usually in a state where we have many empty chunks allocated but no GC is triggered because we are waiting that allocation hits a GC trigger. 
With this patch now we also perform a GC if we have empty chunks allocated and don't use them within 20 seconds. So we return unused chunks to the OS after 20 seconds.

2) 
The second part deals with slow allocation rates or keeping the browser open over night.
After a GC we fill up already existing chunks/arenas. Once they are full and we allocate a new chunk from the OS we set a flag and after 20 seconds we perform a GC. So we trigger the GC right after a new chunk is allocated and don't wait any more until the heap increases by 3x.


> Can you clarify the new heuristic, for people who aren't authors of
> jsgc.cpp?  Thanks :)

Hope that helps :)
I went through all the MemShrink:P2 bugs looking for cases where this GC trigger change will fix the problem.  I came up with this list:

- Confirmed fixed: bug 654028.

- Very likely to be fixed, not yet confirmed: bug 636077, bug 638238, bug 664128, bug 666101.

- Reasonable chance of being fixed: bug 666104, 666105.

- Slight chance of being fixed: bug 639780, bug 653817, bug 665628.

Not bad!  Thanks again, Gregor, for finding time to fix this :)

I'm nominating this for tracking-firefox6 because it has the potential to fix problems that a *lot* of users are seeing;  it'd be great to fix this in FF6 rather than FF7.  The risk seems low-to-moderate... not a lot of code was changed, but Gregor might have more to say about that.
Wow that's an impressive list. Thanks for tracking it!

This patch might have a much bigger impact for regular surf behavior than I expected. 
I was running a nightly build the whole day and noticed a much smaller memory footprint. I don't have any scientific measurements but usually the browser needed about 400-500MB after a few hours for my settings.  Today I noticed a reduction to about 200-300MB. It would be great if we could measure this impact somehow.

My explanation is that reusing memory instead of allocating new one makes a lot of sense for regular surf behavior. So this patch greatly improves the fragmentation-problem over time without influencing the benchmark scores in a bad way.

As for FF6: The risk of the new code itself should be low and if we can confirm the improvements we should take it.

It would be good if we could get feedback from users with 200+ tabs. I know they exist :)
They either see an improvement because the overall memory consumption decreases or they might complain about a GC-pause every 20 seconds.
While the improvements here are great, and I'm dying to get them in the hands of our users as soon as possible, I really don't think that this is Firefox 6 material.  I understand the urge to want to backport it on Aurora, but I would resist is, since officially Aurora is a place for stabilizing features, and introducing changes like this will only increase the risk factor and also make the next Aurora migration more painful.

Thanks for the awesome work, Gregor!  :-)
Summary: Change MaybeGC trigger → Increase GC frequency: GC occasionally based on a timer (except when completely idle)
Letting this wait for FF7 would be a huge mistake, if the improvement is indeed significant and the GC doesn't turn out to cause a problem. For any users still seeing major memory issues in FF5 (the jury still seems to be out), even the wait for FF6 will be too long. While I'm not sure I see the wisdom in waiting for FF6 at all if the benefits of this patch can be confirmed, I see none in waiting for FF7.

I don't think you fully appreciate, Ehsan, how many users are still hesitant to jump ship from 3.6 after 4.0 made the memory situation worse for many of them--there are a lot of people still waiting to find out if 5.0 is worth it. If it isn't, those people shouldn't be asked to wait another three months, let alone six. I'm a little hopeful myself but count me among the cautious. Memory improvements should be considered near-critical at this stage, even sneaked out with security updates if possible.

Just my two cents, but if stability is your concern I can think of few better improvements to it than to give the system back more memory. Indeed high memory use from people not using this patch is only going to obscure other memory bugs that need to be fixed, since it seems to impact so many issues. By all means let's be sure this fix is as good as it seems, but let's not wait on it once we know.
> and the GC doesn't turn out to cause a problem.

That's the rub, no?

> While I'm not sure I see the wisdom in waiting for FF6

Are you suggesting this should be shipped as an unscheduled security update, then?

> those people shouldn't be asked to wait another three months, let alone six.

FF7 is shipping in less than 3 months.  FF6 is shipping in 8 weeks.  No one is talking about six months.

> even sneaked out with security updates if possible.

FF6 _is_ the security update for FF5.

> but let's not wait on it once we know.

Again, that's the rub.  To get this into FF6 we need to "know" within the next several weeks at the most.
(In reply to comment #34)
> Are you suggesting this should be shipped as an unscheduled security update,
> then?

I'm suggesting if the fix is confirmed solid, and a critical security update comes up, it's worth considering bundling it in. The memory problem has been so bad for so long that a fix of this significance should be seriously considered for release as early as absolutely possible. Waiting till August would be a disappointment; waiting longer would be worse.

I was misinformed on the release schedule; while I'm glad to know FF6 is out in eight weeks, that's still two months and that's a long time for users with heavy memory problems to wait. So far what I'm seeing on Twitter suggests FF5 improves memory for some, but others have seen no improvement over FF4. I'm glad the devs are now taking memory bloat more seriously than in the past (not to say it was ever unserious), but it needs to be treated with urgency as well.
I guess the risks are increased power use and less-smooth animations.  Those can take a while to shake out.
This patch can potentially help us with a lot of memory issues. But it's limited too. For example in a single tab, the biggest amount of unreclaimed memory will be 70MB (at which point we *do* finally do a GC even without this patch).

It is not clear to me if the 70MB is a limit for all tabs or each one. If it is all tabs, then I think the motivation to take risks in order to get this out earlier is lessened (since it's just 70MB altogether. that's not great obviously, but I doubt it's worth stability risks). If it is per tab, I'm not sure anymore.
not going to track this and I think it's probably too late for 6 but we can have that discussion if someone nominates the patch.
(In reply to comment #37)
> This patch can potentially help us with a lot of memory issues. But it's
> limited too. For example in a single tab, the biggest amount of unreclaimed
> memory will be 70MB (at which point we *do* finally do a GC even without
> this patch).

I don't think that's true.  Look at the bugs in comment 30.  We saw some much bigger reductions than that.
Comment on attachment 540492 [details] [diff] [review]
patch

Nominating the patch for approval-mozilla-aurora.  My argument:  this is a bad (performance) bug that was introduced in FF 4.0.0.  And we're allowed to fix old, bad bugs on Aurora, surely?  So let's fix it.

We've had arguments from both sides now, so let's leave the decision up to the release drivers.
Attachment #540492 - Flags: approval-mozilla-aurora?
(In reply to comment #39)
> (In reply to comment #37)
> > This patch can potentially help us with a lot of memory issues. But it's
> > limited too. For example in a single tab, the biggest amount of unreclaimed
> > memory will be 70MB (at which point we *do* finally do a GC even without
> > this patch).
> 
> I don't think that's true.  Look at the bugs in comment 30.  We saw some
> much bigger reductions than that.

That could be true. I was basing my statement on Slashdot and on what I saw when I reduced it to a smaller testcase. I thought I ended up with something pretty generic but perhaps there are other use patterns that end up differently.
(In reply to comment #36)
> I guess the risks are increased power use and less-smooth animations.  Those
> can take a while to shake out.

For animations like FOTN the cycle-collector currently triggers a global GC every 4 seconds. So this patch doesn't influence that with the 20 seconds waiting time.
Using less memory should also decrease the overall power consumption.


Our memory footprint is a big problem and I filed this bug because FF was/is very painful to use or even unusable (see bug 655455) on my new lowest-end netbook. This patch tries to keep the memory footprint small and therefore is a big win on such devices. Finally I can use FF on my new machine.
Also note that even if it was 70 MB, that's a real lot on mobile devices.
(In reply to comment #40)

> So let's fix it.

dmandelin encouraged me to be more specific about the risks and benefits here.

Benefits:

- It fixes a widespread problem.  Confirmed to fix bug 654028, bug 636077, bug 638238, bug 664128, bug 666101, bug 666104.  Bug 661527 was already marked as a dup of this bug.  And see Gregor's netbook experience in comment 42;  presumably mobile Firefox will benefit even more.

- Memory reductions are significant.  In most of the bugs it's in the order of several tens of MB up to ~100MB, though the measurements reported are generally pretty inconsistent so it's hard to be sure.  

  Bug 664128 has better measurements;  without this patch the user saw the "private bytes" on windows go up to 510MB when idle on gamespot.com for 4 hours (http://img221.imageshack.us/img221/2010/firefoxmemoryusage4hourb.png).  With the patch the private bytes never went above 140MB (http://img24.imageshack.us/img24/9255/firefoxgamespotcom9h.png).  That's a factor of 3.6x reduction.

  Bug 654028 also had good measurements, showing we save 70MB. 

- We have tentative evidence that it reduces fragmentation in the JS heap.  See bug 666101 comment 5 and subsequent comments -- after idling for a while on sina.cn.com and then triggering GC, the end GC heap size was 26MB with the patch, and 32MB without.

- Basically, it fixes the "I left Firefox running overnight and now my machine has locked up" problem.  Which is a big problem.


Risks:

- Could affect some benchmarks, but we know (AFAIK) it doesn't affect Sunspider, V8bench, FOTN.  Gregor, any others you know about?  I'm not an expert on GC and benchmarks.

- Could cause mobile uses to use more power or wake up.  But the extra GCs aren't triggered when the browser is truly idle;  JS allocations have to be already happening.  (Gregor, correct me if I'm wrong about that.)

- Could make animations jerky, if we collect in the middle of one.  Maybe... not sure.  That can already happen.

Note that these are all "coulds".
Stop the presses: bug 666845 was just filed, and looks almost certain to be fixed by the change.
(In reply to comment #44)
> 
> Risks:
> 
> - Could affect some benchmarks, but we know (AFAIK) it doesn't affect
> Sunspider, V8bench, FOTN.  Gregor, any others you know about?  I'm not an
> expert on GC and benchmarks.
> 

The benchmark would have to run JS for 20sec without allocations. I don't know of any benchmarks that do that.

> - Could cause mobile uses to use more power or wake up.  But the extra GCs
> aren't triggered when the browser is truly idle;  JS allocations have to be
> already happening.  (Gregor, correct me if I'm wrong about that.)
> 

We don't wake up the browser for that. The GC usually happens after some JS execution. 

> - Could make animations jerky, if we collect in the middle of one.  Maybe...
> not sure.  That can already happen.
> 

Animations usually have a high allocation rate and other triggers take care of of the GC. Nothing changed there. The new trigger makes sure that after an animation we shrink the heap again.
Nicholas/Gregor,

Under the risk section is there also an increased possibility of crashes, and *reduced* stability?

We seen evidence of GC on the stack in lots of crash reports, so the theory here is would be that if GC is running more frequently, the risk of stomping on memory in these kinds of GC related crashes could go up.   

I don't thing we have good metrics or test cases that demonstrate how often GC is involved in crashes so this might need to be one of the things that we need to be watching for in evaluation when this hits mozilla-central and aurora now, or later.
(In reply to comment #47)
> Nicholas/Gregor,
> 
> Under the risk section is there also an increased possibility of crashes,
> and *reduced* stability?
> 
> We seen evidence of GC on the stack in lots of crash reports, so the theory
> here is would be that if GC is running more frequently, the risk of stomping
> on memory in these kinds of GC related crashes could go up.   
> 
> I don't thing we have good metrics or test cases that demonstrate how often
> GC is involved in crashes so this might need to be one of the things that we
> need to be watching for in evaluation when this hits mozilla-central and
> aurora now, or later.

Well, I guess it's a possibility.  But the idea "GC sometimes causes crashes, so we better not run it" makes me want to cry.  It's not like GC is some optional thing, it's utterly integral to JavaScript execution.

Also, if some memory has been corrupted that causes GC to crash, it'll presumably happen whenever the next GC happens, whether that's right now or a few minutes from now.

Anyway, enough speculation... this landed on Nightly on June 21 -- can we look at the crash stats since then and see if GC-related crashes have increased?
More data: see bug 657115 comment 26 and subsequent comments.  The ROME demo saw slightly reduced memory usage, maxing out at 2.0GB instead of 2.2GB.  azakai said there are "more" GCs, some of which are quite noticeable.  Gregor said the number of GCs performed for him increased from 15 to 18.  Note that the demo already had some pauses, and Google Chrome also has significant pauses.

This is an *extremely* memory-intensive demo, however, and doesn't represent typical web usage, so I'd put a low weight on it.  Other GC changes (parallel marking, incremental GC, generational GC) will be needed to make this demo behave better.
(In reply to comment #45)
> Stop the presses: bug 666845 was just filed, and looks almost certain to be
> fixed by the change.

More evidence for the prosecution, m'lud:  the fix has been confirmed by the reporter.  The example script in the bug report previously caused the reporter's machine to crash due to OOM, now the "mem used" (reported by Win XP) peaks at 154MB.  OOM vs 154MB!  That's a big difference.
To be fair, they were running it on a VM with only 256MB of memory, half of the recommended minimum for FF4.
This could reduce GC-related crashes by:

* avoiding the extra complexity of allocation-triggered GCs (stack scanning, reliance on temporary rooting, leaving trace)

* avoiding running out of VM

* improving the quality of crash reports and bug reports when we have heap-corruption bugs

I would not expect this change to increase crashiness. If the heap is in a state where GCing will crash, and script isn't in the middle of running, it's unlikely to leave that state just by running more script before GCing.
(In reply to comment #48)
> Anyway, enough speculation... this landed on Nightly on June 21 -- can we
> look at the crash stats since then and see if GC-related crashes have
> increased?

We'll closely watch this, but we probably won't have suitable data until this is on Beta (as both Nightly and Aurora tend to have too few users for decent crash data) so I'm (from a crash analysis POV) all for getting this into Aurora before the next uplift - if nothing bad happens on Nightly, that is. ;-)
(In reply to comment #44)
> 
> 
> Risks:
> 
> - Could affect some benchmarks, but we know (AFAIK) it doesn't affect
> Sunspider, V8bench, FOTN.  Gregor, any others you know about?  I'm not an
> expert on GC and benchmarks.
> 
> - Could cause mobile uses to use more power or wake up.  But the extra GCs
> aren't triggered when the browser is truly idle;  JS allocations have to be
> already happening.  (Gregor, correct me if I'm wrong about that.)
> 
> - Could make animations jerky, if we collect in the middle of one.  Maybe...
> not sure.  That can already happen.
>

Is another possible area to test beyond benchmarks and animations "web 2.0 apps".

The impact seems low or non-existent on small/fast loading pages.

Maybe moderate on larger/more complex pages that take a while to load, since the window of loading is increased and gc has a bigger opportunity to kick in at some point.

Could the risk be even higher on gmail/web mail/other apps that are "running all the time"?   The theory here is that at some point (and now more frequently) GC kicks and starve interactions and updates on the page.

> Note that these are all "coulds".

right, the hard part is getting this out in front of a lot of users, then watching for vague hints of problems that come back in feedback reports or bugs, and then having them come back to be investigated as part of this bug, or not.   that's the work head for landing in either firefox 6 or 7.
> Could the risk be even higher on gmail/web mail/other apps that are "running
> all the time"?   The theory here is that at some point (and now more
> frequently) GC kicks and starve interactions and updates on the page.

Apps that are running all the time are exactly the ones that will *benefit* most from this patch.  We might see some GC pauses, but we'll see smaller pauses more frequently, instead of the occasional big one.  (And note that these small pauses happen when the browser is mostly idle, ie. when the user is unlikely to notice.)
And we'll avoid paging, which is what really kills performance.

I think the analysis here is pretty much complete.  We have a large number of clear wins, and no clear losses.  Each suggested risks has had some kind of counter-argument saying why it's unlikely to be a problem.  I agree with KaiRo;  the only way to get this into FF6 is to put it in Aurora now, and watch the betas closely.  We can back it out of beta if we see ill effects.  A release driver needs to say "yes" or "no".
For whatever it's worth (as a LONG-time ago driver, and as a ridiculous-number-of-tabs user), I'm strongly in favor of taking this for FF6.  The number one complaint I get from friends about FF4 (and I don't see it changing in FF5) is "my browser ate all my memory just sitting around!"  And this is *horrid* behavior for a mobile (fennec) or near-mobile (netbooks) client.

Risk of crashes seems extremely low; risk to performance seems very low, and in some cases will improve performance by avoiding a big GC when you use it after sitting idle a while, and avoid possible paging at that time - look at the netbook bug.
I guess I should make one very important advantage of this patch more explicit.

Currently most of our GCs happen during allocation-heavy situations. If you enter GMail you are very likely to hit a GC trigger because GMail does a lot of allocation.
This happens at the worst time: right during the page-load.

Much better would be to perform a GC before we reach an allocation heavy situation and make as much space as possible in order to avoid a GC during the critical page load.
That's exactly what this patch does. It make sure that we are always ready for an allocation heavy situation.
Waiting with an garbage-collected heap is *always* better than waiting with an almost full heap.
This is probably a stupid question, but as I'm not familiar with the process--what does it take for a release driver to take notice of this issue? It was nominated days ago but there's no answer on it. Is that common, and if so does it mean this issue could potentially be ignored until inclusion in FF6 is moot, or does it usually get decided right before a beta?
The issue was nominated after 5pm Pacific on Thursday last week.

Triage for Aurora nominations happens Tuesday/Thursday 2pm-3pm Pacific and Wednesday 11:30am-noon Pacific.  See https://wiki.mozilla.org/Firefox/Aurora

So no answer is expected so far. Give it about 6 hours. ;)
Comment on attachment 540492 [details] [diff] [review]
patch

After a wonderful debate, the nays have it. This should have a full Aurora cycle.
Attachment #540492 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Nay as in it's not going into FF6? If so I'm curious of the logic behind that decision.
(In reply to comment #62)
> Nay as in it's not going into FF6? If so I'm curious of the logic behind
> that decision.

Please take that discussion elsewhere, it doesn't belong in the bug report.


Drivers, with that decision, just be clear that we probably will not be able to tell if this has any impact on crashes of any kind before the end of August at the earliest, when Beta builds with it have been out for some time.
At the request of Nicholas Nethercote in Comment #11 of Bug 652801 I tried the nightly Firefox 7.0a1 (2011-06-28) with the same large set of windows and tabs that was causing the problem I reported in Comments 6, 7, and 8 of Bug 652801. Memory use stayed constant overnight. Seems like the problem has been fixed.
(In reply to comment #63)
> > Nay as in it's not going into FF6? If so I'm curious of the logic behind
> > that decision.
> 
> Please take that discussion elsewhere, it doesn't belong in the bug report.

It's not an unreasonable question.  The nomination for back-porting to Aurora happened in this bug, the discussion of the benefits and risks has been happening in this bug, so an explanation (however brief) of why it was rejected would be perfectly suitable here.

Ultimately, I'm sure the reasoning can be boiled down to "the risks are too high, this needs more time in the Aurora channel to make sure there are no problems".  But the phrase "wonderful debate" has piqued my interest -- was that in reference to the comments in this bug, or the discussion in the meeting?  I'd be interested to hear, for example, if any of the mentioned risks were perceived to be more significant than others.
Sorry I didn't take better notes during triage so I can't really relay the specifics of the debate. I'll just say that we have a diverse group of drivers with varying comfort levels for risk. I'm one of the bravest (most foolish?) and I love saying yes. Others are somewhat more conservative and lean towards using Aurora and Beta for their defined purpose -- which should be almost exclusively subtractive change (back our way out to known good state.) 

You are correct that the final call was "too much risk without enough confidence that we could identify any major problems before release." 

We're not as consistent, even within the core group of drivers, as we ought to be and these decisions will probably never be purely mechanical, but we're getting better and over time I expect the whole community will come to a more shared understanding of what kind of risk for what kind of reward we're willing to accept in Aurora and Beta. Until then, bear with us all and our learning curve.
(In reply to comment #66)

Oh, and I forgot one very important thing. You all are invited to these meetings too. We do not triage Aurora and Beta nominations and requests in secret and we'd love more attendance -- especially from the feature developers and advocates who might be able to give us better information about risk and reward that we can get from what's in the bug. Meetings happen three times a week and if you're joining us for a particular bug, let us know and we can try to get to that one up front so you can participate without having to sit through the whole grueling session.

Aurora triage sessions happen on Tuesday and Thursday at 2PM Pacific https://wiki.mozilla.org/Firefox/Aurora  Beta triage happens Mondays at 2PM Pacific https://wiki.mozilla.org/Firefox/Beta We also meet Wednesday after the Product Planning meeting which happens at 11am Pacific and that session varies between Beta and Aurora depending on which needs us the most.
Asa, thanks for the thoughtful response!

(In reply to comment #66)
> Others are somewhat more conservative and
> lean towards using Aurora and Beta for their defined purpose -- which should
> be almost exclusively subtractive change (back our way out to known good
> state.) 

I had assumed that fixing pre-existing bugs would also be considered legitimate.  Eg. if we found a bad security bug that had been introduced in FF4, presumably we'd consider backporting it to Aurora and Beta?

I thought the same logic could apply here, since this bad GC behaviour was introduced in FF4.  (Note I'm not trying to change the decision, I'm just trying to understand what's considered acceptable for Aurora/Beta.)  Obviously, as we go through the rapid release cycle a few times the number of these "existed since FF4 or earlier" bugs will diminish.
(In reply to comment #68)
> Asa, thanks for the thoughtful response!
> 
> (In reply to comment #66)
> > Others are somewhat more conservative and
> > lean towards using Aurora and Beta for their defined purpose -- which should
> > be almost exclusively subtractive change (back our way out to known good
> > state.) 
> 
> I had assumed that fixing pre-existing bugs would also be considered
> legitimate.  Eg. if we found a bad security bug that had been introduced in
> FF4, presumably we'd consider backporting it to Aurora and Beta?

Yes. We certainly take bugfixes on Aurora and Beta but we also ask ourselves whether backing out the thing that cause the regression is a better path. And our risk aversion for taking additive fixes rather than subtractive goes up the further we are into the cycle. You'll notes, for example, that we approve larger and scarier things at the beginning of Aurora than at the end or than in Beta -- unless it's absolutely critical like for security or a intolerable topcrash or something. 

> I thought the same logic could apply here, since this bad GC behaviour was
> introduced in FF4.  (Note I'm not trying to change the decision, I'm just
> trying to understand what's considered acceptable for Aurora/Beta.) 
> Obviously, as we go through the rapid release cycle a few times the number
> of these "existed since FF4 or earlier" bugs will diminish.

Yeah. That's the case I made in favor of taking it. I asserted that this could even fix OOM crashes (though I may be wrong about that -- it could have been other fixes on the trunk alleviating some of the OOM crashes my friends and family were seeing) and the lowered memory usage (getting us back closer to 3.6 era usage) was a very important regression fix. Had this been the first week of Aurora, I might have succeeded in convincing the group. Had this been the only scary change we were evaluating for late in the game or had we not already sort of exhausted our tolerance for risk with already approved fixes then things might have been different.

So a fair bit of context and then straight up arguing plays a role here. That means it's not perfectly predictable or even totally consistent. But as I said above, I think it works pretty well not being a purely mechanical system and we just have to change the expectations we all have as we change to a different release cadence and process. Once we've done a few of these things, it will probably fall into a much more comfortable and consistent rhythm.
> Had this been the only scary change we were evaluating for late in
> the game or had we not already sort of exhausted our tolerance for risk with
> already approved fixes then things might have been different.

Thanks again for the details.  Not everyone can listen in on the meetings (eg. I'm in an awkward timezone) so putting the information here is greatly appreciated.
Future GC changes will increase the GC frequency dramatically. Incremental GC will perform 10x, 100x or even N00x more GCs. So we definitely have to rely on our GC.
Data about the effect of (slowly) increasing the frequency will help Bill.
(In reply to comment #68)
> I had assumed that fixing pre-existing bugs would also be considered
> legitimate.  Eg. if we found a bad security bug that had been introduced in
> FF4, presumably we'd consider backporting it to Aurora and Beta?

See http://mozilla.github.com/process-releases/draft/development_specifics/, to quote from that document:

<quote>
What happens where?

  mozilla-aurora

    Preffing off and backing out fixes/features which the central channel exposes as problematic with a wider audience.
    Remaining localizations
    Landing spot fixes to get the product in a shipping state
    Merge from mozilla-shadow (see below) and fixes/backouts for any resulting issues
    Preffing off and backing out fixes/features which were deemed ready but testing determined they in fact were not

</quote>

According to our policy, we are not supposed to fix bugs which have existed in the previous releases on Aurora.  We need to move away from our old mindset of trying to squeeze in bug fixes in a release out of the fear of the next release happening in an unknown point in the future, and try to implement the policy that we agreed to when we were first designing the rapid release mode.

Unfortunately, we've been too welcoming to Aurora changes.  This may cause us problems (see bug 655703) or make the future migrations more painful than necessary.  Perhaps I need to start attending the triage sessions and start saying No to almost everything.  No is a good word, we should use it more often.  ;-)
(In reply to comment #72)
 
> According to our policy, we are not supposed to fix bugs which have existed
> in the previous releases on Aurora.  We need to move away from our old
> mindset of trying to squeeze in bug fixes in a release out of the fear of
> the next release happening in an unknown point in the future, and try to
> implement the policy that we agreed to when we were first designing the
> rapid release mode

I agree that we need to move away from our old mindset of trying to squeeze bug fixes into a release like that but I don't agree that this document is "policy". It's a guideline that's designed to help engineers and drivers do the right thing. Also, we intentionally gave ourselves room to land additive fixes on Aurora so that we could avoid backing out very nearly complete features if the remaining work was well understood, well contained, and low risk. I think we should add to that that we're very much interested in evaluating fixes for major crashes, hangs, or memory situations -- especially where they're regressions from the previous release. That absolutely should not be the primary purpose of Aurora, but I think that we'd be foolish to not take something like a null check for a topcrasher regression on Aurora because we'd put in place a policy that said we don't fix regressions in Aurora.

That being said, I expect that over time our engineers, project managers, and teams like crashkill will put their major focus on fixing these problems on trunk and not letting them make it up to Aurora. Right now I don't think that's enough of the case so the "eye of sauron" the drivers can direct towards eggregious regressions when they make it up to Aurora and Beta and the freedom to take well-contained and low risk regression fixes, plays an important role in keeping Firefox from degrading from release to release.

Gregor and njn, sorry to have taken over your bug with all of this project management conversation. I think it's valuable and I will try to capture what's been said here for a blog post or a .planning post or something.  We're still learning and I would love for this conversation to continue in a way that doesn't spam this bug's cc list so bad and that will likely be seen by a broader audience. 

Ehsan, thank you for the comment. I think you're absolutely right about all of us needing to learn to use the new model well. We do all have to change how we think (and that includes the release drivers who are grappling with all the same issues during this transition.)
I'm not a software project manager per se but I've been in the position of trying to convince Java developers that the user perception of a Java exception with its scary looking stack trace will always be "the software is broken". Don't allow a "harmless" stack trace to appear when the condition is an expected outcome that can be explained as such, e.g. "No books have the title specified."

Users have the same perception of crashes: the software is broken. This is even more true if the crash happens when "I wasn't even doing anything".

Taking better advantage of graphic cards is progress. But a program that doesn't crash, responds quickly, uses memory efficiently, and scales well is priceless. Most users don't know exactly how to measure these things but they have a gut feeling and that feeling translates into market share.

The soon we can get this fix into the release, the better.
I understand the need for everyone to adjust to the new model and that there's great value in sticking to policy. However as an end user, this decision feeds into a common perception that memory issues are not being treated with the gravity they deserve.

Whatever the policy issues, they're a weak argument against a bug fix that has such an extremely low risk/reward ratio AND fixes a regression of great importance. This isn't a minor or moderate fix like so many others, where denying a backport is more sensible. No may be a dandy word, but the end result is that Joe User won't see this fix for three more months. It beats years, but Joe has already been waiting a few of those too. Joe also doesn't understand why this patch can't make it into 5.0.1, even though the arguments against that are much stronger. Many of Joe's friends are still using 3.6 or even 3.5 because of the regression. Bottom line: when FF6 comes out, the public will still see it as broken.
> that has such an extremely low risk/reward ratio

See, that ratio is what people are disagreeing on...

Just so we're all clear on this, by the way, we're talking a difference between August 16 (Firefox 6) and September 27 (Firefox 7) for when this gets released.
(In reply to comment #75)
> No may be a dandy word, but the end
> result is that Joe User won't see this fix for three more months.

Six weeks, not three months.

And I don't think you as someone not in the developer or release management teams can reasonably assess this to be "an extremely low risk/reward ratio". We actually don't know the real risk and not the full extent of rewards either.

(In reply to comment #74)
> Users have the same perception of crashes: the software is broken. This is
> even more true if the crash happens when "I wasn't even doing anything".

There's no clear knowledge yet, only speculation, if this patch even has any influence at all on crashes. This is a memory cleanup patch, not a crash-fixing patch.

And Ehsan, your comment sounds very good, but omits that the channels aren't yet what we planned them to be. We set them up to have a million users on Aurora and ten million on Beta - right now, we have 50-60k on Aurora and 1-2 million on Beta. We need to work on that, and I know this is being done, but until we get nearer to what we set out to have, we e.g. can't see a lot of useful crash analysis on Aurora, and we need to deal with that.
> This is a memory cleanup patch, not a crash-fixing patch.

True. Until you hit the 2 GB process limit on a 32-bit platform. In my case I was seeing memory leak at ~100 MB / hr. By the next morning Firefox would crash. Users don't understand GC; they understand coming back to their computer the next day and seeing that Firefox has crashed. Or they wonder why it takes so long to unlock their screen, never understanding that 500M of memory is being swapped before their desktop manager and related programs can fully run. How many computers are thrown out every year because they are "slow" and the user has no idea how to fix the problem? We can't fix Microsoft but let's not add to the problem.

If Firefox 7 is really scheduled for September 27th, okay. But which annoying option do I choose in the meantime: use the nightly and live without all my extensions, hack all my .xpi files to tell them they are compatible Firefox 7, or go back to Firefox 4/5 and do a binary search to figure out which tab(s) are leaking?
(In reply to comment #78)

> If Firefox 7 is really scheduled for September 27th, okay. But which
> annoying option do I choose in the meantime

Use the nightly builds, get the Aurora build next week ,or wait for Beta or release. Those are your options.
This is pretty off topic, but the Addon Compatibility Reporter basically does the same thing as editing your .xpi files with the hassle. https://addons.mozilla.org/en-US/firefox/addon/add-on-compatibility-reporter/
I'm confused why this patch fixes severe "left Firefox open overnight" problems. Shouldn't our other GC heuristics, based on the amount of memory allocated, have caused Firefox to run GC often enough? Maybe we should backport bug 660734 and bug 654399 instead?
Jesse, the memory allocation heuristics are based on the amount of memory allocated on the JS heap.

If there's a significant non-JS-heap component to the allocations but that non-JS-heap memory is kept alive by JS objects (e.g. DOM nodes, canvas backing stores, etc, etc) you could have pretty noticeable memory growth without the JS engine deciding to GC, I suspect.
Jesse:  also, the allocation-based triggers are really weak.  It's something like "don't GC until the heap has grown to 3x it's previous size".  AIUI, it was tuned for FF4 mostly to avoid GC during Sunspider and V8bench, because GC'ing during them kills performance.
(In reply to comment #73)
>
> Gregor and njn, sorry to have taken over your bug with all of this project
> management conversation.

No need to apologize!  This discussion has been helpful, the context is much clearer than it was after comment 60 (and comment 63).
Agreed, the commentary here has been helpful.  And I understand ehsan's feeling we should be saying "No" more often, having been a driver in a previous life (and probably having have said "Yes" too much).  

Obviously we're all still coming to grips with the new release roadmaps, and it will take a while, and adjustments may well be needed not just in what we expect to get in, but also some adjustments in the process will probably be needed as we get experience with the impacts and inputs - see for example the discussion of Aurora and Beta uptake rates.  Probably we should have some semi-formal postmortem on how the process is working after each release, at least for a while.

I weighed in on the yes side because of the perceived risk/reward ratio, and because I felt this was a major regression that we should have fixed in FF5 (if we had been in the Aurora stage for FF5, would we have taken this fix as it was a regression we'd released in FF4?  Some comments above have implied that we might have.)

One thing that should always weigh into our decisions (though not rule them!) is what will the *perception* by end-users of taking or not taking a change.  This can affect long-term perception of the quality of FF and of uptake/recommendation rates.  It's a weighting factor.  This bug gets a high weight-factor by that scale.  That doesn't necessarily overrule the arguments in favor of No, and as I wasn't on the call where the decision was made, I can accept the decision.  Just something to remember as we go forward and evaluate other Aurora requests.
Target Milestone: --- → mozilla7
6 is gone. not going to track this for a release we're not going to approve it for.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: