Last Comment Bug 656120 - Increase GC frequency: GC occasionally based on a timer (except when completely idle)
: Increase GC frequency: GC occasionally based on a timer (except when complete...
Status: RESOLVED FIXED
[MemShrink:P1], fixed-in-tracemonkey
: footprint
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal with 4 votes (vote)
: mozilla7
Assigned To: Gregor Wagner [:gwagner]
:
:
Mentors:
: 619822 661527 678185 (view as bug list)
Depends on: 603916
Blocks: GC 657115 636077 638238 653817 654028 mlk-fx6 661527 664128 664613 666101 666104 666845
  Show dependency treegraph
 
Reported: 2011-05-10 13:31 PDT by Gregor Wagner [:gwagner]
Modified: 2013-12-27 14:35 PST (History)
47 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-


Attachments
patch (19.46 KB, patch)
2011-05-10 13:37 PDT, Gregor Wagner [:gwagner]
no flags Details | Diff | Splinter Review
patch (19.49 KB, patch)
2011-05-10 13:42 PDT, Gregor Wagner [:gwagner]
no flags Details | Diff | Splinter Review
patch (11.80 KB, patch)
2011-05-10 23:53 PDT, Gregor Wagner [:gwagner]
no flags Details | Diff | Splinter Review
patch (12.44 KB, patch)
2011-05-11 15:00 PDT, Gregor Wagner [:gwagner]
no flags Details | Diff | Splinter Review
patch (14.63 KB, patch)
2011-06-18 15:39 PDT, Gregor Wagner [:gwagner]
no flags Details | Diff | Splinter Review
patch (14.23 KB, patch)
2011-06-20 09:33 PDT, Gregor Wagner [:gwagner]
igor: review+
asa: approval‑mozilla‑aurora-
Details | Diff | Splinter Review

Description Gregor Wagner [:gwagner] 2011-05-10 13:31:40 PDT
There are many open bug reports about not reclaimed memory (not leaks).
We don't have the notion of an "idle" GC.
The idea is to perform the GC in MaybeGC if there was no GC during the last N seconds.
Comment 1 Gregor Wagner [:gwagner] 2011-05-10 13:37:28 PDT
Created attachment 531446 [details] [diff] [review]
patch
Comment 2 Gregor Wagner [:gwagner] 2011-05-10 13:38:14 PDT
Igor what do you think about this patch and can we remove the browser mem_frequency?
Comment 3 Gregor Wagner [:gwagner] 2011-05-10 13:42:43 PDT
Created attachment 531452 [details] [diff] [review]
patch

update.
Also perform idle GC when chunks are waiting to expire.
Comment 4 Igor Bukanov 2011-05-10 15:00:49 PDT
(In reply to comment #2)
> Igor what do you think about this patch and can we remove the browser
> mem_frequency?

The patch still do not try to recalculate the last GC bytes at the and of the finalization.
Comment 5 Gregor Wagner [:gwagner] 2011-05-10 15:04:29 PDT
(In reply to comment #4)
> (In reply to comment #2)
> > Igor what do you think about this patch and can we remove the browser
> > mem_frequency?
> 
> The patch still do not try to recalculate the last GC bytes at the and of
> the finalization.

We do it on the fly with reduceGCTriggerBytes(...).
Whenever we release an arena during background finalization we adjust gcTriggerBytes.
Comment 6 Igor Bukanov 2011-05-10 15:19:00 PDT
(In reply to comment #5)
> We do it on the fly with reduceGCTriggerBytes(...).
> Whenever we release an arena during background finalization we adjust
> gcTriggerBytes.

But then why we need setGCLastBytes at all? With background finalization arenas are freed and allocated at the same time, so the last GC bytes becomes rather moot. On the other hand we only release chunks during the GC, so perhaps we should just focus on chunks for GC heuristics? I.e. the idea is to make gcBytes == the amount of allocated memory in chunks. 

Also, has the background finalization patch decreases by one the amount of GC cycles the free chunk lives before it returns to the system?
Comment 7 Igor Bukanov 2011-05-10 15:25:25 PDT
Gregor: could you move the removal of the trigger factor API to a separated bug? I will review that quickly - lets simplify the code first before any other changes.
Comment 8 Gregor Wagner [:gwagner] 2011-05-10 23:53:58 PDT
Created attachment 531550 [details] [diff] [review]
patch

update
Comment 9 Igor Bukanov 2011-05-11 14:25:06 PDT
The patch removes chunks from the pool when the browser is idle during the GC. But this would not release the chunks that the following background finalization can free. Perhaps we should disable that finalization when the GC is idle?

An alternative is not to pool chunks based on the number of Gc cycles the survive but rather pool fixed number of chunks and release them when idling or when hitting OOM.
Comment 10 Igor Bukanov 2011-05-11 14:30:28 PDT
(In reply to comment #9)
> The patch removes chunks from the pool when the browser is idle during the
> GC. But this would not release the chunks that the following background
> finalization can free. Perhaps we should disable that finalization when the
> GC is idle?

Or can we do the chunk expiration at the end of the background finalization, not at the end of the GC?
Comment 11 Gregor Wagner [:gwagner] 2011-05-11 14:51:00 PDT
Yes we can move the chunk expiration. The worst case now is that we perform 2 GCs so I don't see that as big problem.

I am still experimenting with this patch and the frequency of GC calls.
This patch always calls a full GC but I want to change this to compartment GCs.
So if we allocate additional arenas I call compartment GCs every 10 secs and a full GCs every minute.

The definition of "idle" is that there was no GC within the last N seconds and MaybeGC was called. So that doesn't mean that the browser is completely idle.

I also tried in bug 619822 to change the event queue to call MaybeGC during idle time but it turned out to be very complicated on all the different platforms.
Comment 12 Gregor Wagner [:gwagner] 2011-05-11 15:00:48 PDT
Created attachment 531763 [details] [diff] [review]
patch

update

Only call idle GCs if we have allocated an arena or if chunks are waiting to be expired.
The frequency right now is 10 sec for compartment GC and one minute for a full GC.
Comment 13 Nicholas Nethercote [:njn] 2011-06-09 17:56:40 PDT
*** Bug 661527 has been marked as a duplicate of this bug. ***
Comment 14 Robert O'Callahan (:roc) (email my personal email if necessary) 2011-06-15 17:27:21 PDT
If it would help to know when the user is idle, we have nsIIdleService.
Comment 15 Nicholas Nethercote [:njn] 2011-06-15 17:28:25 PDT
I talked to Gregor about this, he says there are two things left to do here:

- Check if it makes sense to perform a compartment GC with the compartment of the JSContext calling MaybeGC.

- And test if there are any regression with this patch.

He also said he's busy with other stuff for now.
Comment 16 Robert O'Callahan (:roc) (email my personal email if necessary) 2011-06-15 17:34:47 PDT
We really need this. Without this, it's very difficult for people to distinguish real leaks from late GC (especially if they don't know to click obsessively on the GC buttons in about:memory). Not to mention that it's probably a global performance win anyway to keep the heap down.
Comment 17 Olli Pettay [:smaug] 2011-06-15 17:45:13 PDT
(In reply to comment #14)
> If it would help to know when the user is idle, we have nsIIdleService.

We have also user-interaction-active/inactive notifications to indicate
if user is using the browser. That might be more useful that
system level idle time.

Also, especially because of mobile, don't fire some timer constantly.
If it is known that there isn't anything to collect, we should do nothing.
Comment 18 Olli Pettay [:smaug] 2011-06-15 17:49:54 PDT
(In reply to comment #17)
> Also, especially because of mobile, don't fire some timer constantly.
> If it is known that there isn't anything to collect, we should do nothing.
Bug 508518
	

Though, that timer was firing every 5s. Perhaps something which fires
less often is ok.
Comment 19 Gregor Wagner [:gwagner] 2011-06-17 14:45:52 PDT
This patch doesn't perform a GC every N seconds.
It only triggers a GC if we may shrink the heap (return unused chunks to the OS) or if we allocated new space after the last GC.
The patch also relies on some kind of JS activity in order to trigger a GC in MaybeGC.

Basically this patch triggers a GC in MabyeGC if there was no GC within the last N seconds and we are positive to collect some garbage.

The remaining question is if the compartment that we get out of the context that calls MaybeGC is a good target for a compartment-GC. Maybe it's better to just force a global GC there all the time or use some other logic? It works fine for my settings but I don't know if this is the right approach for 50+ tabs.

I am very busy with my university stuff right now. Maybe somebody should steal this bug if it's high priority.
Comment 20 Gregor Wagner [:gwagner] 2011-06-18 15:39:45 PDT
Created attachment 540273 [details] [diff] [review]
patch

Ok lets keep it small and simple.

This patch calls the GC every 20 seconds from MaybeGC if:
a) a new chunk was allocated after the previous GC or
b) chunks are waiting to be returned to the OS.

A normal GC always resets the 20 seconds timer.
More complexity can be added later if needed.

This should solve the problem of growing heaps caused by long periods without GCs.
Comment 21 Igor Bukanov 2011-06-19 03:16:32 PDT
Comment on attachment 540273 [details] [diff] [review]
patch

Review of attachment 540273 [details] [diff] [review]:
-----------------------------------------------------------------

r+ if it is sufficient to add comments to address the issues below.

::: js/src/jsgc.cpp
@@ +498,5 @@
>      for (GCChunkSet::Range r(rt->gcChunkSet.all()); !r.empty(); r.popFront()) {
>          chunk = r.front();
>          if (chunk->hasAvailableArenas()) {
>              cx->compartment->chunk = chunk;
> +            rt->gcChunkAllocationSinceLastGC = true;

The gcChunkAllocationSinceLastGC name contradicts the usage as the flag is set when the the arena allocation code picks a new chunk from the set of currently allocated ones. Is it intentional? 

If yes, we need a better name and comments here why the flag is set even for chunks that may have almost no free arenas.

@@ +537,5 @@
>      for (GCChunkSet::Enum e(rt->gcChunkSet); !e.empty(); e.popFront()) {
>          Chunk *chunk = e.front();
>          JS_ASSERT(chunk->info.runtime == rt);
>          if (chunk->unused()) {
> +            if (chunk->info.age++ > MaxAge || gckind == GC_SHRINK) {

Change the condition to:

gckind == GC_SHRINK || chunk->info.age++ > MaxAge

This way we avoid useless chunk->info.age++ for soon to be released chunks.

@@ +1094,5 @@
>  
>  void
>  JSRuntime::reduceGCTriggerBytes(uint32 amount) {
>      JS_ASSERT(amount > 0);
> +    JS_ASSERT((gcTriggerBytes - amount) >= 0);

Nit: remove () around gcTriggerBytes - amount

@@ +1106,4 @@
>  {
>      gcLastBytes = lastBytes;
> +
> +    size_t base = gckind == GC_SHRINK ? lastBytes : Max(lastBytes, GC_ARENA_ALLOCATION_TRIGGER);

Nit: Add () around gckind == GC_SHRINK. In general the style dictates () in front of ? for anything but simple names.

@@ +1113,5 @@
>  
>  void
>  JSCompartment::reduceGCTriggerBytes(uint32 amount) {
>      JS_ASSERT(amount > 0);
> +    JS_ASSERT((gcTriggerBytes - amount) >= 0);

Same here

@@ +1957,5 @@
> +        return;
> +    }
> +
> +    int64 now = PRMJ_Now();
> +    if (rt->gcNextFullGCTime && rt->gcNextFullGCTime <= now) {

gcNextFullGCTime is 64 bit and its assignments are not atomic on 32 bits. Add comments explaining why it is OK to use it here outside the locks.

::: js/src/jsgc.h
@@ +944,5 @@
>       * it is imperative that rt->gcPoke gets cleared early in js_GC.
>       */
> +    GC_LAST_CONTEXT     = 1,
> +
> +    /* Minimize GC triggers and return chunks right away. */

Nit: s/return chunks/release empty GC chunks
Comment 22 Gregor Wagner [:gwagner] 2011-06-20 09:29:45 PDT
(In reply to comment #21)
> Comment on attachment 540273 [details] [diff] [review] [review]
> patch
> 
> Review of attachment 540273 [details] [diff] [review] [review]:
> -----------------------------------------------------------------
> 
> r+ if it is sufficient to add comments to address the issues below.
> 
> ::: js/src/jsgc.cpp
> @@ +498,5 @@
> >      for (GCChunkSet::Range r(rt->gcChunkSet.all()); !r.empty(); r.popFront()) {
> >          chunk = r.front();
> >          if (chunk->hasAvailableArenas()) {
> >              cx->compartment->chunk = chunk;
> > +            rt->gcChunkAllocationSinceLastGC = true;
> 
> The gcChunkAllocationSinceLastGC name contradicts the usage as the flag is
> set when the the arena allocation code picks a new chunk from the set of
> currently allocated ones. Is it intentional? 
> 
> If yes, we need a better name and comments here why the flag is set even for
> chunks that may have almost no free arenas.
> 

Right! This was my first approach where I didn't trigger a GC when empty chunks were around. They are covered now with gcChunksWaitingToExpire. I will remove the first assignment.
Comment 23 Gregor Wagner [:gwagner] 2011-06-20 09:33:49 PDT
Created attachment 540492 [details] [diff] [review]
patch
Comment 24 Igor Bukanov 2011-06-20 12:58:43 PDT
Comment on attachment 540492 [details] [diff] [review]
patch

Review of attachment 540492 [details] [diff] [review]:
-----------------------------------------------------------------

Now gcChunkAllocationSinceLastGC becomes self-descriptive :)

::: js/src/jsgc.cpp
@@ +1971,5 @@
> +    int64 now = PRMJ_Now();
> +    /* 
> +     * We accept the fact that setting gcNextFullGCTime is not atomic on 32bit
> +     * and a race condition could trigger a GC.
> +     */

Nit: style dictates to put a blank line before comments. It looks even better to move the comments before the "int64 now" line. Also in other places the word "tolerate" is used to describe acceptance of the races. So write this as:

/*
 * On 32 bit setting gcNextFullGCTime below is not atomic and a race condition could trigger an GC. We tolerate this. */
int64 now = PRMJ_Now();
Comment 25 Gregor Wagner [:gwagner] 2011-06-20 14:46:52 PDT
http://hg.mozilla.org/tracemonkey/rev/41d5782eabf2
Comment 26 Chris Leary [:cdleary] (not checking bugmail) 2011-06-20 17:16:20 PDT
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/41d5782eabf2
Comment 27 Nicholas Nethercote [:njn] 2011-06-21 02:41:32 PDT
(In reply to comment #20)
> 
> This patch calls the GC every 20 seconds from MaybeGC if:
> a) a new chunk was allocated after the previous GC or
> b) chunks are waiting to be returned to the OS.
> 
> A normal GC always resets the 20 seconds timer.
> More complexity can be added later if needed.

Just to clarify... is the following right?

- If the browser is truly idle (ie. nothing is being allocated from the GC heap) nothing will happen.

- If some GC heap allocations are occurring... actually, I'm not sure what the behaviour is.

Can you clarify the new heuristic, for people who aren't authors of jsgc.cpp?  Thanks :)
Comment 28 Nicholas Nethercote [:njn] 2011-06-21 02:44:16 PDT
BTW, here's some info from jst, hopefully this problem will be fixed by this change:

"I left firefox running here with a clean profile and one tab open with a single article from cbsnews.com loaded, and over the course of ~14h the RES mem usage grew from ~200 megs or so after loading the page to 640 megs. I noticed this because my wife's firefox twice now got up to 5+ gigs, and I finally traced it to be caused by cbsnews.com that she had forgotten open in a tab."  

about:memory was dominated by the JS heap.  Here's the abridged version:

550,738,724 B (100.0%) -- explicit
├──489,820,652 B (88.94%) -- js
│  ├──475,004,928 B (86.25%) -- gc-heap
Comment 29 Gregor Wagner [:gwagner] 2011-06-21 08:29:33 PDT
(In reply to comment #27)
> (In reply to comment #20)
> > 
> > This patch calls the GC every 20 seconds from MaybeGC if:
> > a) a new chunk was allocated after the previous GC or
> > b) chunks are waiting to be returned to the OS.
> > 
> > A normal GC always resets the 20 seconds timer.
> > More complexity can be added later if needed.
> 
> Just to clarify... is the following right?
> 
> - If the browser is truly idle (ie. nothing is being allocated from the GC
> heap) nothing will happen.

Yes. MaybeGC is usually called once a JS script finished executing. 

> 
> - If some GC heap allocations are occurring... actually, I'm not sure what
> the behaviour is.
>

Some basics again: 
We allocate 1MB chunk from the OS and split them up into 4k arenas. If a single arena contains reachable objects we can't release the whole chunk.

The problem was that our GC was very expensive and we tried to minimize the impact by weak GC triggers. So we allowed the heap to grow very fast without performing GCs. Most of the GC cost is gone because we don't do sweeping during the critical GC pause time any more and parallel marking should also speed up the remaining cost. The incremental GC is also a very important factor for the future. With this patch and also future patches we are trying to tighten our GC triggers and increase the frequency of GCs.

There are 2 things that this patch addresses:

1) We have this delayed scheme for releasing 1MB chunks to the OS because we don't want to free chunks after the GC and allocate them again right away. Releasing these chunks only happens after a GC.
Once a benchmark or an allocation heavy animation finishes we are usually in a state where we have many empty chunks allocated but no GC is triggered because we are waiting that allocation hits a GC trigger. 
With this patch now we also perform a GC if we have empty chunks allocated and don't use them within 20 seconds. So we return unused chunks to the OS after 20 seconds.

2) 
The second part deals with slow allocation rates or keeping the browser open over night.
After a GC we fill up already existing chunks/arenas. Once they are full and we allocate a new chunk from the OS we set a flag and after 20 seconds we perform a GC. So we trigger the GC right after a new chunk is allocated and don't wait any more until the heap increases by 3x.


> Can you clarify the new heuristic, for people who aren't authors of
> jsgc.cpp?  Thanks :)

Hope that helps :)
Comment 30 Nicholas Nethercote [:njn] 2011-06-21 20:39:02 PDT
I went through all the MemShrink:P2 bugs looking for cases where this GC trigger change will fix the problem.  I came up with this list:

- Confirmed fixed: bug 654028.

- Very likely to be fixed, not yet confirmed: bug 636077, bug 638238, bug 664128, bug 666101.

- Reasonable chance of being fixed: bug 666104, 666105.

- Slight chance of being fixed: bug 639780, bug 653817, bug 665628.

Not bad!  Thanks again, Gregor, for finding time to fix this :)

I'm nominating this for tracking-firefox6 because it has the potential to fix problems that a *lot* of users are seeing;  it'd be great to fix this in FF6 rather than FF7.  The risk seems low-to-moderate... not a lot of code was changed, but Gregor might have more to say about that.
Comment 31 Gregor Wagner [:gwagner] 2011-06-21 21:51:54 PDT
Wow that's an impressive list. Thanks for tracking it!

This patch might have a much bigger impact for regular surf behavior than I expected. 
I was running a nightly build the whole day and noticed a much smaller memory footprint. I don't have any scientific measurements but usually the browser needed about 400-500MB after a few hours for my settings.  Today I noticed a reduction to about 200-300MB. It would be great if we could measure this impact somehow.

My explanation is that reusing memory instead of allocating new one makes a lot of sense for regular surf behavior. So this patch greatly improves the fragmentation-problem over time without influencing the benchmark scores in a bad way.

As for FF6: The risk of the new code itself should be low and if we can confirm the improvements we should take it.

It would be good if we could get feedback from users with 200+ tabs. I know they exist :)
They either see an improvement because the overall memory consumption decreases or they might complain about a GC-pause every 20 seconds.
Comment 32 :Ehsan Akhgari 2011-06-22 11:15:31 PDT
While the improvements here are great, and I'm dying to get them in the hands of our users as soon as possible, I really don't think that this is Firefox 6 material.  I understand the urge to want to backport it on Aurora, but I would resist is, since officially Aurora is a place for stabilizing features, and introducing changes like this will only increase the risk factor and also make the next Aurora migration more painful.

Thanks for the awesome work, Gregor!  :-)
Comment 33 LummoxJR 2011-06-23 08:39:56 PDT
Letting this wait for FF7 would be a huge mistake, if the improvement is indeed significant and the GC doesn't turn out to cause a problem. For any users still seeing major memory issues in FF5 (the jury still seems to be out), even the wait for FF6 will be too long. While I'm not sure I see the wisdom in waiting for FF6 at all if the benefits of this patch can be confirmed, I see none in waiting for FF7.

I don't think you fully appreciate, Ehsan, how many users are still hesitant to jump ship from 3.6 after 4.0 made the memory situation worse for many of them--there are a lot of people still waiting to find out if 5.0 is worth it. If it isn't, those people shouldn't be asked to wait another three months, let alone six. I'm a little hopeful myself but count me among the cautious. Memory improvements should be considered near-critical at this stage, even sneaked out with security updates if possible.

Just my two cents, but if stability is your concern I can think of few better improvements to it than to give the system back more memory. Indeed high memory use from people not using this patch is only going to obscure other memory bugs that need to be fixed, since it seems to impact so many issues. By all means let's be sure this fix is as good as it seems, but let's not wait on it once we know.
Comment 34 Boris Zbarsky [:bz] (still a bit busy) 2011-06-23 09:13:44 PDT
> and the GC doesn't turn out to cause a problem.

That's the rub, no?

> While I'm not sure I see the wisdom in waiting for FF6

Are you suggesting this should be shipped as an unscheduled security update, then?

> those people shouldn't be asked to wait another three months, let alone six.

FF7 is shipping in less than 3 months.  FF6 is shipping in 8 weeks.  No one is talking about six months.

> even sneaked out with security updates if possible.

FF6 _is_ the security update for FF5.

> but let's not wait on it once we know.

Again, that's the rub.  To get this into FF6 we need to "know" within the next several weeks at the most.
Comment 35 LummoxJR 2011-06-23 13:03:01 PDT
(In reply to comment #34)
> Are you suggesting this should be shipped as an unscheduled security update,
> then?

I'm suggesting if the fix is confirmed solid, and a critical security update comes up, it's worth considering bundling it in. The memory problem has been so bad for so long that a fix of this significance should be seriously considered for release as early as absolutely possible. Waiting till August would be a disappointment; waiting longer would be worse.

I was misinformed on the release schedule; while I'm glad to know FF6 is out in eight weeks, that's still two months and that's a long time for users with heavy memory problems to wait. So far what I'm seeing on Twitter suggests FF5 improves memory for some, but others have seen no improvement over FF4. I'm glad the devs are now taking memory bloat more seriously than in the past (not to say it was ever unserious), but it needs to be treated with urgency as well.
Comment 36 Jesse Ruderman 2011-06-23 13:50:53 PDT
I guess the risks are increased power use and less-smooth animations.  Those can take a while to shake out.
Comment 37 Alon Zakai (:azakai) 2011-06-23 14:19:12 PDT
This patch can potentially help us with a lot of memory issues. But it's limited too. For example in a single tab, the biggest amount of unreclaimed memory will be 70MB (at which point we *do* finally do a GC even without this patch).

It is not clear to me if the 70MB is a limit for all tabs or each one. If it is all tabs, then I think the motivation to take risks in order to get this out earlier is lessened (since it's just 70MB altogether. that's not great obviously, but I doubt it's worth stability risks). If it is per tab, I'm not sure anymore.
Comment 38 Asa Dotzler [:asa] 2011-06-23 14:30:33 PDT
not going to track this and I think it's probably too late for 6 but we can have that discussion if someone nominates the patch.
Comment 39 Nicholas Nethercote [:njn] 2011-06-23 17:34:42 PDT
(In reply to comment #37)
> This patch can potentially help us with a lot of memory issues. But it's
> limited too. For example in a single tab, the biggest amount of unreclaimed
> memory will be 70MB (at which point we *do* finally do a GC even without
> this patch).

I don't think that's true.  Look at the bugs in comment 30.  We saw some much bigger reductions than that.
Comment 40 Nicholas Nethercote [:njn] 2011-06-23 17:38:04 PDT
Comment on attachment 540492 [details] [diff] [review]
patch

Nominating the patch for approval-mozilla-aurora.  My argument:  this is a bad (performance) bug that was introduced in FF 4.0.0.  And we're allowed to fix old, bad bugs on Aurora, surely?  So let's fix it.

We've had arguments from both sides now, so let's leave the decision up to the release drivers.
Comment 41 Alon Zakai (:azakai) 2011-06-23 17:47:22 PDT
(In reply to comment #39)
> (In reply to comment #37)
> > This patch can potentially help us with a lot of memory issues. But it's
> > limited too. For example in a single tab, the biggest amount of unreclaimed
> > memory will be 70MB (at which point we *do* finally do a GC even without
> > this patch).
> 
> I don't think that's true.  Look at the bugs in comment 30.  We saw some
> much bigger reductions than that.

That could be true. I was basing my statement on Slashdot and on what I saw when I reduced it to a smaller testcase. I thought I ended up with something pretty generic but perhaps there are other use patterns that end up differently.
Comment 42 Gregor Wagner [:gwagner] 2011-06-23 17:47:54 PDT
(In reply to comment #36)
> I guess the risks are increased power use and less-smooth animations.  Those
> can take a while to shake out.

For animations like FOTN the cycle-collector currently triggers a global GC every 4 seconds. So this patch doesn't influence that with the 20 seconds waiting time.
Using less memory should also decrease the overall power consumption.


Our memory footprint is a big problem and I filed this bug because FF was/is very painful to use or even unusable (see bug 655455) on my new lowest-end netbook. This patch tries to keep the memory footprint small and therefore is a big win on such devices. Finally I can use FF on my new machine.
Comment 43 Robert Kaiser 2011-06-23 18:22:38 PDT
Also note that even if it was 70 MB, that's a real lot on mobile devices.
Comment 44 Nicholas Nethercote [:njn] 2011-06-23 23:49:05 PDT
(In reply to comment #40)

> So let's fix it.

dmandelin encouraged me to be more specific about the risks and benefits here.

Benefits:

- It fixes a widespread problem.  Confirmed to fix bug 654028, bug 636077, bug 638238, bug 664128, bug 666101, bug 666104.  Bug 661527 was already marked as a dup of this bug.  And see Gregor's netbook experience in comment 42;  presumably mobile Firefox will benefit even more.

- Memory reductions are significant.  In most of the bugs it's in the order of several tens of MB up to ~100MB, though the measurements reported are generally pretty inconsistent so it's hard to be sure.  

  Bug 664128 has better measurements;  without this patch the user saw the "private bytes" on windows go up to 510MB when idle on gamespot.com for 4 hours (http://img221.imageshack.us/img221/2010/firefoxmemoryusage4hourb.png).  With the patch the private bytes never went above 140MB (http://img24.imageshack.us/img24/9255/firefoxgamespotcom9h.png).  That's a factor of 3.6x reduction.

  Bug 654028 also had good measurements, showing we save 70MB. 

- We have tentative evidence that it reduces fragmentation in the JS heap.  See bug 666101 comment 5 and subsequent comments -- after idling for a while on sina.cn.com and then triggering GC, the end GC heap size was 26MB with the patch, and 32MB without.

- Basically, it fixes the "I left Firefox running overnight and now my machine has locked up" problem.  Which is a big problem.


Risks:

- Could affect some benchmarks, but we know (AFAIK) it doesn't affect Sunspider, V8bench, FOTN.  Gregor, any others you know about?  I'm not an expert on GC and benchmarks.

- Could cause mobile uses to use more power or wake up.  But the extra GCs aren't triggered when the browser is truly idle;  JS allocations have to be already happening.  (Gregor, correct me if I'm wrong about that.)

- Could make animations jerky, if we collect in the middle of one.  Maybe... not sure.  That can already happen.

Note that these are all "coulds".
Comment 45 Nicholas Nethercote [:njn] 2011-06-23 23:56:59 PDT
Stop the presses: bug 666845 was just filed, and looks almost certain to be fixed by the change.
Comment 46 Gregor Wagner [:gwagner] 2011-06-24 01:14:48 PDT
(In reply to comment #44)
> 
> Risks:
> 
> - Could affect some benchmarks, but we know (AFAIK) it doesn't affect
> Sunspider, V8bench, FOTN.  Gregor, any others you know about?  I'm not an
> expert on GC and benchmarks.
> 

The benchmark would have to run JS for 20sec without allocations. I don't know of any benchmarks that do that.

> - Could cause mobile uses to use more power or wake up.  But the extra GCs
> aren't triggered when the browser is truly idle;  JS allocations have to be
> already happening.  (Gregor, correct me if I'm wrong about that.)
> 

We don't wake up the browser for that. The GC usually happens after some JS execution. 

> - Could make animations jerky, if we collect in the middle of one.  Maybe...
> not sure.  That can already happen.
> 

Animations usually have a high allocation rate and other triggers take care of of the GC. Nothing changed there. The new trigger makes sure that after an animation we shrink the heap again.
Comment 47 chris hofmann 2011-06-24 13:56:22 PDT
Nicholas/Gregor,

Under the risk section is there also an increased possibility of crashes, and *reduced* stability?

We seen evidence of GC on the stack in lots of crash reports, so the theory here is would be that if GC is running more frequently, the risk of stomping on memory in these kinds of GC related crashes could go up.   

I don't thing we have good metrics or test cases that demonstrate how often GC is involved in crashes so this might need to be one of the things that we need to be watching for in evaluation when this hits mozilla-central and aurora now, or later.
Comment 48 Nicholas Nethercote [:njn] 2011-06-24 14:28:27 PDT
(In reply to comment #47)
> Nicholas/Gregor,
> 
> Under the risk section is there also an increased possibility of crashes,
> and *reduced* stability?
> 
> We seen evidence of GC on the stack in lots of crash reports, so the theory
> here is would be that if GC is running more frequently, the risk of stomping
> on memory in these kinds of GC related crashes could go up.   
> 
> I don't thing we have good metrics or test cases that demonstrate how often
> GC is involved in crashes so this might need to be one of the things that we
> need to be watching for in evaluation when this hits mozilla-central and
> aurora now, or later.

Well, I guess it's a possibility.  But the idea "GC sometimes causes crashes, so we better not run it" makes me want to cry.  It's not like GC is some optional thing, it's utterly integral to JavaScript execution.

Also, if some memory has been corrupted that causes GC to crash, it'll presumably happen whenever the next GC happens, whether that's right now or a few minutes from now.

Anyway, enough speculation... this landed on Nightly on June 21 -- can we look at the crash stats since then and see if GC-related crashes have increased?
Comment 49 Nicholas Nethercote [:njn] 2011-06-24 14:34:52 PDT
More data: see bug 657115 comment 26 and subsequent comments.  The ROME demo saw slightly reduced memory usage, maxing out at 2.0GB instead of 2.2GB.  azakai said there are "more" GCs, some of which are quite noticeable.  Gregor said the number of GCs performed for him increased from 15 to 18.  Note that the demo already had some pauses, and Google Chrome also has significant pauses.

This is an *extremely* memory-intensive demo, however, and doesn't represent typical web usage, so I'd put a low weight on it.  Other GC changes (parallel marking, incremental GC, generational GC) will be needed to make this demo behave better.
Comment 50 Nicholas Nethercote [:njn] 2011-06-24 14:44:36 PDT
(In reply to comment #45)
> Stop the presses: bug 666845 was just filed, and looks almost certain to be
> fixed by the change.

More evidence for the prosecution, m'lud:  the fix has been confirmed by the reporter.  The example script in the bug report previously caused the reporter's machine to crash due to OOM, now the "mem used" (reported by Win XP) peaks at 154MB.  OOM vs 154MB!  That's a big difference.
Comment 51 Andrew McCreight [:mccr8] 2011-06-24 14:48:34 PDT
To be fair, they were running it on a VM with only 256MB of memory, half of the recommended minimum for FF4.
Comment 52 Jesse Ruderman 2011-06-24 14:55:06 PDT
This could reduce GC-related crashes by:

* avoiding the extra complexity of allocation-triggered GCs (stack scanning, reliance on temporary rooting, leaving trace)

* avoiding running out of VM

* improving the quality of crash reports and bug reports when we have heap-corruption bugs

I would not expect this change to increase crashiness. If the heap is in a state where GCing will crash, and script isn't in the middle of running, it's unlikely to leave that state just by running more script before GCing.
Comment 53 Robert Kaiser 2011-06-24 15:05:55 PDT
(In reply to comment #48)
> Anyway, enough speculation... this landed on Nightly on June 21 -- can we
> look at the crash stats since then and see if GC-related crashes have
> increased?

We'll closely watch this, but we probably won't have suitable data until this is on Beta (as both Nightly and Aurora tend to have too few users for decent crash data) so I'm (from a crash analysis POV) all for getting this into Aurora before the next uplift - if nothing bad happens on Nightly, that is. ;-)
Comment 54 chris hofmann 2011-06-24 15:33:18 PDT
(In reply to comment #44)
> 
> 
> Risks:
> 
> - Could affect some benchmarks, but we know (AFAIK) it doesn't affect
> Sunspider, V8bench, FOTN.  Gregor, any others you know about?  I'm not an
> expert on GC and benchmarks.
> 
> - Could cause mobile uses to use more power or wake up.  But the extra GCs
> aren't triggered when the browser is truly idle;  JS allocations have to be
> already happening.  (Gregor, correct me if I'm wrong about that.)
> 
> - Could make animations jerky, if we collect in the middle of one.  Maybe...
> not sure.  That can already happen.
>

Is another possible area to test beyond benchmarks and animations "web 2.0 apps".

The impact seems low or non-existent on small/fast loading pages.

Maybe moderate on larger/more complex pages that take a while to load, since the window of loading is increased and gc has a bigger opportunity to kick in at some point.

Could the risk be even higher on gmail/web mail/other apps that are "running all the time"?   The theory here is that at some point (and now more frequently) GC kicks and starve interactions and updates on the page.

> Note that these are all "coulds".

right, the hard part is getting this out in front of a lot of users, then watching for vague hints of problems that come back in feedback reports or bugs, and then having them come back to be investigated as part of this bug, or not.   that's the work head for landing in either firefox 6 or 7.
Comment 55 Nicholas Nethercote [:njn] 2011-06-24 15:46:10 PDT
> Could the risk be even higher on gmail/web mail/other apps that are "running
> all the time"?   The theory here is that at some point (and now more
> frequently) GC kicks and starve interactions and updates on the page.

Apps that are running all the time are exactly the ones that will *benefit* most from this patch.  We might see some GC pauses, but we'll see smaller pauses more frequently, instead of the occasional big one.  (And note that these small pauses happen when the browser is mostly idle, ie. when the user is unlikely to notice.)
And we'll avoid paging, which is what really kills performance.

I think the analysis here is pretty much complete.  We have a large number of clear wins, and no clear losses.  Each suggested risks has had some kind of counter-argument saying why it's unlikely to be a problem.  I agree with KaiRo;  the only way to get this into FF6 is to put it in Aurora now, and watch the betas closely.  We can back it out of beta if we see ill effects.  A release driver needs to say "yes" or "no".
Comment 56 Randell Jesup [:jesup] 2011-06-27 05:26:42 PDT
For whatever it's worth (as a LONG-time ago driver, and as a ridiculous-number-of-tabs user), I'm strongly in favor of taking this for FF6.  The number one complaint I get from friends about FF4 (and I don't see it changing in FF5) is "my browser ate all my memory just sitting around!"  And this is *horrid* behavior for a mobile (fennec) or near-mobile (netbooks) client.

Risk of crashes seems extremely low; risk to performance seems very low, and in some cases will improve performance by avoiding a big GC when you use it after sitting idle a while, and avoid possible paging at that time - look at the netbook bug.
Comment 57 Gregor Wagner [:gwagner] 2011-06-27 15:23:47 PDT
I guess I should make one very important advantage of this patch more explicit.

Currently most of our GCs happen during allocation-heavy situations. If you enter GMail you are very likely to hit a GC trigger because GMail does a lot of allocation.
This happens at the worst time: right during the page-load.

Much better would be to perform a GC before we reach an allocation heavy situation and make as much space as possible in order to avoid a GC during the critical page load.
That's exactly what this patch does. It make sure that we are always ready for an allocation heavy situation.
Waiting with an garbage-collected heap is *always* better than waiting with an almost full heap.
Comment 58 LummoxJR 2011-06-28 08:47:51 PDT
This is probably a stupid question, but as I'm not familiar with the process--what does it take for a release driver to take notice of this issue? It was nominated days ago but there's no answer on it. Is that common, and if so does it mean this issue could potentially be ignored until inclusion in FF6 is moot, or does it usually get decided right before a beta?
Comment 59 Boris Zbarsky [:bz] (still a bit busy) 2011-06-28 09:10:21 PDT
The issue was nominated after 5pm Pacific on Thursday last week.

Triage for Aurora nominations happens Tuesday/Thursday 2pm-3pm Pacific and Wednesday 11:30am-noon Pacific.  See https://wiki.mozilla.org/Firefox/Aurora

So no answer is expected so far. Give it about 6 hours. ;)
Comment 60 Asa Dotzler [:asa] 2011-06-28 15:15:04 PDT
Comment on attachment 540492 [details] [diff] [review]
patch

After a wonderful debate, the nays have it. This should have a full Aurora cycle.
Comment 61 Gregor Wagner [:gwagner] 2011-06-28 18:23:50 PDT
*** Bug 619822 has been marked as a duplicate of this bug. ***
Comment 62 LummoxJR 2011-06-28 21:33:30 PDT
Nay as in it's not going into FF6? If so I'm curious of the logic behind that decision.
Comment 63 Robert Kaiser 2011-06-29 06:36:02 PDT
(In reply to comment #62)
> Nay as in it's not going into FF6? If so I'm curious of the logic behind
> that decision.

Please take that discussion elsewhere, it doesn't belong in the bug report.


Drivers, with that decision, just be clear that we probably will not be able to tell if this has any impact on crashes of any kind before the end of August at the earliest, when Beta builds with it have been out for some time.
Comment 64 Matthew Kidd 2011-06-29 11:58:39 PDT
At the request of Nicholas Nethercote in Comment #11 of Bug 652801 I tried the nightly Firefox 7.0a1 (2011-06-28) with the same large set of windows and tabs that was causing the problem I reported in Comments 6, 7, and 8 of Bug 652801. Memory use stayed constant overnight. Seems like the problem has been fixed.
Comment 65 Nicholas Nethercote [:njn] 2011-06-29 16:27:30 PDT
(In reply to comment #63)
> > Nay as in it's not going into FF6? If so I'm curious of the logic behind
> > that decision.
> 
> Please take that discussion elsewhere, it doesn't belong in the bug report.

It's not an unreasonable question.  The nomination for back-porting to Aurora happened in this bug, the discussion of the benefits and risks has been happening in this bug, so an explanation (however brief) of why it was rejected would be perfectly suitable here.

Ultimately, I'm sure the reasoning can be boiled down to "the risks are too high, this needs more time in the Aurora channel to make sure there are no problems".  But the phrase "wonderful debate" has piqued my interest -- was that in reference to the comments in this bug, or the discussion in the meeting?  I'd be interested to hear, for example, if any of the mentioned risks were perceived to be more significant than others.
Comment 66 Asa Dotzler [:asa] 2011-06-29 16:37:35 PDT
Sorry I didn't take better notes during triage so I can't really relay the specifics of the debate. I'll just say that we have a diverse group of drivers with varying comfort levels for risk. I'm one of the bravest (most foolish?) and I love saying yes. Others are somewhat more conservative and lean towards using Aurora and Beta for their defined purpose -- which should be almost exclusively subtractive change (back our way out to known good state.) 

You are correct that the final call was "too much risk without enough confidence that we could identify any major problems before release." 

We're not as consistent, even within the core group of drivers, as we ought to be and these decisions will probably never be purely mechanical, but we're getting better and over time I expect the whole community will come to a more shared understanding of what kind of risk for what kind of reward we're willing to accept in Aurora and Beta. Until then, bear with us all and our learning curve.
Comment 67 Asa Dotzler [:asa] 2011-06-29 16:45:14 PDT
(In reply to comment #66)

Oh, and I forgot one very important thing. You all are invited to these meetings too. We do not triage Aurora and Beta nominations and requests in secret and we'd love more attendance -- especially from the feature developers and advocates who might be able to give us better information about risk and reward that we can get from what's in the bug. Meetings happen three times a week and if you're joining us for a particular bug, let us know and we can try to get to that one up front so you can participate without having to sit through the whole grueling session.

Aurora triage sessions happen on Tuesday and Thursday at 2PM Pacific https://wiki.mozilla.org/Firefox/Aurora  Beta triage happens Mondays at 2PM Pacific https://wiki.mozilla.org/Firefox/Beta We also meet Wednesday after the Product Planning meeting which happens at 11am Pacific and that session varies between Beta and Aurora depending on which needs us the most.
Comment 68 Nicholas Nethercote [:njn] 2011-06-29 16:49:59 PDT
Asa, thanks for the thoughtful response!

(In reply to comment #66)
> Others are somewhat more conservative and
> lean towards using Aurora and Beta for their defined purpose -- which should
> be almost exclusively subtractive change (back our way out to known good
> state.) 

I had assumed that fixing pre-existing bugs would also be considered legitimate.  Eg. if we found a bad security bug that had been introduced in FF4, presumably we'd consider backporting it to Aurora and Beta?

I thought the same logic could apply here, since this bad GC behaviour was introduced in FF4.  (Note I'm not trying to change the decision, I'm just trying to understand what's considered acceptable for Aurora/Beta.)  Obviously, as we go through the rapid release cycle a few times the number of these "existed since FF4 or earlier" bugs will diminish.
Comment 69 Asa Dotzler [:asa] 2011-06-29 18:47:36 PDT
(In reply to comment #68)
> Asa, thanks for the thoughtful response!
> 
> (In reply to comment #66)
> > Others are somewhat more conservative and
> > lean towards using Aurora and Beta for their defined purpose -- which should
> > be almost exclusively subtractive change (back our way out to known good
> > state.) 
> 
> I had assumed that fixing pre-existing bugs would also be considered
> legitimate.  Eg. if we found a bad security bug that had been introduced in
> FF4, presumably we'd consider backporting it to Aurora and Beta?

Yes. We certainly take bugfixes on Aurora and Beta but we also ask ourselves whether backing out the thing that cause the regression is a better path. And our risk aversion for taking additive fixes rather than subtractive goes up the further we are into the cycle. You'll notes, for example, that we approve larger and scarier things at the beginning of Aurora than at the end or than in Beta -- unless it's absolutely critical like for security or a intolerable topcrash or something. 

> I thought the same logic could apply here, since this bad GC behaviour was
> introduced in FF4.  (Note I'm not trying to change the decision, I'm just
> trying to understand what's considered acceptable for Aurora/Beta.) 
> Obviously, as we go through the rapid release cycle a few times the number
> of these "existed since FF4 or earlier" bugs will diminish.

Yeah. That's the case I made in favor of taking it. I asserted that this could even fix OOM crashes (though I may be wrong about that -- it could have been other fixes on the trunk alleviating some of the OOM crashes my friends and family were seeing) and the lowered memory usage (getting us back closer to 3.6 era usage) was a very important regression fix. Had this been the first week of Aurora, I might have succeeded in convincing the group. Had this been the only scary change we were evaluating for late in the game or had we not already sort of exhausted our tolerance for risk with already approved fixes then things might have been different.

So a fair bit of context and then straight up arguing plays a role here. That means it's not perfectly predictable or even totally consistent. But as I said above, I think it works pretty well not being a purely mechanical system and we just have to change the expectations we all have as we change to a different release cadence and process. Once we've done a few of these things, it will probably fall into a much more comfortable and consistent rhythm.
Comment 70 Nicholas Nethercote [:njn] 2011-06-29 18:53:59 PDT
> Had this been the only scary change we were evaluating for late in
> the game or had we not already sort of exhausted our tolerance for risk with
> already approved fixes then things might have been different.

Thanks again for the details.  Not everyone can listen in on the meetings (eg. I'm in an awkward timezone) so putting the information here is greatly appreciated.
Comment 71 Gregor Wagner [:gwagner] 2011-06-29 21:19:01 PDT
Future GC changes will increase the GC frequency dramatically. Incremental GC will perform 10x, 100x or even N00x more GCs. So we definitely have to rely on our GC.
Data about the effect of (slowly) increasing the frequency will help Bill.
Comment 72 :Ehsan Akhgari 2011-06-30 07:52:04 PDT
(In reply to comment #68)
> I had assumed that fixing pre-existing bugs would also be considered
> legitimate.  Eg. if we found a bad security bug that had been introduced in
> FF4, presumably we'd consider backporting it to Aurora and Beta?

See http://mozilla.github.com/process-releases/draft/development_specifics/, to quote from that document:

<quote>
What happens where?

  mozilla-aurora

    Preffing off and backing out fixes/features which the central channel exposes as problematic with a wider audience.
    Remaining localizations
    Landing spot fixes to get the product in a shipping state
    Merge from mozilla-shadow (see below) and fixes/backouts for any resulting issues
    Preffing off and backing out fixes/features which were deemed ready but testing determined they in fact were not

</quote>

According to our policy, we are not supposed to fix bugs which have existed in the previous releases on Aurora.  We need to move away from our old mindset of trying to squeeze in bug fixes in a release out of the fear of the next release happening in an unknown point in the future, and try to implement the policy that we agreed to when we were first designing the rapid release mode.

Unfortunately, we've been too welcoming to Aurora changes.  This may cause us problems (see bug 655703) or make the future migrations more painful than necessary.  Perhaps I need to start attending the triage sessions and start saying No to almost everything.  No is a good word, we should use it more often.  ;-)
Comment 73 Asa Dotzler [:asa] 2011-06-30 08:08:29 PDT
(In reply to comment #72)
 
> According to our policy, we are not supposed to fix bugs which have existed
> in the previous releases on Aurora.  We need to move away from our old
> mindset of trying to squeeze in bug fixes in a release out of the fear of
> the next release happening in an unknown point in the future, and try to
> implement the policy that we agreed to when we were first designing the
> rapid release mode

I agree that we need to move away from our old mindset of trying to squeeze bug fixes into a release like that but I don't agree that this document is "policy". It's a guideline that's designed to help engineers and drivers do the right thing. Also, we intentionally gave ourselves room to land additive fixes on Aurora so that we could avoid backing out very nearly complete features if the remaining work was well understood, well contained, and low risk. I think we should add to that that we're very much interested in evaluating fixes for major crashes, hangs, or memory situations -- especially where they're regressions from the previous release. That absolutely should not be the primary purpose of Aurora, but I think that we'd be foolish to not take something like a null check for a topcrasher regression on Aurora because we'd put in place a policy that said we don't fix regressions in Aurora.

That being said, I expect that over time our engineers, project managers, and teams like crashkill will put their major focus on fixing these problems on trunk and not letting them make it up to Aurora. Right now I don't think that's enough of the case so the "eye of sauron" the drivers can direct towards eggregious regressions when they make it up to Aurora and Beta and the freedom to take well-contained and low risk regression fixes, plays an important role in keeping Firefox from degrading from release to release.

Gregor and njn, sorry to have taken over your bug with all of this project management conversation. I think it's valuable and I will try to capture what's been said here for a blog post or a .planning post or something.  We're still learning and I would love for this conversation to continue in a way that doesn't spam this bug's cc list so bad and that will likely be seen by a broader audience. 

Ehsan, thank you for the comment. I think you're absolutely right about all of us needing to learn to use the new model well. We do all have to change how we think (and that includes the release drivers who are grappling with all the same issues during this transition.)
Comment 74 Matthew Kidd 2011-06-30 12:04:19 PDT
I'm not a software project manager per se but I've been in the position of trying to convince Java developers that the user perception of a Java exception with its scary looking stack trace will always be "the software is broken". Don't allow a "harmless" stack trace to appear when the condition is an expected outcome that can be explained as such, e.g. "No books have the title specified."

Users have the same perception of crashes: the software is broken. This is even more true if the crash happens when "I wasn't even doing anything".

Taking better advantage of graphic cards is progress. But a program that doesn't crash, responds quickly, uses memory efficiently, and scales well is priceless. Most users don't know exactly how to measure these things but they have a gut feeling and that feeling translates into market share.

The soon we can get this fix into the release, the better.
Comment 75 LummoxJR 2011-06-30 12:45:31 PDT
I understand the need for everyone to adjust to the new model and that there's great value in sticking to policy. However as an end user, this decision feeds into a common perception that memory issues are not being treated with the gravity they deserve.

Whatever the policy issues, they're a weak argument against a bug fix that has such an extremely low risk/reward ratio AND fixes a regression of great importance. This isn't a minor or moderate fix like so many others, where denying a backport is more sensible. No may be a dandy word, but the end result is that Joe User won't see this fix for three more months. It beats years, but Joe has already been waiting a few of those too. Joe also doesn't understand why this patch can't make it into 5.0.1, even though the arguments against that are much stronger. Many of Joe's friends are still using 3.6 or even 3.5 because of the regression. Bottom line: when FF6 comes out, the public will still see it as broken.
Comment 76 Boris Zbarsky [:bz] (still a bit busy) 2011-06-30 12:58:06 PDT
> that has such an extremely low risk/reward ratio

See, that ratio is what people are disagreeing on...

Just so we're all clear on this, by the way, we're talking a difference between August 16 (Firefox 6) and September 27 (Firefox 7) for when this gets released.
Comment 77 Robert Kaiser 2011-06-30 13:02:15 PDT
(In reply to comment #75)
> No may be a dandy word, but the end
> result is that Joe User won't see this fix for three more months.

Six weeks, not three months.

And I don't think you as someone not in the developer or release management teams can reasonably assess this to be "an extremely low risk/reward ratio". We actually don't know the real risk and not the full extent of rewards either.

(In reply to comment #74)
> Users have the same perception of crashes: the software is broken. This is
> even more true if the crash happens when "I wasn't even doing anything".

There's no clear knowledge yet, only speculation, if this patch even has any influence at all on crashes. This is a memory cleanup patch, not a crash-fixing patch.

And Ehsan, your comment sounds very good, but omits that the channels aren't yet what we planned them to be. We set them up to have a million users on Aurora and ten million on Beta - right now, we have 50-60k on Aurora and 1-2 million on Beta. We need to work on that, and I know this is being done, but until we get nearer to what we set out to have, we e.g. can't see a lot of useful crash analysis on Aurora, and we need to deal with that.
Comment 78 Matthew Kidd 2011-06-30 13:29:37 PDT
> This is a memory cleanup patch, not a crash-fixing patch.

True. Until you hit the 2 GB process limit on a 32-bit platform. In my case I was seeing memory leak at ~100 MB / hr. By the next morning Firefox would crash. Users don't understand GC; they understand coming back to their computer the next day and seeing that Firefox has crashed. Or they wonder why it takes so long to unlock their screen, never understanding that 500M of memory is being swapped before their desktop manager and related programs can fully run. How many computers are thrown out every year because they are "slow" and the user has no idea how to fix the problem? We can't fix Microsoft but let's not add to the problem.

If Firefox 7 is really scheduled for September 27th, okay. But which annoying option do I choose in the meantime: use the nightly and live without all my extensions, hack all my .xpi files to tell them they are compatible Firefox 7, or go back to Firefox 4/5 and do a binary search to figure out which tab(s) are leaking?
Comment 79 Asa Dotzler [:asa] 2011-06-30 13:31:22 PDT
(In reply to comment #78)

> If Firefox 7 is really scheduled for September 27th, okay. But which
> annoying option do I choose in the meantime

Use the nightly builds, get the Aurora build next week ,or wait for Beta or release. Those are your options.
Comment 80 Andrew McCreight [:mccr8] 2011-06-30 13:32:34 PDT
This is pretty off topic, but the Addon Compatibility Reporter basically does the same thing as editing your .xpi files with the hassle. https://addons.mozilla.org/en-US/firefox/addon/add-on-compatibility-reporter/
Comment 81 Jesse Ruderman 2011-06-30 14:09:10 PDT
I'm confused why this patch fixes severe "left Firefox open overnight" problems. Shouldn't our other GC heuristics, based on the amount of memory allocated, have caused Firefox to run GC often enough? Maybe we should backport bug 660734 and bug 654399 instead?
Comment 82 Boris Zbarsky [:bz] (still a bit busy) 2011-06-30 14:11:27 PDT
Jesse, the memory allocation heuristics are based on the amount of memory allocated on the JS heap.

If there's a significant non-JS-heap component to the allocations but that non-JS-heap memory is kept alive by JS objects (e.g. DOM nodes, canvas backing stores, etc, etc) you could have pretty noticeable memory growth without the JS engine deciding to GC, I suspect.
Comment 83 Nicholas Nethercote [:njn] 2011-06-30 16:30:46 PDT
Jesse:  also, the allocation-based triggers are really weak.  It's something like "don't GC until the heap has grown to 3x it's previous size".  AIUI, it was tuned for FF4 mostly to avoid GC during Sunspider and V8bench, because GC'ing during them kills performance.
Comment 84 Nicholas Nethercote [:njn] 2011-06-30 16:33:09 PDT
(In reply to comment #73)
>
> Gregor and njn, sorry to have taken over your bug with all of this project
> management conversation.

No need to apologize!  This discussion has been helpful, the context is much clearer than it was after comment 60 (and comment 63).
Comment 85 Randell Jesup [:jesup] 2011-07-01 12:33:16 PDT
Agreed, the commentary here has been helpful.  And I understand ehsan's feeling we should be saying "No" more often, having been a driver in a previous life (and probably having have said "Yes" too much).  

Obviously we're all still coming to grips with the new release roadmaps, and it will take a while, and adjustments may well be needed not just in what we expect to get in, but also some adjustments in the process will probably be needed as we get experience with the impacts and inputs - see for example the discussion of Aurora and Beta uptake rates.  Probably we should have some semi-formal postmortem on how the process is working after each release, at least for a while.

I weighed in on the yes side because of the perceived risk/reward ratio, and because I felt this was a major regression that we should have fixed in FF5 (if we had been in the Aurora stage for FF5, would we have taken this fix as it was a regression we'd released in FF4?  Some comments above have implied that we might have.)

One thing that should always weigh into our decisions (though not rule them!) is what will the *perception* by end-users of taking or not taking a change.  This can affect long-term perception of the quality of FF and of uptake/recommendation rates.  It's a weighting factor.  This bug gets a high weight-factor by that scale.  That doesn't necessarily overrule the arguments in favor of No, and as I wasn't on the call where the decision was made, I can accept the decision.  Just something to remember as we go forward and evaluate other Aurora requests.
Comment 86 Asa Dotzler [:asa] 2011-07-05 14:34:13 PDT
6 is gone. not going to track this for a release we're not going to approve it for.
Comment 87 :aceman 2011-08-11 07:15:32 PDT
*** Bug 678185 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.