Closed Bug 1332744 Opened 9 years ago Closed 9 years ago

Initial idle compactation takes minutes with many tabs

Categories

(Core :: JavaScript: GC, defect)

x86_64
Windows 10
defect
Not set
major

Tracking

()

RESOLVED FIXED
Tracking Status
firefox53 --- affected

People

(Reporter: bugzilla.mozilla.org, Assigned: jonco)

Details

(Keywords: perf)

Attachments

(4 files)

1. create a profile with 1500 tabs (about:blank should do) 2. enable restore previous session on restart 3. javascript.options.compact_on_user_inactive = false 4. restart 5. go to about:memory, hit minimize memory usage, measure time 6. hit minimize memory usage again, measure time This takes the first pass takes 2-3 minutes for me, the 2nd one is fast and remains fast even during a long-lasting browser session. So something during startup creates a lot of expensive to collect garbage that normal operation does not. That means the compact_on_user_inactive behavior, when enabled, causes a hang lasting several minutes at some point after startup, that's why I have to disable it to keep the browser usable.
Do you have a good way of creating a profile with 1500 tabs? When I do it with a simple gBrowser.addTab loop, it fails to save my tabs even though I've configured it to, and/or it crashes. Neither of which are good, but I'd really like to look at this specific bug. (And while I have 1500 tabs open, minimizing memory usage doesn't take much time, but that's not the STR here.) I may have to go back to my original plan of writing out a custom sessionstore.js.
Ah, something must have changed. about:blank used to be persisted, now apparently it is not. try: for(let i=0;i<1500;i++){gBrowser.addTab("data:text/plain,");} That'll once you restart the browser it should be effectively the same since the tabs will be lazy-loaded.
It appears that my last attempt worked. (I loaded www.google.com, which crashed, but it saved the tabs first.) And I can reproduce the problem. The GC pattern is weird, but the main thing is that yes, a compacting GC took a very long time. I will attach the stats, but it was a 94 second pause, 93 seconds of which was spent in the "Compact Update" phase: Compact Update: 92927.703ms Other: 57003.267ms Mark Roots: 35764.395ms Other: 32043.854ms Mark Cross Compartment Wrappers: 348.138ms Mark C and JS stacks: 113.573ms Mark Runtime-wide Data: 2314.053ms Mark Embedding: 944.778ms Compact Update Cells: 160.040ms Oh. Looks like there was a 103 second compaction too. (Both long ones were from MEM_PRESSURE.) The next longest pause was "only" 7 seconds, from a 218-slice CC_WAITING GC.
(In reply to Steve Fink [:sfink] [:s:] from comment #5) Thanks for reproducing this. This is my fault - we trace the runtime and a bunch of system wide things once per zone when compacting. I expected this part to be much quicker than iterating all the cells in each zone to update them, but I guess this makes us N^2 in the number of zones or something. We should be able to only trace this system-wide stuff once per compacting slice.
Assignee: nobody → jcoppeard
This patch splits the compacting GC pointer updates into two parts: one that runs once per zone and one that runs once per compacting slice. For the most part this just meant splitting GCRuntime::updatePointersToRelocatedCells in two based on whether each call took a zone parameter or not. I moved part of JSCompartment::fixupCrossCompartmentWrappersAfterMovingGC into JSCompartment::fixupAfterMovingGC so as not to iterate all compartments on every zone. I changed Debugger::traceIncomingCrossCompartmentEdges to trace edges for all zones when compacting since this needs to trace all edges where the destination is in a compacting zone now it's only called once per slice. I also had to remove an assertion in JSObject2WrappedJSMap::UpdateWeakPointersAfterGC because this assumed that the gray roots were marked before this method was called, which is not true now this marking happens once at the end of the slice. Here is the GC profiling output for an optdebug build with 500 tabs open: Without patch: 3 matches for "MEM_PRESSURE" in buffer: debuglog0.txt MajorGC: Reason States total bgnCB waitBG discrd relzfy prgTbl purge mark sweep cmpct endCB brrier 3175:MajorGC: MEM_PRESSURE 0 -> 0 27884 0 367 0 33 1 0 2113 691 24675 0 0 3185:MajorGC: MEM_PRESSURE 0 -> 0 3369 0 74 0 31 0 0 1936 674 649 0 0 3191:MajorGC: MEM_PRESSURE 0 -> 0 3580 0 80 0 28 0 0 1979 762 725 0 0 With patch: 7 matches for "MEM_PRESSURE" in buffer: debuglog.txt MajorGC: Reason States total bgnCB waitBG discrd relzfy prgTbl purge mark sweep cmpct endCB brrier 2947:MajorGC: MEM_PRESSURE 2 -> 0 7 0 0 0 0 0 0 0 0 0 0 0 2951:MajorGC: MEM_PRESSURE 0 -> 0 4106 0 169 0 29 1 0 2044 729 1131 0 0 2956:MajorGC: MEM_PRESSURE 0 -> 0 2964 0 67 0 29 0 0 2009 695 159 0 0 2962:MajorGC: MEM_PRESSURE 0 -> 0 3529 0 73 0 29 0 0 2010 695 718 0 0 2971:MajorGC: MEM_PRESSURE 0 -> 0 3605 0 74 0 32 0 0 2061 719 715 0 0 2974:MajorGC: MEM_PRESSURE 0 -> 0 2942 0 74 0 30 0 0 2016 686 132 0 0 2981:MajorGC: MEM_PRESSURE 0 -> 0 3589 0 75 0 32 0 0 2028 707 741 0 0 You can see that we were getting a 24 second compact before which has disappeared. (I'm not sure why we triggered more MEM_PRESSURE GCs in the second run).
Attachment #8832446 - Flags: review?(sphink)
Comment on attachment 8832446 [details] [diff] [review] bug1332744-improve-compacting-update Review of attachment 8832446 [details] [diff] [review]: ----------------------------------------------------------------- I'm not 100% confident that that there isn't *something* lurking in here that could cause problems, but I looked pretty closely, and the fact that compaction was already incremental dramatically reduces the opportunities for new problems. ::: js/src/jscompartment.cpp @@ +871,5 @@ > MOZ_ASSERT(trc->runtime()->gc.isHeapCompacting()); > > for (CompartmentsIter comp(trc->runtime(), SkipAtoms); !comp.done(); comp.next()) { > + // Sweep the wrapper map to update keys (wrapped values) in other > + // comparments that may have been moved. *compartments ::: js/src/jsgc.cpp @@ +2537,1 @@ > rt->geckoProfiler.fixupStringsMapAfterMovingGC(); Probably ought to move this one too. We've seen some large profiler string tables. @@ +5507,5 @@ > + // TODO: JSScripts can move. If the sampler interrupts the GC in the > + // middle of relocating an arena, invalid JSScript pointers may be > + // accessed. Suppress all sampling until a finer-grained solution can be > + // found. See bug 1295775. > + AutoSuppressProfilerSampling suppressSampling(rt); I'm a little concerned about broadening the scope of the sampler suppression. This adds in 3 cache purges, plus arena protection. Though *should* arena protection be within the sampler suppression zone? If so, ignore this comment; I'm guessing that the cache purges themselves aren't too bad. Though I also wonder if sampling needs to be suppressed for the runtime-wide fixup. I don't know whether any of that data could be read. Still, too risky to roll it up into this patch. @@ +5538,5 @@ > + Zone* zone = relocatedZones.front(); > + relocatedZones.removeFront(); > + zone->setGCState(Zone::Finished); > + } > + while (!relocatedZones.isEmpty()); This is really nice. It does point out the (preexisting, but different) budget overruns -- after we blow our budget, we still have to do the runtime update. I don't know if it'll be a little worse now because we're updating for more zones at once, or whether we have to visit the same stuff anyway so it won't make a difference. The overall improvement is totally worth it either way. Maybe someday we can add in a predictive model so we can stop *before* we overrun our budget...
Attachment #8832446 - Flags: review?(sphink) → review+
(In reply to Steve Fink [:sfink] [:s:] from comment #8) > *compartments Cheers, fixed. > > rt->geckoProfiler.fixupStringsMapAfterMovingGC(); > > Probably ought to move this one too. We've seen some large profiler string > tables. Good idea, I'll give that a try. Note that this may require profiler suppression until the runtime-wide pointers have been updated even if it wasn't before. > I'm a little concerned about broadening the scope of the sampler > suppression. This adds in 3 cache purges, plus arena protection. Though > *should* arena protection be within the sampler suppression zone? If so, > ignore this comment; I'm guessing that the cache purges themselves aren't > too bad. The arena protection only happens in debug builds and the the cache purges shouldn't be too bad like you say. > Maybe someday we can add in a predictive model so we can stop > *before* we overrun our budget... Yes that would totally make sense here - stop when we get to 90% or a few milliseconds before the end or something along those lines.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Looks good in so far as it doesn't do any minute-long pauses due to compacting anymore so I can turn javascript.options.compact_on_user_inactive back on. Alas, startup itself still takes minutes. The hunt continues.
(In reply to The 8472 from comment #11) Thanks for reporting this. > Looks good in so far as it doesn't do any minute-long pauses due to > compacting anymore so I can turn javascript.options.compact_on_user_inactive > back on. > > Alas, startup itself still takes minutes. The hunt continues. Well, some good news at least. Are the current long startups also due to GC? If so can you provide a trace?
I'm not sure, the profiler only manages to capture a tiny fraction of the startup. How do I turn on GC logging to file again?
(In reply to The 8472 from comment #13) You can set the environment variable MOZ_GCTIMER to a filename. Alternatively if you just want to watch what's going on you can set JS_GC_PROFILE=0 and that will print a line for every GC slice that happens.
(In reply to Jon Coppeard (:jonco) from comment #14) > (In reply to The 8472 from comment #13) > You can set the environment variable MOZ_GCTIMER to a filename. Attached a logfile covering startup. This is with e10s due to crashiness in the current nightlies, but the time spent in startup is similar to the e10s-on case.
ah, now I recalled, the crashes were probably due to bug 1261850
Flags: needinfo?(jcoppeard)
There are a few longish GCs in there - around 2 seconds each - but if your startup is taking minutes I'd say that it's not a major factor. Incidentally, if you're not on the latest nightly we made a change a few days ago that makes the browser trigger zone GCs much more often. That may help if you're seeing a lot of GC activity.
Flags: needinfo?(jcoppeard)
Severity: normal → major
Keywords: perf
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: