Closed Bug 639262 Opened 10 years ago Closed 7 years ago

pinwheeling after resume due to GC/CC

Categories

(Firefox :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: dietrich, Unassigned)

References

Details

(Keywords: hang, perf, Whiteboard: [snappy:p2])

Attachments

(3 files)

I started using a Mac more often recently, and experience this routinely:

1. use browser
2. close lid
3. open lid
4. try to use browser

Expected: Can interact with browser.

Actual: Browser is non-responsive, pinwheel shows up, for up to 20 seconds. Then browser is usable again. Other applications are responsive during this period of time.
Keywords: hang
Whiteboard: [tsnap]
Can you grab a profile while this is happening? I don't think I see this on my MBP...
This is a Spin Control report from a Macbook Air 2.13ghz/4gb/10.6.6. I can reproduce consistently after a suspend/resume cycle of significant time. Going to sleep and immediately resuming doesn't trigger it.

https://people.mozilla.com/~dietrich/hang-on-resume.txt
Pretty sure this is caused by F1. I have the same thing happening, and it goes away when I disable it.

(Adding in some F1 people on CC)
:dietrich, are you using f1?
I did have it installed and enabled. I disabled it yesterday, while trying to narrow down a memory leak. But I have not had a sleep/resume cycle since, so can't say if it's related yet.
Might have been a false alarm as far as F1 goes, sorry about that. I disabled it, and the hang went away, but after waking up my computer today without F1 enabled, I got the same hang.

Attached is a sample of the process. I assume we should do this sample with a build with debug symbols for it to be any use? Let me know what I can do to help track this down.

(the sample was 8MB, so had to create a zip file of it)
FWIW, I've now resumed from sleep a couple of times. No hang with F1 disabled. However, I disabled a bunch of other add-ons as well at the time that I disabled F1.
For F1 we're using bug 642629 to track down the problem and believe to have it fixed in bug 646598. A hotfix of the add-on code should be pushed out fairly soon. If it is the F1 issue you'll need to restart your browser at some point to clear out the hang but it will likely come back until the new add-on is pushed out.
We've got a new version of the add-on out with the fix in, if you can update to that and give your Firefox a restart we can start to rule out F1 as the cause.
I was definitely getting beachballs without F1 enabled over the last week. Not as much though, so it feels like it was contributing to the problem. I've just now updated to the latest F1 and have enabled it.
still problems deitrich?
Keywords: perf
yes! i need to disable all add-ons, try to narrow it down.
Whiteboard: [tsnap] → [snappy]
Whiteboard: [snappy] → [snappy:p2]
i see the files for session restore, localStorage and cookies all written-to while the app is frozen.

this bug could be a combination of many parts of the app all handling the firing of timers/events after resuming the OS, each doing synchronous work or disk IO. or it could be mostly one of the features listed above. i'll continue to investigate.

cookies writes many times during the frozen period, but supposedly writes its file async and shouldn't block while writing. it could be doing too much synchronous work prior to writing, like some of the session restore bugs we've found.
Attached file another sample
looks like it could very well be bug 722646.

basically, we come back from resume, and the idle service considers that time "idle" since no keyboard/mouse interaction, and then Places immediately does a crap-ton of synchronous I/O before you have time to even hit a key.
Depends on: 722646
(In reply to Dietrich Ayala (:dietrich) from comment #15)
> then Places immediately
> does a crap-ton of synchronous I/O before you have time to even hit a key.

Just to clarify all the Places work is asynchronous on idle, though it assumes user is idle. if you are not it's likely something tries to use a synchronous api and contends the mutex with the async work.
I installed memchaser recently, so am always seeing GC/CC times in the add-on bar (which is a scary way to live).

https://addons.mozilla.org/en-US/firefox/addon/memchaser/

This morning, I opened up the laptop, got the long pinwheel again, and noticed that GC was ~20 seconds.
Dietrich has about 400 unrestored tabs, which is probably unique.  Maybe our scheduling is failing.
Assignee: nobody → continuation
Bill, Olli, can you think of any reason why resuming with a ton of unrestored tabs would cause hideous GC and CC times?  (Resuming meaning resuming the whole computer from sleep, the browser is just staying there).
Dietrich, what kinds of CC times have you seen after resume?

Is it possible that because of idle service announcing idleness we try to do lots of I/O and while
doing that we also GC/CC and that all uses lots of memory and we end up tons of page faults?
Does about:memory show page faults on OSX?
(In reply to Olli Pettay [:smaug] from comment #20)
> Dietrich, what kinds of CC times have you seen after resume?

I've seen up to 11 seconds. The only screenshot I have is one that's 1.5 seconds:

http://cl.ly/2m1r1S36023s0V3j3I0r

> Is it possible that because of idle service announcing idleness we try to do
> lots of I/O and while
> doing that we also GC/CC and that all uses lots of memory and we end up tons
> of page faults?
> Does about:memory show page faults on OSX?

Current value is below, but I've just restarted the browser.

     288 ── page-faults-hard
    913,319 ── page-faults-soft

What would be a lot? Should I look after a resume that pinwheels?
Hmm.  If I added stuff about the size of the CC graph to the error console, then we could see if there is a huge amount of stuff in the graph (eg resuming somehow causes us to look at a lot more stuff in the CC), or if the CC is just super slow.  If it is the latter, then maybe we could delay the CC by an extra 10 seconds like we do on startup.
mitchell is seeing this too. She's seated next to me and will show me the next time it happens. Is there anything I should look for that would tell me if it was the same problem Dietrich is seeing?
If you go to about:config and set javascript.options.mem.log to true, then you can look in the error console and see GC and CC times.  Alternatively, you can install the MemChaser addon ( https://addons.mozilla.org/en-US/firefox/addon/memchaser/ ) and that will show the most recent GC and CC times.
For either option, that has to be done before you resume, it can't be done after the fact.  A third option would be installing about:telemetry addon ( https://addons.mozilla.org/en-US/firefox/addon/abouttelemetry/ ) and seeing there are some really long GCs or CCs, but that just will show you if you've had any since you last started up, or in the last 24 hours.
incremental GC might have fixed this. i have not seen it since.
Nope, still around. Memchaser has been broken for GC since incremental GC landed, but CC post-resume was 4 seconds this time:

http://cl.ly/0V292l0u3q3N0v380X1P
Okay, I'll try to recreate it.
Summary: pinwheeling after resume → pinwheeling after resume due to GC/CC
Note, incremental GC gets disabled usually, see Bug 728686.

But ugh, that is a horrible CC time.

Could you perhaps try about:jank addon.
Check what the values about before suspending, and then
after resume check again.
Also, as I said before, perhaps page faults cause it. Running CC/GC needs to bring
all sorts of objects to memory.
So, if you can reproduce this easily, check about:memory before suspending and after resume.
Hard to check before/after easily, since I can't reproduce on-demand and don't really think about this every time i close the lid of my laptop... would quite likely remember *just* after ;)

FWIW, here are screenshots of after:

about:jank: http://cl.ly/2c3T16420O2L270X191O
about:memory 1: http://cl.ly/2L3m3U2s1S0z2f3F3A40
about:memory 2: http://cl.ly/1z1c3F341x2K3W1j382Y
Memchaser makes tracking memory a bit easier, since the pre-resume value is visible:

resident before resume: 1127mb
resident immediate after resume: 262mb
resident 10 seconds after resume: 944mb

2038ms CC, and unknown GC because incremental GC broke Memchaser.
Same pattern today for resident: 1585mb -> pinwheeling -> 350mb -> a few seconds -> 730mb and can use normally thereafter.
This one was serious business: http://cl.ly/3M3X1F1D0W322p3p3O2r
I wonder what we could do to this. If lots of tabs are open, lots of memory is used.
Then, suspending/resuming may cause data to be moved from memory to disk and back to memory.
(That is what I guess is happening.) If I/O is slow, moving data back to memory can take time.
So, we should reduce memory usage.
But we could also try to postpone GC and CC after resuming.
That way more data is hopefully in memory when GC and CC actually run.
Dietrich, if you have a recent (in the last week or so) could you try setting javascript.options.mem.log to true in about:config?  Then when this happens go to the error console and copy the line about the CC here.  It will look something like this:

CC(T+184517.8) duration: 56ms, suspected: 153, visited: 7462 RCed and 5548 GCed, collected: 52 RCed and 2 GCed (54 waiting for GC)
ForgetSkippable 4 times before CC, min: 0 ms, max: 13 ms, avg: 3 ms, total: 15 ms, removed: 370

If the visited part is not unusually large (also copy in the information about the CC after it gets back to a normal length), that would suggest that scanning each thing is long (like if it is due to paging), as opposed to their being some weird leak.  If you also copy in the information about long GCs that will be in the error console that could be helpful, too.
If you need more data, I have the exact same issue.
Kadir, how many tabs do you usually have open?  What version of Firefox are you running? Thanks.
My computer just came up from a fairly long suspend, after I forgot to turn on memchaser logging, so here's some about:telemetry.

One thing of note is that I've got an SSD--before I got it I had resume times in the *minutes*.

I'm not 100% certain that this is Firefox's fault, I've been having this issue for a pretty long time and I vaguely recall thinking that this *wasn't* Fx's fault because once I resumed and it happened and Firefox wasn't even running. But I know Gnome used to use gecko all through all kinds of things and so even if Firefox wasn't the culprit it may have been some embedded XUL somewhere. (This has been happening for at least 12 months.)

At the very least I think I can confirm that something similar happens on Linux.
Attachment #603584 - Attachment mime type: text/plain → text/html
Andrew, here are the GC/CC entries from the console after a long post-resume hang. Also, these are sent to the console every 3 - 15 seconds, with GCs often around 500ms.

CC(T+27710.4) duration: 1745ms, suspected: 6083, visited: 17338 RCed
and 5848 GCed, collected: 5670 RCed and 0 GCed (5670 waiting for GC)
ForgetSkippable 14 times before CC, min: 4 ms, max: 547 ms, avg: 83
ms, total: 1163 ms, removed: 12571

GC(T+27718.6) TotalTime: 7820.2ms, Type: global, MMU(20ms): 0%,
MMU(50ms): 0%, Reason: PAGE_HIDE, NonIncrementalReason: incremental
permanently disabled, +chunks: 0, -chunks: 0
mark: 6992.0, mark-roots: 309.2, mark-other: 352.1, sweep: 735.1,
sweep-obj: 64.2, sweep-string: 10.3, sweep-script: 2.6, sweep-shape:
69.0, discard-code: 28.1, discard-analysis: 134.4, xpconnect: 53.4,
deallocate: 3.1
(In reply to Olli Pettay [:smaug] from comment #35)
> I wonder what we could do to this. If lots of tabs are open, lots of memory
> is used.
> Then, suspending/resuming may cause data to be moved from memory to disk and
> back to memory.

What data, exactly? Is there a way to tell?

A very large transfer of data out of and back into memory sounds exactly like the pattern I see when I experience this problem: pre-resume at 1gb resident, then after resuming and hanging, resident is <200mb, and boom a few seconds later it's back up to ~800mb resident.
(In reply to Andrew McCreight [:mccr8] from comment #38)
> Kadir, how many tabs do you usually have open?  What version of Firefox are
> you running? Thanks.

loaded tabs: ~15
undloaded tabs: ~30

Firefox 12.0a2 (2012-03-13) on Mac OS 10.7
I found a page that shows how we can be notified when a machine resumes:
  https://developer.apple.com/library/mac/#qa/qa1340/_index.html

Maybe when this happens, we should kill any existing GC and CC timers and schedule new ones in 30 seconds, as Olli suggested.
I'm going to WFM this.  Please reopen if you still see this.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Just had a 33 second GC after resuming.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Any better now that bug 765215 is fixed in Firefox 20?
Flags: needinfo?(dietrich)
Comment 46 is private: false
I've seen it pretty bad in the last few days actually.
Flags: needinfo?(dietrich)
I'm not really sure what's going on with this one, but leaving it open doesn't seem very useful at this point.
Assignee: continuation → nobody
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → INCOMPLETE
See Also: → 1310264
You need to log in before you can comment on or make changes to this bug.