Closed Bug 665997 Opened 13 years ago Closed 12 years ago

General lagginess in recent firefox builds

Categories

(Core :: Widget: Gtk, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: catlee, Unassigned)

Details

(Keywords: perf)

Attachments

(2 files)

Firefox has been unbearably laggy in recent weeks. Firefox 5.0 final as well as 7.0 nightlies are affected. Symptoms include: * Pauses while typing into text input fields (like this bug report) * Pauses when switching between tabs * Pauses when scrolling * Pauses when navigating awesomebar results * Slow clipboard access. e.g. I copy something in firefox with Ctrl-C, and then paste in vim or my terminal with Ctrl-V, and it takes a second or two to show up. Most of the pauses above are between 2-3 seconds. I'm running 64-bit linux (debian testing/unstable), on an ext4 filesystem mounted on an encrypted drive. Kernel version is currently 2.6.38, glibc is 2.13-4. bhearsum sees some of the above symptoms also, but not all of them. If there's any debugging or profiling information I can provide, let me know, this is driving me crazy. I took a screencast of the delay in text input: http://people.mozilla.org/~catlee/lag.ogv
Thanks, Chris. Looks bad. I have occasionally noticed a little delay in typing, 2-3 characters behind, but I had thought it would always at a time when the browser was busy with something such as page load or location bar completions. Is it reproducible enough for you to get a regression range? Can you run a system-monitor tool such as gkrellm or xosview, please, to try to determine whether pauses are associated with cpu use, disk activity, or network activity?
I have been seeing some issues as well, Wordpress is really bad. Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0a1) Gecko/20110621 Firefox/7.0a1. Mainly the pauses when typing in text areas
so far it seems like lags correspond with a spike to 100% cpu usage in one of my cores, as well as disk write activity.
nothing obvious in strace output. lots of lines like this: poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=10, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN|POLLPRI}, {fd=49, events=POLLIN}, {fd=73, events=POLLIN}, {fd=23, events=POLLIN}], 9, 0) = 0 (Timeout) <0.000021> read(3, 0x7fa51106e074, 4096) = -1 EAGAIN (Resource temporarily unavailable) <0.000014>
Polling like that is probably expected from the event loop. Excluding these calls might reduce noise: strace -T -e '!poll,read' Can you comment on whether the issue gets worse over time? Any evidence of memory leaks? Running with MOZ_INSTRUMENT_EVENT_LOOP=1 in the environment will log whenever an event takes longer than 50 ms to process. This would help diagnose whether one thing is taking too long, or whether there are just so many events that painting never happens. Observations from lag.ogv: There is a flashing icon in the desktop status-bar indicating that the server is responsive. The flashing mozilla icon on an app tab that stops flashing, indicating that Firefox is not drawing in general. (It's not specific to the text input.)
Sample output: MOZ_EVENT_TRACE sample 1308797266260 2912 MOZ_EVENT_TRACE sample 1308797269695 119 MOZ_EVENT_TRACE sample 1308797270594 83 MOZ_EVENT_TRACE sample 1308797271876 58 MOZ_EVENT_TRACE sample 1308797272789 198 MOZ_EVENT_TRACE sample 1308797273153 210 MOZ_EVENT_TRACE sample 1308797273345 141 MOZ_EVENT_TRACE sample 1308797273645 95 MOZ_EVENT_TRACE sample 1308797273884 137 MOZ_EVENT_TRACE sample 1308797274148 213 MOZ_EVENT_TRACE sample 1308797274327 76 MOZ_EVENT_TRACE sample 1308797274829 93 MOZ_EVENT_TRACE sample 1308797274944 64 MOZ_EVENT_TRACE sample 1308797275106 60 MOZ_EVENT_TRACE sample 1308797275297 89 MOZ_EVENT_TRACE sample 1308797275447 98 MOZ_EVENT_TRACE sample 1308797284175 2912 MOZ_EVENT_TRACE sample 1308797287561 121 MOZ_EVENT_TRACE sample 1308797291678 87 MOZ_EVENT_TRACE sample 1308797295795 88 MOZ_EVENT_TRACE sample 1308797297324 50 MOZ_EVENT_TRACE sample 1308797298577 79 MOZ_EVENT_TRACE sample 1308797298721 92 MOZ_EVENT_TRACE sample 1308797299439 54 MOZ_EVENT_TRACE sample 1308797302496 2904 MOZ_EVENT_TRACE sample 1308797305876 116 MOZ_EVENT_TRACE sample 1308797309991 86 MOZ_EVENT_TRACE sample 1308797314108 87 MOZ_EVENT_TRACE sample 1308797320388 2863
I have a hunch this is i/o related. It's much much worse when I'm compiling something like firefox.
** (firefox:22827): WARNING **: Serious fd usage error 15 ** (firefox:22827): WARNING **: Serious fd usage error 11 ** (firefox:22829): WARNING **: Serious fd usage error 15 ** (firefox:22829): WARNING **: Serious fd usage error 11 MOZ_EVENT_TRACE sample 1308799513602 90452 fd 11,15 are: firefox 30288 catlee 11u unix 0xffff880115b52d80 0t0 4036225 socket firefox 30288 catlee 15u unix 0xffff880037a63740 0t0 4036227 /tmp/orbit-catlee/linc-7650-0-3e77f1ef5fb8
And some stats from strace: [catlee] ~/mozilla/mozilla-central (hg)-[default]-% strace -p 30288 -c Process 30288 attached - interrupt to quit ^CProcess 30288 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 62.08 0.028002 32 883 73 futex 37.88 0.017088 3 4925 poll 0.03 0.000015 0 6182 4852 read 0.00 0.000000 0 192 write 0.00 0.000000 0 2 close 0.00 0.000000 0 58 4 stat 0.00 0.000000 0 26 lseek 0.00 0.000000 0 22 mmap 0.00 0.000000 0 1 mprotect 0.00 0.000000 0 14 munmap 0.00 0.000000 0 605 writev 0.00 0.000000 0 90 madvise 0.00 0.000000 0 45 getsockname 0.00 0.000000 0 1 clone 0.00 0.000000 0 1 sched_setscheduler 0.00 0.000000 0 2 sched_get_priority_max 0.00 0.000000 0 2 sched_get_priority_min 0.00 0.000000 0 1 restart_syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.045105 13052 4929 total Are so many failed read calls normal? Attaching strace shows these two lines repeating over and over: poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=9, events=POLLIN|POLLPRI}, {fd=11, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN|POLLPRI}, {fd=15, events=POLLIN|POLLPRI}, {fd=36, events=POLLIN}, {fd=35, events=POLLIN}, {fd=22, events=POLLIN}], 9, 0) = 0 (Timeout) read(4, 0x7f75d5b9d074, 4096) = -1 EAGAIN (Resource temporarily unavailable) Why do we read from fd 4 if poll timed out? Excluding poll and read from strace shows mostly futex calls
Attached file jprof profile
Nothing jumps out from that other than it's mostly js::Interpret(). No GC hits. It would help to know what you were doing when you captured it, or to run a longer sequence (or shorter period: JP_PERIOD=0.002, or both)
Also, what's the environment: # of tabs, what are they on, etc. Hmmm. "on an ext4 filesystem mounted on an encrypted drive" -- Perhaps when typing into a test box it writes to the profile, or reads from it? Compiles would make that slower, especially on an encrypted drive.
(In reply to comment #6) > Sample output: > MOZ_EVENT_TRACE sample 1308797266260 2912 Yes, some long pauses of about 2.9 seconds. (In reply to comment #8) > ** (firefox:22829): WARNING **: Serious fd usage error 15 > > ** (firefox:22829): WARNING **: Serious fd usage error 11 > MOZ_EVENT_TRACE sample 1308799513602 90452 > > fd 11,15 are: > firefox 30288 catlee 11u unix 0xffff880115b52d80 0t0 4036225 socket > firefox 30288 catlee 15u unix 0xffff880037a63740 0t0 4036227 > /tmp/orbit-catlee/linc-7650-0-3e77f1ef5fb8 A 90 second period of unresponsiveness - Wow. I don't understand the ORBit error. http://git.gnome.org/browse/ORBit2/tree/linc2/src/linc-connection.c#n929 I don't know why the file descriptor would be invalid. Is 22829 actually the process id of the main firefox process? (Wondering whether something funny is happening after fork.) But I've seen that before, so I'm not expecting it to be directly related to lagginess. Can you disable all plugins please to confirm they are not involved?
(In reply to comment #9) > [catlee] ~/mozilla/mozilla-central (hg)-[default]-% strace -p 30288 -c > > Process 30288 attached - interrupt to quit > ^CProcess 30288 detached > % time seconds usecs/call calls errors syscall > 100.00 0.045105 13052 4929 total Time here is system (in kernel) cpu time, which is minimal, but this doesn't tell us about wall clock time from waiting. > > Are so many failed read calls normal? Many EAGAIN "errors" would be expected with non-blocking IO. > Attaching strace shows these two lines > repeating over and over: > > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=9, > events=POLLIN|POLLPRI}, {fd=11, events=POLLIN|POLLPRI}, {fd=14, > events=POLLIN|POLLPRI}, {fd=15, events=POLLIN|POLLPRI}, {fd=36, > events=POLLIN}, {fd=35, events=POLLIN}, {fd=22, events=POLLIN}], 9, 0) = 0 > (Timeout) > read(4, 0x7f75d5b9d074, 4096) = -1 EAGAIN (Resource temporarily > unavailable) > > Why do we read from fd 4 if poll timed out? Yes, that does seem unnecessary, but it happens even when things are working fine. fd 4 seems to be the connection to the X server. I expect some calls like XNextEvent() would read from the connection at least sometimes. I would expect these polls to pause during the period of unresponsiveness. Can you tell whether that is what happens?
CCing Taras as he might have some good hints for watching disk access.
Attached file larger jprof profile
(In reply to comment #12) > Also, what's the environment: # of tabs, what are they on, etc. I've got 26 tabs across 5 tab groups plus 6 app tabs. My app tabs are gmail, greader, google cal, zimbra, remember the milk, and yammer. Most of my other tabs are pretty normal static html pages.
> Can you disable all plugins please to confirm they are not involved? I'm currently running with -safe-mode and have disabled all plugins and I'm still experiencing this. (incidentally, running with -safe-mode doesn't disable plugins) > I would expect these polls to pause during the period of unresponsiveness. > Can you tell whether that is what happens? I don't notice any difference in poll times in the unresponsive periods. Most polls complete in well under 0.01 seconds. I just noticed that firefox is spamming stderr (fd 2) with (firefox-bin:15907): Gdk-WARNING **: XID collision, trouble ahead This isn't showing up in my console for some reason, but is being redirected to a global log I have catching output in my ~/.xsession file. This file is growing at about 3.8kb/s. I'll try redirecting this to /dev/null.
> This isn't showing up in my console for some reason, but is being redirected > to a global log I have catching output in my ~/.xsession file. This file is > growing at about 3.8kb/s. I'll try redirecting this to /dev/null. Redirecting stderr reducing the spam in my global X log file, but doesn't improve the lagginess.
Well, it's doing mostly JS stuff. A bunch goes through: 27.6% XPCWrappedNative::CallMethod() The top of the biggest stack is PrepareAndDispatch() at 70% leading to JS::Interpret() and JS::Invoke(). The XRE_main event loop runs (14.x%), and does some GC, but GC is a small percentage (<4%) of CPU time so I doubt that's your problem. Around 8% is in chatting with gtk. So barring some of this JS code doing more than expected, I see no real hotspots. The next step may help: try using JP_REALTIME. This will sample off wall-clock time, so if it's being slow in a system or other library call (but not chewing CPU) it should catch it. Note that you want to not spend time idle more than you have to during the test when done this way, or find a way to start and stop it programatically using JS (but since this is a user-interaction bug I doubt that would work.)
Note that you start collection in JP_REALTIME via kill -ABRT not -PROF (I should fix that; we know it's realtime...)
(In reply to comment #18) > > I would expect these polls to pause during the period of unresponsiveness. > > Can you tell whether that is what happens? > > I don't notice any difference in poll times in the unresponsive periods. > Most polls complete in well under 0.01 seconds. What I meant (but didn't say clearly) was that I would expect no polling or at least fewer polls in the unresponsive times. > I just noticed that firefox is spamming stderr (fd 2) with > (firefox-bin:15907): Gdk-WARNING **: XID collision, trouble ahead > > This isn't showing up in my console for some reason, but is being redirected > to a global log I have catching output in my ~/.xsession file. That was a known issue with Flash Player, but Flash should be running out of process (not firefox-bin) and is disabled anyway. A break point in g_log would enable getting a backtrace. If it is appearing in ~/.xsession rather than the console that started a Firefox, it suggests that there is another Firefox running, unless there's some environment variable we're using to determine error output.
So I wrote a little script that monitors page faults per second (as reported by ps -o min_flt), here's some sample output: 0 12 3 0 0 0 3 0 0 0 0 1439 4591 1912 15351 5 22 524 Periods of lagginess correspond with the spike in these soft page faults.
firefox-bin is currently using 2,328,676 bytes of virtual memory (vsz) and 1,348,904 bytes resident. free reports: total used free shared buffers cached Mem: 3920404 3605112 315292 0 62068 827464 -/+ buffers/cache: 2715580 1204824 Swap: 5242876 493860 4749016
I experience this issue with one tab open. it sits at 110% of cpu, destroys my battery life.switching back to the release (fx 5.0) everything goes back to normal.
btw: I noticed this increase in CPU usage on 2011-06-24 or 2011-06-25 in the AM
:ddahl -- that sounds like a different bug than is reported here. Please file a separate bug report and please include the URL that one tab is on, and if it occurs if you start Firefox in safe mode (to rule out issues with extensions/flash/etc). Thanks.
bug 667250 may well be related or the same bug. I have a clear profile on that one.
So my session store recently got corrupted, so I now have far fewer tabs open, and performance is much better.
(In reply to Chris AtLee [:catlee] from comment #29) > So my session store recently got corrupted, so I now have far fewer tabs > open, and performance is much better. do you mean far fewer than 32? (comment 25 - FWIW 32 is peanuts, I typically have >100 on a marginal vista laptop, and the only appreciable perf hit is GC and CC) better as in fine? Or better as in, improved but still bad?
Flags: needinfo?(catlee)
Keywords: perf
I haven't had this problem in a while. I suspect it was due to bug 705597.
Status: NEW → RESOLVED
Closed: 12 years ago
Flags: needinfo?(catlee)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: