Closed Bug 436897 Opened 16 years ago Closed 16 years ago

System slowdown following overactive windowserver (with Thunderbird and Firefox, but not Camino)

Categories

(Core :: Widget: Cocoa, defect, P1)

x86
macOS
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: justdave, Assigned: smichaud)

References

Details

(Keywords: fixed1.9.1, regression, Whiteboard: [tb3needs])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #389683 +++

User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.9a7pre) Gecko/2007072400 Camino/2.0a1pre (like Firefox/3.0a7pre)
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.9a7pre) Gecko/2007072400 Camino/2.0a1pre (like Firefox/3.0a7pre)

I have noticed over the past week that when Camino has been running for a while, my system starts to slow down drastically. Upon observing Activity Monitor, I found that a WindowServer job is hogging in the excess of 80% of CPU.

It took a few iterations to notice a possible Camino connection. In addition, I have also noticed that this happens more when I am surfing a site with Flash content.

In #Camino, I was told to note the number of ports Camino uses when the slow down occurs. I'll add that data as a comment soon. In my normal use, Camino uses about 240+ (at startup - iGoogle as home page with lots of feeds) and about 140+ afterwards.

Reproducible: Sometimes

Steps to Reproduce:
1. Open a website (possibly with flash content)
2. Wait for a while

Actual Results:  
System starts to slow down with WindowServer starting to use much more CPU than it usually does. Just quitting Camino mostly doesn't help. Logout-login helps.

Expected Results:  
There should be no system performance impact when Camino is running
Flags: blocking-thunderbird3?
Flags: blocking-thunderbird3.0a2?
OK, so creating the pref that's checked by the patch on bug 389683 (ui.use_native_popup_windows) and setting it to true does indeed fix the CPU usage problem in Thunderbird as well.  However, since Thunderbird doesn't use native popup windows, this disables context menus and tooltips when you set this.

That's an acceptable short-term workaround for me, since I rarely use that stuff anyway, but probably not shippable.

Thunderbird version 3.0a2pre (2008053020)
There is a bug report for a similar problem in Firefox: bug 417655.
Assignee: nobody → joshmoz
Component: General → Widget: Cocoa
Flags: blocking-thunderbird3?
Flags: blocking-thunderbird3.0a2?
Product: Thunderbird → Core
QA Contact: general → cocoa
Sigh :-(

Apple's implementation of event taps is very buggy.  But we need event
taps to implement Firefox's (and Thunderbird's) non-native context
menus.

I worked around all the bugs I could in Apple's event taps that I
could consistently reproduce.  Clearly some remain.  I may be able to
work around them.  But first I need to be able to reproduce them.

For what it's worth, I never see this problem myself.  But because I
do so much testing, of so many different versions of Mozilla.org's
browsers, I rarely run any one of them for more than an hour or two at
a time.
> I worked around all the bugs I could in Apple's event taps that I
> could consistently reproduce.

I worked around all the bugs in Apple's event taps that I could
consistently reproduce.
I'll raise the priority if/when this becomes consistently reproducible.
Assignee: joshmoz → smichaud
Priority: -- → P3
Not knowing anything about event taps, here's one piece of information that
might be helpful in debugging (though I rather suspect it won't): if Tb
deadlocks or I break in the debugger and then sit back and wait a while,
WindowServer gradually loses its mind in the same way, presumably because
events are no longer being serviced.
> Just quitting Camino mostly doesn't help. Logout-login helps.

This is probably the most easily identified diagnostic marking this as
an event taps bug.  Look for this in other bugs before marking them
dups of this one.
(In reply to comment #7)

I probably need something like "this sequence of actions makes the
WindowServer start hogging CPU (which doesn't get fixed by quitting
the browser)".

We currently only use event taps for mouse-down events (left-, right-
and other-), and event taps are cross-application.  So sequences of
events that trigger this bug will almost certainly include mouse-downs
and involve more than one application.
Flags: wanted1.9.0.x?
Summary: System slowdown following overactive windowserver, when Thunderbird running → System slowdown following overactive windowserver (with Thunderbird and Firefox, but not Camino)
(In further reply to comment #7)

> or I break in the debugger and then sit back and wait a while

So you're telling me that all you need to do to make this bug happen
is run Thunderbird (or maybe Firefox) in gdb, break, wait a while
(5-10 minutes?), and then "continue" (in gdb)?
I believe justdave is able to reliably reproduce the problem on his system during normal use.  He has provided some Activity Monitor sample traces:

Thunderbird with the progress meter bug (bug 432028) patch applied:
http://people.mozilla.com/~justdave/Sample_of_ThunderbirdDebug-20080531.txt

Window server with the progress bar patch not applied:
http://people.mozilla.com/~justdave/Sample_of_WindowServer-20080530.txt

He has also indicated that, with his custom build, a pop-up for thunderbird in the background will not highlight the active menu options as the mouse passes over them.  This also happens to me for my custom build (with the progress bar patch applied), but the highlight does track with official nightly builds.  I am not able to reproduce the problem, despite much effort.
It's not that I don't believe you.

But we need (and don't yet have) precise and specific STR that anyone
can reproduce.

The traces are nice to have ... but (as far as I can see) they don't
tell us how the problem started.
(In reply to comment #8)
> > Just quitting Camino mostly doesn't help. Logout-login helps.
> 
> This is probably the most easily identified diagnostic marking this as
> an event taps bug.  Look for this in other bugs before marking them
> dups of this one.

This is not the case here.  If I quit Thunderbird, the problem goes away, and WindowServer regains its sanity.
Also of note (mentioned on IRC but I should mention it here as well)...  I'm not sure if it's related, but it's surely possible.  When I do quit Thunderbird, I get a phantom right-click event posted to the system, which gets sent to whatever app is frontmost after TB exits.
(In reply to comment #10)
> (In further reply to comment #7)
> 
> > or I break in the debugger and then sit back and wait a while
> 
> So you're telling me that all you need to do to make this bug happen
> is run Thunderbird (or maybe Firefox) in gdb, break, wait a while
> (5-10 minutes?), and then "continue" (in gdb)?

Except that I never hit continue.  If I hit continue, the problem goes away.


> I get a phantom right-click event posted to the system, which gets
> sent to whatever app is frontmost after TB exits.

This definitely sounds like an event tap problem.

>> > Just quitting Camino mostly doesn't help. Logout-login helps.
>>
>> This is probably the most easily identified diagnostic marking this
>> as an event taps bug.  Look for this in other bugs before marking
>> them dups of this one.
>
> This is not the case here.  If I quit Thunderbird, the problem goes
> away, and WindowServer regains its sanity.

But it sounds like I was wrong about event-tap-induced WindowServer
problems never going away when you quit Thunderbird and/or Firefox.
But yeah, it's 5-10 mins.  If you watch with top, you'll notice the amount of CPU being eaten by WindowServer gradually increasing over time.
(In reply to comment #15 and comment #17)

Interesting, and thanks.

I just failed to reproduce this with Firefox ... but I'll keep trying
(with Thunderbird and Firefox).
(In reply to comment #14)
> Also of note (mentioned on IRC but I should mention it here as well)...  I'm
> not sure if it's related, but it's surely possible.  When I do quit
> Thunderbird, I get a phantom right-click event posted to the system, which gets
> sent to whatever app is frontmost after TB exits.

Jesse reported something like this in bug 393664.
Just throwing out wild ideas (I haven't seen the code at all and don't really know how it works)... when receiving an event tap event from the system, do we have to send something back to the system in the return code from the callback to say that we've handled the event, and if so are we doing that?
(In reply to comment #20)

The relevant code is in widget/src/cocoa/nsToolkit.mm (in
nsToolkit::RegisterForAllProcessMouseEvents() and EventTapCallback()).
Apple's doc (the "Quarz Event Services Reference") is at:

http://developer.apple.com/documentation/Carbon/Reference/QuartzEventServicesRef/Reference/reference.html

No, we don't need to send something back to the system (because we've
installed a "listen only" tap).  We do so anyway ... but in my
original tests it didn't matter whether or not we did this.
(I had a comment about the listener event tap too, but got beaten to the punch :)

Dave, I guess there are two potential debugging things you could do that might help:

1) Is your Thunderbird so busy before the WindowServer activity goes insane
that it is conceivable that events are backing up?  I know you have a lot of
mail accounts configured, and a lot of message filters, all of which might
presumably tie up the UI thread.

2) Can you bring up python (just 'python') in a terminal and type the
following:

import Quartz
Quartz.CGGetEventTapList(0, None)

The return value that gets printed out should be (0, Number of event tap
listeners active on the system).  If you have Thunderbird 3 and Firefox 3 open
and active, we would expect that number to be 2.  If just one of them, then
just 1.
> 2) Can you bring up python (just 'python') in a terminal and type the
> following:
>
> import Quartz
> Quartz.CGGetEventTapList(0, None)

I did this and got an error about a missing argument.

In fact (from the "Quartz Event Services Reference" I cited above) CGGetEventTapList() takes three arguments, the last of which is a pointer to a structure to hold the information it returns.
The correct syntax seems to be:

Quartz.CGGetEventTapList(0, None, None)
>>> Quartz.CGGetEventTapList(0, None, None)
(0, 3)

I have both TB3 and FF3 running (and only one of each), so apparently I have something else running that uses them, too.  I'll have to do some trial and error to figure out what, and it'll have to wait until I have some other work-related stuff out of the way this afternoon.
OK, through some experimenting, I now know with certainty how you can reproduce this at will.

When I ran the above command in python immediately after launching Thunderbird, it says there's 1 listener.  I have TB set to prompt me at launch whether I want to work offline or online.  After I click the Work Online button, the python command then returns 2 event listeners active.

If I set the offline stuff not to prompt me, and restart Thunderbird, then I only get one listener, ever.  I'm going to leave it running right now and see if the presence of the 2nd listener is what caused all the problems.  If this theory is right, then I won't have any CPU issues this time around.
OK, it's been going for a good hour and a half, and no CPU issues yet, so I'd say that's the problem.  Whatever code is restarting things after the "do you want to work offline" prompt isn't throwing away the stuff it's already started at that point.
> I have TB set to prompt me at launch whether I want to work offline
> or online.

How (exactly) do you make it do this?
Preferences > Advanced > Network & Disk Space > Configure Offline Settings

When starting up:
  ( ) Remember previous online state
  (*) Ask me for online state at startup
  ( ) Always start up online
Bingo!

With this setting, I also see two event taps after Thunderbird has
asked me (at startup) whether I want to work offline or online (it
doesn't matter which option I choose).  When I quit Thunderbird the
number of event taps goes back to zero (and I get wierdness in the
Terminal session where I was running python).

This dialog is displayed very early ... which I suspect has something
to do with the problem.

Now that you've given me something to work with, I'll up the priority.
I hope we can kill off Jesse's bug (bug 393664) at the same time.

Thanks both Dave and Andrew -- Dave for the STR, and Andrew for the
Python trick (which is very convenient, and which I didn't know
about).
Priority: P3 → P1
Blocks: 393664
Only wanted1.9.0.x pending the size and risk of the patch. This should be in 1.9.1 though.
Flags: wanted1.9.1?
Flags: wanted1.9.0.x?
Flags: wanted1.9.0.x+
Flags: blocking1.9.1?
Flags: wanted1.9.1?
Flags: blocking1.9.1?
Flags: blocking1.9.1+
Is this really a P1 blocker?  That is, we would hold the beta next week for this?  Seems like we should fix it before final, but even that is kinda iffy; I don't think there's been a lot of feedback about this issue (I see something similar often when firefox is stopped at a gdb breakpoint, but that wouldn't be a blocker).
(In reply to comment #32)

> Is this really a P1 blocker?  That is, we would hold the beta next
> week for this?

Put that way, I think the answer should be "no".

I'm not yet sure I'll be able to fix this bug at all.  And even if the
patch itself isn't particularly complex, investigating this problem
*will* be a complex task.  There's also at least one other bug (itself
quite complex) that should have priority over this one (bug 436473/bug
442442).

Given what I've just said, Vlad, what priority do you think this
should be -- P2 or P3?

(As of comment #30 I know how to reproduce this bug.  So as of that
comment this bug went from impossible to difficult.  But nothing can
make it easy.)
I think P1 is blocking b1, P2 is blocking b2, and P3 is blocking final release -- given those comments, I'd actually put this as a P1-wanted.  That is, we would not block the final release if we didn't fix this, but we really really want to fix it.
Flags: wanted1.9.1+
Flags: blocking1.9.1-
Flags: blocking1.9.1+
While I agree with vlad that the debugging lossage doesn't make this a 1.9.1 blocker, I think that the existence of bug 467630 probably does.  Re-requesting blocking status.
Flags: blocking1.9.1- → blocking1.9.1?
Whiteboard: [tb3needs]
(In reply to comment #36)

What I said in comment #33 still stands.

A lot depends on how reliable the STR for bug 467630 is, and (if it's
100% reliable) how often the problem is likely to occur.
It sounds to me like the behavior in bug 467630 is similar or identical to that described by Dave in comment 29, which you say you're able to reproduce.  Do you disagree?

You're right that it's not clear how prevalent this problem is.
(In reply to comment #38)

> It sounds to me like the behavior in bug 467630 is similar or
> identical to that described by Dave in comment 29, which you say
> you're able to reproduce.  Do you disagree?

I'd say bug 467630 *might* be a dup of this one.  But both bugs are so
complex that one can't (at this point) be more certain than that.

By the way (and I forgot to mention this earlier), as best I can
remember I can only reproduce precisely the symptoms mentioned in
comment #30 -- which doesn't include the "overactive windowserver".

Sometime in the next couple of days I'll check to see I can still
reproduce what I described in comment #30, and to what extent (if any)
I can reproduce bug 467630.
(Following up comment #38)

I still see the "two event taps" (from comment #30) when I set
Shredder to "ask me for online state at startup" (whether I choose
"work offline" or "work online").  But I never see the WindowServer
eating lots of CPU (or RAM).

I tested by starting Shredder and letting it sit for 5-10 minutes.
When I chose "work online" I connected to an IMAP server before
letting Shredder sit.  I tested running Shredder "on its own" and in
gdb.  While running it in gdb, I "broke" before letting Shredder sit
(without doing a "continue", as per comment #15).  So (to cover all
possible combinations of these factors) I did four different tests.  I
never had trouble from the WindowServer in any of them.
I should add that I tested on OS X 10.5.5, with an opt build of
Shredder (containing debug symbols) made from code that I pulled (from
comm-central) this Tuesday morning (2008-12-02).
Steven, as dmose said in comment 7, I see this all the time when I break firefox in gdb and take too long debugging. The system gets its act together as soon as I continue.
(In reply to comment #42)

Thanks for the info.  But (as I *can't* reproduce the "overactive
WindowServer" problem) there's clearly some other factor (or factors)
involved here that nobody's yet been able to identify.

This makes it difficult to discover exactly how widespread this
problem is.

(By the way, though, I've got a hunch about avoiding the "two event
taps" problem which I'm going to try out over the next few hours.  If
I get anywhere I'll post a patch here, for you and others to test.)
Generally when I'm running under the debugger, I use debug builds.  Maybe there's some relevant difference between debug and release builds here?
I've found a fix for the "two event taps" problem.

If you can reproduce the "overactive WindowServer" problem, please try
this patch to see if it also clears that up.

A tryserver build (of Firefox) will follow in a while.  I'm also doing
my own (universal binary) build of Shredder, which I'll upload to my
people.mozilla.com account once it's done.

Turns out the code I was using in
nsToolkit::UnregisterAllProcessMouseEventHandlers() to destroy an
event tap doesn't actually remove it (at least until the app process
ends).  You also need to invalidate the CFMachPortRef object returned
by the call to CGEventTapCreate().  (Of course this isn't documented
anywhere ... as best I can tell.)

It also turns out that (under certain circumstances) the app can
create an nsToolkit object, destroy it, and create it again for the
same thread (this happens every time the app runs, not just when it
does a "double-bounce" on startup).  This may also be a bug ... or
maybe someone here can tell me why it happens (and why it isn't a
bug).  The "certain circumstances" include when you've set Shredder to
ask for the online state at startup.
Anybody tried my patch yet?
I hope to do this today or tomorrow.
Unfortunately, my todo list is a bit crazy.  However, since the problem I'm seeing (stopping in a debugger) isn't what might make this bug want to block 1.9.1, I think having me test doesn't really provide the information we need.  In particular, the Thunderbird problems seen by justdave and bclary are the real issues here.   justdave/bclary, can you guys test the tryserver build linked to from comment 46?
> justdave/bclary, can you guys test the tryserver build linked to
> from comment 46?

Hear! hear! :-)
I thought I had commented about the try server builds here, but must have forgotten to commit it. While the steps to reproduce in Thunderbird were pretty clear to me, I'm not sure how to do so with Firefox. What is the equivalent of asking whether to start up in offline mode in Firefox?
Use my Shredder build from comment #46.
(Following up comment #53)

In other words, I'm satisfied if you only test Shredder, since that's where you originally saw the problem (the overactive WindowServer problem).  There's no guarantee you'd see something similar in Firefox, even with a very similar STR.
I think it is important and exciting that we're making progress on this, but I don't think we should block. Repro steps are still not nailed down, the bug is not particularly visible, it is not a regression from the last release, and I don't see any reason not to software update a final fix if we get one.
Flags: blocking1.9.1? → blocking1.9.1-
fwiw, i'm running steven's build now and don't notice any cpu piggage on windowserver's part, but am letting it continue to run to make sure. I did get the two prompts regarding working on|off line during startup but so far have been good as gold.
build in comment 46 fixes the windowserver piggage. thanks steven.
Attachment #351599 - Flags: review?(joshmoz)
Comment on attachment 351599 [details] [diff] [review]
Fix for "two event taps" (and maybe also for "overactive WindowServer")

Glad to hear it, and you're welcome.

Now I'll start the review process.
Josh: I believe (though I haven't verified) that it is a regression for Thunderbird, since the last release was before the advent of Cocoa widgetry in Gecko.
If this problem is connected with event taps, then yes, this is a regression for Thunderbird (since we didn't use event taps on the 1.8 branch).
Re-requesting blocking, since this is a regression, as per comments 59 and 60.
Flags: blocking1.9.1- → blocking1.9.1?
Attachment #351599 - Flags: superreview?(roc)
Attachment #351599 - Flags: review?(joshmoz)
Attachment #351599 - Flags: review+
So long as we have a patch with confirmation of it fixing the problem I'm happy to block. That is, we should block on getting this patch in. If the patch turns out not to be a fix then we're back to the state of things in my comment #55 we'll need to un-block.
Flags: blocking1.9.1? → blocking1.9.1+
Attachment #351599 - Flags: superreview?(roc) → superreview+
Landed on mozilla-central:
http://hg.mozilla.org/mozilla-central/rev/36d0b62a2f56
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
So now that this is blocking1.9.1+, should I land it right away on the 1.9.1 branch?  Or should I let it bake a few days on the trunk?
Landed on the 1.9.1 branch:
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/0ce3af5127dc

The patch is pretty trivial, and will get more testing on the 1.9.1 branch.

This will also make it available to users of Shredder nightlies.
Keywords: fixed1.9.1
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: