Closed Bug 436897 Opened 14 years ago Closed 14 years ago
System slowdown following overactive windowserver (with Thunderbird and Firefox, but not Camino)
+++ This bug was initially created as a clone of Bug #389683 +++ User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.9a7pre) Gecko/2007072400 Camino/2.0a1pre (like Firefox/3.0a7pre) Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.9a7pre) Gecko/2007072400 Camino/2.0a1pre (like Firefox/3.0a7pre) I have noticed over the past week that when Camino has been running for a while, my system starts to slow down drastically. Upon observing Activity Monitor, I found that a WindowServer job is hogging in the excess of 80% of CPU. It took a few iterations to notice a possible Camino connection. In addition, I have also noticed that this happens more when I am surfing a site with Flash content. In #Camino, I was told to note the number of ports Camino uses when the slow down occurs. I'll add that data as a comment soon. In my normal use, Camino uses about 240+ (at startup - iGoogle as home page with lots of feeds) and about 140+ afterwards. Reproducible: Sometimes Steps to Reproduce: 1. Open a website (possibly with flash content) 2. Wait for a while Actual Results: System starts to slow down with WindowServer starting to use much more CPU than it usually does. Just quitting Camino mostly doesn't help. Logout-login helps. Expected Results: There should be no system performance impact when Camino is running
OK, so creating the pref that's checked by the patch on bug 389683 (ui.use_native_popup_windows) and setting it to true does indeed fix the CPU usage problem in Thunderbird as well. However, since Thunderbird doesn't use native popup windows, this disables context menus and tooltips when you set this. That's an acceptable short-term workaround for me, since I rarely use that stuff anyway, but probably not shippable.
Thunderbird version 3.0a2pre (2008053020)
There is a bug report for a similar problem in Firefox: bug 417655.
Assignee: nobody → joshmoz
Component: General → Widget: Cocoa
Product: Thunderbird → Core
QA Contact: general → cocoa
Sigh :-( Apple's implementation of event taps is very buggy. But we need event taps to implement Firefox's (and Thunderbird's) non-native context menus. I worked around all the bugs I could in Apple's event taps that I could consistently reproduce. Clearly some remain. I may be able to work around them. But first I need to be able to reproduce them. For what it's worth, I never see this problem myself. But because I do so much testing, of so many different versions of Mozilla.org's browsers, I rarely run any one of them for more than an hour or two at a time.
> I worked around all the bugs I could in Apple's event taps that I > could consistently reproduce. I worked around all the bugs in Apple's event taps that I could consistently reproduce.
I'll raise the priority if/when this becomes consistently reproducible.
Assignee: joshmoz → smichaud
Priority: -- → P3
Not knowing anything about event taps, here's one piece of information that might be helpful in debugging (though I rather suspect it won't): if Tb deadlocks or I break in the debugger and then sit back and wait a while, WindowServer gradually loses its mind in the same way, presumably because events are no longer being serviced.
> Just quitting Camino mostly doesn't help. Logout-login helps. This is probably the most easily identified diagnostic marking this as an event taps bug. Look for this in other bugs before marking them dups of this one.
(In reply to comment #7) I probably need something like "this sequence of actions makes the WindowServer start hogging CPU (which doesn't get fixed by quitting the browser)". We currently only use event taps for mouse-down events (left-, right- and other-), and event taps are cross-application. So sequences of events that trigger this bug will almost certainly include mouse-downs and involve more than one application.
Summary: System slowdown following overactive windowserver, when Thunderbird running → System slowdown following overactive windowserver (with Thunderbird and Firefox, but not Camino)
(In further reply to comment #7) > or I break in the debugger and then sit back and wait a while So you're telling me that all you need to do to make this bug happen is run Thunderbird (or maybe Firefox) in gdb, break, wait a while (5-10 minutes?), and then "continue" (in gdb)?
I believe justdave is able to reliably reproduce the problem on his system during normal use. He has provided some Activity Monitor sample traces: Thunderbird with the progress meter bug (bug 432028) patch applied: http://people.mozilla.com/~justdave/Sample_of_ThunderbirdDebug-20080531.txt Window server with the progress bar patch not applied: http://people.mozilla.com/~justdave/Sample_of_WindowServer-20080530.txt He has also indicated that, with his custom build, a pop-up for thunderbird in the background will not highlight the active menu options as the mouse passes over them. This also happens to me for my custom build (with the progress bar patch applied), but the highlight does track with official nightly builds. I am not able to reproduce the problem, despite much effort.
It's not that I don't believe you. But we need (and don't yet have) precise and specific STR that anyone can reproduce. The traces are nice to have ... but (as far as I can see) they don't tell us how the problem started.
(In reply to comment #8) > > Just quitting Camino mostly doesn't help. Logout-login helps. > > This is probably the most easily identified diagnostic marking this as > an event taps bug. Look for this in other bugs before marking them > dups of this one. This is not the case here. If I quit Thunderbird, the problem goes away, and WindowServer regains its sanity.
Also of note (mentioned on IRC but I should mention it here as well)... I'm not sure if it's related, but it's surely possible. When I do quit Thunderbird, I get a phantom right-click event posted to the system, which gets sent to whatever app is frontmost after TB exits.
(In reply to comment #10) > (In further reply to comment #7) > > > or I break in the debugger and then sit back and wait a while > > So you're telling me that all you need to do to make this bug happen > is run Thunderbird (or maybe Firefox) in gdb, break, wait a while > (5-10 minutes?), and then "continue" (in gdb)? Except that I never hit continue. If I hit continue, the problem goes away.
> I get a phantom right-click event posted to the system, which gets > sent to whatever app is frontmost after TB exits. This definitely sounds like an event tap problem. >> > Just quitting Camino mostly doesn't help. Logout-login helps. >> >> This is probably the most easily identified diagnostic marking this >> as an event taps bug. Look for this in other bugs before marking >> them dups of this one. > > This is not the case here. If I quit Thunderbird, the problem goes > away, and WindowServer regains its sanity. But it sounds like I was wrong about event-tap-induced WindowServer problems never going away when you quit Thunderbird and/or Firefox.
But yeah, it's 5-10 mins. If you watch with top, you'll notice the amount of CPU being eaten by WindowServer gradually increasing over time.
(In reply to comment #15 and comment #17) Interesting, and thanks. I just failed to reproduce this with Firefox ... but I'll keep trying (with Thunderbird and Firefox).
(In reply to comment #14) > Also of note (mentioned on IRC but I should mention it here as well)... I'm > not sure if it's related, but it's surely possible. When I do quit > Thunderbird, I get a phantom right-click event posted to the system, which gets > sent to whatever app is frontmost after TB exits. Jesse reported something like this in bug 393664.
Just throwing out wild ideas (I haven't seen the code at all and don't really know how it works)... when receiving an event tap event from the system, do we have to send something back to the system in the return code from the callback to say that we've handled the event, and if so are we doing that?
(In reply to comment #20) The relevant code is in widget/src/cocoa/nsToolkit.mm (in nsToolkit::RegisterForAllProcessMouseEvents() and EventTapCallback()). Apple's doc (the "Quarz Event Services Reference") is at: http://developer.apple.com/documentation/Carbon/Reference/QuartzEventServicesRef/Reference/reference.html No, we don't need to send something back to the system (because we've installed a "listen only" tap). We do so anyway ... but in my original tests it didn't matter whether or not we did this.
(I had a comment about the listener event tap too, but got beaten to the punch :) Dave, I guess there are two potential debugging things you could do that might help: 1) Is your Thunderbird so busy before the WindowServer activity goes insane that it is conceivable that events are backing up? I know you have a lot of mail accounts configured, and a lot of message filters, all of which might presumably tie up the UI thread. 2) Can you bring up python (just 'python') in a terminal and type the following: import Quartz Quartz.CGGetEventTapList(0, None) The return value that gets printed out should be (0, Number of event tap listeners active on the system). If you have Thunderbird 3 and Firefox 3 open and active, we would expect that number to be 2. If just one of them, then just 1.
> 2) Can you bring up python (just 'python') in a terminal and type the > following: > > import Quartz > Quartz.CGGetEventTapList(0, None) I did this and got an error about a missing argument. In fact (from the "Quartz Event Services Reference" I cited above) CGGetEventTapList() takes three arguments, the last of which is a pointer to a structure to hold the information it returns.
The correct syntax seems to be: Quartz.CGGetEventTapList(0, None, None)
>>> Quartz.CGGetEventTapList(0, None, None) (0, 3) I have both TB3 and FF3 running (and only one of each), so apparently I have something else running that uses them, too. I'll have to do some trial and error to figure out what, and it'll have to wait until I have some other work-related stuff out of the way this afternoon.
OK, through some experimenting, I now know with certainty how you can reproduce this at will. When I ran the above command in python immediately after launching Thunderbird, it says there's 1 listener. I have TB set to prompt me at launch whether I want to work offline or online. After I click the Work Online button, the python command then returns 2 event listeners active. If I set the offline stuff not to prompt me, and restart Thunderbird, then I only get one listener, ever. I'm going to leave it running right now and see if the presence of the 2nd listener is what caused all the problems. If this theory is right, then I won't have any CPU issues this time around.
OK, it's been going for a good hour and a half, and no CPU issues yet, so I'd say that's the problem. Whatever code is restarting things after the "do you want to work offline" prompt isn't throwing away the stuff it's already started at that point.
> I have TB set to prompt me at launch whether I want to work offline > or online. How (exactly) do you make it do this?
Preferences > Advanced > Network & Disk Space > Configure Offline Settings When starting up: ( ) Remember previous online state (*) Ask me for online state at startup ( ) Always start up online
Bingo! With this setting, I also see two event taps after Thunderbird has asked me (at startup) whether I want to work offline or online (it doesn't matter which option I choose). When I quit Thunderbird the number of event taps goes back to zero (and I get wierdness in the Terminal session where I was running python). This dialog is displayed very early ... which I suspect has something to do with the problem. Now that you've given me something to work with, I'll up the priority. I hope we can kill off Jesse's bug (bug 393664) at the same time. Thanks both Dave and Andrew -- Dave for the STR, and Andrew for the Python trick (which is very convenient, and which I didn't know about).
Priority: P3 → P1
Only wanted1.9.0.x pending the size and risk of the patch. This should be in 1.9.1 though.
Is this really a P1 blocker? That is, we would hold the beta next week for this? Seems like we should fix it before final, but even that is kinda iffy; I don't think there's been a lot of feedback about this issue (I see something similar often when firefox is stopped at a gdb breakpoint, but that wouldn't be a blocker).
(In reply to comment #32) > Is this really a P1 blocker? That is, we would hold the beta next > week for this? Put that way, I think the answer should be "no". I'm not yet sure I'll be able to fix this bug at all. And even if the patch itself isn't particularly complex, investigating this problem *will* be a complex task. There's also at least one other bug (itself quite complex) that should have priority over this one (bug 436473/bug 442442). Given what I've just said, Vlad, what priority do you think this should be -- P2 or P3? (As of comment #30 I know how to reproduce this bug. So as of that comment this bug went from impossible to difficult. But nothing can make it easy.)
I think P1 is blocking b1, P2 is blocking b2, and P3 is blocking final release -- given those comments, I'd actually put this as a P1-wanted. That is, we would not block the final release if we didn't fix this, but we really really want to fix it.
While I agree with vlad that the debugging lossage doesn't make this a 1.9.1 blocker, I think that the existence of bug 467630 probably does. Re-requesting blocking status.
Flags: blocking1.9.1- → blocking1.9.1?
(In reply to comment #36) What I said in comment #33 still stands. A lot depends on how reliable the STR for bug 467630 is, and (if it's 100% reliable) how often the problem is likely to occur.
It sounds to me like the behavior in bug 467630 is similar or identical to that described by Dave in comment 29, which you say you're able to reproduce. Do you disagree? You're right that it's not clear how prevalent this problem is.
(In reply to comment #38) > It sounds to me like the behavior in bug 467630 is similar or > identical to that described by Dave in comment 29, which you say > you're able to reproduce. Do you disagree? I'd say bug 467630 *might* be a dup of this one. But both bugs are so complex that one can't (at this point) be more certain than that. By the way (and I forgot to mention this earlier), as best I can remember I can only reproduce precisely the symptoms mentioned in comment #30 -- which doesn't include the "overactive windowserver". Sometime in the next couple of days I'll check to see I can still reproduce what I described in comment #30, and to what extent (if any) I can reproduce bug 467630.
(Following up comment #38) I still see the "two event taps" (from comment #30) when I set Shredder to "ask me for online state at startup" (whether I choose "work offline" or "work online"). But I never see the WindowServer eating lots of CPU (or RAM). I tested by starting Shredder and letting it sit for 5-10 minutes. When I chose "work online" I connected to an IMAP server before letting Shredder sit. I tested running Shredder "on its own" and in gdb. While running it in gdb, I "broke" before letting Shredder sit (without doing a "continue", as per comment #15). So (to cover all possible combinations of these factors) I did four different tests. I never had trouble from the WindowServer in any of them.
I should add that I tested on OS X 10.5.5, with an opt build of Shredder (containing debug symbols) made from code that I pulled (from comm-central) this Tuesday morning (2008-12-02).
Steven, as dmose said in comment 7, I see this all the time when I break firefox in gdb and take too long debugging. The system gets its act together as soon as I continue.
(In reply to comment #42) Thanks for the info. But (as I *can't* reproduce the "overactive WindowServer" problem) there's clearly some other factor (or factors) involved here that nobody's yet been able to identify. This makes it difficult to discover exactly how widespread this problem is. (By the way, though, I've got a hunch about avoiding the "two event taps" problem which I'm going to try out over the next few hours. If I get anywhere I'll post a patch here, for you and others to test.)
Generally when I'm running under the debugger, I use debug builds. Maybe there's some relevant difference between debug and release builds here?
I've found a fix for the "two event taps" problem. If you can reproduce the "overactive WindowServer" problem, please try this patch to see if it also clears that up. A tryserver build (of Firefox) will follow in a while. I'm also doing my own (universal binary) build of Shredder, which I'll upload to my people.mozilla.com account once it's done. Turns out the code I was using in nsToolkit::UnregisterAllProcessMouseEventHandlers() to destroy an event tap doesn't actually remove it (at least until the app process ends). You also need to invalidate the CFMachPortRef object returned by the call to CGEventTapCreate(). (Of course this isn't documented anywhere ... as best I can tell.) It also turns out that (under certain circumstances) the app can create an nsToolkit object, destroy it, and create it again for the same thread (this happens every time the app runs, not just when it does a "double-bounce" on startup). This may also be a bug ... or maybe someone here can tell me why it happens (and why it isn't a bug). The "certain circumstances" include when you've set Shredder to ask for the online state at startup.
Here's my Shredder build (made with my patch from comment #45): http://people.mozilla.com/~stmichaud/bmo/thunderbird-3.0b2pre.en-US.mac.bugzilla436897-smichaud.dmg
Here's a tryserver build (of Firefox) made with my patch from comment #45: https://build.mozilla.org/tryserver-builds/2008-12-05_12:firstname.lastname@example.orgemail@example.com
Anybody tried my patch yet?
I hope to do this today or tomorrow.
Unfortunately, my todo list is a bit crazy. However, since the problem I'm seeing (stopping in a debugger) isn't what might make this bug want to block 1.9.1, I think having me test doesn't really provide the information we need. In particular, the Thunderbird problems seen by justdave and bclary are the real issues here. justdave/bclary, can you guys test the tryserver build linked to from comment 46?
> justdave/bclary, can you guys test the tryserver build linked to > from comment 46? Hear! hear! :-)
I thought I had commented about the try server builds here, but must have forgotten to commit it. While the steps to reproduce in Thunderbird were pretty clear to me, I'm not sure how to do so with Firefox. What is the equivalent of asking whether to start up in offline mode in Firefox?
Use my Shredder build from comment #46.
(Following up comment #53) In other words, I'm satisfied if you only test Shredder, since that's where you originally saw the problem (the overactive WindowServer problem). There's no guarantee you'd see something similar in Firefox, even with a very similar STR.
I think it is important and exciting that we're making progress on this, but I don't think we should block. Repro steps are still not nailed down, the bug is not particularly visible, it is not a regression from the last release, and I don't see any reason not to software update a final fix if we get one.
Flags: blocking1.9.1? → blocking1.9.1-
fwiw, i'm running steven's build now and don't notice any cpu piggage on windowserver's part, but am letting it continue to run to make sure. I did get the two prompts regarding working on|off line during startup but so far have been good as gold.
build in comment 46 fixes the windowserver piggage. thanks steven.
Comment on attachment 351599 [details] [diff] [review] Fix for "two event taps" (and maybe also for "overactive WindowServer") Glad to hear it, and you're welcome. Now I'll start the review process.
Josh: I believe (though I haven't verified) that it is a regression for Thunderbird, since the last release was before the advent of Cocoa widgetry in Gecko.
If this problem is connected with event taps, then yes, this is a regression for Thunderbird (since we didn't use event taps on the 1.8 branch).
Re-requesting blocking, since this is a regression, as per comments 59 and 60.
Flags: blocking1.9.1- → blocking1.9.1?
(and comment 57, really).
So long as we have a patch with confirmation of it fixing the problem I'm happy to block. That is, we should block on getting this patch in. If the patch turns out not to be a fix then we're back to the state of things in my comment #55 we'll need to un-block.
Flags: blocking1.9.1? → blocking1.9.1+
Attachment #351599 - Flags: superreview?(roc) → superreview+
Landed on mozilla-central: http://hg.mozilla.org/mozilla-central/rev/36d0b62a2f56
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
So now that this is blocking1.9.1+, should I land it right away on the 1.9.1 branch? Or should I let it bake a few days on the trunk?
Landed on the 1.9.1 branch: http://hg.mozilla.org/releases/mozilla-1.9.1/rev/0ce3af5127dc The patch is pretty trivial, and will get more testing on the 1.9.1 branch. This will also make it available to users of Shredder nightlies.
You need to log in before you can comment on or make changes to this bug.