Bugzilla

Reporter

Comment 1

•

16 years ago

OK, so creating the pref that's checked by the patch on bug 389683 (ui.use_native_popup_windows) and setting it to true does indeed fix the CPU usage problem in Thunderbird as well.  However, since Thunderbird doesn't use native popup windows, this disables context menus and tooltips when you set this.

That's an acceptable short-term workaround for me, since I rarely use that stuff anyway, but probably not shippable.

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 2

•

16 years ago

Thunderbird version 3.0a2pre (2008053020)

Comment 3

•

16 years ago

There is a bug report for a similar problem in Firefox: bug 417655.

Assignee

Updated

•

16 years ago

Assignee: nobody → joshmoz

Component: General → Widget: Cocoa

Flags: blocking-thunderbird3?

Flags: blocking-thunderbird3.0a2?

Product: Thunderbird → Core

QA Contact: general → cocoa

Assignee

Comment 4

•

16 years ago

Sigh :-(

Apple's implementation of event taps is very buggy.  But we need event
taps to implement Firefox's (and Thunderbird's) non-native context
menus.

I worked around all the bugs I could in Apple's event taps that I
could consistently reproduce.  Clearly some remain.  I may be able to
work around them.  But first I need to be able to reproduce them.

For what it's worth, I never see this problem myself.  But because I
do so much testing, of so many different versions of Mozilla.org's
browsers, I rarely run any one of them for more than an hour or two at
a time.

Assignee

Comment 5

•

16 years ago

> I worked around all the bugs I could in Apple's event taps that I
> could consistently reproduce.

I worked around all the bugs in Apple's event taps that I could
consistently reproduce.

Assignee

Comment 6

•

16 years ago

I'll raise the priority if/when this becomes consistently reproducible.

Assignee: joshmoz → smichaud

Priority: -- → P3

Comment 7

•

16 years ago

Not knowing anything about event taps, here's one piece of information that
might be helpful in debugging (though I rather suspect it won't): if Tb
deadlocks or I break in the debugger and then sit back and wait a while,
WindowServer gradually loses its mind in the same way, presumably because
events are no longer being serviced.

Assignee

Comment 8

•

16 years ago

> Just quitting Camino mostly doesn't help. Logout-login helps.

This is probably the most easily identified diagnostic marking this as
an event taps bug.  Look for this in other bugs before marking them
dups of this one.

Assignee

Comment 9

•

16 years ago

(In reply to comment #7)

I probably need something like "this sequence of actions makes the
WindowServer start hogging CPU (which doesn't get fixed by quitting
the browser)".

We currently only use event taps for mouse-down events (left-, right-
and other-), and event taps are cross-application.  So sequences of
events that trigger this bug will almost certainly include mouse-downs
and involve more than one application.

Flags: wanted1.9.0.x?

Assignee

Updated

•

16 years ago

Summary: System slowdown following overactive windowserver, when Thunderbird running → System slowdown following overactive windowserver (with Thunderbird and Firefox, but not Camino)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 10

•

16 years ago

(In further reply to comment #7)

> or I break in the debugger and then sit back and wait a while

So you're telling me that all you need to do to make this bug happen
is run Thunderbird (or maybe Firefox) in gdb, break, wait a while
(5-10 minutes?), and then "continue" (in gdb)?

Comment 11

•

16 years ago

I believe justdave is able to reliably reproduce the problem on his system during normal use.  He has provided some Activity Monitor sample traces:

Thunderbird with the progress meter bug (bug 432028) patch applied:
http://people.mozilla.com/~justdave/Sample_of_ThunderbirdDebug-20080531.txt

Window server with the progress bar patch not applied:
http://people.mozilla.com/~justdave/Sample_of_WindowServer-20080530.txt

He has also indicated that, with his custom build, a pop-up for thunderbird in the background will not highlight the active menu options as the mouse passes over them.  This also happens to me for my custom build (with the progress bar patch applied), but the highlight does track with official nightly builds.  I am not able to reproduce the problem, despite much effort.

Assignee

Comment 12

•

16 years ago

It's not that I don't believe you.

But we need (and don't yet have) precise and specific STR that anyone
can reproduce.

The traces are nice to have ... but (as far as I can see) they don't
tell us how the problem started.

Reporter

Comment 13

•

16 years ago

(In reply to comment #8)
> > Just quitting Camino mostly doesn't help. Logout-login helps.
> 
> This is probably the most easily identified diagnostic marking this as
> an event taps bug.  Look for this in other bugs before marking them
> dups of this one.

This is not the case here.  If I quit Thunderbird, the problem goes away, and WindowServer regains its sanity.

Reporter

Comment 14

•

16 years ago

Also of note (mentioned on IRC but I should mention it here as well)...  I'm not sure if it's related, but it's surely possible.  When I do quit Thunderbird, I get a phantom right-click event posted to the system, which gets sent to whatever app is frontmost after TB exits.

Comment 15

•

16 years ago

(In reply to comment #10)
> (In further reply to comment #7)
> 
> > or I break in the debugger and then sit back and wait a while
> 
> So you're telling me that all you need to do to make this bug happen
> is run Thunderbird (or maybe Firefox) in gdb, break, wait a while
> (5-10 minutes?), and then "continue" (in gdb)?

Except that I never hit continue.  If I hit continue, the problem goes away.

Assignee

Comment 16

•

16 years ago

> I get a phantom right-click event posted to the system, which gets
> sent to whatever app is frontmost after TB exits.

This definitely sounds like an event tap problem.

>> > Just quitting Camino mostly doesn't help. Logout-login helps.
>>
>> This is probably the most easily identified diagnostic marking this
>> as an event taps bug.  Look for this in other bugs before marking
>> them dups of this one.
>
> This is not the case here.  If I quit Thunderbird, the problem goes
> away, and WindowServer regains its sanity.

But it sounds like I was wrong about event-tap-induced WindowServer
problems never going away when you quit Thunderbird and/or Firefox.

Comment 17

•

16 years ago

But yeah, it's 5-10 mins.  If you watch with top, you'll notice the amount of CPU being eaten by WindowServer gradually increasing over time.

Smokey Ardisson (offline for a while; not following bugs - do not email)

Assignee

Comment 18

•

16 years ago

(In reply to comment #15 and comment #17)

Interesting, and thanks.

I just failed to reproduce this with Firefox ... but I'll keep trying
(with Thunderbird and Firefox).

Comment 19

•

16 years ago

(In reply to comment #14)
> Also of note (mentioned on IRC but I should mention it here as well)...  I'm
> not sure if it's related, but it's surely possible.  When I do quit
> Thunderbird, I get a phantom right-click event posted to the system, which gets
> sent to whatever app is frontmost after TB exits.

Jesse reported something like this in bug 393664.

Reporter

Comment 20

•

16 years ago

Just throwing out wild ideas (I haven't seen the code at all and don't really know how it works)... when receiving an event tap event from the system, do we have to send something back to the system in the return code from the callback to say that we've handled the event, and if so are we doing that?

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 21

•

16 years ago

(In reply to comment #20)

The relevant code is in widget/src/cocoa/nsToolkit.mm (in
nsToolkit::RegisterForAllProcessMouseEvents() and EventTapCallback()).
Apple's doc (the "Quarz Event Services Reference") is at:

http://developer.apple.com/documentation/Carbon/Reference/QuartzEventServicesRef/Reference/reference.html

No, we don't need to send something back to the system (because we've
installed a "listen only" tap).  We do so anyway ... but in my
original tests it didn't matter whether or not we did this.

Comment 22

•

16 years ago

(I had a comment about the listener event tap too, but got beaten to the punch :)

Dave, I guess there are two potential debugging things you could do that might help:

1) Is your Thunderbird so busy before the WindowServer activity goes insane
that it is conceivable that events are backing up?  I know you have a lot of
mail accounts configured, and a lot of message filters, all of which might
presumably tie up the UI thread.

2) Can you bring up python (just 'python') in a terminal and type the
following:

import Quartz
Quartz.CGGetEventTapList(0, None)

The return value that gets printed out should be (0, Number of event tap
listeners active on the system).  If you have Thunderbird 3 and Firefox 3 open
and active, we would expect that number to be 2.  If just one of them, then
just 1.

Assignee

Comment 23

•

16 years ago

> 2) Can you bring up python (just 'python') in a terminal and type the
> following:
>
> import Quartz
> Quartz.CGGetEventTapList(0, None)

I did this and got an error about a missing argument.

In fact (from the "Quartz Event Services Reference" I cited above) CGGetEventTapList() takes three arguments, the last of which is a pointer to a structure to hold the information it returns.

Assignee

Comment 24

•

16 years ago

The correct syntax seems to be:

Quartz.CGGetEventTapList(0, None, None)

Reporter

Comment 25

•

16 years ago

>>> Quartz.CGGetEventTapList(0, None, None)
(0, 3)

I have both TB3 and FF3 running (and only one of each), so apparently I have something else running that uses them, too.  I'll have to do some trial and error to figure out what, and it'll have to wait until I have some other work-related stuff out of the way this afternoon.

Reporter

Comment 26

•

16 years ago

OK, through some experimenting, I now know with certainty how you can reproduce this at will.

When I ran the above command in python immediately after launching Thunderbird, it says there's 1 listener.  I have TB set to prompt me at launch whether I want to work offline or online.  After I click the Work Online button, the python command then returns 2 event listeners active.

If I set the offline stuff not to prompt me, and restart Thunderbird, then I only get one listener, ever.  I'm going to leave it running right now and see if the presence of the 2nd listener is what caused all the problems.  If this theory is right, then I won't have any CPU issues this time around.

Reporter

Comment 27

•

16 years ago

OK, it's been going for a good hour and a half, and no CPU issues yet, so I'd say that's the problem.  Whatever code is restarting things after the "do you want to work offline" prompt isn't throwing away the stuff it's already started at that point.

Assignee

Comment 28

•

16 years ago

> I have TB set to prompt me at launch whether I want to work offline
> or online.

How (exactly) do you make it do this?

Reporter

Comment 29

•

16 years ago

Preferences > Advanced > Network & Disk Space > Configure Offline Settings

When starting up:
  ( ) Remember previous online state
  (*) Ask me for online state at startup
  ( ) Always start up online

Samuel Sidler (old account; do not CC)

Assignee

Comment 30

•

16 years ago

Bingo!

With this setting, I also see two event taps after Thunderbird has
asked me (at startup) whether I want to work offline or online (it
doesn't matter which option I choose).  When I quit Thunderbird the
number of event taps goes back to zero (and I get wierdness in the
Terminal session where I was running python).

This dialog is displayed very early ... which I suspect has something
to do with the problem.

Now that you've given me something to work with, I'll up the priority.
I hope we can kill off Jesse's bug (bug 393664) at the same time.

Thanks both Dave and Andrew -- Dave for the STR, and Andrew for the
Python trick (which is very convenient, and which I didn't know
about).

Priority: P3 → P1

Jesse Ruderman

Updated

•

16 years ago

Blocks: 393664

Comment 31

•

16 years ago

Only wanted1.9.0.x pending the size and risk of the patch. This should be in 1.9.1 though.

Flags: wanted1.9.1?

Flags: wanted1.9.0.x?

Flags: wanted1.9.0.x+

Flags: blocking1.9.1?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

16 years ago

Flags: wanted1.9.1?

Flags: blocking1.9.1?

Flags: blocking1.9.1+

Comment 32

•

16 years ago

Is this really a P1 blocker?  That is, we would hold the beta next week for this?  Seems like we should fix it before final, but even that is kinda iffy; I don't think there's been a lot of feedback about this issue (I see something similar often when firefox is stopped at a gdb breakpoint, but that wouldn't be a blocker).

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 33

•

16 years ago

(In reply to comment #32)

> Is this really a P1 blocker?  That is, we would hold the beta next
> week for this?

Put that way, I think the answer should be "no".

I'm not yet sure I'll be able to fix this bug at all.  And even if the
patch itself isn't particularly complex, investigating this problem
*will* be a complex task.  There's also at least one other bug (itself
quite complex) that should have priority over this one (bug 436473/bug
442442).

Given what I've just said, Vlad, what priority do you think this
should be -- P2 or P3?

(As of comment #30 I know how to reproduce this bug.  So as of that
comment this bug went from impossible to difficult.  But nothing can
make it easy.)

Comment 34

•

16 years ago

I think P1 is blocking b1, P2 is blocking b2, and P3 is blocking final release -- given those comments, I'd actually put this as a P1-wanted.  That is, we would not block the final release if we didn't fix this, but we really really want to fix it.

Updated

•

16 years ago

Flags: wanted1.9.1+

Flags: blocking1.9.1-

Flags: blocking1.9.1+

Comment 36

•

16 years ago

While I agree with vlad that the debugging lossage doesn't make this a 1.9.1 blocker, I think that the existence of bug 467630 probably does.  Re-requesting blocking status.

Flags: blocking1.9.1- → blocking1.9.1?

Whiteboard: [tb3needs]

Assignee

Comment 37

•

16 years ago

(In reply to comment #36)

What I said in comment #33 still stands.

A lot depends on how reliable the STR for bug 467630 is, and (if it's
100% reliable) how often the problem is likely to occur.

Comment 38

•

16 years ago

It sounds to me like the behavior in bug 467630 is similar or identical to that described by Dave in comment 29, which you say you're able to reproduce.  Do you disagree?

You're right that it's not clear how prevalent this problem is.

Assignee

Comment 39

•

16 years ago

(In reply to comment #38)

> It sounds to me like the behavior in bug 467630 is similar or
> identical to that described by Dave in comment 29, which you say
> you're able to reproduce.  Do you disagree?

I'd say bug 467630 *might* be a dup of this one.  But both bugs are so
complex that one can't (at this point) be more certain than that.

By the way (and I forgot to mention this earlier), as best I can
remember I can only reproduce precisely the symptoms mentioned in
comment #30 -- which doesn't include the "overactive windowserver".

Sometime in the next couple of days I'll check to see I can still
reproduce what I described in comment #30, and to what extent (if any)
I can reproduce bug 467630.

Assignee

Comment 40

•

16 years ago

(Following up comment #38)

I still see the "two event taps" (from comment #30) when I set
Shredder to "ask me for online state at startup" (whether I choose
"work offline" or "work online").  But I never see the WindowServer
eating lots of CPU (or RAM).

I tested by starting Shredder and letting it sit for 5-10 minutes.
When I chose "work online" I connected to an IMAP server before
letting Shredder sit.  I tested running Shredder "on its own" and in
gdb.  While running it in gdb, I "broke" before letting Shredder sit
(without doing a "continue", as per comment #15).  So (to cover all
possible combinations of these factors) I did four different tests.  I
never had trouble from the WindowServer in any of them.

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 41

•

16 years ago

I should add that I tested on OS X 10.5.5, with an opt build of
Shredder (containing debug symbols) made from code that I pulled (from
comm-central) this Tuesday morning (2008-12-02).

Comment 42

•

16 years ago

Steven, as dmose said in comment 7, I see this all the time when I break firefox in gdb and take too long debugging. The system gets its act together as soon as I continue.

Assignee

Comment 43

•

16 years ago

(In reply to comment #42)

Thanks for the info.  But (as I *can't* reproduce the "overactive
WindowServer" problem) there's clearly some other factor (or factors)
involved here that nobody's yet been able to identify.

This makes it difficult to discover exactly how widespread this
problem is.

(By the way, though, I've got a hunch about avoiding the "two event
taps" problem which I'm going to try out over the next few hours.  If
I get anywhere I'll post a patch here, for you and others to test.)

Comment 44

•

16 years ago

Generally when I'm running under the debugger, I use debug builds.  Maybe there's some relevant difference between debug and release builds here?

Assignee

Comment 45

•

16 years ago

Attached patch Fix for "two event taps" (and maybe also for "overactive WindowServer") — Details — Splinter Review

I've found a fix for the "two event taps" problem.

If you can reproduce the "overactive WindowServer" problem, please try
this patch to see if it also clears that up.

A tryserver build (of Firefox) will follow in a while.  I'm also doing
my own (universal binary) build of Shredder, which I'll upload to my
people.mozilla.com account once it's done.

Turns out the code I was using in
nsToolkit::UnregisterAllProcessMouseEventHandlers() to destroy an
event tap doesn't actually remove it (at least until the app process
ends).  You also need to invalidate the CFMachPortRef object returned
by the call to CGEventTapCreate().  (Of course this isn't documented
anywhere ... as best I can tell.)

It also turns out that (under certain circumstances) the app can
create an nsToolkit object, destroy it, and create it again for the
same thread (this happens every time the app runs, not just when it
does a "double-bounce" on startup).  This may also be a bug ... or
maybe someone here can tell me why it happens (and why it isn't a
bug).  The "certain circumstances" include when you've set Shredder to
ask for the online state at startup.

Assignee

Comment 46

•

16 years ago

Here's my Shredder build (made with my patch from comment #45):

http://people.mozilla.com/~stmichaud/bmo/thunderbird-3.0b2pre.en-US.mac.bugzilla436897-smichaud.dmg

Assignee

Comment 47

•

16 years ago

Here's a tryserver build (of Firefox) made with my patch from comment #45:

https://build.mozilla.org/tryserver-builds/2008-12-05_12:53-smichaud@pobox.com-bugzilla436897/smichaud@pobox.com-bugzilla436897-firefox-try-mac.dmg

Assignee

Comment 48

•

16 years ago

Anybody tried my patch yet?

Comment 49

•

16 years ago

I hope to do this today or tomorrow.

Comment 50

•

16 years ago

Unfortunately, my todo list is a bit crazy.  However, since the problem I'm seeing (stopping in a debugger) isn't what might make this bug want to block 1.9.1, I think having me test doesn't really provide the information we need.  In particular, the Thunderbird problems seen by justdave and bclary are the real issues here.   justdave/bclary, can you guys test the tryserver build linked to from comment 46?

Bob Clary [:bc] (inactive)

Assignee

Comment 51

•

16 years ago

> justdave/bclary, can you guys test the tryserver build linked to
> from comment 46?

Hear! hear! :-)

Comment 52

•

16 years ago

I thought I had commented about the try server builds here, but must have forgotten to commit it. While the steps to reproduce in Thunderbird were pretty clear to me, I'm not sure how to do so with Firefox. What is the equivalent of asking whether to start up in offline mode in Firefox?

Assignee

Comment 53

•

16 years ago

Use my Shredder build from comment #46.

Assignee

Comment 54

•

16 years ago

(Following up comment #53)

In other words, I'm satisfied if you only test Shredder, since that's where you originally saw the problem (the overactive WindowServer problem).  There's no guarantee you'd see something similar in Firefox, even with a very similar STR.

Bob Clary [:bc] (inactive)

Comment 55

•

16 years ago

I think it is important and exciting that we're making progress on this, but I don't think we should block. Repro steps are still not nailed down, the bug is not particularly visible, it is not a regression from the last release, and I don't see any reason not to software update a final fix if we get one.

Flags: blocking1.9.1? → blocking1.9.1-

Comment 56

•

16 years ago

fwiw, i'm running steven's build now and don't notice any cpu piggage on windowserver's part, but am letting it continue to run to make sure. I did get the two prompts regarding working on|off line during startup but so far have been good as gold.

Bob Clary [:bc] (inactive)

Comment 57

•

16 years ago

build in comment 46 fixes the windowserver piggage. thanks steven.

Assignee

Updated

•

16 years ago

Attachment #351599 - Flags: review?(joshmoz)

Assignee

Comment 58

•

16 years ago

Comment on attachment 351599 [details] [diff] [review]
Fix for "two event taps" (and maybe also for "overactive WindowServer")

Glad to hear it, and you're welcome.

Now I'll start the review process.

Comment 59

•

16 years ago

Josh: I believe (though I haven't verified) that it is a regression for Thunderbird, since the last release was before the advent of Cocoa widgetry in Gecko.

Assignee

Comment 60

•

16 years ago

If this problem is connected with event taps, then yes, this is a regression for Thunderbird (since we didn't use event taps on the 1.8 branch).

Updated

•

16 years ago

Keywords: regression

Comment 61

•

16 years ago

Re-requesting blocking, since this is a regression, as per comments 59 and 60.

Flags: blocking1.9.1- → blocking1.9.1?

Comment 62

•

16 years ago

(and comment 57, really).

Updated

•

16 years ago

Attachment #351599 - Flags: superreview?(roc)

Attachment #351599 - Flags: review?(joshmoz)

Attachment #351599 - Flags: review+

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 63

•

16 years ago

So long as we have a patch with confirmation of it fixing the problem I'm happy to block. That is, we should block on getting this patch in. If the patch turns out not to be a fix then we're back to the state of things in my comment #55 we'll need to un-block.

Flags: blocking1.9.1? → blocking1.9.1+

Updated

•

16 years ago

Attachment #351599 - Flags: superreview?(roc) → superreview+

Assignee

Comment 64

•

16 years ago

Landed on mozilla-central:
http://hg.mozilla.org/mozilla-central/rev/36d0b62a2f56

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Assignee

Comment 65

•

16 years ago

So now that this is blocking1.9.1+, should I land it right away on the 1.9.1 branch?  Or should I let it bake a few days on the trunk?