Closed Bug 258370 Opened 20 years ago Closed 19 years ago

Slack Timer with delay zero locks up UI

Categories

(Thunderbird :: General, defect)

PowerPC
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: avi, Assigned: mscott)

References

Details

Attachments

(2 files)

I'm migrating from Eudora, which is known for being incredibly quick in searching for old emails. Thunderbird, unfortunately, isn't nearly as fast. (Should I file a bug?) What's problematic is that while Thunderbird is grinding away at the "Search Messages" results, nothing else can happen. If Thunderbird is going to take forever to return search results, at least it should allow me to switch to the other windows and do things. Using Thunderbird version 0.7.3 (20040803).
are these local message searches? Are you searching on message bodies or just headers? on win2k, search doesn't lock up the UI at all.
Local messages. The particular search that drove me to write up this bug happened to be a body search. I'm using Mac OS X. You're saying that on Win2k, you can have the gnarliest search grinding away, updating the results window, and still use the rest of Thunderbird?
Let me be more precise about what I see. I start a "body contains xxx" search on about 400 megs of archived mailboxes. Thunderbird locks up and stops responding. It finds a few matches, updates the result list, and processes all the clicks and keystrokes that piled up while it was unresponsive. It then goes catatonic for ten more seconds while it searches, and then repeats the dump-results-and-process-events deal. Repeat until all the matching messages are found. I don't know how it balances its time on Windows, but here on the Mac it's really unusable. Using a 0.8 nightly: version 0.8 (20040908)
it's a lot more useable than that. I have around 350MB of messages in my local folders account and I can do a body search over that and still open other folders,read messages, etc.. It's a little sluggish but not too bad. This is on a 1.6GHz PC. The pre Unix mac was pretty notorious for slow file I/O and inefficient thread context switching - I would hope that OS/X would be a lot better. We might be doing something that performs poorly on the Mac. Our basic algorithm is to set a recurrinig timer, and on each timer callback, do some processing for no more than 200 milliseconds, and then return to the event loop.
OS X is a completely different beast than 9, so forget all you know about that... :) Nevertheless, something's going on here, as I'm getting the spinning beachball, which is automatically put up by the system whenever an app stops responding to events for five seconds. While Thunderbird was hung searching, I sampled the app. It really does appear to be spending all its time searching...
Attached file Sample during a search
cc'ing sfraser who's been very helpful with this kind of thing in the past. Herre's where create the timer: http://lxr.mozilla.org/mozilla/source/mailnews/base/search/src/nsMsgSearchSession.cpp#535 Here's where we loop doing 200 milliseconds of work, from the timer callback: http://lxr.mozilla.org/mozilla/source/mailnews/base/search/src/nsMsgLocalSearch.cpp#663 Are we mis-using the timer? I don't see how doing 200 milliseconds of work could lock up the ui for 10 seconds at a time. Otherwise, I'd consider backing off to 100 ms, but I suspect something else is going on. Re the sample data, that's interesting data. I would expect that if you weren't doing anything else in the UI that we would spend most of our time in the search code, so that's not neccesarily wrong. If I'm reading the log, most of the time we're either in nsMsgBodyHandler::GetNextLine, nsMsgSearchTerm::MatchBody, or nsMsgSearchTerm::MatchString. There looks to be lots of room for improvement in that code - for example, we shouldn't be getting the charsets each time through the loop. The dns service call during the reading from the file input stream seems odd as well (nsDNSService::GetAFForLookup). And we're probably copying strings around more than needed.
Thanks for the code references. I'm an experienced developer but know little about Mozilla, though, so I'm not quite sure what I'm looking at. > I would expect that if you weren't > doing anything else in the UI that we would spend most of our time in the search > code, so that's not neccesarily wrong. Sounds good, but I'm _trying_ to do other things in the UI. Just failing... > If I'm reading the log, most of the time > we're either in nsMsgBodyHandler::GetNextLine, nsMsgSearchTerm::MatchBody, or > nsMsgSearchTerm::MatchString. MatchBody is indented out, so it calls GetNextLine and MatchString. The numbers are the number of times the sample hit that function, and the indenting is the call chain.
sorry, what I meant was, were you trying to do other things when you created the sample? I wonder if on the mac, we should specify a small interval before the next slack timer fires, instead of specifying 0. 0 seems fine on windows, and I hate having platform #ifdefs but... Avi, I don't suppose you have a source tree built that you could play with?
> sorry, what I meant was, were you trying to do other things when you created the > sample? Yes. I was trying to pull down a menu or move a window, and was failing (the UI locked up). > I wonder if on the mac, we should specify a small interval before the next slack > timer fires, instead of specifying 0. The timers must be build on some platform-specific code. Where is that going to on the Mac? Wait--I'm looking at the stack trace. Mozilla still uses WNE? :( > 0 seems fine on windows, and I hate having > platform #ifdefs but... Avi, I don't suppose you have a source tree built that > you could play with? Um, I could try to do that. Do the resources of http://www.mozilla.org/source.html and http://www.mozilla.org/build/ apply to Tb?
http://www.mozilla.org/projects/thunderbird/build.html has some thunderbird specific instructions. Re WNE, I don't know about mac-specific stuff. Maybe ducarroz can say more...
I got Tb building. Not too hard. :) First, I tried adjusting the timer frequency. Given that each tweak means a five-minute rebuild, my level of experimentation was limited. It locked up with a frequency of 1ms the same as 0. Adjusting it to 2000ms gave a very responsive UI, but the search never got anywhere. I tried a different approach--make sure that things weren't getting locked up in the callback. I wrote up a log to check out how long the callback was taking. (Will attach.) It occasionally went up over 200ms, but most of the time it was well-behaved. I'm not sure what's wrong here. Perhaps we're not getting enough idle time using WNE, so that every time through we've blown our timer, have to immediately do a callback, and don't get any UI events? I've had similar problems with my own apps. I think I'm going to start looking at the nsTimerImpl source, to see how it works on the Mac. Any recommendations as to where to look?
no idea about the mac timer code. thx for building the source and experimenting with it. I'm confused about your tests, however. I would expect that making the time interval 2000 milliseconds instead of 200 would have opposite effect - the search would go faster, but the UI would lock up even more. I would think setting the search interval to 100 milliseconds would slow the search down but make the ui more responsive. My guess is that when we're going well over the limit is when we have to open a db, which happens synchronously. I see one time it takes us a whole second, and there are a few cases where it takes 3/4's of a second or so. (Is this a debug build, btw?)
> I would expect that making the time interval 2000 milliseconds > instead of 200 would have opposite effect - the search would go faster, but the > UI would lock up even more. I would think setting the search interval to 100 > milliseconds would slow the search down but make the ui more responsive. I was changing the rate of the callback firing, not the length of time that nsMessageLocalSearch was taking during each call back. Thus, the search was taking the same 200ms of time, but was only getting called every 2000ms and thus wasn't progressing. A slack timer is being used, which is good, but that makes me suspect some kind of deeper timer bug. On the Mac, perhaps, specifying too small of a timer fire time prevents event handling? In other places that timers are used, what value do they use for the frequency?
I've been scouring the code. The search uses a slack, repeating timer with a delay of 0 to perform a search, doing only 200 milliseconds of work at a time. This search code is the only code that specifies a delay of 0. (Well, there's some inspector thing at http://lxr.mozilla.org/mozilla/source/extensions/inspector/base/src/inSearchLoop.cpp#61 but that doesn't count.) On the Win32 platform, this means that the search proceeds for 200ms, then events are handled, then as soon as time can be found the search is called back again, and so on. On Mac OS X, having the timer be set to a delay 0 causes UI events to not be handled, and the UI to blocked. This seems to be a problem with the timer implementation. Changing the summary.
Summary: "Search Messages" locks up while searching → Slack Timer with delay zero locks up UI
Please follow along in the stack trace (attachment 158391 [details]). When I saw WaitNextEvent, I was mislead into thinking that no Carbon Events were used. I was mistaken. Let's take the stack crawl from the top down: 260 xre_main(int, char**, nsXREAppData const*) 260 nsAppShell::Run() 260 nsMacMessagePump::DoMessagePump() 255 nsMacMessagePump::GetEvent(EventRecord&) 5 nsMacMessagePump::DispatchEvent(int, EventRecord*) DoMessagePump is just a loop. While we're running, GetEvent and then DispatchEvent. If you look at the children of DispatchEvent, we can see that we're processing the UI. So it looks like all our time is in GetEvent. What's GetEvent? Just a call to WaitNextEvent. Why is WNE eating so much time? Carbon Events. Someone is dispatching kEventClassPL/kEventProcessPLEvents events to one of our event handlers. We're getting a callback (http://lxr.mozilla.org/mozilla/source/xpcom/threads/plevent.c#1596), calling PL_ProcessPendingEvents, and then timer code hits. Who's sending these events? We are, of course, since that's a custom event type. Searching gets us (http://lxr.mozilla.org/mozilla/search?string=kEventProcessPLEvents): _pl_NativeNotify (http://lxr.mozilla.org/mozilla/source/xpcom/threads/plevent.c#1263). Now the overall rule of event handling on the Mac is: event loop pumps events to the event handlers. There are two (main) ways to run the event loop. One is the new-style RunApplicationEventLoop. The other is the classic WaitNextEvent. If you use WNE (like Mozilla does) then each time you call in, it pulls the top event out of the queue, tries to dispatch it, and only if it fails does it return from WNE with the event converted to a classic EventRecord. So now we have one question. Why is all the time used pumping our custom events for the timer and none actually used for processing the UI? After all, the custom events are added at a low priority. This is still under investigation. BTW, do any of the Mozilla Mac developers use Firebird? Can I get some help here? (Or at least a second opinion to my rambling?)
I think there are some mac developers who use firefox, or vice versa...what happens if you change the slack timer from 0 to say, 10 milliseconds? It would not surprise me if the slack timer with a 0 second deley does not perform well on the mac... the timer and event processing code can be hard to get right.
> I think there are some mac developers who use firefox It just seems that the Mac port of Thunderbird is raw. Something as simple as searching shows fundamental problems. And I'm here, a reasonably-experienced Mac developer, rambling on about a bug, trying to approach a massive cross-platform codebase, trying to figure out why it doesn't work like it does. Any Mac developers out there to help? Dagley? Fraser? Pink? > what > happens if you change the slack timer from 0 to say, 10 milliseconds? It's still unusable at 10. At 50, it's tolerable. > It would > not surprise me if the slack timer with a 0 second deley does not perform well > on the mac... the timer and event processing code can be hard to get right. Indeed. But I'd rather get it fixed at the lower level, in the timer code, so that everyone can use a delay of 0, than #ifdef the delay values higher up.
Pink is definitely active - cc'ing him. Yes, I'd definitely want this to be fixed at a lower level. My suggestions were just experiments to try to get some data points on what was happening. As you say, there's really only one piece of code that uses a slack timer with a 0 interval. But it's not really a thunderbird port issue - mozilla mail has been doing this for years now. I don't know if it has always locked up the UI like this, however.
I really have no idea. I've looked over the code, but don't really even know where to begin. And I can't even experiment since each compile takes 5 minutes. I'm giving up. If anyone familiar with the Mac platform code wants to take a look at this, please do.
why does each compile take 5 minutes? surely there's a way just to rebuild just the library that changed and copy that manually into the app package.
I don't know. That's how long the make script takes, and I never took the time to figure out exactly which library the code ended up with and custom-build it. Pink, what do you think of this? Sometimes I feel like I'm the only person actually using TB on the Mac. Do you see this problem too?
shouldn't be hard to determine the library that's rebuilt from where you are in the code. what file are you touching? i don't use TBird, fwiw, either at home or the office.
Well, I was touching files in the mailnews component when experimenting with the timer delay. But once it seemed to me to be a problem with the timer itself I stopped. For now, I'm not touching anything because I don't know what to touch. There seemed to be a problem with slack timers, but when I started digging I started hitting core Mac event code, and I was at a loss to explain the behavior that I was seeing. I'm at a total loss as to how to proceed in the diagnosis of this slack timer problem.
A shot in the dark: hack plevent.c to turn on MAC_USE_CFRUNLOOPSOURCE for the carbon build; it might work better than carbon events.
(In reply to comment #26) > A shot in the dark: hack plevent.c to turn on MAC_USE_CFRUNLOOPSOURCE for the > carbon build; it might work better than carbon events. That's from the checkin from bug 271050, right? My tree is a bit old, so it's erroring out when building. Let me do a clean rebuild. [Goes to bed; it's late.] [Wakes up.] OK. Holy cow, it's significantly better. It's very sluggish, but is actually usable. You can open up messages as they show up in the list, and cancel the search. Look at bug 271050's comment #47. Joel Craig was saying that FF and Mozilla were having event problems like 271050's checkin fixed for Camino. I think this deserves to be turned on for all the Mac ports, not just for Camino. (BTW, why does the trunk look so different than the branch? Did someone forget to check in the artwork?)
I filed bug 258370 to switch Moz/FF to the Camino-style plevent handling.
Depends on: 282940
(In reply to comment #28) > I filed bug 258370 to switch Moz/FF to the Camino-style plevent handling. Eehm, that's this bug :) I think you mean bug 282940
Well, the switch to CFRunLoopSource plevent handling is in, and testing in Thunderbird 1.5 beta something (version 1.5 (20051025)) shows that my original complaint is gone. Therefore, I'm declaring this fixed via bug 282940 and closing it. Thanks, all y'all.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: