Closed Bug 67370 Opened 24 years ago Closed 24 years ago

interleaving events and xlib events

Categories

(Core :: XPCOM, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: blizzard, Assigned: blizzard)

Details

Attachments

(3 files)

This is a tracking bug. I want to see if interleaving xlib events and plevent events is a win or not.
Attached patch patch #1Splinter Review
Have to remember to remove the static nsVoidArray this adds.
Status: NEW → ASSIGNED
Keywords: patch
Adding dougt + danm. Guys, there's a patch in here that mucks with gtk events that I want you guys to look at.
A quickish look tells me it probably won't hurt anything. Curious, though -- processing nspr events during gtk event processing would be basically increasing the priority of nspr events. Whazzup?
It doesn't really increase the priority of nspr events exactly. It actually puts them on equal footing since we only process events that are in the queue when compared with the serial of the X event. The most important thing that this fixes is if X is really busy we aren't completely starving the plevent queue and vice versa. That's what all the 'id' stuff in the PLEvent object is.
Attached patch patch #2Splinter Review
OK, I need reviews on this patch. It's done enough and people don't seem to see any bad effects from it. I need jrgm [ now cc'ed on this bug ] to apply this patch to a linux build and do a perf run on it to see if it affects page load times since I'll be yelled at if it does affect it.
Okay, I'm pulling a new build now, and I'll have an answer later this evening. Just to confirm: patch #2, right?
Okay, so I ran this patch on a build with '--disable-debug --enable-optimize'. I ran 5x40 urls, cacheable content, 1sec delay between pages, sidebar collapsed, 600x800 content area, ... There is no difference in the times with or without the patch. The average median time is 3.249 seconds without, and 3.254 seconds without, with the median times for every URL staying within a 3 percent of each other. (Note that this was not done in a situation where the X server was particularly busy). So, be yelling at me, if I'm wrong.
Attached image pictures and everything
To John Morrison: be sure that X11 is very busy doing things. Try to run the Zilla cross-continental (~200km wire or more) and/or via ssh channel to increase latency...
Hmm, one caveat (to any test) is that if I introduce a variance of 20% in the measurements, then I have no confidence in results that differ by less than 20% (waving hands but you know what I mean). In other words, I don't know if we can get a useful measurement in the exceptional case. Is there some other middle level case we can look at and/or is it necessary to test further with the X server in a more active condition.
Remote Xserver and slow (local) Xservers (m64 in Ultra5/Blade100) are cases where this optimisation described here would make sense... Remote Xserver should be easy to test (please no 1000baseT... 100baseT may still to fast when 8bit visuals are used... best would be 10baseT with a 24bit visual to make X11 slow&sluggish) - slow local Xserver perfmance testing would only be usefull with multi-CPU machines (Pavlov has a shiny E450... :-) or "renice" the Xserver to run with a _lower_ priority (e.g. only get CPU if Mozilla process does not need the CPU)...
Thanks, John. I appreciate your time in testing this.
Oh, and to answer your question I don't think that you need to test exceptional cases. This patch will probably help in cases where there is a lot of X traffic in between mozilla and the server and a lot of layout and thread events happening as well. Less starvation.
I need dougt or danm to look at these changes and say r= or no r= since they are the event queue dudes.
la la la still waiting for people to review this la la la
If a review falls in a forest, does it make a sound?
r=timeless, but it sounds like you really want an r=pavlov. #endif /* XP_UNIX */ + +/* extra functions for unix */ + +#ifdef XP_UNIX any reason to have 2 separate ifdef xp unix? /* ** is ** common in /* comments?
I put the extra #ifdef XP_UNIX in there because it seemed like the right thing to do at the time. In that file the XP_* flags tend to surround each function for each platform. Also, the ** in /* comments is how it is in the rest of that header file. Just following local custom.
Sorry it took me so long to get around to reviewing this. Once again my Bugzilla mail and I have lost track of each other. So basically I think the patch looks like it'll work as intended (I assume you've run with it and it actually helps!). A nit: the PL_ProcessEventsBeforeID returns as its value its "count" variable, which is uninitialized. Not important because no one uses the return value, but wants fixing. However, nits aside, in terms of overal structure of the patch, I gotta express some misgivings. One thing is the duplication of PL_ProcessPendingEvents. Can you fold it and the new PL_ProcessEventsBeforeID into one function with some sort of illegal ID value? I'm worried that the somewhat fragile method, once duplicated, will undergo divergent evolution. As an example, I'm curious why ProcessEventsBefore has a dissimilar exit clause: the original puts a notification back on the native queue if events remain, but the new one doesn't. Accident or intent? This is all twitchy code that makes me twitch. But more to the point, I'm concerned that there is now code in two places that traverses the active event queues, looking for events to process. The plevent.c code is a kind of brute underlayer, and there's already a somewhat delicately balanced thing sitting on top (in nsEventQueue and nsEventQueueService) that groups them together and controls which queues are processed under which circumstances. You're adding another layer of similar code that does something very similar but tangentially. I'm afraid they'll short out on each other some day, wires crossed as they are. I'd really rather you put this stuff in the upper layer, which is already playing the balancing act.
Thanks for your review, Dan. The fact that count isn't initialized is a bug. I've got that change locally but I don't think that it's worth a whole new patch. As for your other misgivings I thought that they two functions were different enough to warrent a new function as opposed to folding them all together. I could probably merge them into a single function but I'm worried about fragility, too. If you want me to try and do that I will. As to your specific question about why I don't put a token back into the queue when the event count is > 0 is I never take the token out. Remember that this running of the event queue is not done from the standard unix native event queue notification method. That method is to wake up from select() by reading from a file descriptor. In this method if there are still events in the queue there will still be a native token in the pipe and if that event ends up being processed outside of the event processing loop it will still wake up from select() and process the event queue. And I don't see a way to put this into a higher level except to wrap it and that wouldn't buy us much except to complicate the code somewhat. I still need to label each event queue and each event with an ID and that needs to be done at the lowest level. If you want to get rid of plevent.c and move to EventQueueUnix.cpp/EventQueueWin.cpp then we're talking about something else I would buy into. The current impl is organic and a mess. I'm just trying to cause as little pain as possible here. :)
Alright; thanks for the explanations. The native notification removal/addition stuff is twitchy in its cross-platform glory, but this being gtk-only makes it easier. My particular objection to the plevent-vs-nsEventQueue thing is this: nsEventQueue keeps a two dimensional array of queues, thread vs stack-o'-stalled- queues. Event processing steps down one of the dimensions (down the stack, within the current thread, though I think that's broken on gtk, because of the ListenToEventQueue stuff). At the plevent level, queues are independent. All the queuegroup interactions are handled in one place, at the nsEventQueue level. This is goofy code and delicate. (See bug 54371, though that bug's specific complaint is about something different and has been fixed. But it remains open because of the gtk problem and gives a window on the seamy underbelly of the event code.) Your patch in this bug adds another level of queue grouping independent of the nsEventQueue's grouping, bringing the concept of grouping also into plevent. That's my main worry. Oo. Come to think of it, that is a problem. Suddenly I have a specific worry, rather than general qualms. There is a problem in the current (unpatched) code. The ListenToEventQueue thing causes events in worker threads to be processed outside their thread. Oddly, this mostly doesn't seem to hurt anything. But I hear complaints from the networking guys that they're seeing events occasionally processed out of order on the Linux builds, and I'm sure this is the reason. It wants to be fixed, but I haven't figured out how yet. Your patch links all the event queues together and processes them at odd moments, all in a bunch, regardless of their owning thread. It'll exacerbate the ListenToEventQueue problem of events being processed off thread and out of order. Yes there is already a problem here, but you don't want to make it worse. I'm beginning to be pretty convinced that code which wants to jump from queue to queue, processing events, wants to be in the nsEventQueue level, where we're already doing that, where lives the knowledge to stay in the current thread and process them in the correct order of stalled queues within the thread.
I only use this on the primary thread on the thread event queue. I'm not talking about using this anywhere except there. Doesn't that obsolve your worries about processing events on the wrong thread? Also, if you call ListenToEVentQueue(PR_FALSE) I don't process events for that queue anymore, just like normal processing. So, if you fix that then you fix this entry point too. My patch does _not_ process event queues in bunches and at odd times. In fact that's how I describe the PL_ProcessPendingEvents() method. I no longer queue up events until we get back up to the idle loop. This method prevents us from being very busy and failing to ever pay attention to the event queue which might have a lot of important events in it. If you have a lot of timers firing with a lot of drawing going on it's very likely that you will end up completely starving the event queue or the window queue. This is what I'm trying to fix here. I always process the events in order with regards to window events. That's what the whole addition of the 'id' element is on the PLEvent structure is for. I never process events for which the window events haven't been seen yet. This should keep draws and reflows syncronized as well. Essentially what I've tried to do is replicate the way that Windows does its event queue handling. That is that they intermingle window event messages and event queue messages. It's better than what we have in unix for some things and I'm trying to make things smoother wrt event delivery and event queue starvation.
But you're hooking into ListenToEventQueue, so I don't see how you're limiting yourself to queues on the main thread. The patch affects every queue that's listened to, which at the moment is every queue anyone ever creates, anywhere. It's unfair of me to say the queues are processed at odd moments. But they will be processed at more moments, and in a great fell swoop that traverses every active listening queue, in the order they were created. That's probably the right order, at least within a thread, but why second guess all the careful ordering already in nsEventQueue? Now within a single queue, you're pulling events out in order and potentially stopping short. No worries there. It's what happens when you jump to the next queue that I'm all bunched up about. To be honest I don't completely follow the event ID thing. I gather you're being very careful to stop short in each queue when you run into an NSPR event that belongs to a later native notification. Alright. That may work. It still seems like there could be problems between queues, with event delivery latency and all that. To be fair, it does seem like part of the fix for ListenToEventQueue that I've alluded to in the comment above would be to stop listening to monitored queues. I'm not certain that that change alone would keep you from jumping into queues for worker threads. I mean, it's an error to make a native queue off the main thread but it works on Windows and I thought we were doing this in a few places. My point being that I think we're misusing native queues. With some tweaks and careful checking and maybe a paddling or two, then queue jumping at the plevent level might not be harmful. But as the situation stands, I think it would be. From there back to my qualms, I'd just rather not be traversing multiple queues at the plevent level. A thought. I suspect you're going to more trouble than you need to. We've been trying to cut back on native notifications anyway (Doug put in a patch to that effect which doesn't quite work, but it could be made to work...). I believe the correlation between NSPR events and native notification events is an uncritical one. So I suspect it isn't important to be so careful with the native event serial numbering and all, or wouldn't be if you stuck with the queue jumping code in nsEventQueue. Wouldn't something like this shorter patch do the desirous thing? + if (XPending(GDK_DISPLAY())) + nsAppShell::ProcessAFewPendingEvents(); where the new function is just ProcessPendingEvents on the current thread, with a limit of, I don't know, 3 or 10 events?
As far as I know ListenToEventQueue is only called on native event queues. That is, it is only called on event queues that use an FD and are called from the mail thread. Other thread queues are monitored from other threads. There have been bugs in the past where someone called ListenToEventQueue on a non-native event queue and all of a sudden events were being processed on the wrong thread. As far as I know that doesn't happen anymore. I don't understand your concerns about transversing multiple event queues. I mean, we do that now but in a very random fashon. As for your proposed patch I don't like it because you still end up with potential event queue starvation. I'm trying to get reproducable ordering here not just interleaving events.
I was fooled again by the general code in nsEventQueue, forgetting about the filter in nsAppShellService. So yes, ListenToEventQueue is only called on native queues. That makes more sense anyway, what can I say. So scratch everything I've said above about monitored queues being listened to. That helps, but I'm still bequalmed. But not enough to keep bumping heads. r=me.
sr=shaver.
Checked in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: