Closed Bug 1384717 Opened 3 years ago Closed 3 years ago
Reduce QI overhead for snow white objects
Ehsan says the overhead of doing QIs on snow white objects shows up in speedometer. He said "the real issue is the QIs in question need to scan all of the table of interfaces in the child class, then find nothing, then go to the base class and find the answers there".
Example profile: https://perfht.ml/2uYsYI7
My idea for doing this is to treat nsISupports like we do native classes. Though I recall there are some gotchas for why subclasses don't work with our approach for native classes, so I'll have to remember what that is...
I've started looking at this. It relies on subclasses that implement their own participant also implementing their own addref and release. I think we can do dynamic checks for that. Hopefully that won't be too expensive for debug builds. The problem I was thinking of in bug 802442 only applies if we get rid of canonicalize participant entirely. It'll be harder to get rid of for NoteXPCOMChild, because in that case we know nothing. Maybe we could have a class that all CC'd nsISupports classes inherit from, and put some methods on that, but that would likely be quite a bit of work.
Assignee: nobody → continuation
FWIW bug 1384821 reduces the amount of time that NoteXPCOMChild takes up in profiles of Speedometer, if we decide to take the patch there.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #2) > Another possibility if we're willing to break binary compatibility of > nsISupports is add an explicit nsISupports.getCycleCollectionParticipant > method and bypass QI altogether. I've started looking into this. While it is a huge change, it may require less less of code changed, and it'll let us get rid of these weird CC QIs altogether.
Here's a prototype. It at least compiles and starts the browser. I haven't done any speed tests yet. It adds a new method Participant() to nsISupports, then uses it instead of the QI to nsXPCOMCycleCollectionParticipant in the CC. In theory we should be able to remove that QI entirely, though some other places may use it. We still need QI to do the canonicalization of nsISupports during CC traversal. I'm not sure of a way to avoid that, besides adding another method to nsISupports. It might be possible to avoid ambiguous upcasts by doing virtual inheritance of all nsIFoo interfaces, and then we could just static cast to nsISupports when passing pointers to the various CC interfaces, but that would be a large change, and I don't know what the performance impact would be, if any.
Ehsan, do these QIs still show up in profiles? There have been two patches (one by you and one by me) to reduce this overhead.
My patches haven't landed yet (assuming you're mentioning bug 1384821 and bug 1384798). Here is a profile from current m-c tip: https://perfht.ml/2vQyOvU Not sure if this shows any differences compared to comment 1 that you expected?
Oh, sorry, for some reason I'd thought that bug 1384821 had already landed. It looks like we're still spending just as much time in the table driven QI from the purple buffer, so I guess my patch had no real impact. Maybe calling QI twice in a row on the same object is just as fast as calling it once?
(In reply to Andrew McCreight [:mccr8] from comment #9) > Oh, sorry, for some reason I'd thought that bug 1384821 had already landed. > > It looks like we're still spending just as much time in the table driven QI > from the purple buffer, so I guess my patch had no real impact. Maybe > calling QI twice in a row on the same object is just as fast as calling it > once? Yes it should be, because the second time it will be a well predicted branch so it should be almost free...
This is a profile from the latest trunk from the first ~150 subtests of Speedometer. Things look better now: https://perfht.ml/2i43XpR Not sure how much further we can improve things...
That is a lot better! I guess bug 1384821 worked like magic. It looks like the residual overhead is all from scheduler label stuff. Maybe Bevis has some ideas of easy wins we could get here?
I see only 1ms out of a 30 or so second profile.
(In reply to Bill McCloskey (:billm) from comment #13) > I see only 1ms out of a 30 or so second profile. I think you're looking at the wrong content process perhaps? Sorry I realized the generated link was misleading, this one is better: https://perfht.ml/2i2ZhQZ I don't believe in magic. :-) Reopening.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Ehsan, when you have a free moment (ha ha), I'd be interested to see if the table driven QI still shows up in profiles. I think I've made enough changes that nothing CC-related should show up for that.
I'll try to remember to profile this tomorrow.
So I see 11ms QI over 15s period. That is 0.073% In ehsan's profile there is 16ms over similar period of time. So there is something there, but hard to say whether anything to optimize in QI implementation, or should we try to make canSkip and content blasting more effective so that we wouldn't need to traverse so much?
(In reply to Andrew McCreight [:mccr8] from comment #16) > Ehsan, when you have a free moment (ha ha), I'd be interested to see if the > table driven QI still shows up in profiles. I think I've made enough changes > that nothing CC-related should show up for that. Here is a profile from a build from last Thursday before I went away which does have all of the recent optimizations: https://perfht.ml/2jrGPCd Now all of the major offenders that I was seeing previously are gone, and the amount of time spent in NS_TableDrivenQI is also reduced drastically, and if you pay close attention to the remaining callers, I don't think any of them would matter as far as Speedometer is concerned really. So for benchmark purposes I think this bug is all done I think. :-) Thanks!
Thanks for checking.
Status: REOPENED → RESOLVED
Closed: 3 years ago → 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.