Closed
Bug 517909
Opened 15 years ago
Closed 13 years ago
Perf regression August 25th
Categories
(Core :: JavaScript Engine, defect, P2)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: sayrer, Assigned: gal)
References
Details
Attachments
(9 files, 1 obsolete file)
137.23 KB,
image/png
|
Details | |
153.87 KB,
image/png
|
Details | |
141.16 KB,
image/png
|
Details | |
2.31 KB,
patch
|
Details | Diff | Splinter Review | |
3.00 KB,
patch
|
Details | Diff | Splinter Review | |
1.18 MB,
application/x-bzip2
|
Details | |
6.52 KB,
patch
|
graydon
:
review+
|
Details | Diff | Splinter Review |
6.52 KB,
patch
|
Details | Diff | Splinter Review | |
4.57 KB,
patch
|
Details | Diff | Splinter Review |
I think this is probably tracing getters. Details coming.
Reporter | ||
Comment 1•15 years ago
|
||
Reporter | ||
Comment 2•15 years ago
|
||
Reporter | ||
Comment 3•15 years ago
|
||
Reporter | ||
Comment 4•15 years ago
|
||
assigning to jorendorff on the assumption that it was tracing getters that caused this.
Assignee: general → jorendorff
Reporter | ||
Updated•15 years ago
|
Flags: blocking1.9.2+
Priority: -- → P1
Comment 5•15 years ago
|
||
Any progress here?
Comment 6•15 years ago
|
||
Here's the patch to txul I've been using to get profiles.
The profiles have not been helpful to me, but if anyone else wants to take a crack, please do.
Comment 7•15 years ago
|
||
Update?
![]() |
||
Comment 8•15 years ago
|
||
OK, here's the state of things as I see it.
I can maybe reproducea 2% Txul regression on Mac. Hard to tell; the numbers are noisy. I looked at the trace logs, and it seemed like most of the newly-acquired branch exits were in the "currentset" setter in toolkit/content/widgets/toolbar.xml
Measuring just this function locally using the patch below, it seems to have gotten about 1ms slower with the getter landing (going from 5ms to 6ms usually on 1/3 of its invocations; going from 0-1 to 0-2 on the other 2/3). My baseline is about 110ms, so this accounts for at least half the regression I observe.
![]() |
||
Comment 9•15 years ago
|
||
OK, so I created some time profiles and malloc traces of that setter in shark, before and after the getter patch. The time profiles do in fact show us taking more time in that setter (by about 1ms or so on my machine); the difference is entirely accounted for by time spent under TraceRecorder::monitorRecording. About half of this is under closeLoop; the rest is scattered over various TraceRecorder stuff.
I sent Andreas a copy of the TMFLAGS=full,abort,stats log for Txul after the getter change. I'm hoping he'll see something sane. There are about 100 more branch exits than there used to be before the getter patch; fewer aborts and less blacklisting. So it looks like the issue is that we try to trace more code (in particular this setter), and it just doesn't want to trace very well. for one thing, it's reasonably branchy in the loops, looks like.
The malloc log shows about 80KB more malloc allocations coming from nanojit::Allocator::allocChunk; about a third of that is coming from nanojit::LirBufWriter::insSkip, another 25% from nanojit::LInsHashSet::grow, then insStorei, insLoad, ins2, trackCfgMerges, record_JSOP_CALLPROP, monitorRecording, TraceRecorder::guard, getAnchor.
Time spent under vm_fault went from 0.7% of time in this setter to 5%; that's about 1/8-1/4 of the regression I'm seeing right there....
![]() |
||
Comment 10•15 years ago
|
||
Assignee | ||
Comment 11•15 years ago
|
||
Assignee | ||
Comment 12•15 years ago
|
||
Attachment #405586 -
Attachment is obsolete: true
Assignee | ||
Updated•15 years ago
|
Attachment #405587 -
Flags: review?(graydon)
Updated•15 years ago
|
Attachment #405587 -
Flags: review?(graydon) → review+
Assignee | ||
Comment 13•15 years ago
|
||
The attached patch should make the tempAlloc go to the malloc heap less often. That might help with vm_faults. Need confirmation from bz to proceed.
Assignee | ||
Comment 14•15 years ago
|
||
And this time without overflowing the allocated blob size.
Assignee | ||
Comment 15•15 years ago
|
||
Assignee: jorendorff → gal
Comment 16•15 years ago
|
||
Wow, nice effort bz+sayrer+gal+dvander!
Comment 17•15 years ago
|
||
Arena pools give back memory promptly to the malloc heap -- that's not evil. Your point must be less moralistic: stack or other "inline" allocation (jsvector can do this) beats heap (whether chunked by arena code or not). That's true!
/be
Comment 19•15 years ago
|
||
[10:04am] bz: here are the things we do know
[10:04am] bz: 1) We checked in a patch
[10:04am] bz: 2) That patch causes a small but measurable (1-3%, depending on who's measuring) regression in Txul
[10:04am] bz: 3) That patch causes a small but measurable win (10% or so) in microbenchmarks that specifically exercise that code.
[10:05am] bz: 4) That patch has the potential to lead to significant wins in bigger programs that have a lot of JS computation and some DOM access in an inner loop
[10:05am] bz: 5) I can't give you an exampe of such a program off the top of my head, though I could certainly write one
[10:05am] damons: 6) Something else regressed Tdhtml on Oct 4.
[10:05am] bz: yes
[10:06am] bz: 7) At least half the measured Txul regression can be attributed to a regression in the runtime of a single function
[10:07am] bz: I say "at least" because it's hard to tell for sure given the low resolution of date objects in JS
[10:08am] bz: We have a full TM log of a Txul run (attached in the bug)
[10:08am] bz: 9) Someone who understands such logs might want to examine the parts of it dealing with the above function.
[10:08am] damons: So, to boil this down a bit, Are we willing to take the 1-3% txul regression for a 10% microbenchmark gain with potential for big payoff--in the context of delaying 3.6 beta to look for something we may never find?
[10:08am] bz: #9 is my opinion, not fact.
[10:09am] bz: I think the right course of action is probably to ship the beta
[10:09am] damons: with the regression?
[10:09am] bz: but continue looking into the txul hit here
[10:09am] bz: yes
[10:09am] damons: sayrer: ?
[10:09am] sayrer: yeah, I think so too
[10:10am] bz: The other obvious options are to not ship tracing getters at all in 1.9.2
[10:10am] damons: alright.. I'll unblock on this.
[10:10am] bz: or to hold the beta on the off chance that we need major surgery to fix the txul hit
[10:10am] bz: and that we'll block final on the txul hit
[10:10am] jorendorff: Are we going to block final on the txul hit?
[10:10am] bz: that is another excellent question
Assignee | ||
Comment 20•15 years ago
|
||
So looks like we made up 1% of the 3%. The patch is on m-c.
I did look through the TM log bz attached. There is nothing obvious that goes wrong. We compile a bunch of stuff, and run it. Most loops don't run for a long time (low iteration count), so we end up not winning any time on compiling.
The patch mentioned above makes compilation faster in general. That helped us making up some time. We will have to land more patches like that to make up the regression, but this isn't really fixing something, we are just making something else better to hide it again.
One of the problems seems to be that since this is code opening a new window, its global is different every time (the window). So we re-record at every window open. Thats a bit unfortunate. I will discuss that with brendan, but we did make a choice to special for global objects, so I am not sure we want to change that.
Reporter | ||
Comment 21•15 years ago
|
||
(In reply to comment #20)
>
> The patch mentioned above makes compilation faster in general. That helped us
> making up some time. We will have to land more patches like that to make up the
> regression, but this isn't really fixing something, we are just making
> something else better to hide it again.
This tradeoff seems inherent in the approach we're taking, and for compilation in general when it comes to web content (it's easier to amortize up front compilation costs for repeatedly executed chrome). I don't think it's hiding anything to make compilation or recording cheaper, though we might make bigger strides on these benchmarks with your suggested investigations.
Comment 22•15 years ago
|
||
Do we need to hold the 3.6 release for this? My sense, given 3.6's performance relative to 3.5 and the scale of this regression, is that we do not.
Who dares oppose?!?!
I mean: comments?
Assignee | ||
Comment 23•15 years ago
|
||
We did win back about 30-50% of this, so we are not looking at the full hit any more.
Comment 24•15 years ago
|
||
(In reply to comment #22)
> Do we need to hold the 3.6 release for this? My sense, given 3.6's performance
> relative to 3.5 and the scale of this regression, is that we do not.
Well, in the past we've said no performance regressions unless we have a really really really good reason (like security). We've also said that we don't take a regression because it gets us a win in one or more other places unless we have a good reason again. I've had to backout many sqlite upgrade because of the second reason. Perhaps the justification has been mentioned in this bug in the various comments, but having a summary of why we should take the regression should be included. Also, I don't think this bug is the right place for this conversation. dev.tree-management and/or dev.platform would get a wider audience (unless this is strictly a driver decision).
Comment 25•15 years ago
|
||
Blocking is a driver decision, yes. If you're arguing that we should back it out, I think that the time for it has past, but would be a thread for d.t-m.
The second paragraph of comment 9 describes the situation, though: as we trace more code, we sometimes cause existing slow-compilation cases to be attempted unprofitably. I don't think this is avoidable unless we believe that we can go from the current state to "trace *everything* profitably" in one step; I very much think that's unrealistic.
The performance discontinuities we see here (and elsewhere when a single odd part can disrupt the performance of the rest of the loop which might otherwise be well-traced) seem to be inherent to tracing, so we're going to have to just push through them as we complete our coverage.
Assignee | ||
Comment 26•15 years ago
|
||
The root cause seems to be that we specialize per global object to shortcut security checks. This code is run per every new window opening, so we specialize and recompile anew for that new global object (window). Undoing that basic decision would make a whole bunch of other cases slower. I agree with a very strict "only better" policy, but in this particular case that would mean never compile code that looks like the one in the window opening path, which is not realistic. Also, please not that the JS chrome code in question is supremely unpretty and probably ready for a rewrite which might help making this faster, too.
Comment 28•14 years ago
|
||
These bugs are all part of a search I made for js bugs that are getting lost in transit:
http://tinyurl.com/jsDeadEndBugs
They all have a review+'ed, non-obsoleted patch and are not marked fixed-in-tracemonkey or checkin-needed but have not seen any activity in 300 days. Some of these got lost simply because the assignee/patch provider never requested a checkin, or just because they were forgotten about.
Comment 29•13 years ago
|
||
Obsolete with the removal of tracejit.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•