Closed
Bug 477425
Opened 15 years ago
Closed 15 years ago
Tp memory use regression, midday on January 27, 2009
Categories
(Core :: JavaScript Engine, defect, P2)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: dbaron, Assigned: graydon)
Details
(Keywords: fixed1.9.1, regression)
Attachments
(3 files)
4.73 KB,
patch
|
gal
:
review+
|
Details | Diff | Splinter Review |
578 bytes,
patch
|
jorendorff
:
review+
|
Details | Diff | Splinter Review |
7.20 KB,
patch
|
gal
:
review+
|
Details | Diff | Splinter Review |
A few days ago I noticed Tp RSS regressions (this is memory usage during the Tp test), on Linux: http://graphs.mozilla.org/graph.html#show=395139,395141,395170,911698,1431037 There's one jump midday on January 27 (that's the regression this bug is about), and then another on the early morning of January 30, reported inconsistently between the boxes (that was the talos unthrottling, which increased the memory use number a little, not too surprisingly, given time-based caches). There's a similar memory use regression in the Tp3 (Mem-PB) test (clearly something to do with memory use, although I don't know what) on Windows XP: http://graphs.mozilla.org/graph.html#show=395012,395026,395052,912152,1431881 at the same time period (it didn't show up on the *-fast machine, though, or the trunk07, which also seems different from the rest). The Mac numbers are *much* noisier than the Windows and Linux numbers, so it's hard to see if there's a regression there as well. However, on the smoother 10.5 machines (which are better than the 10.4 machines) it looks like there was an equivalent regression on Mac.
Reporter | ||
Comment 1•15 years ago
|
||
Commits around the window are: http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2009-01-27+09%3A00%3A00&enddate=2009-01-27+14%3A00%3A00 *If* the times in the graph are actually the build start times (which they should be, but I'm not sure if they are), it's likely towards the end of that window, which means it's the tracemonkey merge.
Reporter | ||
Comment 2•15 years ago
|
||
Better graph URLs, showing just the primary machines: http://graphs.mozilla.org/graph.html#show=395012,395026,395052&sel=1233065693,1233115747 http://graphs.mozilla.org/graph.html#show=395139,395141,395170&sel=1233050643,1233110707 which says the Linux RSS number regressed about 2%, the Windows Mem-PB number closer to 3.5%-4%.
Updated•15 years ago
|
Flags: blocking1.9.1?
Comment 3•15 years ago
|
||
Here's a link to Tp-WS on TraceMonkey showing a regression on 1/26 http://graphs.mozilla.org/graph.html#show=2542724,2539662,2539199&sel=1232980571,1233069052 and a regression for Tp-RSS during the same period on linux: http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136&sel=1232891285,1233228446
Comment 4•15 years ago
|
||
This looks like http://hg.mozilla.org/tracemonkey/rev/72dd6ee4baf7 Graydon Hoare — Bug 473688 - provide a 2-level hash map (global, pc) -> tree, r=gal.
Comment 5•15 years ago
|
||
Looks like we leak treeInfos since we don't call Fragment's destructor any more (Fragments are placed into the JIT cache instead of malloc'd since this patch landed and are then freed on bulk). My bad. I reviewed that patch. Is it ok if Graydon takes care of this on Monday or should I try to patch it?
Updated•15 years ago
|
Flags: blocking1.9.1? → blocking1.9.1+
Assignee | ||
Comment 7•15 years ago
|
||
Good catch on the leaked treeinfos, and probably correct. Shall we allocate them inside the lirbuf as well, or try to hook up the dtors? I'm imagine the former. (Also note: your earlier comment that this patch should never have caused increased memory use is not entirely correct. It just shouldn't have caused *much*. It lets the fragment cache survive 4x as many global mismatches before flushing, so it actually does bring up memory use a touch. By design. But only a touch. Flushing is so frequent anyways, I'd be surprised if it causes more than a few dozen KB resident increase.)
Assignee | ||
Updated•15 years ago
|
Assignee: nobody → graydon
Comment 8•15 years ago
|
||
Yeah, I would suggest moving the treeinfos into the cache (lirbuf). Also, maybe TraceRecorder should have an inline copy of tree info it uses for recording and then after a successful recording we copy it into the lirbuf and update the fragment->vmprivate field. This way we don't fill the cache with treeinfos of unsuccessful recordings. I assume this regression shows up for talos. If there is JS that has multiple globals and extensive JS use I would assume we already max out cache use, which is capped at 16MB so we shouldn't ever see an increase. Quite to the contrary we no longer malloc fragments so we should see a _decrease_ imo. Lets fix the bug and measure it :)
Assignee | ||
Comment 9•15 years ago
|
||
I think you misdiagnosed this, actually. We're destroying treeinfos. They're owned by the recorder, not the fragments, and the recorder is still deleting them. If we're leaking, it's something else. I'm digging presently.
Comment 10•15 years ago
|
||
No we don't. TreeInfos are owned by the recorder until successful compilation, at which point ownership transfers to the root fragment.
Assignee | ||
Comment 11•15 years ago
|
||
Well, the approach you're describing is perhaps ideal, but it's much more intrusive to implement; lifecycles have to shift around to match, as TreeInfos currently come into existence before TreeRecorders. And the lifecycle of UnstableExits still wouldn't match either. For the time being, this is a simple enough modification that is correct, at the expense of possibly filling the LIR buffer faster than we'll ideally want: just moves the TreeInfo allocation (along with those for UnstableExits) into the lirbuf and removes all the explicit deletes of same. Seems to work fine.
Attachment #361351 -
Flags: review?(gal)
Comment 12•15 years ago
|
||
Comment on attachment 361351 [details] [diff] [review] Fix the bug Looks good. Can you please remove the onDestroy hack too? (r=me with that) We can improve the cache use in a separate bug. You are right, lets fix the regression with as little code motion as possible.
Attachment #361351 -
Flags: review?(gal) → review+
Assignee | ||
Comment 13•15 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/14d5479a5c73
Assignee | ||
Comment 14•15 years ago
|
||
Memory use appears to have *increased* with this change. I feel ill. Will watch the talos numbers a couple more times before ... backing out? I don't really see how this should be possible, unless we are perhaps getting "closer" to the 16mb lirbuf limit, and we previously never did, so we're seeing the ceiling of our own use grow closer to 16mb? I don't know what else to propose.
Whiteboard: fixed-in-tracemonkey
Assignee | ||
Updated•15 years ago
|
Whiteboard: fixed-in-tracemonkey
Assignee | ||
Comment 15•15 years ago
|
||
Ho ho. So here is probably our culprit: the fragmento is indented in such a way as to suggest that it only doubles its page set every time it exhausts it. But it's indentation is suggestively wrong. Its actual meaning is "double the page set on each page-allocation". So the fact that we're performing a *few* extra page-based allocations under the initial patch -- having moved TreeInfos from the malloc heap to fragmento pages -- means we're *doubling memory* a few more times than before. It doesn't take many doublings to hit our ceiling, and indeed any nontrivial test gets close if you actually print the fragmento's page size (eg. even sunspider gets to the 8mb point after 2 micro-benchmark loops). It only has to request 16 *used* pages to have allocated the full 4096-page set. Fixing this causes fragmento's memory behavior to drop back down to 9 doublings (~511 pages, 2mb) on ss and our trace-test, though v8 still hits the full 16mb.
Attachment #361531 -
Flags: review?(gal)
Assignee | ||
Updated•15 years ago
|
Attachment #361531 -
Flags: review?(gal) → review?(jorendorff)
Comment 16•15 years ago
|
||
Comment on attachment 361531 [details] [diff] [review] Fix an uncovered, even worse bug, that is not my fault! You've got a bunch of trailing whitespace after your `}`. r+ with that fixed.
Attachment #361531 -
Flags: review?(jorendorff) → review+
Assignee | ||
Comment 17•15 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/8590bf3b592c , now to see if talos feels generous.
Comment 18•15 years ago
|
||
cc'ing stuart. He always was curious how much JIT cache we need for SS.
Assignee | ||
Comment 19•15 years ago
|
||
Confusion upon confusion. Talos shows that private bytes have gone down quite substantially: http://graphs.mozilla.org/graph.html#show=2539196,2542722,2539660 While simultaneously claiming that working set / RSS is still up, as it was last night: http://graphs.mozilla.org/graph.html#show=2539199,2539662,2542724 Did I perhaps collide with someone else's regressing merge (say, shared code increase)? Or ... anyone have any wild guesses about how private bytes can drop while working set expands?
Comment 20•15 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/14d5479a5c73
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•15 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 21•15 years ago
|
||
*Sigh* ok, well, I spent the first 2/3 of the day chasing ghosts in the Fragmento's memory management system (which is somewhat silly) before giving up and doing what I should have done when this was first reported: talking to valgrind's leak ckecker and chasing the number down to 0. Turns out there remained a fair number hard leaks due to extra fiddly sub-structures that dangle off the things we moved into the LIR buffers. Enough that it quickly gets tired to track them all down when there are perfectly good destructors for them already. So this patch just reverts the design decision of "store fragments and treeinfos in the LIR buffer". Back on the malloc heap they go, as before. Still do non-Fragmento hashtable keyed by multiple globals and all that good stuff. Appears to help with RSS on real pageset tests I'm running locally. Valgrind is (more) happy with this. We still leak a couple JS structures (property tree bits?) but nothing inside the tracer or nanojit.
Attachment #361919 -
Flags: review?(gal)
Comment 22•15 years ago
|
||
Comment on attachment 361919 [details] [diff] [review] Back out poorly-implemented lirbuf placement-new allocation scheme In the long term we should still go for the lirbuf allocation, but no need for 3.1.
Attachment #361919 -
Flags: review?(gal) → review+
Assignee | ||
Comment 23•15 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/e6238b9ea42a
Assignee | ||
Comment 24•15 years ago
|
||
That took a swipe at the problem, anyway. Seems to recover the amount lost by my initial Jan 27 commit: http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136 It still looks like we haven't made up all the distance since the pre-Jan-27 baseline though; there's a secondary culprit that landed early in the day on Jul 30 that's I haven't identified yet. Will continue hunting.
Comment 25•15 years ago
|
||
(In reply to comment #24) > That took a swipe at the problem, anyway. Seems to recover the amount lost by > my initial Jan 27 commit: > > http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136 > > It still looks like we haven't made up all the distance since the pre-Jan-27 > baseline though; there's a secondary culprit that landed early in the day on > Jul 30 that's I haven't identified yet. Will continue hunting. On Jan 30, the Talos boxes were unthrottled. This executes the tests twice as fast, and our time-based caches (imgs, textruns, etc) are probably running a little bigger.
Assignee | ||
Comment 26•15 years ago
|
||
Hmm, plausible. I wouldn't want to leap to that conclusion and overlook a real bug though. On the other hand, your notion is supported at least by the evidence: Tracemonkey, 1.9.1 and 1.9.2 branches all show a simultaneous jump in RSS. http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136&sel=1232787589,1233647313 http://graphs.mozilla.org/graph.html#show=2418072,2421544,2424797&sel=1232787589,1233647313 http://graphs.mozilla.org/graph.html#show=395170,395141,395139&sel=1232787589,1233647313 I don't suppose you happened to do a merge across all 3 branches that day? I'll check the logs...
Assignee | ||
Comment 27•15 years ago
|
||
Another visual, all 3 branches superimposed: http://graphs.mozilla.org/graph.html#show=395170,395141,395139,2418072,2421544,2424797,2539324,2542679,2568136&sel=1232345177,1234090287
Assignee | ||
Comment 28•15 years ago
|
||
Confusing. Some data points: - The unthrottling happened between 06:00 - 09:00 PST - The regression happened around 01:00 PST, give or take. This contradicts the hypothesis - The regression is not visible in the staging graph2 servers, which came online unthrottled, and cover the same range. This supports the hypothesis. - Even with a very wide scope of pushlog, there are no changes that landed on Tracemonkey, 1.9.1 and m-c simultaneously. This supports the hypothesis. Reference pushlogs covering a day on either side: http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2009-01-29+00%3A00%3A00&enddate=2009-01-31+00%3A00%3A00 http://hg.mozilla.org/tracemonkey/pushloghtml?startdate=2009-01-29+00%3A00%3A00&enddate=2009-01-31+00%3A00%3A00 http://hg.mozilla.org/releases/mozilla-1.9.1/pushloghtml?startdate=2009-01-29+00%3A00%3A00&enddate=2009-01-31+00%3A00%3A00 Reference staging graphs2 window: http://graphs-stage2.mozilla.org/graph.html#tests=[{%22test%22:%225%22,%22branch%22:%221%22,%22machine%22:%225%22}]&sel=1233157419,1233539428 Reference announcement of downtime for unthrottling: http://blog.mozilla.com/bhearsum/archives/76 I'm a little perplexed. It *could* be the unthrottling, but only if the timing is significantly different than reported. Any thoughts?
Comment 29•15 years ago
|
||
(In reply to comment #28) > > I'm a little perplexed. It *could* be the unthrottling, but only if the timing > is significantly different than reported. It looks like that is the case. You can spot the first unthrottled build by looking at a performance test (say, SunSpider).
Reporter | ||
Comment 30•15 years ago
|
||
(In reply to comment #28) > - The unthrottling happened between 06:00 - 09:00 PST In real time. > - The regression happened around 01:00 PST, give or take. In time of the build being tested time. > This contradicts > the hypothesis So I don't think there's a contradiction. Especially since, at least when I looked at the graphs when I filed the bug, different builds actually showed inconsistent ranges about when the "regression" was. That's consistent with a change associated with the real time rather than build timestamp.
Assignee | ||
Comment 31•15 years ago
|
||
Yeah, ok. I was a little suspicious since the times appear to be a bit more separate, but only if you look at a group of results chosen randomly; turns out that's just multi-machine slew. Same-machine, you do get a simultaneous movement of RSS-up, sunspider-down: http://graphs.mozilla.org/graph.html#show=395162&sel=1233279285,1233343445 http://graphs.mozilla.org/graph.html#show=395170&sel=1233266413,1233385107 I guess that's "case closed" then. Testing artifact.
Status: REOPENED → RESOLVED
Closed: 15 years ago → 15 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•15 years ago
|
Component: General → JavaScript Engine
QA Contact: general → general
Updated•15 years ago
|
Flags: in-testsuite-
Flags: in-litmus-
Comment 33•15 years ago
|
||
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/0e16c9d0ef7d
Keywords: fixed1.9.1
Comment 34•15 years ago
|
||
and the backout http://hg.mozilla.org/releases/mozilla-1.9.1/rev/5f32d8d14458
You need to log in
before you can comment on or make changes to this bug.
Description
•