Closed Bug 477425 Opened 15 years ago Closed 15 years ago

Tp memory use regression, midday on January 27, 2009

Categories

(Core :: JavaScript Engine, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: graydon)

Details

(Keywords: fixed1.9.1, regression)

Attachments

(3 files)

A few days ago I noticed Tp RSS regressions (this is memory usage
during the Tp test), on Linux:
http://graphs.mozilla.org/graph.html#show=395139,395141,395170,911698,1431037

There's one jump midday on January 27 (that's the regression this
bug is about), and then another on the early morning of January 30,
reported inconsistently between the boxes (that was the talos
unthrottling, which increased the memory use number a little, not
too surprisingly, given time-based caches).

There's a similar memory use regression in the Tp3 (Mem-PB) test
(clearly something to do with memory use, although I don't know
what) on Windows XP:
http://graphs.mozilla.org/graph.html#show=395012,395026,395052,912152,1431881
at the same time period (it didn't show up on the *-fast machine,
though, or the trunk07, which also seems different from the rest).

The Mac numbers are *much* noisier than the Windows and Linux 
numbers, so it's hard to see if there's a regression there as well.
However, on the smoother 10.5 machines (which are better than the
10.4 machines) it looks like there was an equivalent regression on
Mac.
Commits around the window are:
http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2009-01-27+09%3A00%3A00&enddate=2009-01-27+14%3A00%3A00

*If* the times in the graph are actually the build start times (which they should be, but I'm not sure if they are), it's likely towards the end of that window, which means it's the tracemonkey merge.
Better graph URLs, showing just the primary machines:
http://graphs.mozilla.org/graph.html#show=395012,395026,395052&sel=1233065693,1233115747
http://graphs.mozilla.org/graph.html#show=395139,395141,395170&sel=1233050643,1233110707
which says the Linux RSS number regressed about 2%, the Windows Mem-PB number closer to 3.5%-4%.
Flags: blocking1.9.1?
Here's a link to Tp-WS on TraceMonkey showing a regression on 1/26

http://graphs.mozilla.org/graph.html#show=2542724,2539662,2539199&sel=1232980571,1233069052

and a regression for Tp-RSS during the same period on linux:

http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136&sel=1232891285,1233228446
This looks like

http://hg.mozilla.org/tracemonkey/rev/72dd6ee4baf7

Graydon Hoare — Bug 473688 - provide a 2-level hash map (global, pc) -> tree, r=gal.
Looks like we leak treeInfos since we don't call Fragment's destructor any more (Fragments are placed into the JIT cache instead of malloc'd since this patch landed and are then freed on bulk). My bad. I reviewed that patch. Is it ok if Graydon takes care of this on Monday or should I try to patch it?
I concur this should block final. Easy fix though.
Priority: -- → P2
Flags: blocking1.9.1? → blocking1.9.1+
Good catch on the leaked treeinfos, and probably correct. Shall we allocate them inside the lirbuf as well, or try to hook up the dtors? I'm imagine the former.

(Also note: your earlier comment that this patch should never have caused increased memory use is not entirely correct. It just shouldn't have caused *much*. It lets the fragment cache survive 4x as many global mismatches before flushing, so it actually does bring up memory use a touch. By design. But only a touch. Flushing is so frequent anyways, I'd be surprised if it causes more than a few dozen KB resident increase.)
Assignee: nobody → graydon
Yeah, I would suggest moving the treeinfos into the cache (lirbuf). Also, maybe TraceRecorder should have an inline copy of tree info it uses for recording and then after a successful recording we copy it into the lirbuf and update the fragment->vmprivate field. This way we don't fill the cache with treeinfos of unsuccessful recordings.

I assume this regression shows up for talos. If there is JS that has multiple globals and extensive JS use I would assume we already max out cache use, which is capped at 16MB so we shouldn't ever see an increase. Quite to the contrary we no longer malloc fragments so we should see a _decrease_ imo. Lets fix the bug and measure it :)
I think you misdiagnosed this, actually. We're destroying treeinfos. They're owned by the recorder, not the fragments, and the recorder is still deleting them.

If we're leaking, it's something else. I'm digging presently.
No we don't.

TreeInfos are owned by the recorder until successful compilation, at which point ownership transfers to the root fragment.
Attached patch Fix the bugSplinter Review
Well, the approach you're describing is perhaps ideal, but it's much more intrusive to implement; lifecycles have to shift around to match, as TreeInfos currently come into existence before TreeRecorders. And the lifecycle of UnstableExits still wouldn't match either.

For the time being, this is a simple enough modification that is correct, at the expense of possibly filling the LIR buffer faster than we'll ideally want: just moves the TreeInfo allocation (along with those for UnstableExits) into the lirbuf and removes all the explicit deletes of same. Seems to work fine.
Attachment #361351 - Flags: review?(gal)
Comment on attachment 361351 [details] [diff] [review]
Fix the bug

Looks good. Can you please remove the onDestroy hack too? (r=me with that)

We can improve the cache use in a separate bug. You are right, lets fix the regression with as little code motion as possible.
Attachment #361351 - Flags: review?(gal) → review+
Memory use appears to have *increased* with this change. I feel ill. Will watch the talos numbers a couple more times before ... backing out? 

I don't really see how this should be possible, unless we are perhaps getting "closer" to the 16mb lirbuf limit, and we previously never did, so we're seeing the ceiling of our own use grow closer to 16mb? I don't know what else to propose.
Whiteboard: fixed-in-tracemonkey
Whiteboard: fixed-in-tracemonkey
Ho ho. So here is probably our culprit: the fragmento is indented in such a way as to suggest that it only doubles its page set every time it exhausts it. But it's indentation is suggestively wrong. Its actual meaning is "double the page set on each page-allocation". So the fact that we're performing a *few* extra page-based allocations under the initial patch -- having moved TreeInfos from the malloc heap to fragmento pages -- means we're *doubling memory* a few more times than before.

It doesn't take many doublings to hit our ceiling, and indeed any nontrivial test gets close if you actually print the fragmento's page size (eg. even sunspider gets to the 8mb point after 2 micro-benchmark loops). It only has to request 16 *used* pages to have allocated the full 4096-page set.

Fixing this causes fragmento's memory behavior to drop back down to 9 doublings (~511 pages, 2mb) on ss and our trace-test, though v8 still hits the full 16mb.
Attachment #361531 - Flags: review?(gal)
Attachment #361531 - Flags: review?(gal) → review?(jorendorff)
Comment on attachment 361531 [details] [diff] [review]
Fix an uncovered, even worse bug, that is not my fault!

You've got a bunch of trailing whitespace after your `}`.  r+ with that fixed.
Attachment #361531 - Flags: review?(jorendorff) → review+
http://hg.mozilla.org/tracemonkey/rev/8590bf3b592c , now to see if talos feels generous.
cc'ing stuart. He always was curious how much JIT cache we need for SS.
Confusion upon confusion. Talos shows that private bytes have gone down quite
substantially:

http://graphs.mozilla.org/graph.html#show=2539196,2542722,2539660

While simultaneously claiming that working set / RSS is still up, as it was
last night:

http://graphs.mozilla.org/graph.html#show=2539199,2539662,2542724

Did I perhaps collide with someone else's regressing merge (say, shared code
increase)? Or ... anyone have any wild guesses about how private bytes can drop
while working set expands?
http://hg.mozilla.org/mozilla-central/rev/14d5479a5c73
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
*Sigh* ok, well, I spent the first 2/3 of the day chasing ghosts in the Fragmento's memory management system (which is somewhat silly) before giving up and doing what I should have done when this was first reported: talking to valgrind's leak ckecker and chasing the number down to 0.

Turns out there remained a fair number hard leaks due to extra fiddly sub-structures that dangle off the things we moved into the LIR buffers. Enough that it quickly gets tired to track them all down when there are perfectly good destructors for them already.

So this patch just reverts the design decision of "store fragments and treeinfos in the LIR buffer". Back on the malloc heap they go, as before. Still do non-Fragmento hashtable keyed by multiple globals and all that good stuff. Appears to help with RSS on real pageset tests I'm running locally.

Valgrind is (more) happy with this. We still leak a couple JS structures (property tree bits?) but nothing inside the tracer or nanojit.
Attachment #361919 - Flags: review?(gal)
Comment on attachment 361919 [details] [diff] [review]
Back out poorly-implemented lirbuf placement-new allocation scheme

In the long term we should still go for the lirbuf allocation, but no need for 3.1.
Attachment #361919 - Flags: review?(gal) → review+
That took a swipe at the problem, anyway. Seems to recover the amount lost by my initial Jan 27 commit:

http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136

It still looks like we haven't made up all the distance since the pre-Jan-27 baseline though; there's a secondary culprit that landed early in the day on Jul 30 that's I haven't identified yet. Will continue hunting.
(In reply to comment #24)
> That took a swipe at the problem, anyway. Seems to recover the amount lost by
> my initial Jan 27 commit:
> 
> http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136
> 
> It still looks like we haven't made up all the distance since the pre-Jan-27
> baseline though; there's a secondary culprit that landed early in the day on
> Jul 30 that's I haven't identified yet. Will continue hunting.

On Jan 30, the Talos boxes were unthrottled. This executes the tests twice as fast, and our time-based caches (imgs, textruns, etc) are probably running a little bigger.
Hmm, plausible. I wouldn't want to leap to that conclusion and overlook a real bug though.

On the other hand, your notion is supported at least by the evidence: Tracemonkey, 1.9.1 and 1.9.2 branches all show a simultaneous jump in RSS.

http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136&sel=1232787589,1233647313
http://graphs.mozilla.org/graph.html#show=2418072,2421544,2424797&sel=1232787589,1233647313
http://graphs.mozilla.org/graph.html#show=395170,395141,395139&sel=1232787589,1233647313

I don't suppose you happened to do a merge across all 3 branches that day? I'll check the logs...
Confusing. Some data points:

  - The unthrottling happened between 06:00 - 09:00 PST
  - The regression happened around 01:00 PST, give or take. This contradicts the hypothesis
  - The regression is not visible in the staging graph2 servers, which came online unthrottled, and cover the same range. This supports the hypothesis.
  - Even with a very wide scope of pushlog, there are no changes that landed on Tracemonkey, 1.9.1 and m-c simultaneously. This supports the hypothesis.

Reference pushlogs covering a day on either side: 
http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2009-01-29+00%3A00%3A00&enddate=2009-01-31+00%3A00%3A00
http://hg.mozilla.org/tracemonkey/pushloghtml?startdate=2009-01-29+00%3A00%3A00&enddate=2009-01-31+00%3A00%3A00
http://hg.mozilla.org/releases/mozilla-1.9.1/pushloghtml?startdate=2009-01-29+00%3A00%3A00&enddate=2009-01-31+00%3A00%3A00

Reference staging graphs2 window:
http://graphs-stage2.mozilla.org/graph.html#tests=[{%22test%22:%225%22,%22branch%22:%221%22,%22machine%22:%225%22}]&sel=1233157419,1233539428

Reference announcement of downtime for unthrottling:
http://blog.mozilla.com/bhearsum/archives/76

I'm a little perplexed. It *could* be the unthrottling, but only if the timing is significantly different than reported. Any thoughts?
(In reply to comment #28)
> 
> I'm a little perplexed. It *could* be the unthrottling, but only if the timing
> is significantly different than reported.

It looks like that is the case. You can spot the first unthrottled build by looking at a performance test (say, SunSpider).
(In reply to comment #28)
>   - The unthrottling happened between 06:00 - 09:00 PST

In real time.

>   - The regression happened around 01:00 PST, give or take.

In time of the build being tested time.

> This contradicts
> the hypothesis

So I don't think there's a contradiction.  Especially since, at least when I looked at the graphs when I filed the bug, different builds actually showed inconsistent ranges about when the "regression" was.  That's consistent with a change associated with the real time rather than build timestamp.
Yeah, ok. I was a little suspicious since the times appear to be a bit more separate, but only if you look at a group of results chosen randomly; turns out that's just multi-machine slew. Same-machine, you do get a simultaneous movement of RSS-up, sunspider-down:

http://graphs.mozilla.org/graph.html#show=395162&sel=1233279285,1233343445
http://graphs.mozilla.org/graph.html#show=395170&sel=1233266413,1233385107

I guess that's "case closed" then. Testing artifact.
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Component: General → JavaScript Engine
QA Contact: general → general
Flags: in-testsuite-
Flags: in-litmus-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: