Bugzilla

Comment 3

•

15 years ago

Here's a link to Tp-WS on TraceMonkey showing a regression on 1/26

http://graphs.mozilla.org/graph.html#show=2542724,2539662,2539199&sel=1232980571,1233069052

and a regression for Tp-RSS during the same period on linux:

http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136&sel=1232891285,1233228446

Comment 4

•

15 years ago

This looks like

http://hg.mozilla.org/tracemonkey/rev/72dd6ee4baf7

Graydon Hoare — Bug 473688 - provide a 2-level hash map (global, pc) -> tree, r=gal.

Comment 5

•

15 years ago

Looks like we leak treeInfos since we don't call Fragment's destructor any more (Fragments are placed into the JIT cache instead of malloc'd since this patch landed and are then freed on bulk). My bad. I reviewed that patch. Is it ok if Graydon takes care of this on Monday or should I try to patch it?

Comment 6

•

15 years ago

I concur this should block final. Easy fix though.

Priority: -- → P2

Updated

•

15 years ago

Flags: blocking1.9.1? → blocking1.9.1+

Assignee

Comment 7

•

15 years ago

Good catch on the leaked treeinfos, and probably correct. Shall we allocate them inside the lirbuf as well, or try to hook up the dtors? I'm imagine the former.

(Also note: your earlier comment that this patch should never have caused increased memory use is not entirely correct. It just shouldn't have caused *much*. It lets the fragment cache survive 4x as many global mismatches before flushing, so it actually does bring up memory use a touch. By design. But only a touch. Flushing is so frequent anyways, I'd be surprised if it causes more than a few dozen KB resident increase.)

Assignee

Updated

•

15 years ago

Assignee: nobody → graydon

Comment 8

•

15 years ago

Yeah, I would suggest moving the treeinfos into the cache (lirbuf). Also, maybe TraceRecorder should have an inline copy of tree info it uses for recording and then after a successful recording we copy it into the lirbuf and update the fragment->vmprivate field. This way we don't fill the cache with treeinfos of unsuccessful recordings.

I assume this regression shows up for talos. If there is JS that has multiple globals and extensive JS use I would assume we already max out cache use, which is capped at 16MB so we shouldn't ever see an increase. Quite to the contrary we no longer malloc fragments so we should see a _decrease_ imo. Lets fix the bug and measure it :)

Assignee

Comment 9

•

15 years ago

I think you misdiagnosed this, actually. We're destroying treeinfos. They're owned by the recorder, not the fragments, and the recorder is still deleting them.

If we're leaking, it's something else. I'm digging presently.

Comment 10

•

15 years ago

No we don't.

TreeInfos are owned by the recorder until successful compilation, at which point ownership transfers to the root fragment.

Assignee

Comment 11

•

15 years ago

Attached patch Fix the bug — Details — Splinter Review

Well, the approach you're describing is perhaps ideal, but it's much more intrusive to implement; lifecycles have to shift around to match, as TreeInfos currently come into existence before TreeRecorders. And the lifecycle of UnstableExits still wouldn't match either.

For the time being, this is a simple enough modification that is correct, at the expense of possibly filling the LIR buffer faster than we'll ideally want: just moves the TreeInfo allocation (along with those for UnstableExits) into the lirbuf and removes all the explicit deletes of same. Seems to work fine.

Attachment #361351 - Flags: review?(gal)

Comment 12

•

15 years ago

Comment on attachment 361351 [details] [diff] [review]
Fix the bug

Looks good. Can you please remove the onDestroy hack too? (r=me with that)

We can improve the cache use in a separate bug. You are right, lets fix the regression with as little code motion as possible.

Attachment #361351 - Flags: review?(gal) → review+

Assignee

Comment 13

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/14d5479a5c73

Assignee

Comment 14

•

15 years ago

Memory use appears to have *increased* with this change. I feel ill. Will watch the talos numbers a couple more times before ... backing out? 

I don't really see how this should be possible, unless we are perhaps getting "closer" to the 16mb lirbuf limit, and we previously never did, so we're seeing the ceiling of our own use grow closer to 16mb? I don't know what else to propose.

Whiteboard: fixed-in-tracemonkey

Assignee

Updated

•

15 years ago

Whiteboard: fixed-in-tracemonkey

Assignee

Comment 15

•

15 years ago

Attached patch Fix an uncovered, even worse bug, that is not my fault! — Details — Splinter Review

Ho ho. So here is probably our culprit: the fragmento is indented in such a way as to suggest that it only doubles its page set every time it exhausts it. But it's indentation is suggestively wrong. Its actual meaning is "double the page set on each page-allocation". So the fact that we're performing a *few* extra page-based allocations under the initial patch -- having moved TreeInfos from the malloc heap to fragmento pages -- means we're *doubling memory* a few more times than before.

It doesn't take many doublings to hit our ceiling, and indeed any nontrivial test gets close if you actually print the fragmento's page size (eg. even sunspider gets to the 8mb point after 2 micro-benchmark loops). It only has to request 16 *used* pages to have allocated the full 4096-page set.

Fixing this causes fragmento's memory behavior to drop back down to 9 doublings (~511 pages, 2mb) on ss and our trace-test, though v8 still hits the full 16mb.

Attachment #361531 - Flags: review?(gal)

Jason Orendorff [:jorendorff]

Assignee

Updated

•

15 years ago

Attachment #361531 - Flags: review?(gal) → review?(jorendorff)

Comment 16

•

15 years ago

Comment on attachment 361531 [details] [diff] [review]
Fix an uncovered, even worse bug, that is not my fault!

You've got a bunch of trailing whitespace after your `}`.  r+ with that fixed.

Attachment #361531 - Flags: review?(jorendorff) → review+

Assignee

Comment 17

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/8590bf3b592c , now to see if talos feels generous.

Comment 18

•

15 years ago

cc'ing stuart. He always was curious how much JIT cache we need for SS.

Assignee

Comment 19

•

15 years ago

Confusion upon confusion. Talos shows that private bytes have gone down quite
substantially:

http://graphs.mozilla.org/graph.html#show=2539196,2542722,2539660

While simultaneously claiming that working set / RSS is still up, as it was
last night:

http://graphs.mozilla.org/graph.html#show=2539199,2539662,2542724

Did I perhaps collide with someone else's regressing merge (say, shared code
increase)? Or ... anyone have any wild guesses about how private bytes can drop
while working set expands?

Comment 20

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/14d5479a5c73

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Updated

•

15 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Assignee

Comment 21

•

15 years ago

Attached patch Back out poorly-implemented lirbuf placement-new allocation scheme — Details — Splinter Review

*Sigh* ok, well, I spent the first 2/3 of the day chasing ghosts in the Fragmento's memory management system (which is somewhat silly) before giving up and doing what I should have done when this was first reported: talking to valgrind's leak ckecker and chasing the number down to 0.

Turns out there remained a fair number hard leaks due to extra fiddly sub-structures that dangle off the things we moved into the LIR buffers. Enough that it quickly gets tired to track them all down when there are perfectly good destructors for them already.

So this patch just reverts the design decision of "store fragments and treeinfos in the LIR buffer". Back on the malloc heap they go, as before. Still do non-Fragmento hashtable keyed by multiple globals and all that good stuff. Appears to help with RSS on real pageset tests I'm running locally.

Valgrind is (more) happy with this. We still leak a couple JS structures (property tree bits?) but nothing inside the tracer or nanojit.

Attachment #361919 - Flags: review?(gal)

Comment 22

•

15 years ago

Comment on attachment 361919 [details] [diff] [review]
Back out poorly-implemented lirbuf placement-new allocation scheme

In the long term we should still go for the lirbuf allocation, but no need for 3.1.

Attachment #361919 - Flags: review?(gal) → review+

Assignee

Comment 23

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/e6238b9ea42a

Assignee

Comment 24

•

15 years ago

That took a swipe at the problem, anyway. Seems to recover the amount lost by my initial Jan 27 commit:

http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136

It still looks like we haven't made up all the distance since the pre-Jan-27 baseline though; there's a secondary culprit that landed early in the day on Jul 30 that's I haven't identified yet. Will continue hunting.

Comment 25

•

15 years ago

(In reply to comment #24)
> That took a swipe at the problem, anyway. Seems to recover the amount lost by
> my initial Jan 27 commit:
> 
> http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136
> 
> It still looks like we haven't made up all the distance since the pre-Jan-27
> baseline though; there's a secondary culprit that landed early in the day on
> Jul 30 that's I haven't identified yet. Will continue hunting.

On Jan 30, the Talos boxes were unthrottled. This executes the tests twice as fast, and our time-based caches (imgs, textruns, etc) are probably running a little bigger.

Assignee

Comment 26

•

15 years ago

Hmm, plausible. I wouldn't want to leap to that conclusion and overlook a real bug though.

On the other hand, your notion is supported at least by the evidence: Tracemonkey, 1.9.1 and 1.9.2 branches all show a simultaneous jump in RSS.

http://graphs.mozilla.org/graph.html#show=2539324,2542679,2568136&sel=1232787589,1233647313
http://graphs.mozilla.org/graph.html#show=2418072,2421544,2424797&sel=1232787589,1233647313
http://graphs.mozilla.org/graph.html#show=395170,395141,395139&sel=1232787589,1233647313

I don't suppose you happened to do a merge across all 3 branches that day? I'll check the logs...

Assignee

Comment 27

•

15 years ago

Another visual, all 3 branches superimposed:
http://graphs.mozilla.org/graph.html#show=395170,395141,395139,2418072,2421544,2424797,2539324,2542679,2568136&sel=1232345177,1234090287

Assignee

Comment 28

•

15 years ago

Confusing. Some data points:

  - The unthrottling happened between 06:00 - 09:00 PST
  - The regression happened around 01:00 PST, give or take. This contradicts the hypothesis
  - The regression is not visible in the staging graph2 servers, which came online unthrottled, and cover the same range. This supports the hypothesis.
  - Even with a very wide scope of pushlog, there are no changes that landed on Tracemonkey, 1.9.1 and m-c simultaneously. This supports the hypothesis.

Reference pushlogs covering a day on either side: 
http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2009-01-29+00%3A00%3A00&enddate=2009-01-31+00%3A00%3A00
http://hg.mozilla.org/tracemonkey/pushloghtml?startdate=2009-01-29+00%3A00%3A00&enddate=2009-01-31+00%3A00%3A00
http://hg.mozilla.org/releases/mozilla-1.9.1/pushloghtml?startdate=2009-01-29+00%3A00%3A00&enddate=2009-01-31+00%3A00%3A00

Reference staging graphs2 window:
http://graphs-stage2.mozilla.org/graph.html#tests=[{%22test%22:%225%22,%22branch%22:%221%22,%22machine%22:%225%22}]&sel=1233157419,1233539428

Reference announcement of downtime for unthrottling:
http://blog.mozilla.com/bhearsum/archives/76

I'm a little perplexed. It *could* be the unthrottling, but only if the timing is significantly different than reported. Any thoughts?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 29

•

15 years ago

(In reply to comment #28)
> 
> I'm a little perplexed. It *could* be the unthrottling, but only if the timing
> is significantly different than reported.

It looks like that is the case. You can spot the first unthrottled build by looking at a performance test (say, SunSpider).

Reporter

Comment 30

•

15 years ago

(In reply to comment #28)
>   - The unthrottling happened between 06:00 - 09:00 PST

In real time.

>   - The regression happened around 01:00 PST, give or take.

In time of the build being tested time.

> This contradicts
> the hypothesis

So I don't think there's a contradiction.  Especially since, at least when I looked at the graphs when I filed the bug, different builds actually showed inconsistent ranges about when the "regression" was.  That's consistent with a change associated with the real time rather than build timestamp.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 31

•

15 years ago

Yeah, ok. I was a little suspicious since the times appear to be a bit more separate, but only if you look at a group of results chosen randomly; turns out that's just multi-machine slew. Same-machine, you do get a simultaneous movement of RSS-up, sunspider-down:

http://graphs.mozilla.org/graph.html#show=395162&sel=1233279285,1233343445
http://graphs.mozilla.org/graph.html#show=395170&sel=1233266413,1233385107

I guess that's "case closed" then. Testing artifact.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 15 years ago

Resolution: --- → FIXED

Reporter

Updated

•

15 years ago

Component: General → JavaScript Engine

QA Contact: general → general

Bob Clary [:bc] (inactive)

Comment 32

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/e6238b9ea42a

Updated

•

15 years ago

Flags: in-testsuite-

Flags: in-litmus-

Comment 33

•

15 years ago

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/0e16c9d0ef7d

Keywords: fixed1.9.1