Closed Bug 416628 Opened 16 years ago Closed 16 years ago

O(n^2) blowup due to overlong cx->tempPool arena list

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9beta4

People

(Reporter: brendan, Assigned: brendan)

References

Details

(Keywords: perf, verified1.8.1.13)

Attachments

(3 files, 3 obsolete files)

The attached shows a severe bug, evident in Shark with 39.6% self and total time under JS_ArenaRelease called from js_EmitTree (the TOK_FUNCTION case), involving a huge statement at top-level whose guts allocate JSParseNodes from cx->tempPool (which live and are kept alive via a freelist), and the LIFO allocations done in tempPool by js_EmitTree's TOK_FUNCTION case.

The "first out" part of LIFO means JS_ARENA_RELEASE (JS_ArenaRelease) from js_EmitTree spends a huge amount of time linear-searching for the last (or nearly last) arena. This compounds quadratically over the testcase attached here, since it has many lambda functions under the one huge top-level if(true){...} statement.

/be
(In reply to comment #0)
> The "first out" part of LIFO means JS_ARENA_RELEASE (JS_ArenaRelease) from
> js_EmitTree spends a huge amount of time linear-searching for the last (or
> nearly last) arena.

Nearly last, of course -- if the mark were in the last arena, which would be cx->stackPool.current, then the JS_ARENA_RELEASE macro would take its fast path and not call JS_ArenaRelease. 

One might want doubly-linked arenas, to enable searching from top of stack (current in pool) backward. Or you could complain about the fixed and relatively small net size (1024) of each tempPool arena, which adapts not at all to load.

But I think the primal bug here is the use of tempPool by parsing and codegen, which interleave by top-level statement in js_CompileScript (which saves lots of memory compared to parsing the entire script into an AST and then emitting). Given the AST node (JSParseNode) load preceding the mark and release for each emitted function in the big AST, there's no good ending with one pool for both workloads.

Using cg->codePool instead of &cx->tempPool to allocate cg2 in the js_EmitTree TOK_FUNCTION case fixes the bug (top-heavy 13% self/total js_GetToken).

/be
Status: NEW → ASSIGNED
Priority: -- → P1
Flags: blocking1.9+
Attached patch fix (obsolete) — Splinter Review
Attachment #302369 - Flags: review?(igor)
Flags: wanted1.8.1.x+
Flags: blocking1.8.1.13?
(In reply to comment #0)
> The attached shows a severe bug, evident in Shark with 39.6% self and total
> time under JS_ArenaRelease called from js_EmitTree (the TOK_FUNCTION case),

By "the attached," do you mean the patch, or are you referring to something that wasn't attached?
Attached patch 1.8 branch patchSplinter Review
Easy to back-port, but the patch did not apply so requesting review.

/be
Attachment #302370 - Flags: review?(igor)
Attachment #302370 - Flags: approval1.8.1.13?
Attached file test .js file, gzipped
Thanks, reed -- I had this attached in the new bug page's attachment form, but it did not "take". Bugzilla bug? I shouldn't have had to click extra buttons, and my Firefox saved the "test .js file" description, so I wasn't imagining typing that!

/be
(In reply to comment #5)
> Created an attachment (id=302371) [details]
> test .js file, gzipped
> 
> Thanks, reed -- I had this attached in the new bug page's attachment form, but
> it did not "take". Bugzilla bug?

I bet I did not attach the gzipped version, and the unzipped file is huge. From this new-bug page, that attempt was rewarded by no red screen of rejection from bugzilla -- so definite bugzilla bug.

/be
/be
(In reply to comment #1)
> But I think the primal bug here is the use of tempPool by parsing and codegen,

I fail to see this. Code generator allocations follow LIFO order so if this is slow, then the culprit is LIFO implementation. Using a different arena for codegen is just sweeping things under the carpet.

It is OK to do the latter on 1.8 branch as a safe fix, but the trunk asks for a proper fix. I.e. one either fixes arena's internals to make the mark-release O(1) in time on average or stop using arenas for the emitter or parser. Given that the parser never releases nodes, LIFO is not a good choice for it in any case. Using a linked list of parse node chunks can be a better idea.  
(In reply to comment #7)
> (In reply to comment #1)
> > But I think the primal bug here is the use of tempPool by parsing and codegen,
> 
> I fail to see this. Code generator allocations follow LIFO order so if this is
> slow, then the culprit is LIFO implementation.

The emitter uses LIFO order, but the parser does not because nodes are held in the AST as the emitter pushes and pops. It's true that there is a global last in first out order, though. My point is that the two phases are not using tempPool in the same way.

> Using a different arena for
> codegen is just sweeping things under the carpet.

Or it's using the tools at hand without overcomplicating arena pools to cope with a case we can avoid ;-).

> It is OK to do the latter on 1.8 branch as a safe fix,

Please review accordingly, want to get this fixed there.

> but the trunk asks for a
> proper fix. I.e. one either fixes arena's internals to make the mark-release
> O(1) in time on average or stop using arenas for the emitter or parser. Given
> that the parser never releases nodes,

But it does, at the end ("first out" happens all at once, at the last line in js_FinishParseContext).

> LIFO is not a good choice for it in any
> case. Using a linked list of parse node chunks can be a better idea.  

That sounds like an arena pool -- a linked list of chunks -- and it is morally equivalent to this fix. You are proposing another ad-hoc mechanism that's formally equivalent to arena pools, that uses separate storage from tempPool, to keep the emitter using tempPool for cg2. If you rename terms and unify, it is the same as the patch in this bug. It might be more aesthetic but it would result in exactly the same machine code, with storage addresses renamed to swap tempPool for cg->codePool and the new ad-hoc linked list of chunks for tempPool.

Proposal for making release O(1) without adding too much overhead?

/be
(In reply to comment #8)
> That sounds like an arena pool -- a linked list of chunks

No: the parsed nodes do not require a LIFO structure, they just need something that can be allocated in chunks and than released at once. I.e. using the arena is an overkill due to an extra support for pop and mark operations.

Yes, it would be ad-hoc structure, but one that can be implemented trivially and that can use, for example, an exponential growth for chunk allocation. And given that the emitter also has this pattern of grow-for-some-time-then-release-at-once, it may not be so ad-hoc. 
Attachment #302370 - Flags: review?(igor) → review+
(In reply to comment #8)
> Proposal for making release O(1) without adding too much overhead?

The code instead of just storing the mark can also store the current arena. This is enough to implement a release operation that puts all freed arenas to a free list in O(1) time.
Agreed, we use LIFO arena pools for stacks *and* for "release everything all at once when done". The latter case could use a better growth strategy as noted in comment 1 second paragraph.

But I would still like to get this patch in and file a followup bug. My motivation is time to Firefox 3 release and getting the most perf bang for buck. Reworking the parsenode allocator and perhaps other such things to use a new, better-growing, release-at-end data structure is not likely to pay off, compared to other work.

This is an argument about opportunity costs. Igor, what do you say?

/be
There are about 20 void *mark; variables in the codebase. Switching those to an (arena, cursor) pair could be done, and it would cure this bug. It might have other good effects, but not on our bug radar yet. Want to make a patch? I'm out for the rest of the day, and we both have other bugs.

/be
(In reply to comment #12)
> There are about 20 void *mark; variables in the codebase. Switching those to an
> (arena, cursor) pair could be done, and it would cure this bug. It might have
> other good effects, but not on our bug radar yet. Want to make a patch?

I do not have time for that.

To Brian Crowder: could you look at this? 

The idea of a specialized allocator for JSParseNodes and other growing-allocation cases is good, but it is for a separate bug.

The only issue here is expedient fix (the patch attached, and the backported version of it for the 1.8 branch) vs. O(1) release fix, which turns on expedience and runtime/codesize costs of a fatter "mark" representation.

If we had aligned arenas, the (arena,cursor) pair would be one word. That looks to me like the way to go for large-ish power-of-two sized arenas anyway. It would also cure bug 408921. We have a page allocator under the GC already. Comments?

/be
Comment on attachment 302369 [details] [diff] [review]
fix

The patch fixes the problem and is small.
Attachment #302369 - Flags: review?(igor) → review+
I filed bug 416719 about release-at-once data structures and bug 416717 about O(1) arena release.
Attachment #302369 - Attachment is obsolete: true
Attachment #302467 - Flags: review?(igor)
Keywords: perf
Comment on attachment 302467 [details] [diff] [review]
can't assert about guardMark since cg2mark is not in tempPool

>Index: jsemit.c
>         js_FinishCodeGenerator(cx, cg2);
>-        JS_ASSERT(js_GuardedArenaMark(&cx->tempPool, cg2mark,
>-                                      cg->treeContext.
>-                                      parseContext->lastAllocMark));

Since the emitter does not use the tempPool, there is no need for lastAllocMark and it can be removed together with no longer applicable comments and code.
Attachment #302467 - Flags: review?(igor)
Attachment #302467 - Attachment is obsolete: true
Attachment #302500 - Flags: review?(igor)
Attachment #302500 - Flags: review?(igor) → review+
Fixed on trunk:

js/src/jsarena.c 3.37
js/src/jsarena.h 3.28
js/src/jsemit.c 3.302
js/src/jsparse.c 3.333
js/src/jsparse.h 3.59

/be
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
cc:ing Robin for thoughts on O(1) arena-release changes.
Flags: in-testsuite?
(In reply to comment #21)
> cc:ing Robin for thoughts on O(1) arena-release changes.

The bug for that, per comment 16, is bug 416717.

/be
Comment on attachment 302370 [details] [diff] [review]
1.8 branch patch

approved for 1.8.1.13, a=dveditz for release-drivers
Attachment #302370 - Flags: approval1.8.1.13? → approval1.8.1.13+
Flags: blocking1.8.1.13? → blocking1.8.1.13+
Fixed on the 1.8 branch:

js/src/jsemit.c 3.128.2.75

/be
Keywords: fixed1.8.1.13
Attached file attempt at test (obsolete) —
The test in attachment 302371 [details] is static and not conducive to an O(N^2) type test. I tried to make something that could be called repeatedly with a size parameter to show the behavior, but couldn't get it to show anything significant. Some help on a function that takes a size input and scales as O(N^2) before this bug was fixed and as O(N) afterwards would be helpful. Otherwise, I'm moving on to other tests that need to be added.
Comment on attachment 303984 [details]
attempt at test

This had a mistake where it should have declared the code var in the createCode function as in 

function createCode(i)
{
  var code = '';

but that doesn't help as far as I can see. I need more coffee, but if you have a suggested testcase please attach.
Attachment #303984 - Attachment is obsolete: true
never mind, user error. 

/cvsroot/mozilla/js/tests/js1_5/Regress/regress-416628.js,v  <--  regress-416628.js
initial revision: 1.1
Flags: in-testsuite? → in-testsuite+
Test is performance related and not stable when run under loaded virtual machines. Adding to the performance dependent test lists.

/cvsroot/mozilla/js/tests/performance-1.8.0.tests,v  <--  performance-1.8.0.tests
new revision: 1.3; previous revision: 1.2

/cvsroot/mozilla/js/tests/performance-1.8.1.tests,v  <--  performance-1.8.1.tests
new revision: 1.3; previous revision: 1.2

/cvsroot/mozilla/js/tests/performance-1.9.0.tests,v  <--  performance-1.9.0.tests
new revision: 1.3; previous revision: 1.2

(In reply to comment #14)
> If we had aligned arenas, the (arena,cursor) pair would be one word. That looks
> to me like the way to go for large-ish power-of-two sized arenas anyway. It
> would also cure bug 408921. We have a page allocator under the GC already.
> Comments?

Aligned arenas would be a good idea, but if implemented with the page allocator, would require that arenas be at least 4096 bytes in size.  Currently, they are often smaller.

Can someone point me to the page allocator in js?

I'll make my other comments in bug 416717, the O(1) release bug.
v 1.8.1 manually testing.
Note js1_5/Regress/regress-416628.js passes on 1.8.1 linux, mac and windows browser/shell and linux, mac browser/shell and windows shell on 1.9.0 but does fail on windows browser on 1.9.0. The shape of the curve starts linearly then suffers an abrupt jump and then appears to become a shallow quadratic. 

windows xp vm 1.9.0 browser

Size	Time	Time 
       (debug)	(opt)
100	54	24
200	107	49
300	157	70
400	821	95
500	1040	121
600	1165	142
700	1315	733
800	1392	767
900	1733	884

verifying 1.9.0. 
bc, that's disturbing since there is no platform-dependent code involved in this patch. Are you using a jemalloc-based windows browser? How about windows js shell (jemalloc or default MS malloc)?

/be
I believe I am using default MS malloc on both of these. I definitely don't have jemalloc enabled for the browser and don't know how to do jemalloc for the shell.

I _should have been_ testing both with jemalloc, right? 
(In reply to comment #31)
> The shape of the curve starts linearly then
> suffers an abrupt jump and then appears to become a shallow quadratic. 
> Size    Time    Time 
>        (debug)  (opt)
> 100     54      24
> 200     107     49
> 300     157     70
> 400     821     95
> 500     1040    121
> 600     1165    142
> 700     1315    733
> 800     1392    767
> 900     1733    884

I have encountered problems like this before.  Your case *may* not be due to an algorithm which is not O(n), but rather due to the effects of the memory hierarchy.  In other words, the algorithm may be O(n) as desired, but the execution times for increasingly large n may not be linear.

This could happen when a particular level of the memory hierarchy (probably the L2 cache in this case) becomes full, and spills into the next level, which is slower.  When this is happening, one often sees linear execution times but with big jumps in the slope of the line.  The jumps correspond to the spilling.

In your data above, we can see that some level of the memory hierarchy fills up after about 200ms of execution, in both the debug and optimized versions.  The execution time curve is not a shallow quadratic, but 2 linear ones.  The execution times are linear (ish) after 200ms and before 200ms.  

One could see this by timing an algorithm which is "known linear", like filling an array of size n with random numbers, and then dividing each of the numbers by pi, or whatever.  For increasingly large n, you will see that there are "jumps" in the execution time, like in your data above.  That is even though the algorithm is obviously O(n).

More powerful/expensive machines (i.e. not PCs) tend to be better at hiding the effects of the memory hierarchy, IMO.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: