Closed Bug 939088 Opened 11 years ago Closed 11 years ago

massive regression in testClosureAssign

Categories

(Core :: JavaScript Engine, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla28
Tracking Status
firefox25 --- unaffected
firefox26 --- unaffected
firefox27 --- wontfix
firefox28 --- verified
firefox29 --- verified
firefox-esr17 --- unaffected
firefox-esr24 --- unaffected
b2g18 --- unaffected
b2g-v1.1hd --- unaffected
b2g-v1.2 --- unaffected

People

(Reporter: fb+mozdev, Assigned: bhackett1024)

References

()

Details

(Keywords: perf, regression)

Attachments

(3 files)

+++ This bug was initially created as a clone of Bug #517164 +++

(In reply to Florian Bender from Bug 517164 comment #9
> "testClosureAssign" seems regressed (first bad: Aurora), this needs a few
> more runs and bisection.

Last good nightly: 2013-10-11
First bad nightly: 2013-10-12

Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6b101d4c6d24&tochange=73f37c7a3860


Does anything stand out? If not, can anyone bisect this further, please?
I only bisected on Mac, so resetting the platform info for now, though I suspect the numbers are the same for other platforms.
OS: All → Mac OS X
Hardware: All → x86_64
(In reply to Florian Bender from comment #0)
> Does anything stand out? If not, can anyone bisect this further, please?

Yes, bug 880085 seems related. Requesting needinfo from Brian.
Flags: needinfo?(bhackett1024)
Attached patch bandaidSplinter Review
This test is pretty silly but does point to a perf bottleneck that could affect large applications using run once closures.  It generates a run once closure with 14000 variables with an inner function that assigns to each of those variables once.  That inner function is called 150 times so we never try to Ion compile it.  Baseline does a fine job generating ICs for it, and 90% of the time taken here is before we even baseline compile, almost all of it warming up in the interpreter.  That time taken has increased because the interpreter needs to keep track of type side effects on the object (since bug 880085 unbroke run once closure optimizations), which requires it to compute the property name being accessed.  That computation incurs a linear search on the call object shape, which takes a while since there are 14000 entries in it.

The attached patch is a bandaid to avoid the linear search in cases when we know the call object's types don't need to be updated, i.e. when no code under it has been compiled.  This improves the testClosureAssign() case from 1560 to 95 for me, but won't have much effect in more diffuse large applications where we baseline/ion compile some inner code and delazify the type but still do a lot of interpreting in cold code.  I'd like to do a followup patch to add a cache in the runtime mapping slots to names for singleton scope objects, similar to the source note cache.  (This seems a better use of memory than attaching the name to every SETALIASEDVAR op.)
Assignee: nobody → bhackett1024
Attachment #832936 - Flags: review?(luke)
Flags: needinfo?(bhackett1024)
Attached patch cacheSplinter Review
Attachment #8333581 - Flags: review?(luke)
Instead of adding a whole new mechanism that, as you said, doesn't really solve the problem for realistic workloads, can we just blow another 4 bytes on JOF_SCOPECOORD ops and embed the atom?

Actually, with bug 932276, the only remaining use of the 4 byte blockIndex current embedded in JOF_SCOPECOORD ops is from Ion/Baseline JITs, so we could use the O(n) GetBlockChainAtPC added by bug 932276 and remove the blockIndex.
(In reply to Luke Wagner [:luke] from comment #5)
> Instead of adding a whole new mechanism that, as you said, doesn't really
> solve the problem for realistic workloads, can we just blow another 4 bytes
> on JOF_SCOPECOORD ops and embed the atom?
> 
> Actually, with bug 932276, the only remaining use of the 4 byte blockIndex
> current embedded in JOF_SCOPECOORD ops is from Ion/Baseline JITs, so we
> could use the O(n) GetBlockChainAtPC added by bug 932276 and remove the
> blockIndex.

I think the cache mechanism will work fine for realistic workloads; it's the first patch that's just a bandaid.  Embedding the atoms directly will be a good deal less efficient on memory than the cache; it's not just the extra four bytes for the atom index, but also (usually) a new word-sized entry in the script's atoms table.
(In reply to Brian Hackett (:bhackett) from comment #6)
comment 5 explains how to avoid increasing the bytecode size.  To avoid increasing atoms-table size, we can use the index of the aliased var's Bindings::bindingArray.  Before doing either, though, we should see how prevalent aliased-var opcodes even are or how many extra atoms-table entries would added; this might only be a few KB we're talking about here.
(In reply to Luke Wagner [:luke] from comment #7)
> (In reply to Brian Hackett (:bhackett) from comment #6)
> comment 5 explains how to avoid increasing the bytecode size.

Well, I'm confused now.  Comment 5 sounds like an independent optimization which would allow shortening the bytecode length by 4 bytes; if we did that and also used this cache then the bytecode size would be improved over the current state of things.
(In reply to Brian Hackett (:bhackett) from comment #8)
> Well, I'm confused now.  Comment 5 sounds like an independent optimization

If we removed the block index *without* embedding the var name's index in the opcode then we'd add a second O(n) lookup in ScopeCoordinateName with the same potential for pathological performance we're seeing here.  If we remove the block index *and* add the atom index, then ScopeCoordinateName won't need to call GetBlockChainAtPC so ScopeCoordinateName will become O(1).
(In reply to Luke Wagner [:luke] from comment #9)
> (In reply to Brian Hackett (:bhackett) from comment #8)
> If we removed the block index *without* embedding the var name's index in
> the opcode then we'd add a second O(n) lookup in ScopeCoordinateName with
> the same potential for pathological performance we're seeing here.  If we
> remove the block index *and* add the atom index, then ScopeCoordinateName
> won't need to call GetBlockChainAtPC so ScopeCoordinateName will become O(1).

Does GetBlockChainAtPC need to be O(n)?  Looking at the code it is doing a linear search of what looks like a sorted array and could be doing a binary search instead.
Hmm, that's a nice thought, probably so.

Still, embedding the name in the bytecode is the obvious/simple strategy -- it's what one would expect when reading the code -- so I think we need to justify the complexity of adding a caching mechanism with some memory measurements.  If the savings are significant, great.  It'll also justify the work to remove blockIndex too.
Naveed, does this FF27 perf regression have any major user or external test impact? If this is a critical test, shouldn't we be testing more frequently?
Flags: needinfo?(nihsanullah)
Alex this is not a general case issue. The issues are confined to a few specific unusual test cases that are not indicative of real world web apps.
Flags: needinfo?(nihsanullah)
(In reply to Luke Wagner [:luke] from comment #11)
> Still, embedding the name in the bytecode is the obvious/simple strategy --
> it's what one would expect when reading the code -- so I think we need to
> justify the complexity of adding a caching mechanism with some memory
> measurements.  If the savings are significant, great.  It'll also justify
> the work to remove blockIndex too.

Running the asm.js bullet test with --no-asmjs, 26% of the bytecode we emit (194k out of 759k) is used up by aliasedvar ops.
Comment on attachment 832936 [details] [diff] [review]
bandaid

Well, that is non-trivial.  I measured a few more sites and saw percentages from 6% (GMail) to 18% (Facebook).

Still, it feels strange to add a whole cache for a pathological case (tons of closed-over variables and tons of accesses that don't get baseline-compiled) when it doesn't really fix the problem (X accesses to X unique closed-over variables will miss X times; also performing aliased var accesses in alternating functions will invalidate the cache each time).

If you want to land this first band-aid patch, that's fine, but I think the cache is just a bigger bandaid.  If we want to fix the problem of "it's slow to find the name of an aliased var", I think we should do it fully and embed the atom index.  We can trade out the blockIndex and avoid any regression.  But I'd like to see this problem show up in a non-synthetic situation first.
Attachment #832936 - Flags: review?(luke) → review+
Attachment #8333581 - Flags: review?(luke)
(In reply to Luke Wagner [:luke] from comment #15)
> Still, it feels strange to add a whole cache for a pathological case (tons
> of closed-over variables and tons of accesses that don't get
> baseline-compiled) when it doesn't really fix the problem (X accesses to X
> unique closed-over variables will miss X times; also performing aliased var
> accesses in alternating functions will invalidate the cache each time).

I don't think you understand what this cache's behavior is.  This cache has benefits even in code we eventually baseline compile --- per comment 3 we baseline compile this script fine, but eat a huge amount of time doing linear lookups while we're still in the interpreter.  Use of this cache will also benefit lookups during baseline and Ion compilation; even making one pass over scripts with these aliased vars (the interp only does 10) will) still hurts.  This does fix the problem of X accesses to X closed over vars when those closed over vars are associated with the same scope object, and while it would repeatedly invalidate when code is visiting different large call objects over and over the resulting behavior will not be any worse asymptotically than what the code is currently doing.

> If you want to land this first band-aid patch, that's fine, but I think the
> cache is just a bigger bandaid.  If we want to fix the problem of "it's slow
> to find the name of an aliased var", I think we should do it fully and embed
> the atom index.  We can trade out the blockIndex and avoid any regression. 
> But I'd like to see this problem show up in a non-synthetic situation first.

When is a cache *anything* except a bandaid of some size?  The reason we have caches is to trade off speed and memory concerns.  We already know that the size of aliased var ops can be quite large so reducing the size of those ops while speeding up their accesses in common cases seems an entirely reasonable thing to do, even if it requires *gasp* a cache.  It's pretty clear this problem can happen in non-synthetic situations, as comment 3 describes, all you need is a big app inside a closure (e.g. all asm.js style apps fit this pattern).
Attachment #8333581 - Flags: review?(luke)
(In reply to Brian Hackett (:bhackett) from comment #16)
> This does fix the problem of X accesses to X closed over vars
> when those closed over vars are associated with the same scope object, and

Ah, I misread the code; I thought the cache was lazily adding (id, slot) pairs as they were accessed, but I see now the cache eagerly populates all (id, slot) pairs on first access.  (Perhaps you can see why I was hesitant under this assumption.)

However, this seems to suggest a new realistic pathologically scenario: a program with several large modules that calls back and forth between them can incur an O(n) purge/fill at each call.  This could produce much worse behavior than the status quo.   This actually seems likely to occur in practice given the popular requireJS/program-composed-of-many-modules style.

This is of course is fixable by having a two-level map (shape->(id->slot)), if you want to go there.

> even if it requires *gasp* a cache.

Save the snark.

> It's pretty clear this problem can happen in non-synthetic situations, as comment 3
> describes, all you need is a big app inside a closure (e.g. all asm.js style
> apps fit this pattern).

Also, a bunch of closed-over variables and a bunch of accesses in functions that don't get quickly ICed.  asm.js has highly-aliased variables, but a program-independent fixed quantity (~139, far less than testClosureAssign).

That being said, of course it would be nice to solve this problem w/o using more memory.  With the abovementioned problem fixed as suggested or otherwise, I think the cache *would* be quite solid and worth the added complexity.
(In reply to Luke Wagner [:luke] from comment #17)
> However, this seems to suggest a new realistic pathologically scenario: a
> program with several large modules that calls back and forth between them
> can incur an O(n) purge/fill at each call.  This could produce much worse
> behavior than the status quo.   This actually seems likely to occur in
> practice given the popular requireJS/program-composed-of-many-modules style.
> 
> This is of course is fixable by having a two-level map (shape->(id->slot)),
> if you want to go there.

Right now we eat an O(n) lookup for every single access, so I don't see how purging/filling the cache when calling between modules is 'much worse' than the status quo.  Using the cache would be asymptotically the same, with a not much worse constant factor, and anyhow most accesses will typically be within the same module as the previous access, not a different one.

> Save the snark.

Sorry, but you've been dragging this bug out for a week and a half now when the solution I came up with initially is perfectly fine.  I'd like to be spending my time on more interesting problems.
(In reply to Brian Hackett (:bhackett) from comment #18)
> Right now we eat an O(n) lookup for every single access, so I don't see how
> purging/filling the cache when calling between modules is 'much worse' than
> the status quo.  Using the cache would be asymptotically the same, with a
> not much worse constant factor, and anyhow most accesses will typically be
> within the same module as the previous access, not a different one.

Ideally, yes.  But I was thinking about the case I explained with small functions calling each other.  Also, the constant different between N linked-list hops and N hash-table insertions/deletions isn't exactly small nor are aliased variable accesses necessarily randomly distributed.  This is all in the pathological realm of "tons of aliased variables" so I guess don't really know.  I'll drop it.

> Sorry, but you've been dragging this bug out for a week and a half now when
> the solution I came up with initially is perfectly fine.  I'd like to be
> spending my time on more interesting problems.

I've had a sequence of questions, which I think are quite valid.  This is not dragging feet; this is a reticence to add complexity to an area that is already somewhat confusing to those unfamiliar without being quite clear about we're getting.  I'm sorry for the latency, but I'm also doing things and this bug doesn't exactly motivate the context switch.  I'll get to it now.
On the question of "is this worth it?" the initial measurements we did in comment 14/15 still leave open the question of what the actual impact to browser memory usage would be.  This is really easy to measure with the attached patch to about:memory.  I get these results, where "Saved" is the accumulation of the per-SharedScriptData number of aliased vars * sizeof(uint32_t):

Site                    Saved    script-data   explicit   Saved as % of script-data   Saved as % of explicit
fresh browser           44K      3.7M          102M       1.2%                        .04%
amazon.com              52K      4.5M          163M       1.2%                        .03%
gmail.com               88K      6.7M          210M       1.3%                        .04%
cnn.com article         94K      5.8M          165M       1.6%                        .05%
facebook.com            106K     5.7M          250M       1.8%                        .04%
techcrunch.com article  114K     5.4M          150M       2.1%                        .08%
turbulenz.com games     115K     6.7M          888M       1.7%                        .01%
Citadel (w/o asm.js)    1.3M     18.9M         1.2G       6.8%                        .11%

I'm probably biased by this point, but the cache doesn't seem like a particular win over embedding the atomsIndex.  Do you still think it is?
cc'ing njn and jorendorff in case they have an opinion on this tradeoff.
(In reply to Naveed Ihsanullah [:naveed] from comment #13)
> Alex this is not a general case issue. The issues are confined to a few
> specific unusual test cases that are not indicative of real world web apps.

Given this, we are not going to track for release. But if a low risk uplift is found, feel free to uplift!
(In reply to Luke Wagner [:luke] from comment #20)
> I'm probably biased by this point, but the cache doesn't seem like a
> particular win over embedding the atomsIndex.  Do you still think it is?

Thanks for measuring.  I think a simple patch to save .05% of explicit is definitely worth taking.  It's only a tiny fraction of the effectiveness of a big project like objshrink, but there aren't many such projects left to be done and the little stuff still counts.
Attachment #8333581 - Flags: review?(luke) → review+
I'll put together a patch to remove the blockIndex on the aliasedvar ops.

https://hg.mozilla.org/integration/mozilla-inbound/rev/62e94f70b2cd
https://hg.mozilla.org/mozilla-central/rev/62e94f70b2cd
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
(In reply to Brian Hackett (:bhackett) from comment #24)
> I'll put together a patch to remove the blockIndex on the aliasedvar ops.

Will you file a new bug, or do it as part of this bug? Then this bug should be re- and [leave open]ed.
Flags: needinfo?(bhackett1024)
(In reply to Florian Bender from comment #26)
> (In reply to Brian Hackett (:bhackett) from comment #24)
> > I'll put together a patch to remove the blockIndex on the aliasedvar ops.
> 
> Will you file a new bug, or do it as part of this bug? Then this bug should
> be re- and [leave open]ed.

Filed bug 944930.
Flags: needinfo?(bhackett1024)
(In reply to Luke Wagner [:luke] from comment #21)
> cc'ing njn and jorendorff in case they have an opinion on this tradeoff.

No opinion.

Thanks for obtaining actual numbers. I don't have anything to compare them to, though. We do a ton of work on stuff that wins a few percent on some vanishingly small proportion of the web. This wins a vanishingly small percent on pretty much everything. The future portion of the cost of this patch still seems possibly worth taking and possibly not.
Florian, can you please verify this is fixed now in Firefox 28?
Flags: needinfo?(fb+mozdev)
I got about 180 now (see Bug 517164 comment #9).
Status: RESOLVED → VERIFIED
Flags: needinfo?(fb+mozdev)
Brian is away on PTO until February. Luke, could you nominate this for Firefox 27 Beta uplift?
Flags: needinfo?(luke)
In agreement with comment 13, I'm not sure this problem is so significant as to require uplift.
Flags: needinfo?(luke)
Thanks Luke, marking this as wontfix for Firefox 27.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: