939088 - massive regression in testClosureAssign

Reporter

Description

•

11 years ago

+++ This bug was initially created as a clone of Bug #517164 +++

(In reply to Florian Bender from Bug 517164 comment #9
> "testClosureAssign" seems regressed (first bad: Aurora), this needs a few
> more runs and bisection.

Last good nightly: 2013-10-11
First bad nightly: 2013-10-12

Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6b101d4c6d24&tochange=73f37c7a3860


Does anything stand out? If not, can anyone bisect this further, please?

Florian Bender

Reporter

Comment 1

•

11 years ago

I only bisected on Mac, so resetting the platform info for now, though I suspect the numbers are the same for other platforms.

OS: All → Mac OS X

Hardware: All → x86_64

Jan de Mooij [:jandem]

Comment 2

•

11 years ago

(In reply to Florian Bender from comment #0)
> Does anything stand out? If not, can anyone bisect this further, please?

Yes, bug 880085 seems related. Requesting needinfo from Brian.

Flags: needinfo?(bhackett1024)

Florian Bender

Reporter

Updated

•

11 years ago

status-firefox26: --- → unaffected

status-firefox27: --- → affected

status-firefox28: --- → affected

Brian Hackett [Laid off!]

Assignee

Comment 3

•

11 years ago

Attached patch bandaid — Details — Splinter Review

This test is pretty silly but does point to a perf bottleneck that could affect large applications using run once closures.  It generates a run once closure with 14000 variables with an inner function that assigns to each of those variables once.  That inner function is called 150 times so we never try to Ion compile it.  Baseline does a fine job generating ICs for it, and 90% of the time taken here is before we even baseline compile, almost all of it warming up in the interpreter.  That time taken has increased because the interpreter needs to keep track of type side effects on the object (since bug 880085 unbroke run once closure optimizations), which requires it to compute the property name being accessed.  That computation incurs a linear search on the call object shape, which takes a while since there are 14000 entries in it.

The attached patch is a bandaid to avoid the linear search in cases when we know the call object's types don't need to be updated, i.e. when no code under it has been compiled.  This improves the testClosureAssign() case from 1560 to 95 for me, but won't have much effect in more diffuse large applications where we baseline/ion compile some inner code and delazify the type but still do a lot of interpreting in cold code.  I'd like to do a followup patch to add a cache in the runtime mapping slots to names for singleton scope objects, similar to the source note cache.  (This seems a better use of memory than attaching the name to every SETALIASEDVAR op.)

Assignee: nobody → bhackett1024

Attachment #832936 - Flags: review?(luke)

Flags: needinfo?(bhackett1024)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

11 years ago

Blocks: 880085

status-b2g18: --- → unaffected

status-b2g-v1.1hd: --- → unaffected

status-b2g-v1.2: --- → unaffected

status-firefox25: --- → unaffected

status-firefox-esr17: --- → unaffected

status-firefox-esr24: --- → unaffected

tracking-firefox27: --- → ?

tracking-firefox28: --- → ?

Brian Hackett [Laid off!]

Assignee

Comment 4

•

11 years ago

Attached patch cache — Details — Splinter Review

Attachment #8333581 - Flags: review?(luke)

Luke Wagner [:luke]

Comment 5

•

11 years ago

Instead of adding a whole new mechanism that, as you said, doesn't really solve the problem for realistic workloads, can we just blow another 4 bytes on JOF_SCOPECOORD ops and embed the atom?

Actually, with bug 932276, the only remaining use of the 4 byte blockIndex current embedded in JOF_SCOPECOORD ops is from Ion/Baseline JITs, so we could use the O(n) GetBlockChainAtPC added by bug 932276 and remove the blockIndex.

Brian Hackett [Laid off!]

Assignee

Comment 6

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #5)
> Instead of adding a whole new mechanism that, as you said, doesn't really
> solve the problem for realistic workloads, can we just blow another 4 bytes
> on JOF_SCOPECOORD ops and embed the atom?
> 
> Actually, with bug 932276, the only remaining use of the 4 byte blockIndex
> current embedded in JOF_SCOPECOORD ops is from Ion/Baseline JITs, so we
> could use the O(n) GetBlockChainAtPC added by bug 932276 and remove the
> blockIndex.

I think the cache mechanism will work fine for realistic workloads; it's the first patch that's just a bandaid.  Embedding the atoms directly will be a good deal less efficient on memory than the cache; it's not just the extra four bytes for the atom index, but also (usually) a new word-sized entry in the script's atoms table.

Luke Wagner [:luke]

Comment 7

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #6)
comment 5 explains how to avoid increasing the bytecode size.  To avoid increasing atoms-table size, we can use the index of the aliased var's Bindings::bindingArray.  Before doing either, though, we should see how prevalent aliased-var opcodes even are or how many extra atoms-table entries would added; this might only be a few KB we're talking about here.

Brian Hackett [Laid off!]

Assignee

Comment 8

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #7)
> (In reply to Brian Hackett (:bhackett) from comment #6)
> comment 5 explains how to avoid increasing the bytecode size.

Well, I'm confused now.  Comment 5 sounds like an independent optimization which would allow shortening the bytecode length by 4 bytes; if we did that and also used this cache then the bytecode size would be improved over the current state of things.

Luke Wagner [:luke]

Comment 9

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #8)
> Well, I'm confused now.  Comment 5 sounds like an independent optimization

If we removed the block index *without* embedding the var name's index in the opcode then we'd add a second O(n) lookup in ScopeCoordinateName with the same potential for pathological performance we're seeing here.  If we remove the block index *and* add the atom index, then ScopeCoordinateName won't need to call GetBlockChainAtPC so ScopeCoordinateName will become O(1).

Brian Hackett [Laid off!]

Assignee

Comment 10

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #9)
> (In reply to Brian Hackett (:bhackett) from comment #8)
> If we removed the block index *without* embedding the var name's index in
> the opcode then we'd add a second O(n) lookup in ScopeCoordinateName with
> the same potential for pathological performance we're seeing here.  If we
> remove the block index *and* add the atom index, then ScopeCoordinateName
> won't need to call GetBlockChainAtPC so ScopeCoordinateName will become O(1).

Does GetBlockChainAtPC need to be O(n)?  Looking at the code it is doing a linear search of what looks like a sorted array and could be doing a binary search instead.

Luke Wagner [:luke]

Comment 11

•

11 years ago

Hmm, that's a nice thought, probably so.

Still, embedding the name in the bytecode is the obvious/simple strategy -- it's what one would expect when reading the code -- so I think we need to justify the complexity of adding a caching mechanism with some memory measurements.  If the savings are significant, great.  It'll also justify the work to remove blockIndex too.

Alex Keybl [:akeybl]

Comment 12

•

11 years ago

Naveed, does this FF27 perf regression have any major user or external test impact? If this is a critical test, shouldn't we be testing more frequently?

Flags: needinfo?(nihsanullah)

Naveed Ihsanullah [:naveed]

Comment 13

•

11 years ago

Alex this is not a general case issue. The issues are confined to a few specific unusual test cases that are not indicative of real world web apps.

Flags: needinfo?(nihsanullah)

Brian Hackett [Laid off!]

Assignee

Comment 14

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #11)
> Still, embedding the name in the bytecode is the obvious/simple strategy --
> it's what one would expect when reading the code -- so I think we need to
> justify the complexity of adding a caching mechanism with some memory
> measurements.  If the savings are significant, great.  It'll also justify
> the work to remove blockIndex too.

Running the asm.js bullet test with --no-asmjs, 26% of the bytecode we emit (194k out of 759k) is used up by aliasedvar ops.

Luke Wagner [:luke]

Comment 15

•

11 years ago

Comment on attachment 832936 [details] [diff] [review]
bandaid

Well, that is non-trivial.  I measured a few more sites and saw percentages from 6% (GMail) to 18% (Facebook).

Still, it feels strange to add a whole cache for a pathological case (tons of closed-over variables and tons of accesses that don't get baseline-compiled) when it doesn't really fix the problem (X accesses to X unique closed-over variables will miss X times; also performing aliased var accesses in alternating functions will invalidate the cache each time).

If you want to land this first band-aid patch, that's fine, but I think the cache is just a bigger bandaid.  If we want to fix the problem of "it's slow to find the name of an aliased var", I think we should do it fully and embed the atom index.  We can trade out the blockIndex and avoid any regression.  But I'd like to see this problem show up in a non-synthetic situation first.

Attachment #832936 - Flags: review?(luke) → review+

Luke Wagner [:luke]

Updated

•

11 years ago

Attachment #8333581 - Flags: review?(luke)

Brian Hackett [Laid off!]

Assignee

Comment 16

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #15)
> Still, it feels strange to add a whole cache for a pathological case (tons
> of closed-over variables and tons of accesses that don't get
> baseline-compiled) when it doesn't really fix the problem (X accesses to X
> unique closed-over variables will miss X times; also performing aliased var
> accesses in alternating functions will invalidate the cache each time).

I don't think you understand what this cache's behavior is.  This cache has benefits even in code we eventually baseline compile --- per comment 3 we baseline compile this script fine, but eat a huge amount of time doing linear lookups while we're still in the interpreter.  Use of this cache will also benefit lookups during baseline and Ion compilation; even making one pass over scripts with these aliased vars (the interp only does 10) will) still hurts.  This does fix the problem of X accesses to X closed over vars when those closed over vars are associated with the same scope object, and while it would repeatedly invalidate when code is visiting different large call objects over and over the resulting behavior will not be any worse asymptotically than what the code is currently doing.

> If you want to land this first band-aid patch, that's fine, but I think the
> cache is just a bigger bandaid.  If we want to fix the problem of "it's slow
> to find the name of an aliased var", I think we should do it fully and embed
> the atom index.  We can trade out the blockIndex and avoid any regression. 
> But I'd like to see this problem show up in a non-synthetic situation first.

When is a cache *anything* except a bandaid of some size?  The reason we have caches is to trade off speed and memory concerns.  We already know that the size of aliased var ops can be quite large so reducing the size of those ops while speeding up their accesses in common cases seems an entirely reasonable thing to do, even if it requires *gasp* a cache.  It's pretty clear this problem can happen in non-synthetic situations, as comment 3 describes, all you need is a big app inside a closure (e.g. all asm.js style apps fit this pattern).

Brian Hackett [Laid off!]

Assignee

Updated

•

11 years ago

Attachment #8333581 - Flags: review?(luke)

Luke Wagner [:luke]

Comment 17

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #16)
> This does fix the problem of X accesses to X closed over vars
> when those closed over vars are associated with the same scope object, and

Ah, I misread the code; I thought the cache was lazily adding (id, slot) pairs as they were accessed, but I see now the cache eagerly populates all (id, slot) pairs on first access.  (Perhaps you can see why I was hesitant under this assumption.)

However, this seems to suggest a new realistic pathologically scenario: a program with several large modules that calls back and forth between them can incur an O(n) purge/fill at each call.  This could produce much worse behavior than the status quo.   This actually seems likely to occur in practice given the popular requireJS/program-composed-of-many-modules style.

This is of course is fixable by having a two-level map (shape->(id->slot)), if you want to go there.

> even if it requires *gasp* a cache.

Save the snark.

> It's pretty clear this problem can happen in non-synthetic situations, as comment 3
> describes, all you need is a big app inside a closure (e.g. all asm.js style
> apps fit this pattern).

Also, a bunch of closed-over variables and a bunch of accesses in functions that don't get quickly ICed.  asm.js has highly-aliased variables, but a program-independent fixed quantity (~139, far less than testClosureAssign).

That being said, of course it would be nice to solve this problem w/o using more memory.  With the abovementioned problem fixed as suggested or otherwise, I think the cache *would* be quite solid and worth the added complexity.

Brian Hackett [Laid off!]

Assignee

Comment 18

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #17)
> However, this seems to suggest a new realistic pathologically scenario: a
> program with several large modules that calls back and forth between them
> can incur an O(n) purge/fill at each call.  This could produce much worse
> behavior than the status quo.   This actually seems likely to occur in
> practice given the popular requireJS/program-composed-of-many-modules style.
> 
> This is of course is fixable by having a two-level map (shape->(id->slot)),
> if you want to go there.

Right now we eat an O(n) lookup for every single access, so I don't see how purging/filling the cache when calling between modules is 'much worse' than the status quo.  Using the cache would be asymptotically the same, with a not much worse constant factor, and anyhow most accesses will typically be within the same module as the previous access, not a different one.

> Save the snark.

Sorry, but you've been dragging this bug out for a week and a half now when the solution I came up with initially is perfectly fine.  I'd like to be spending my time on more interesting problems.

Luke Wagner [:luke]

Comment 19

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #18)
> Right now we eat an O(n) lookup for every single access, so I don't see how
> purging/filling the cache when calling between modules is 'much worse' than
> the status quo.  Using the cache would be asymptotically the same, with a
> not much worse constant factor, and anyhow most accesses will typically be
> within the same module as the previous access, not a different one.

Ideally, yes.  But I was thinking about the case I explained with small functions calling each other.  Also, the constant different between N linked-list hops and N hash-table insertions/deletions isn't exactly small nor are aliased variable accesses necessarily randomly distributed.  This is all in the pathological realm of "tons of aliased variables" so I guess don't really know.  I'll drop it.

> Sorry, but you've been dragging this bug out for a week and a half now when
> the solution I came up with initially is perfectly fine.  I'd like to be
> spending my time on more interesting problems.

I've had a sequence of questions, which I think are quite valid.  This is not dragging feet; this is a reticence to add complexity to an area that is already somewhat confusing to those unfamiliar without being quite clear about we're getting.  I'm sorry for the latency, but I'm also doing things and this bug doesn't exactly motivate the context switch.  I'll get to it now.

Luke Wagner [:luke]

Comment 20

•

11 years ago

Attached patch measurement patch using about:memory — Details — Splinter Review

On the question of "is this worth it?" the initial measurements we did in comment 14/15 still leave open the question of what the actual impact to browser memory usage would be.  This is really easy to measure with the attached patch to about:memory.  I get these results, where "Saved" is the accumulation of the per-SharedScriptData number of aliased vars * sizeof(uint32_t):

Site                    Saved    script-data   explicit   Saved as % of script-data   Saved as % of explicit
fresh browser           44K      3.7M          102M       1.2%                        .04%
amazon.com              52K      4.5M          163M       1.2%                        .03%
gmail.com               88K      6.7M          210M       1.3%                        .04%
cnn.com article         94K      5.8M          165M       1.6%                        .05%
facebook.com            106K     5.7M          250M       1.8%                        .04%
techcrunch.com article  114K     5.4M          150M       2.1%                        .08%
turbulenz.com games     115K     6.7M          888M       1.7%                        .01%
Citadel (w/o asm.js)    1.3M     18.9M         1.2G       6.8%                        .11%

I'm probably biased by this point, but the cache doesn't seem like a particular win over embedding the atomsIndex.  Do you still think it is?

Luke Wagner [:luke]

Comment 21

•

11 years ago

cc'ing njn and jorendorff in case they have an opinion on this tradeoff.

alco

Comment 22

•

11 years ago

(In reply to Naveed Ihsanullah [:naveed] from comment #13)
> Alex this is not a general case issue. The issues are confined to a few
> specific unusual test cases that are not indicative of real world web apps.

Given this, we are not going to track for release. But if a low risk uplift is found, feel free to uplift!

tracking-firefox27: ? → ---

tracking-firefox28: ? → ---

Brian Hackett [Laid off!]

Assignee

Comment 23

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #20)
> I'm probably biased by this point, but the cache doesn't seem like a
> particular win over embedding the atomsIndex.  Do you still think it is?

Thanks for measuring.  I think a simple patch to save .05% of explicit is definitely worth taking.  It's only a tiny fraction of the effectiveness of a big project like objshrink, but there aren't many such projects left to be done and the little stuff still counts.

Luke Wagner [:luke]

Updated

•

11 years ago

Attachment #8333581 - Flags: review?(luke) → review+

Brian Hackett [Laid off!]

Assignee

Comment 24

•

11 years ago

I'll put together a patch to remove the blockIndex on the aliasedvar ops.

https://hg.mozilla.org/integration/mozilla-inbound/rev/62e94f70b2cd

Carsten Book [:Tomcat]

Comment 25

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/62e94f70b2cd

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla28

Florian Bender

Reporter

Comment 26

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #24)
> I'll put together a patch to remove the blockIndex on the aliasedvar ops.

Will you file a new bug, or do it as part of this bug? Then this bug should be re- and [leave open]ed.

Flags: needinfo?(bhackett1024)

Ekanan Ketunuti

Updated

•

11 years ago

status-firefox28: affected → fixed

Brian Hackett [Laid off!]

Assignee

Comment 27

•

11 years ago

(In reply to Florian Bender from comment #26)
> (In reply to Brian Hackett (:bhackett) from comment #24)
> > I'll put together a patch to remove the blockIndex on the aliasedvar ops.
> 
> Will you file a new bug, or do it as part of this bug? Then this bug should
> be re- and [leave open]ed.

Filed bug 944930.

Flags: needinfo?(bhackett1024)

Jason Orendorff [:jorendorff]

Comment 28

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #21)
> cc'ing njn and jorendorff in case they have an opinion on this tradeoff.

No opinion.

Thanks for obtaining actual numbers. I don't have anything to compare them to, though. We do a ton of work on stuff that wins a few percent on some vanishingly small proportion of the web. This wins a vanishingly small percent on pretty much everything. The future portion of the cost of this patch still seems possibly worth taking and possibly not.

u279076

Comment 29

•

11 years ago

Florian, can you please verify this is fixed now in Firefox 28?

Flags: needinfo?(fb+mozdev)

Florian Bender

Reporter

Comment 30

•

11 years ago

I got about 180 now (see Bug 517164 comment #9).

Status: RESOLVED → VERIFIED

status-firefox28: fixed → verified

status-firefox29: --- → verified

Flags: needinfo?(fb+mozdev)

Garrett Smith

Updated

•

11 years ago

URL: http://web.archive.org/web/2011082609... → http://dhtmlkitchen.com/jstest/scope-...

u279076

Comment 31

•

10 years ago

Brian is away on PTO until February. Luke, could you nominate this for Firefox 27 Beta uplift?

Flags: needinfo?(luke)

Luke Wagner [:luke]

Comment 32

•

10 years ago

In agreement with comment 13, I'm not sure this problem is so significant as to require uplift.

Flags: needinfo?(luke)

u279076

Comment 33

•

10 years ago

Thanks Luke, marking this as wontfix for Firefox 27.

status-firefox27: affected → wontfix

bandaid 11 years ago Brian Hackett [Laid off!] 1.96 KB, patch	luke : review+	Details \| Diff \| Splinter Review
cache 11 years ago Brian Hackett [Laid off!] 12.28 KB, patch	luke : review+	Details \| Diff \| Splinter Review
measurement patch using about:memory 11 years ago Luke Wagner [:luke] 7.21 KB, patch		Details \| Diff \| Splinter Review