Closed Bug 1608290 Opened 5 years ago Closed 3 years ago

JS parsing (and possibly execution) slower in Fenix vs Fennec

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

RESOLVED INCOMPLETE
Performance Impact medium

People

(Reporter: bas.schouten, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf:pageload)

Current Fenix builds are showing slower parsing, considerably slowing JS execution during pageload. Examples where this can be seen are:

Fennec: https://perfht.ml/37Jvx0u vs Fenix: https://perfht.ml/30aNdzp

In particular during the execution of <script> inline at line 72 of https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2F&followup=https%3A%2F%2Faccounts.google.com%2F

Or on another site:

Fennec: https://perfht.ml/36DJk8x vs Fenix: https://perfht.ml/2tNXghC

In particular during the execution of <script src="https://support.microsoft.com/bundles/jslibraries?v=7595FctQ4nhDnzr1rSpN4H12FKAoBCwtOJxjwZt9U2s1">

In both these cases we are also seeing GC majors in the slow case. I filed this in a separate bug (see bug 1608289).

The inverted call stacks mildly suggest this may be related to some unicode stuff going on, although that could just be differences in inlining. One function that was added since 68 and stands out is CountCodePoints. But also considerably more time is spent waiting for a Futex inside the code generator. So there may be multiple things at play here.

ni?waldo since I believe you added the CountCodePoints call there.

Blocks: 1604930
Whiteboard: [qf:p1:pageload]

Note that GeckoView without e10s does not have the GC, and still has significantly slower execution (although obviously a little faster than default GV, since it doesn't have the GC). See: https://perfht.ml/35yJOvD and https://perfht.ml/2N6Fi0T respectively for single process GV examples.

Looks like I linked the wrong profile for Fennec for accounts.google.com, here's the correct one: https://perfht.ml/35SqhpQ

Flags: needinfo?(jwalden)

There is somewhat irreducible complexity to what's motivating unicode::CountCodePoints being called.

We need in various places to determine a column number for particular pieces of JS syntax within a line. These column numbers are encoded into the compiled JS script using our "source note" mechanism. (We have source saved; we can recompute column numbers on demand if necessary. But we have no infrastructure for it now, and we would have to churn it out from scratch right fast to do this. Doesn't seem like a good short-term solution.)

Historically, we only parsed UTF-16 -- and we computed "columns" as merely the offset in 16-bit code units into the line: a simple subtraction and divide by two. That's wrong for UTF-16 given anything outside the BMP, obviously, but non-BMP is debatably ignorably rare.

But when we decided to start parsing UTF-8, we can't just count code units (i.e. subtract "and divide by one") -- anything non-ASCII in a line would throw off column numbers in strange ways, and non-ASCII is not wholly common but is far from rare. Computing an accurate column number requires counting forward from the start of the line to the offset whose column number is needed.

We do a bunch of tricks to avoid this full count-forward, which obviously is quadratic for very long lines where many columns are needed.

  • Within each line, we cache the column number at periodic offsets (every 128 code units, presently) -- lazily, when a column is required at that offset in the line. So if the column at offset 135 is being computed, and a column at offset 256 was previously computed, the count will only be from offset 128 to offset 135.
  • If a chunk contains only single-unit code points, we store boolean true with it indicating that you can just add the count of code units within that chunk, to the column of the start of the chunk. That is, chunk info would look like [(0, true), (128, false), (255, true)] for a UTF-8 line consisting of 128 single-unit 'a' code points, 55 single-unit 'b' code points, 1 two-unit 'ÿ' code point, 71 single-unit 'c' code points, and 5 single-unit 'd' code points. A lookup in the first 128 code units would be a flat subtraction; a lookup in the second 128 code units would have to count from 128 through up to 128 code units, then add that number to the cached 128; a lookup in the third up-to-128 units would count from 256 up to the offset, then add that number to the cached 255.
  • We cache the last offset => (line, column) computed so we can count forward from there, for successive lookups at columns further into the current line. And this applies atop all the other caching, if it would improve the starting point to count code points from -- if you had lookups at offset 8, 33, 78, 91, 125 there would be CountCodePoints(0, 8), then to that cached value would be added CountCodePoints(8, 33) for the next computation, and so on.

Even with all these tricks, we cannot fully amortize the cost. Suppose lookups at offsets 125, 91, 78, 33, 8 in that order. Because the last offset decreases each time, the final cache doesn't help. The isSingleUnit boolean will eliminate CountCodePoints(0, 125 or 91 or 78 or 33 or 8) if the chunk is single-unit, but if there are any multi-unit code points the overall cost of the counts will be triangular, O(125 + 91 + 78 + 33 + 8). Each individual computation is constant-time...but the constant is O(128). Do a bunch in the wrong order and the cost will mount, quadratically (but with n=128 being small).

If we want to reduce time costs, we could drop 128 to 64. But then you're probably storing chunk information for a lot more lines than before, if we figure lines in non-minified code cap at 80 or 100. I doubt the memory hit is worth it.

Seems to me the alternative at that point is to just do fewer column computations...but I'm not sure how, without losing column-number capabilities we currently support, and without implementing the do-it-all-lazily approach.

Flags: needinfo?(jwalden)
Priority: -- → P1

(In reply to Jeff Walden [:Waldo] from comment #3)

There is somewhat irreducible complexity to what's motivating unicode::CountCodePoints being called.

Even with all these tricks, we cannot fully amortize the cost. Suppose lookups at offsets 125, 91, 78, 33, 8 in that order. Because the last offset decreases each time, the final cache doesn't help. The isSingleUnit boolean will eliminate CountCodePoints(0, 125 or 91 or 78 or 33 or 8) if the chunk is single-unit, but if there are any multi-unit code points the overall cost of the counts will be triangular, O(125 + 91 + 78 + 33 + 8). Each individual computation is constant-time...but the constant is O(128). Do a bunch in the wrong order and the cost will mount, quadratically (but with n=128 being small).

If we want to reduce time costs, we could drop 128 to 64. But then you're probably storing chunk information for a lot more lines than before, if we figure lines in non-minified code cap at 80 or 100. I doubt the memory hit is worth it.

Seems to me the alternative at that point is to just do fewer column computations...but I'm not sure how, without losing column-number capabilities we currently support, and without implementing the do-it-all-lazily approach.

Thanks, fwiw, the CountCodePoints itself is a fairly small part of this profile's additional time (it's not insignificant, but also not going to compensate 20-40% difference we see). So if we cannot easily experiment with any of these options we may be better off looking elsewhere for solutions.

If it's easy to experiment with some of your suggestions, I would suggest that the best thing to do is just throw up some patches, and someone on JS or the perf team can evaluate their effectiveness.

Additional investigation leads us to believe that it's very possible some of the difference in speed is due to Fenix having more profiler overhead, because of the increased amount of processes/threads being profiled. It should be noted that the CountCodePoints regression is likely real, and that without profiling sites (such as accounts.google.com) still load slower. But the speed difference noted in JS execution speed in these profiles may be affected by this, and therefore may in practice be much smaller.

Whiteboard: [qf:p1:pageload] → [qf:p2:pageload]

Bas, should we close this bug as INVALID? It might be worth doing one more comparison run with "No periodic sampling" checked, just so that we can see based on the markers whether the times are the same.

Flags: needinfo?(bas)

Actually, Ted told me they had found some things making JS parsing slower in Fenix, and that they were actually working on them. I'm sure they have their own bugs for that and don't need this one though?

Flags: needinfo?(bas)

NI'ing Ted re: comment 6 and comment 7. Please close-dupe if appropriate.

Flags: needinfo?(tcampbell)
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
Performance Impact: --- → P2
Keywords: perf:pageload
Whiteboard: [qf:p2:pageload]
Flags: needinfo?(tcampbell)
You need to log in before you can comment on or make changes to this bug.