Optimize computing a column number as count of code points by caching column numbers (and whether each chunk might contain anything multi-unit) and counting forward from them

RESOLVED FIXED in Firefox 69



Last month
10 days ago


(Reporter: Waldo, Assigned: Waldo)



Firefox Tracking Flags

(firefox68 wontfix, firefox69 fixed)



(5 attachments, 1 obsolete attachment)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review


Last month

Bug 1504947 made column numbers be counts of code points (at least til its final step was reverted in bug 1549758). Because code points generally aren't all the same length, determining column numbers thus requires some sort of O(n) scan -- O(n**2) when it happens repeatedly. Computations on the same line can avoid redoing work by counting forward from a prior offset. But not all lookups are successive, and non-successive cases hit issues: bug 1545015, bug 1545034, bug 1546313 at minimum.

I've implemented a solution: for lines that are "long enough", lazily compute a vector of the column number at almost-regular intervals, then you can count only from the nearest column. --compileonly --file with bug 1545034's testcase indicates this is effective.

There are a couple different axes of design space/implementation choice here.

First, chunk size.

128 could be any number that will fit the lengthiest code point (2/4 for UTF-8/16). A smaller number reduces worst-case scanning (which can still be quadratic, just within a chunk) but costs more memory and has more overhead to populate the data structures. I bet most non-minified lines are under 100ch, so 128 seems most reasonable. (We could data-gather to support that; I don't think the effort is justified.)

Second, the vector elements.

Regular chunking may split code points, so you must round chunk beginnings down to the start of a code point. This can be done every time a true boundary need be computed, or you could save the exact offset. The memory cost of exact offset saving, when retracting to a code point boundary should be a few predictable instructions, IMO rules out offset-saving. (I have a patch for this, tho, if we think different.)

A better thing to save in vector elements, IMO, is a bit that conservatively indicates whether a chunk is guaranteed to contain only single-unit code points. Then the scanning cost is only paid by chunks that definitely contain multi-unit code points, or by an instantaneously-at-the-end chunk where we just can't know yet. That one byte per chunk seems a much better cost to pay than four bytes per chunk for retracted offsets.

Anyway -- I'll post a patch where vector elements are column numbers only, a patch that adds in the guaranteed-single-unit bit too, and a patch that also stores the offset (that we probably don't want). I also have a nice big ol' testcase -- written specifically for that 128-unit chunk size, to trigger behaviors and edge cases within it -- that seems to indicate things are okay. (And it'd be easy enough to see how the test runs with various values for ColumnChunkLength besides 128, for scattershot testing -- at some point I tried it with value 7 for a way-too-low approach and it worked fine.)

Attachment #9065156 - Attachment is obsolete: true
Priority: -- → P1

Comment 7

Last month

We're somewhat freezy right now, so I'm going to land most of what's here, but I'm not going to land -- yet -- the final patch that actually begins to invoke any of this new code or long-line optimization mechanisms. The work here is hidden behind #if JS_COLUMN_DIMENSION_IS_CODE_POINTS(), so landing it won't pose risk but will get it out of my tree so I don't have rebase/conflict worries. If memory serves, the branch is next week, and I'll land that final bit after that point.

Here's a try run of all the patches except that last effective one:


And here's a try-run that includes that last effective patch:


Keywords: leave-open

Comment 8

Last month
Pushed by jwalden@mit.edu:
Remove |TokenStreamAnyChars::undoInternalUpdateLineInfoForEOL| as unused.  r=arai
Make JS_COLUMN_DIMENSION_IS_CODE_POINTS a no-argument macro function so that calling it before its definition is an error, and move it upward in TokenStream.h so it's defined in places where subsequent changes will need it.  r=arai
Optimize column-number computations for offsets more than |ColumnChunkLength = 128| code units into a line by saving column information at 128-unit increments (rounded down to the nearest code point start) so that at most (length of... r=arai
Add a boolean to every chunk for a long-line vector indicating whether that chunk contains any multiple-unit code points, so that column computations inside wholly-single-unit chunks can do a constant-time pointer-range computation... r=arai

Comment 10

26 days ago
Pushed by jwalden@mit.edu:
Flip column numbers back to being counts of code points and not code units.  r=arai


26 days ago
Keywords: leave-open

Comment 11

26 days ago
Closed: 26 days ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
You need to log in before you can comment on or make changes to this bug.