Bug 1504947 made column numbers be counts of code points (at least til its final step was reverted in bug 1549758). Because code points generally aren't all the same length, determining column numbers thus requires some sort of
O(n) scan --
O(n**2) when it happens repeatedly. Computations on the same line can avoid redoing work by counting forward from a prior offset. But not all lookups are successive, and non-successive cases hit issues: bug 1545015, bug 1545034, bug 1546313 at minimum.
I've implemented a solution: for lines that are "long enough", lazily compute a vector of the column number at almost-regular intervals, then you can count only from the nearest column.
--compileonly --file with bug 1545034's testcase indicates this is effective.
There are a couple different axes of design space/implementation choice here.
First, chunk size.
128 could be any number that will fit the lengthiest code point (2/4 for UTF-8/16). A smaller number reduces worst-case scanning (which can still be quadratic, just within a chunk) but costs more memory and has more overhead to populate the data structures. I bet most non-minified lines are under 100ch, so 128 seems most reasonable. (We could data-gather to support that; I don't think the effort is justified.)
Second, the vector elements.
Regular chunking may split code points, so you must round chunk beginnings down to the start of a code point. This can be done every time a true boundary need be computed, or you could save the exact offset. The memory cost of exact offset saving, when retracting to a code point boundary should be a few predictable instructions, IMO rules out offset-saving. (I have a patch for this, tho, if we think different.)
A better thing to save in vector elements, IMO, is a bit that conservatively indicates whether a chunk is guaranteed to contain only single-unit code points. Then the scanning cost is only paid by chunks that definitely contain multi-unit code points, or by an instantaneously-at-the-end chunk where we just can't know yet. That one byte per chunk seems a much better cost to pay than four bytes per chunk for retracted offsets.
Anyway -- I'll post a patch where vector elements are column numbers only, a patch that adds in the guaranteed-single-unit bit too, and a patch that also stores the offset (that we probably don't want). I also have a nice big ol' testcase -- written specifically for that 128-unit chunk size, to trigger behaviors and edge cases within it -- that seems to indicate things are okay. (And it'd be easy enough to see how the test runs with various values for
ColumnChunkLength besides 128, for scattershot testing -- at some point I tried it with value
7 for a way-too-low approach and it worked fine.)