Open Bug 1552008 Opened 1 year ago Updated 1 year ago

HTML parser should supply column numbers for <script> elements

Categories

(Core :: DOM: HTML Parser, defect, P2)

defect

Tracking

()

Tracking Status
firefox68 --- affected

People

(Reporter: loganfsmyth, Unassigned)

References

(Blocks 4 open bugs)

Details

(Whiteboard: [debugger-mvp])

Attachments

(1 file, 1 obsolete file)

This was spawned from https://bugzilla.mozilla.org/show_bug.cgi?id=1538056 but I'm going to file it separately since there is at least some work that can be done there on its own without this.

In 67, the JS debugger will begin treating breakpoints as line/column positions, where traditionally they have been treated as line-based. One of the outstanding issues is that while that is fine for

<script>
  console.log("foo");
</script>

where a breakpoint might be specifically at column 2 (the c in console.log)
we run into issues with

<script>console.log("foo");</script>

because the position of c, as far as the engine knows is 0, even though visually it is column 8, which causes our breakpoint markers to render in the wrong place.

While hand-written markup usually has that newline and doesn't tend to encounter this, we see it pop up in contexts where the inline code has been minified, which can be quite frustrating for users since debugging minified code is already tough and this will make it even tougher. Things can be especially frustrating if there are multiple <script> elements on the same line, because then it isn't just that the breakpoints are rendered off by 8 characters, but also that there could in fact be multiple sets of breakpoints overlapping.

The nsIScriptElement type already has a SetScriptColumnNumber, which is set by the XML parser, but not the HTML parser, so that would be the objective of this bug.

The XML case also doesn't work, so I'm going to co-opt https://bugzilla.mozilla.org/show_bug.cgi?id=1538056 to address that (we just never used the column from nsIScriptElement at all right now). Then once this lands we should be all set.

Attached patch Untested tokenizer patch (obsolete) — Splinter Review

This patch is untested beyond checking that it compiles. This is the part that is affected by Java to C++ translation. The rest should be buildable on top of this in non-generated C++.

Please check the impact on the innerHTML setter early (with text input that has no tags).

One issue here is that this is a code-unit column count, but SpiderMonkey's column values expect a code-point count, so it's not just an char16_t indexed counter that we need, but a codepoint count.

The alternate approach I've thought about is storing the start offset of the current line start, and in the rare case where getColumnNumber is actually called, it could run through the buffer between that location and the current location, calculating the codepoint offset. That has the downside of the tokenizer needing to be aware of the current buffer and position, which are currently scoped to only the stateLoop, but it would potentially have less of a performance effect since checkChar would have to do less per character.

Thoughts?

Flags: needinfo?(hsivonen)
Whiteboard: [debugger-mvp]

(In reply to Logan Smyth [:loganfsmyth] from comment #2)

One issue here is that this is a code-unit column count, but SpiderMonkey's column values expect a code-point count, so it's not just an char16_t indexed counter that we need, but a codepoint count.

:-( Code point counts instead of code unit counts are generally bad when identifying a position in a text buffer, but I guess these are supposed to "make sense" independently of buffer encoding in messages.

Since line numbers in document.written scripts are hopeless anyway and innerHTML-inserted scripts don't run, you can rely on data from the network being guaranteed to be valid UTF-16, so you can simply make checkChar not increment the column upon encountering a low surrogate.

The alternate approach I've thought about is storing the start offset of the current line start, and in the rare case where getColumnNumber is actually called, it could run through the buffer between that location and the current location, calculating the codepoint offset. That has the downside of the tokenizer needing to be aware of the current buffer and position, which are currently scoped to only the stateLoop, but it would potentially have less of a performance effect since checkChar would have to do less per character.

Since the tokenizer tokenizes a sequence of buffers such that a boundary can be anywhere, dealing with syncing things on buffer boundary is potentially complicated and/or error-prone, so I suggest seeing what the perf impact of the simple solution is first.

Which reminds me that my untested patch doesn't handle restoring the column number when rewinding the input stream after a failed speculation. I'll post a new patch.

Flags: needinfo?(hsivonen)
Priority: -- → P2
No longer blocks: dbg-69
Whiteboard: [debugger-mvp] → [debugger-reserve]
Blocks: dbg-71
No longer blocks: dbg-70
Whiteboard: [debugger-reserve] → [debugger-mvp]
You need to log in before you can comment on or make changes to this bug.