The JS subscript loader should load scripts as UTF-8 only, never as Latin-1

RESOLVED FIXED in Firefox 66

Status

()

enhancement
P2
normal
RESOLVED FIXED
11 months ago
8 months ago

People

(Reporter: Waldo, Assigned: Waldo)

Tracking

Trunk
mozilla66
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox64 wontfix, firefox65 wontfix, firefox66 fixed)

Details

Attachments

(2 attachments)

There are only a few callers of JS::CompileLatin1 and JS::CompileLatin1ForNonSyntacticScope now.  Two of the calls are in PrepareScript in mozJSSubScriptLoader.cpp, dealing (I presume) with loading JS subscripts from trusted code.

https://searchfox.org/mozilla-central/search?q=symbol:_ZN2JS13CompileLatin1EP9JSContextRKNS_22ReadOnlyCompileOptionsEPKcmNS_13MutableHandleIP8JSScriptEE&redirect=false
https://searchfox.org/mozilla-central/search?q=symbol:_ZN2JS33CompileLatin1ForNonSyntacticScopeEP9JSContextRKNS_22ReadOnlyCompileOptionsEPKcmNS_13MutableHandleIP8JSScriptEE&redirect=false

Last I knew, JS subscript loading was an XPCOM/XPConnect API not directly exposed to Web Extensions, so we internally are the only users of it.  Therefore we might be able to just unilaterally change to compiling subscripts as UTF-8.  I haven't attempted this to see whether or not it would work.  Anyone want to?

Note that changing to UTF-8 probably does not entail any different performance.  Latin-1 source text currently gets inflated to UTF-16 before it's compiled, and UTF-8 undergoes essentially the same inflation, so it's six of one, half a dozen of the other now.  But direct parsing, compilation, and evaluation of UTF-8 source text is coming soon via bug 1351107 and other bugs, and in that future world using UTF-8 instead of Latin-1 won't require inflation and likely will be some sort of perf win.
Priority: -- → P2
There are super-few uses of this, and they all pass in ASCII scripts, if you feel like verifying yourself.  (Future patches here may not be so verifiable.  :-\  I don't have an answer to that problem, save to say positive treeherder results are probably the best we can do for the time/effort/reward tradeoff.)

https://searchfox.org/mozilla-central/search?q=loadSubScriptWithOptions&case=true&regexp=false&path=
Attachment #9030383 - Flags: review?(kmaglione+bmo)
Assignee: nobody → jwalden+bmo
Status: NEW → ASSIGNED
Attachment #9030383 - Flags: review?(kmaglione+bmo) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/31eb07ce57ae
Make mozIJSSubScriptLoader.loadSubScriptWithOptions interpret script data only as UTF-8, without any way to use another charset.  r=kmag
Keywords: leave-open
Hi all!  I would appreciate moderately fast reviews on this, because loadSubScript calls and fresh mochitest-browser tests spring up often, and any one of them can bust this patch.

I audited all loadSubScript calls (and all mochitest-browser tests, because one call handles all of them) that load non-ASCII UTF-8 as Latin-1, and I fixed the ones that would break if UTF-8 were used.  First, I added a runUtf8Script identical to loadSubScript, save for using only UTF-8.  Second, I laboriously swapped over callers that processed only ASCII, and callers that already demanded UTF-8, to this new function.  Third, I adjusted callers that passed in UTF-8 but would break if the script were actually interpreted as UTF-8.  And fourth, I changed a bunch of subscripts that clearly were dancing around this bug, to use UTF-8 directly where they really wanted to -- put this fix to immediate use.

A try-push of the full stack of patches as I worked on them is available here.  The full stack makes a lot of runUtf8Script-related changes that are erased in this rollup patch.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ce6c31b86ea09b61e68920bab99f6ac20ed20dd

The first revision in the full patch stack is 4ea4ca0687de5e413b1ae864c9afd052472c0e8b, and the revision preceding it is e305cb6319226b712e95315eeaf95cc0eddeae56.  (The try-push depends on a few patches from a previous try-push, hence they don't show up just looking at that single push.)  I would provide a nice clean pushloghtml-like view of exactly these changes, but I don't know how to get one out of hgweb.

For who reviews what:

* accessible/: surkov

This change fixes accessible/tests/browser/bounds/browser_test_zoom_text.js and accessible/tests/browser/scroll/browser_test_zoom_text.js which passed non-ASCII to addAccessibleTask, which calls snippetToURL, which passes in non-8-bit text to (probably) nsContentUtils::Btoa which promptly throws.  (It does not appear going to one code point from several Latin-1 code points in these snippets changes test behavior, but you tell me, I'm leaning on try.  :-) )  Encode/escape the document into a data: URL in a more webby way that doesn't have btoa restrictions.

* browser/, toolkit/: MattN

The URL bar copying goo has a toUnicode that converts strings containing UTF-8 spewed into separate characters in a string, to strings containing the corresponding code points in UTF-8.  When browser scripts are UTF-8, we can just use the strings directly without going through toUnicode.

Incidentally, I haven't hit content-task.js-related issues of the sort you mentioned yesterday in any of this bug's patching.  (The one failure of that nature I *currently* see in the push, was not in a prior push that didn't include the rename-backward-to-loadSubScript:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b68e4e6e5f9725283c7e9b708870b49583470566&group_state=expanded

but is otherwise the same, up to a couple tweaks, to the try-push I noted earlier.)

* devtools/: jimb

The lambda character isn't in that element any more for browser_dbg-outline.js, so the replacing is just not necessary at all.

https://searchfox.org/mozilla-central/rev/3160ddc1f0ab55d230c595366662c62950e5c785/devtools/client/debugger/new/src/components/PrimaryPanes/Outline.js#152
https://searchfox.org/mozilla-central/rev/3160ddc1f0ab55d230c595366662c62950e5c785/devtools/client/debugger/new/src/components/shared/PreviewFunction.js#52

I think devtools/client/netmonitor/test/browser_net_curl-utils.js may be buggy.  If I use the code points from UTF-8 directly, the resulting escaped string will contain \x-escapes of those extended-ASCII code points.  But as this test is written, it seems to think UTF-8 encodings of stuff are what curl uses!  I can't find good curl docs on what is right for this, so I'd like to do the stupid thing of exactly replicating the Latin-1 interpretation of the UTF-8 bytes into that string, then file a followup bug to determine what ought be done here.  It looks like *possibly* CurlUtils.escapeStringPosix may just be buggy, but I don't know.

devtools/client/performance/test/browser_timeline-waterfall-sidebar.js could stand to directly use UTF-8 in a comparison string, judging by the comment there.  But I don't have the desired comparison string on speed-dial, so I'd rather devtools people make that change.

* js/xpconnect, netwerk/, security/, testing/mochitest: kmag

LoadSubScriptOptions::charset as a field, and charset conversion code related to it, will die in a subsequent patch I haven't written yet.  I'd like to pocket the gains of the bug summary before doing that cleanup.

I guess if the Android xpcshell needs bumping to pick up server changes, as you hypothesized in the other bug, this change will require such.  Let me know where/how to file a bug to make that happen?  Shouldn't leave the footgun around *even* *more* when this is in.

Thanks in advance to everyone!
Attachment #9031072 - Flags: review?(surkov.alexander)
Attachment #9031072 - Flags: review?(kmaglione+bmo)
Attachment #9031072 - Flags: review?(jimb)
Attachment #9031072 - Flags: review?(MattN+bmo)
(In reply to Jeff Walden [:Waldo] from comment #4)
> I guess if the Android xpcshell needs bumping to pick up server changes, as
> you hypothesized in the other bug, this change will require such.

...as soon as there's a .sjs in the tree, containing UTF-8 in a semantically important manner, that runs on Android.  We do not have one now, of course.  And as of this bug, merely picking up a new httpd.js (if we guessed at the cause of the problem correctly) will no longer be enough to fix UTF-8 .sjs on Android.
Comment on attachment 9031072 [details] [diff] [review]
Make loadSubScript always interpret script data as UTF-8, and fix existing subscripts for this where necessary

Review of attachment 9031072 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks
Attachment #9031072 - Flags: review?(MattN+bmo) → review+
Attachment #9031072 - Flags: review?(kmaglione+bmo) → review+
(In reply to Jeff Walden [:Waldo] from comment #4)
> I guess if the Android xpcshell needs bumping to pick up server changes, as
> you hypothesized in the other bug, this change will require such.  Let me
> know where/how to file a bug to make that happen?  Shouldn't leave the
> footgun around *even* *more* when this is in.

Just file a dupe of bug 1457012, and make sure to ask that it includes these changes. Filing sooner is probably better than later, since it can take them a while, and who knows what unexpected problems will show up when you try to land...
Blocks: 1514075
See Also: → 1514075
Comment on attachment 9031072 [details] [diff] [review]
Make loadSubScript always interpret script data as UTF-8, and fix existing subscripts for this where necessary

Review of attachment 9031072 [details] [diff] [review]:
-----------------------------------------------------------------

::: devtools/client/debugger/new/test/mochitest/browser_dbg-outline.js
@@ +51,5 @@
>      dbg,
>      ".outline-list__element .function-signature"
>    );
> +  is(firstAlphaFunction.innerText, "doEval()",
> +     "Alphabetized first function is correct");

I verified with the debugger front end folks that, indeed, the lambda is in a separate element nowadays.

::: devtools/client/netmonitor/test/browser_net_curl-utils.js
@@ +209,5 @@
>    is(CurlUtils.escapeStringPosix(controlChars), "$'\\x07 \\x09 \\x0c \\x1b'",
>      "Control characters should be escaped.");
>  
> +  // æ ø ü ß ö é
> +  const extendedAsciiChars = "\xc3\xa6 \xc3\xb8 \xc3\xbc \xc3\x9f \xc3\xb6 \xc3\xa9";

So if I understand correctly:

The file itself is using the UTF-8 encoding: for example, the 'ash' (fused a/e) character is encoded as two bytes, 0xc3 0xa6.

Pre-patch, the file is being loaded as Latin-1, such that the string JavaScript builds actually holds *two* 16-bit code units for the ash character, 0x00c3 and 0x00a6, so that JavaScript considers its length (in 16-bit code units, like any JS string) to be 17. (I verified this.)

Post-patch, the file is loaded as UTF-8, and uses an encoding-independent way to represent the same string as before, even though it's a very odd string, that seems extremely unlikely to be what someone intended to test --- yet the test passes.

So although the situation is very fishy, the patch doesn't change what's being tested, which seems like the right thing to do.
Attachment #9031072 - Flags: review?(kmaglione+bmo)
Attachment #9031072 - Flags: review?(jimb)
Attachment #9031072 - Flags: review?(MattN+bmo)
Attachment #9031072 - Flags: review+
Comment on attachment 9031072 [details] [diff] [review]
Make loadSubScript always interpret script data as UTF-8, and fix existing subscripts for this where necessary

Saving MattN and kmag's r+ marks, wiped out by bugzilla with my blind "Okay" button mashing
Attachment #9031072 - Flags: review?(kmaglione+bmo)
Attachment #9031072 - Flags: review?(MattN+bmo)
Attachment #9031072 - Flags: review+
Attachment #9031072 - Flags: review?(surkov.alexander) → review?(yzenevich)
Comment on attachment 9031072 [details] [diff] [review]
Make loadSubScript always interpret script data as UTF-8, and fix existing subscripts for this where necessary

Review of attachment 9031072 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, this looks good.
Attachment #9031072 - Flags: review?(yzenevich) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3e4039f9d1e
Make the JS subscript loader load scripts exclusively as UTF-8, with no way to specify any other encoding, and adjust a bunch of existing tests to use UTF-8 directly, rather than Unicode escape sequences or similar.  (This also changes the encoding of .sjs scripts and all mochitest-browser tests in the tree from Latin-1 to UTF-8.)  r=yzen, r=MattN, r=jimb, r=kmag
Keywords: leave-open
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Blocks: 1515801
See Also: → 1515831
Commit pushed to master at https://github.com/mozilla/activity-stream

https://github.com/mozilla/activity-stream/commit/f7dfbf28998d8697b6099a88077b577f0683ef01
Port Bug 1492937 - Make the JS subscript loader load scripts exclusively as UTF-8 (#4600)
Summary: Should the JS subscript loader load scripts as UTF-8, not as Latin-1? → The JS subscript loader should load scripts as UTF-8 only, never as Latin-1
Target Milestone: --- → mozilla66
You need to log in before you can comment on or make changes to this bug.