Remove UTF-8 handling from SpiderMonkey




7 years ago
6 years ago


(Reporter: terrence, Assigned: terrence)


Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)


(Whiteboard: [leave open][js:t])


(2 attachments)



7 years ago
The current UTF-8 handling code in SpiderMonkey is super-gross and completely untested.  If you wanted to use it, you have to turn it on dynamically before creating the first runtime.  Once turned on, it replaces ASCII in most, but not all cases.  This adds an unnecessary extra branch to every Inflate call and, much worse, makes any call to Inflate a potential GC.  The actual UTF-8 implementation itself appears to be cobbled together and not from any well-known (and well tested library).

This code appears to have been strapped on by an embedder in a totally ad-hoc manner and is causing us real problems right now.  We should remove it until we can build something that integrates well with the rest of SpiderMonkey.

Comment 1

7 years ago
After more investigation, it looks like CTypes has started using the UTF-8 functions, because they had public visibility, for no good reason.  However, CTypes doesn't set js_CStringsAreUTF8, which makes these functions go off the rails if there is an error in the UTF-8 string.  So there is at least one bug in CTypes currently because of this weirdness.

Comment 2

7 years ago
Posted patch v0Splinter Review
It's hard to express just how totally messed up our text encoding situation is right now.  I have epsilon less sanity after untangling this gordian knot of misdirection and lies, so I'm in a good position to explain at the moment.  Imagine that Laurel and Hardy are doing their "Who's on first" sketch, only instead of being about baseball it's about text encodings and the argument is between SpiderMonkey and the Internet instead of Laurel and Hardy.  That's the level of confusion we are talking about here.

This patch is my attempt to explain to SpiderMonkey that "Who's a UTF-8 string" does not imply in any way that it should decode Who as a CESU-8 string, and similar awful punning.  There is still a great deal of work to do, but I'd really like to get this in-tree ASAP: I feel we have a duty to save the sanity of anyone who might try to pass a UTF-8 string into our CompileUTF8 functions and expect a anything but comedy to come out.
Assignee: general → terrence
Attachment #676682 - Flags: review?(luke)
Comment on attachment 676682 [details] [diff] [review]

Review of attachment 676682 [details] [diff] [review]:

Thanks for excising all this dead and (in my experience) confusing logic.

::: js/src/jsapi.cpp
@@ +6246,5 @@
>      size_t necessaryLength = GetDeflatedStringLength(NULL, chars, str->length());
>      if (necessaryLength == size_t(-1))
>          return size_t(-1);
> +    if (writtenLength != length)
> +        JS_NOT_REACHED("CStrings are NOT UTF-8");

Seems like we'd want either JS_ASSERT or JS_Assert (if we want a release-mode assert).

::: js/src/jsapi.h
@@ +4965,2 @@
>  JS_CompileUCScript(JSContext *cx, JSObject *obj,
> +                   const jschar *ucs2, size_t length,

Norbert was telling us not to call it UCS2 (because it isn't).  "twoByteChar" would be more honest but, given that we call ascii 'ascii' and utf8 'utf8' and the type 'jschar' tells us "this is a two-byte char", perhaps we can leave all these cases you changed to 'ucs2' as 'chars'.
Attachment #676682 - Flags: review?(luke) → review+

Comment 4

7 years ago
Green try run at:

Pushed at:

(In reply to Luke Wagner [:luke] from comment #3)
> Norbert was telling us not to call it UCS2 (because it isn't). 
> "twoByteChar" would be more honest but, given that we call ascii 'ascii' and
> utf8 'utf8' and the type 'jschar' tells us "this is a two-byte char",
> perhaps we can leave all these cases you changed to 'ucs2' as 'chars'.

You are correct.  I would like to make the character type names actual C++ types when we convert these interfaces to mozilla::Range subclasses.
Blocks: 805081
Whiteboard: [leave open]


7 years ago
Depends on: 807151
Depends on: 807518

Comment 6

7 years ago
Now that we have CompileOptions, we really don't need the absurd diaspora of Compile varieties.  This reigns in some of the madness by removing the API entries that are UTF-8 specific.

Since I was in the area, this also kills off the NOPed parts of our legacy API.  I was planning to do more here, but a quick visual inspection of shows that somewhere from a third to half of our API surface is either obsoleted or deprecated.  You have to start somewhere, but I can easily move this into a different patch or even bug if you'd like.
Attachment #679430 - Flags: review?(luke)
Comment on attachment 679430 [details] [diff] [review]
v0: move UTF-8 api to CompileOptions

Review of attachment 679430 [details] [diff] [review]:

Rock it.  (Sorry for the delay)
Attachment #679430 - Flags: review?(luke) → review+


7 years ago
Attachment #676682 - Flags: checkin+


7 years ago
Depends on: 811060
Whiteboard: [leave open] → [leave open][js:t]
Depends on: 816888

Comment 10

6 years ago
I think all of the CESU-8 is gone as well as the weird modal stuff. If there is more badness to trim, we can do it in other bugs.
Last Resolved: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.