Open Bug 1347877 Opened 8 years ago Updated 11 months ago

Remove nsIScriptableUnicodeConverter

Categories

(Core :: Internationalization, enhancement, P3)

enhancement

Tracking

()

Performance Impact none

People

(Reporter: hsivonen, Unassigned)

References

(Depends on 6 open bugs, Blocks 1 open bug)

Details

We should remove nsIScriptableUnicodeConverter and have chrome script use the same facilities as Web scripts (TextDecoder/TextEncoder) instead of maintaining a parallel XPCOM universe of things.
No longer depends on: post-57-api-changes
This is still used a lot in calendar, chat and mailnews. Can you please describe what the replacement is?
The problem is that TextEncoder does not support non-UTF-8 encodings. We will have to move nsIScriptableUnicodeConverter to c-c unless we fix bug 862292.
Depends on: 1349722
Priority: -- → P3
(In reply to :aceman from comment #1) > This is still used a lot in calendar, chat and mailnews. > Can you please describe what the replacement is? Filed bug 1353285. (Covers only encode to UTF-8. Please don't do output in legacy encodings.) (In reply to Masatoshi Kimura [:emk] from comment #3) > The problem is that TextEncoder does not support non-UTF-8 encodings. We > will have to move nsIScriptableUnicodeConverter to c-c unless we fix bug > 862292. Fixing bug 862292 would be a fine solution. :-) Failing that, c-c needs to figure out on its own how it wants to call into the upcoming (bug 1261841) mozilla::Encoding/mozilla::Encoder for ISO-2022-JP encode if it moves the message encoding step from C++ to JS in the future (I believe the relevant call is in C++ at present). My advice would be not moving nsIScriptableUnicodeConverter over to c-c, but c-c devs may, of course, opt to do so.
(In reply to Henri Sivonen (:hsivonen) from comment #4) > (In reply to :aceman from comment #1) > > This is still used a lot in calendar, chat and mailnews. > > Can you please describe what the replacement is? > > Filed bug 1353285. (Covers only encode to UTF-8. Please don't do output in > legacy encodings.) > Henri, as far as nsIScriptableUnicodeConverter, do you have an opinion on using the unescape(encodeURIComponenet(str))<->decodeURIComponent(escape(str)) method to do the js<->cpp conversion, as proposed in Bug 1349722 patch? (I don't think async is important for the use case.)
Whiteboard: [qf]
Whiteboard: [qf] → [qf-]
(In reply to alta88 from comment #5) > (In reply to Henri Sivonen (:hsivonen) from comment #4) > > (In reply to :aceman from comment #1) > > > This is still used a lot in calendar, chat and mailnews. > > > Can you please describe what the replacement is? > > > > Filed bug 1353285. (Covers only encode to UTF-8. Please don't do output in > > legacy encodings.) > > > > Henri, as far as nsIScriptableUnicodeConverter, do you have an opinion on > using the > unescape(encodeURIComponenet(str))<->decodeURIComponent(escape(str)) method > to do > the js<->cpp conversion, as proposed in Bug 1349722 patch? (I don't think > async is > important for the use case.) I think I'm missing some context for this question. Do you mean to perform a conversion between an UTF-16 JS string and a UTF-8 string represented as a JS string with each byte zero-extended into a 16-bit code unit without actually caring about URL encoding? Or does the escaping and unescaping actually serve some URL-related purpose here?
(In reply to Henri Sivonen (:hsivonen) from comment #6) > (In reply to alta88 from comment #5) > > (In reply to Henri Sivonen (:hsivonen) from comment #4) > > > (In reply to :aceman from comment #1) > > > > This is still used a lot in calendar, chat and mailnews. > > > > Can you please describe what the replacement is? > > > > > > Filed bug 1353285. (Covers only encode to UTF-8. Please don't do output in > > > legacy encodings.) > > > > > > > Henri, as far as nsIScriptableUnicodeConverter, do you have an opinion on > > using the > > unescape(encodeURIComponenet(str))<->decodeURIComponent(escape(str)) method > > to do > > the js<->cpp conversion, as proposed in Bug 1349722 patch? (I don't think > > async is > > important for the use case.) > > I think I'm missing some context for this question. Do you mean to perform a > conversion between an UTF-16 JS string and a UTF-8 string represented as a > JS string with each byte zero-extended into a 16-bit code unit Yes. > without > actually caring about URL encoding? The context is that the xhr response is an xml doc and utf8 is the encoding, and nothing is sniffed or otherwise detected explicitly for any other encoding. Upon js (using UTF-16 internally) parsing of the doc, it is given to xpcom as a |string| which carries no explicit (likely 8859-1 implied) encoding info. The string is currently converted into the right char * bytes using nsIScriptableUnicodeConverter.ConvertFromUnicode and UTF-8 as the encoding. > Or does the escaping and unescaping > actually serve some URL-related purpose here? This seems to be part and parcel of using *codeURIComponenet, meaning it doesn't work without that conversion. So the question is whether you're familiar with this method and whether it's an acceptable alternative to writing some other conversion code as filed in Bug 1353285. It wfm for all manner of idn and non ascii urls as well as utf8 js string content.
(In reply to alta88 from comment #7) > (In reply to Henri Sivonen (:hsivonen) from comment #6) > > (In reply to alta88 from comment #5) > > > (In reply to Henri Sivonen (:hsivonen) from comment #4) > > > > (In reply to :aceman from comment #1) > > > > > This is still used a lot in calendar, chat and mailnews. > > > > > Can you please describe what the replacement is? > > > > > > > > Filed bug 1353285. (Covers only encode to UTF-8. Please don't do output in > > > > legacy encodings.) > > > > > > > > > > Henri, as far as nsIScriptableUnicodeConverter, do you have an opinion on > > > using the > > > unescape(encodeURIComponenet(str))<->decodeURIComponent(escape(str)) method > > > to do > > > the js<->cpp conversion, as proposed in Bug 1349722 patch? (I don't think > > > async is > > > important for the use case.) > > > > I think I'm missing some context for this question. Do you mean to perform a > > conversion between an UTF-16 JS string and a UTF-8 string represented as a > > JS string with each byte zero-extended into a 16-bit code unit > > Yes. It seems inefficient for that purpose. > > without > > actually caring about URL encoding? > > The context is that the xhr response is an xml doc and utf8 is the encoding, > and nothing is sniffed or otherwise detected explicitly for any other > encoding. Upon js (using UTF-16 internally) parsing of the doc, it is given > to xpcom as a |string| which carries no explicit (likely 8859-1 implied) > encoding info. The string is currently converted into the right char * bytes > using nsIScriptableUnicodeConverter.ConvertFromUnicode and UTF-8 as the > encoding. This doesn't make sense to me. XHR gives an AString (UTF-16) to XPCOM and ConvertFromUnicode takes UTF-16. There doesn't appear to be an 8-bit string between those steps. > > Or does the escaping and unescaping > > actually serve some URL-related purpose here? > > This seems to be part and parcel of using *codeURIComponenet, meaning it > doesn't work without that conversion. > > So the question is whether you're familiar with this method and whether it's > an acceptable alternative to writing some other conversion code as filed in > Bug 1353285. It seems excessively inefficient as an implementation for bug 1353285.
(In reply to Henri Sivonen (:hsivonen) from comment #8) > (In reply to alta88 from comment #7) > > (In reply to Henri Sivonen (:hsivonen) from comment #6) > > > (In reply to alta88 from comment #5) > > > > (In reply to Henri Sivonen (:hsivonen) from comment #4) > > > > > (In reply to :aceman from comment #1) > > > > > > This is still used a lot in calendar, chat and mailnews. > > > > > > Can you please describe what the replacement is? > > > > > > > > > > Filed bug 1353285. (Covers only encode to UTF-8. Please don't do output in > > > > > legacy encodings.) > > > > > > > > > > > > > Henri, as far as nsIScriptableUnicodeConverter, do you have an opinion on > > > > using the > > > > unescape(encodeURIComponenet(str))<->decodeURIComponent(escape(str)) method > > > > to do > > > > the js<->cpp conversion, as proposed in Bug 1349722 patch? (I don't think > > > > async is > > > > important for the use case.) > > > > > > I think I'm missing some context for this question. Do you mean to perform a > > > conversion between an UTF-16 JS string and a UTF-8 string represented as a > > > JS string with each byte zero-extended into a 16-bit code unit > > > > Yes. > > It seems inefficient for that purpose. > > > > without > > > actually caring about URL encoding? > > > > The context is that the xhr response is an xml doc and utf8 is the encoding, > > and nothing is sniffed or otherwise detected explicitly for any other > > encoding. Upon js (using UTF-16 internally) parsing of the doc, it is given > > to xpcom as a |string| which carries no explicit (likely 8859-1 implied) > > encoding info. The string is currently converted into the right char * bytes > > using nsIScriptableUnicodeConverter.ConvertFromUnicode and UTF-8 as the > > encoding. > > This doesn't make sense to me. XHR gives an AString (UTF-16) to XPCOM and > ConvertFromUnicode takes UTF-16. There doesn't appear to be an 8-bit string > between those steps. > The xpcom function argument is cast as a const char * for the data string. > > > Or does the escaping and unescaping > > > actually serve some URL-related purpose here? > > > > This seems to be part and parcel of using *codeURIComponenet, meaning it > > doesn't work without that conversion. > > > > So the question is whether you're familiar with this method and whether it's > > an acceptable alternative to writing some other conversion code as filed in > > Bug 1353285. > > It seems excessively inefficient as an implementation for bug 1353285. First, thanks for taking the time to consider this. Measuring the difference between using encodeURIComponent vs. ConvertFromUnicode for the conversion, on a 65kb string (which is large for the use case), both show it taking 1ms. On smaller strings, encodeURIComponent may show 1ms while it takes less than that for ConvertFromUnicode such that it doesn't register out to ms. Given that 1) The upstream api provider is notorious for pulling apis without notice (not necessarily this api/case), 2) The new implementation isn't written, 3) The encodeURIComponent variants are standard spec since ECMA 5 and more.. reliable for a downstream consumer, whatever inefficiency may exist (fractions of ms) is meaningless for prudent technical risk management.
See Also: → 1460100
Performance Impact: --- → -
Whiteboard: [qf-]
Depends on: 1761317
Depends on: 1762335
Depends on: 1773535
Depends on: 1773932
Severity: normal → S3
Depends on: 1803986
Depends on: 1811339

Just to note there are now < 30 instances of this accessed from JavaScript files in the tree. It may be worth a small drive to remove the rest if this is still desired.

I think it still makes sense to do this. A quick glance at some of the remaining uses indicates that the key mismatch arises from XPCOM streams not using ArrayBuffers, so if replacing XPCOM streams with Web Platform Streams is too large a prerequisite step, an intermediate step would be making XPCOM streams able to deal with bytes as ArrayBuffers.

Depends on: 1811994
Depends on: 1851797
No longer depends on: 923017
Depends on: 1861645
You need to log in before you can comment on or make changes to this bug.