mozIJSSubscriptLoader::loadSubScript does not support specifying the character encoding (treats all files as ISO-8859-1)

RESOLVED FIXED in mozilla2.0b7

Status

()

Core
XPConnect
RESOLVED FIXED
11 years ago
7 years ago

People

(Reporter: Georg Maaß, Assigned: kmag)

Tracking

({dev-doc-complete})

Trunk
mozilla2.0b7
dev-doc-complete
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 1 obsolete attachment)

(Reporter)

Description

11 years ago
User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.13) Gecko/20060414
Build Identifier: 

loadSubScript does not support JavaScript 1.7 and does not support specifying the character encoding nor does detect the character encoding automatically or the language version.

http://developer.mozilla.org/en/docs/mozIJSSubScriptLoader

I sugesst we should unfreese the interface and add 2 additional parameters to specify the script language in the format of a mime type with version information, because in the actual implementation JavaScript 1.7 is supported only partially. The let statement is missing. Also a parameter specifying the character encoding should be added.

This is especially a problem, for local script files, that do not send such information via HTTP header. In XULRunner applications all files are local files and do not provide a content type HTTP header telling which character encoding is used. This causes local files to be parsed wrong.

I have trouble with utf-8 files, that are assumed to be 8859-1 but are utf-8.

Reproducible: Always
we don't unfreeze interfaces. we could add an mozIJSSubScriptLoader2 version though.

Comment 2

11 years ago
One issue per bug, please. Can you file a separate bug for the character encoding bug and let this be about loading JS 1.7 with mozIJSSubScriptLoader?

Can you provide a testcase for this? It appears that the sub-script is evaluated with the same JS version as the parent script, i.e. if you run 

var loader = Components.classes["@mozilla.org/moz/jssubscript-loader;1"].getService(Components.interfaces.mozIJSSubScriptLoader);
loader.loadSubScript("data:text/plain,let f=1")

- that works, as long as |let| works in the parent script (the one that uses loadSubScript). That sounds sensible to me. Do you disagree?
(Reporter)

Comment 3

11 years ago
(In reply to comment #1)
> we don't unfreeze interfaces. we could add an mozIJSSubScriptLoader2 version
> though.

that is ok too
(Reporter)

Comment 4

11 years ago
(In reply to comment #2)
> It appears that the sub-script is
> evaluated with the same JS version as the parent script.

You are right. The script version is inherited from the parent script. The charset is not inherited. (tested on Mac using XULRunner 1.9a4)

<script type="application/x-javascript; version=1.7; charset=utf-8" src="chrome://ida/content/js/ida.js"/>

The charset in the type attribute is ignored, so that is no work around. Instead of using the charset from the type attribut ida.js allways uses the charset of the embeding XUL document, but scripts loaded with the script loader use iso-8859-1. They do not inherit the charset from ida.js.

You see an extended interface for script loading to control this individually script by script is useful. This is, why i requested for this enhancement.

eval does not support let. Can I set the ContentScriptType of the XUL window? How can I do that?

My attempt

<html:meta html:http-equiv="Content-Script-Type" html:content="application/x-javascript; version=1.7; charset=utf-8"/>


does not result in eval supporting let.

Comment 5

11 years ago
You're confusing matters by talking about multiple things in a single bug. Again, the charset issue appears to be a valid one - can you file it separately, please?

Not sure what Content-Script-Type or eval have to do with the JS 1.7 part of the report. FWIW, I still don't think the change you're suggesting is necessary (while it gives more flexibility, just switching all of your scripts to 1.7 should not be very hard.)
(Reporter)

Comment 6

11 years ago
It is not a bug, it is a request for an enhancement of an existing interface, which can be done completely transparent to existing applications or for creating an additional more powerful interface as Christian answered in his comment #1.

For the eval it is unclear to me, where it takes its script version from. It does not take it from the script, where it is written, so it is not affected by the inheritance implementes in the script loader interface.

Comment 7

10 years ago
any news about this? we still get a broken behaviour which looks pretty critical to me.
"à".match(/\u00e0/) matches if executed in a script loaded by <script> or in firebug console (where probably the charset is correctly set to utf-8) but stops matching in a script loaded by mozIJSSubScriptLoader.
is it that big? excellent javascript unicode support is broken in firefox extensions?

Thanks,
Federico
(Reporter)

Comment 8

10 years ago
For the moment we encode all umlauts and ligatures as unicode escape sequences to make the script file an ASCII file, because we can't wait for the Mozilla project to provide a fixed implementation of that interface or a new interface with the necessary features to circumvent that trouble of different behaviour of the same file depending on whether loade via script tag or via loadSubScript call.

Comment 9

10 years ago
(In reply to comment #8)
> For the moment we encode all umlauts and ligatures as unicode escape sequences

which version of firefox? i don't get how we can still get "à".match(/\u00e0/) == null,
under loadsubscript on firefox 2.0.0.11, if you actually convert "à" to "\u00e0".
Do you want me to make a test case?

Thanks,
Federico
(Reporter)

Comment 10

10 years ago
Our experience is not from FF 2 but from XulRunner 1.9.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: loadSubScript does not support JavaScript 1.7 and does not support specifying the character encoding → mozIJSSubscriptLoader::loadSubScript does not support specifying the character encoding
Duplicate of this bug: 489137
Duplicate of this bug: 467182
Severity: enhancement → normal
Summary: mozIJSSubscriptLoader::loadSubScript does not support specifying the character encoding → mozIJSSubscriptLoader::loadSubScript does not support specifying the character encoding (treats all files as ISO-8859-1)
This bug we also encounter now when running Mozmill tests which contain non-ascii characters. Is mozIJSSubScriptLoader the only interface to execute scripts in a synchronous way or do we have another interface? If not, is the usage of the numerical value the only way to circumvent this bug for the moment?
Blocks: 506760
Blocks: 559814
Is there any possible way this can be fixed before Firefox 4.0, especially now that interfaces are unfrozen? If not, there are quite a lot of extensions that will have to maintain hacks for Unicode support in their scripts for years to come.

I can provide a patch, if that would help expedite matters.
Interfaces are unfrozen in general, but we're trying to not change them after the first beta, so we don't break people who have started developing against the interface.

If this API is only used from JS, then additive changes that won't break existing JS consumers are probably ok.

Comment 16

7 years ago
Given that billions of potential users don't use English, I should think that not handling international character set is a rather critical issue that was reported more than 3 years ago.
Hope that can be done by adding an optional parameter that will not break existing code.
Thanks!
Users don't use this API; developers do.  And the developer either controls the script being loaded (in which case he can use \u escapes for all non-ASCII characters) or doesn't (in which case he has no idea what encoding to claim for the script anyway).

Fixing this bug is about developer convenience; it doesn't affect users in any way I can see other than by possibly increasing developer productivity.

Comment 18

7 years ago
Developers work for users.
And when developers are not able to do something that works in users' languages, it's not good for users.
If they are able to do it but the workload is too big, they don't do it and it's bad for users too. That explains why so many services don't handle non-English language properly (try creating a developper account at apple.com with a last name using a non ascii characterset: it will be cancelled).

When writing scripts that use localized strings, it's a real pain to maintain \u escapes, the script becomes unreadable.

Have you done it yourself ?

It can be ok when you have a few strings, it's unrealistic when you have hundreds, each in several languages where most need unicode.

I think more use cases are easy to find, like reading JSonified UTF-8 data, scripts being translated by volunteers on tools like crowdin or babelzilla

If it's a matter of not breaking APIs, then as Christian Biesinger suggested 3 years ago, may be a mozIJSSubScriptLoader2 API would be safer.

Comment 19

7 years ago
I have to agree with Xavier. Using \u escapes is so uncomfortable that it's practically impossible for more than a 5-10 strings. It's not just an inconvenience.

FWIW, the JSON parser in Gecko is supposed to work with Unicode (but I don't use it, because I need comments, so I'm back to subscriptloader).
+1 to comment 18.

I originally CC'ed myself on this bug when working on Ubiquity, when we were soliciting parser description files from users with some casual JS experience who spoke the target language (many first-time Mozilla contributors) for inclusion in Ubiquity. We ended up going a different route to circumvent this issue, because telling contributors to use \u was a non-starter.
But telling contributors to use UTF-8 and no other encoding was a starter?

In any case, I agree this should be fixed and would be happy to review a patch that does so in a safe way (or unsafe, for post-Firefox-4).  I just don't think it's a "critical because it immediately affects billions of users" issue, which is how comment 16 affected to characterize it.
(In reply to comment #21)
> But telling contributors to use UTF-8 and no other encoding was a starter?

From my experience then, yes it was.

But I agree that this does not affect billions of users.
Created attachment 479855 [details]
Adds charset argument to mozIJSSubScriptLoader.loadSubScript

The attached patch adds an optional charset string argument to mozIJSSubScriptLoader.loadSubScript which, if present, causes the script to be decoded from that character set and evaluated as Unicode.
Comment on attachment 479855 [details]
Adds charset argument to mozIJSSubScriptLoader.loadSubScript

Oh, this thing is totally not callable from C++ anyway!  Ok, then.

Thank you for writing this!
Attachment #479855 - Flags: review?(bzbarsky)
Comment on attachment 479855 [details]
Adds charset argument to mozIJSSubScriptLoader.loadSubScript

>+++ b/js/src/xpconnect/loader/mozJSSubScriptLoader.cpp
>@@ -158,7 +160,8 @@ mozJSSubScriptLoader::LoadSubScript (con
>+    ok = JS_ConvertArguments (cx, argc, argv, "s / o s", &url, &target_obj, &charsetStr);

Probably better to use an 'W' param here, not 's', which will hand out a jschar*.  And then instead of NS_ConvertUTF8toUTF16 you can use nsDependentString(reinterpret_cast<PRUnichar*>()).  Just saves allocating that extra char* buffer and all.

>+        nsString script;
>+        rv = nsScriptLoader::ConvertToUTF16(chan, (PRUint8*)(char*)buf, PRUint32(len),
>+                                            charset, nsnull, script);

Unless you actually want to allow the channel charset to override |charset|, pass null for the channel?

And does reinterpret_cast<PRUint8*>(buf) do the right thing instead of those two C casts?

Why do you need the cast on |len|?

>+                                               (jschar*)script.get(), script.Length(),

Again, reinterpret_cast<> here.

r=me with those fixes and the channel thing addressed.

Thanks again for the patch!  Do you want to update it to the comments above?  I can do it, but I'll need answers to some of those questions from you.
Attachment #479855 - Flags: review?(bzbarsky) → review+
(In reply to comment #25)
> Comment on attachment 479855 [details]
> Adds charset argument to mozIJSSubScriptLoader.loadSubScript
> 
> >+++ b/js/src/xpconnect/loader/mozJSSubScriptLoader.cpp
> >@@ -158,7 +160,8 @@ mozJSSubScriptLoader::LoadSubScript (con
> >+    ok = JS_ConvertArguments (cx, argc, argv, "s / o s", &url, &target_obj, &charsetStr);
> 
> Probably better to use an 'W' param here, not 's', which will hand out a
> jschar*.  And then instead of NS_ConvertUTF8toUTF16 you can use
> nsDependentString(reinterpret_cast<PRUnichar*>()).  Just saves allocating that
> extra char* buffer and all.

Ok.

> >+        nsString script;
> >+        rv = nsScriptLoader::ConvertToUTF16(chan, (PRUint8*)(char*)buf, PRUint32(len),
> >+                                            charset, nsnull, script);
> 
> Unless you actually want to allow the channel charset to override |charset|,
> pass null for the channel?

I actually struggled with that. At the moment, chrome and file
URL channels don't supply a charset, but it seemed to me that if
they at some point do, it would probably be more valid than the
charset supplied by the caller. But the more I think about it, the
more I think that it should really only be a fallback if the caller
doesn't provide a charset, or to preserve compatibility with the
current interface, if null or "" is passed instead.

> And does reinterpret_cast<PRUint8*>(buf) do the right thing instead of those
> two C casts?
> 
> Why do you need the cast on |len|?

Sorry, I realized just after I posted that I forgot to clean those
up. I was struggling to get the prototype to match, and as GCC so
unhelpfully fails to mention which arguments are incompatible, I
just cast everything until I got it working.

> >+                                               (jschar*)script.get(), script.Length(),
> 
> Again, reinterpret_cast<> here.

I thought that was strange, but every other invocation of that function
in mozilla-central seemed to use the C-style cast, so I figured
there was probably a reason.

> r=me with those fixes and the channel thing addressed.
> 
> Thanks again for the patch!  Do you want to update it to the comments above?  I
> can do it, but I'll need answers to some of those questions from you.

Ok. I'll have an updated patch a little later today.
> so I figured there was probably a reason.

Purely historical; the C++-style casts were not supported by all the compilers we wanted to support at one point in time (over a decade ago).
Created attachment 480197 [details] [diff] [review]
Requested updates to attachment 479855 [details]

Adds the changes requested in comment 25.

I'm sorely tempted to standardize the spacing before the opening paren in function calls, but it's so inconsistent that I don't know where to start.
Attachment #479855 - Attachment is obsolete: true
Attachment #480197 - Flags: review?(bzbarsky)
Comment on attachment 480197 [details] [diff] [review]
Requested updates to attachment 479855 [details]

r=me.

Requesting 2.0 approval; this is a safe additive change that allows the subscript loader to load non-ISO-8859-1 stuff.
Attachment #480197 - Flags: review?(bzbarsky)
Attachment #480197 - Flags: review+
Attachment #480197 - Flags: approval2.0?
Assignee: nobody → maglione.k

Updated

7 years ago
Attachment #480197 - Flags: approval2.0? → approval2.0+

Comment 30

7 years ago
http://hg.mozilla.org/mozilla-central/rev/3dcf107b7267
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0b8
Target Milestone: mozilla2.0b8 → mozilla2.0b7

Comment 31

7 years ago
Has this been fixed?
I've tested it on firefox4 nightly build, but it doesn't work.
loader.loadSubScript(uri, module, "UTF-8");
That should work.  Can you attach an example that shows it not working?

Comment 33

7 years ago
This is my test code, correct me if I make some mistake.
try {
    var testFilePath = 'E:\\test.uc.js';
    var testFile = Cc["@mozilla.org/file/local;1"].createInstance(Ci.nsILocalFile);
    testFile.initWithPath(testFilePath);
    Cc["@mozilla.org/moz/jssubscript-loader;1"].getService(Ci.mozIJSSubScriptLoader).loadSubScript(Cc["@mozilla.org/network/io-service;1"].getService(Ci.nsIIOService).getProtocolHandler("file").QueryInterface(Ci.nsIFileProtocolHandler).getURLSpecFromFile(testFile), window, 'UTF-8');
} catch(e) {
    alert(e);
}

Here is the content of test.uc.js file.
(function() {
    alert("测试");
})();
It certainly works for me in nightlies and the 4.0 RCs.
I meant attaching the actual file you're loading (as an attachment, not via copy/paste).

Comment 36

7 years ago
Created attachment 520857 [details]
A testcase for loading files with utf-8 encoding characters.

I load this file using utf-8, but the string "测试" can not be alerted correctly, but four blocks are shown instead.

Comment 37

7 years ago
Thanks. I've got it.
It works. Sorry for that my file is not saved as utf-8.

Updated

7 years ago
Keywords: dev-doc-needed
Version: unspecified → Trunk
This had already been documented, but I've cleaned the docs up a bit:

https://developer.mozilla.org/en/XPCOM_Interface_Reference/mozIJSSubScriptLoader

And it's mentioned on Firefox 5 for developers.
Keywords: dev-doc-needed → dev-doc-complete
You need to log in before you can comment on or make changes to this bug.