1.65 KB, text/html
12.07 KB, patch
|Details | Diff | Splinter Review|
1.30 KB, patch
|Details | Diff | Splinter Review|
User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a6) Gecko/20041202 Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a6) Gecko/20041202 When inserting content into elements dynamically by using the W3C DOM's createRange ... appendChild procedure or the IE DOM's innerHTML procedure in documents with a utf-8 charset, any raw ZWNJ (U+200C) or ZWJ (U+200D) mulit-byte characters in the content are lost, causing serious corruption of Persian/Arabic words. Reproducible: Always Steps to Reproduce: 1. define a utf-8 string that includes ZWNJ and/or ZWJ 2. dynamically insert it into an element (any container element, e.g. a span or div) using either the W3C or IE DOM procedures supported by the Geckos 3. Actual Results: The ZWMJ and ZWJ multi-byte characters are omitted from the inserted content. Expected Results: The ZWMJ and ZWJ multi-byte characters should have been retained, as by other browsers such as IE. ZWNJ and ZNJ are handled properly when present directly in utf-8 documents. Also, if one replaces the raw utf-8 ZWNJ and ZNJ with their corresponding HTML entities (‌ and ‍) in string definitions, that content is handled properly by dynamic insertions. The testcase at the indicated URL compares handling of ZWNJ and ZNJ versus their HTML entities in Persian words or strings as span content present either directly in an HTML document (with a utf-8 charset specified) or inserted dynamically with the W3C or IE DOM procedures. The direct inclusions, and the dynamic insertions of content with the HTML entities, are handled correctly, and thus show what should be displayed for the raw utf-8 insertions as well.
Ugh, how many ways can I mis-spell 4- or 3-letter acronyms? Quite a few if I submit a bug report too late at night. In the Description, the utf-8 multi- byte character references should be ZWNJ ("Zero-Width Non-Joiner") or ZWJ ("Zero-Width Joiner") throughout.
This bug also applies for documents with the windows-1256 (Arabic) charset.
Created attachment 171357 [details] Minimalish testcase demonstrating that the content model is correct
Over to JS engine. We have the right data, and we're passing it to JS_EvaluateUCScriptForPrincipals (via EvaluateString()). So it looks like the problem is in the JS engine's parsing of string literals...
From ECMA-262 Edition 3 (available at http://www.mozilla.org/js/language/E262-3.pdf among other places): 7.1 Unicode Format-Control Characters The Unicode format-control characters (i.e., the characters in category Cf in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this (such as mark-up languages). It is useful to allow these in source text to facilitate editing and display. The format control characters can occur anywhere in the source text of an ECMAScript program. These characters are removed from the source text before applying the lexical grammar. Since these characters are removed before processing string and regular expression literals, one must use a. Unicode escape sequence (see section 7.6) to include a Unicode format-control character inside a string or regular expression literal. /be
That simply makes it unusable for Persian text. What kind of string 'literals' are they if they drop some characters?
ECMA-262 Edition 3 is also ISO-16262, it's pretty set in stone. I can see about changing things for Edition 4, but that's supposed to be compatible, and it has no definite completion date at this point. If you want to quarrel with the existing standard, it will probably just be frustrating for all sides. I don't recall the rationale for stripping formatting characters, but perhaps someone on the cc: list will. As a practical matter, I'm interested to hear what IE and Safari do. If they do not follow ECMA-262 Edition 3, what rule or rules do they implement? /be
(In reply to comment #7) > That simply makes it unusable for Persian text. > What kind of string 'literals' are they if they > drop some characters? Yes. More importantly, instead of tactlessly charging ahead with a change of Status to RESOLVED INVALID based on ECMA-262 E3, it might have been better to raise that issue for further discussion. We might then have been able to direct /be to UAX 9: http://www.unicode.org/reports/tr9/#X9 for the Unicode 4.0.1 standard, particularly its Implementation Notes section 5.5.1 Joiners, and its section X9 which explains: "The Zero Width Joiner and Non Joiner affect the shaping of the adjacent characters; those that are adjacent in the original backing-store order, even though those characters may end up being rearranged to be non-adjacent by the BIDI algorithm. For more information, see Joiners." It seems to me that getting past the Geckos' present adherence to the ECMA's brain dead formula for handling ZWNJ and ZWJ as "expendable" control characters, so that authors of Persian / Farsi resources might stop viewing the Geckos' implementation as inferior to IE's, is just as important as things such as the Geckos' "undetected document.all" or a number of "detected" non- standards implementations under the rubric of "de-facto" standards.
(In reply to comment #8) > I'm interested to hear what IE and Safari do. As I indicated in my description for this bug, IE behaves as if the ZWNJ or ZWJ were retained in the dynamically loaded (script-based) text, so that the adjacent Persian / Farsi characters are properly formed, just as when the ZWNJ or ZWJ are used directly in the document's content. However, it may be doing what UAX 9 of the Unicode 4.0.1 standard to which I referred you now recommends(which I understand to have been coordinated with what ISO now co-recommends): http://www.unicode.org/reports/tr9/ "5.3. Joiners As described under X9, the Zero Width Joiner and Non Joiner affect the shaping of the adjacent characters葉hose that are adjacent in the original backing- store order容ven though those characters may end up being rearranged to be non- adjacent by the BIDI algorithm. In order to determine the joining behavior of a particular character after applying the BIDI algorithm, there are two main strategies. When shaping, an implementation can refer back to the original backing store to see if there were adjacent ZWNJ or ZWJ characters. Alternatively, the implementation can replace ZWJ and ZWNJ by an out-of-band character property associated with those adjacent characters, so that the information does not interfere with the BIDI algorithm and the information is preserved across rearrangement of those characters. Once the BIDI algorithm has been applied, that out-of-band information can then be used for proper shaping."
Foteos: there's nothing tactless in resolving a bug INVALID, it happens all the time. The issue here is whether we can have a sane, coherent, *specifiable* result for every input, if we abandon ECMA-262 Edition 3 as you advocate. What's the spec now, "do what IE does"? It's true that we try to do that for a select list of hard cases (undetected document.all being just one of those). In this case, though -- and you are the one advocating a change, so it's on you to be specific -- what are the rules? Should we include all Unicode formatting characters in ECMAScript source? Does IE? Or is the issue *only* ZWNJ and ZWJ? Please test IE and report complete results, that will help get this bug REOPENED, or if appropriate, get a new bug filed. Thanks, /be
(In reply to comment #11) > can have a sane, coherent, *specifiable* result for every input, > if we abandon ECMA-262 Edition 3 as you advocate. Please read what I wrote more carefully and do not put your own words in my mouth. I posted a bug specifically about "the Joiners" (ZWNJ and ZWJ). It was you who raised the broader issue of all Unicode format-control characters and seems to think that handling the two Joiners adequately would require that you "abandon" ECMA-262 E3. The issue of the Joiners and the inappropriateness of removing them without subsequently following a strategy for properly shaping the adjacent characters so as not to trash languages such as Persian / Farsi has been discussed by the standards-making organizations, and two implementation strategies have been offered by the Unicode Consortiun in coordination with ISO for the Unicode 4.0.1 standard and its ISO-10646 homolog. In my Comment 9 and Comment 10 I posted URLs for the appropriate standards documents, together with quotations concerning the need for special handling of the Joiners (not all Unicode format-control characters) with the two recommended implementation strageties for accomplishing the needed special handling. It is adherence to those standards and adoption of a recommended implementation strategy that I "advocate." If you are curious about how IE handles the other format-control characters, feel free to investigate that yourself. But this bug is specifically about the Joiners, and should be REOPENED because the marking of it as INVALID was pre- mature and is invalid.
Foteos, I did read what you wrote, so calm down. If I didn't hop to it and reopen this bug as fast as you would like, flaming me isn't going to help. Keep a civil tongue if you want to work well with others in the Mozilla community. I'll raise this issue with the ECMA TG1 working group next week. /be
Behdad, thanks for your comments. I have no problem with deviating from a spec, if there's a good interoperation or utility reason, and provided there is some kind of alternative spec to follow. I was not in on ECMA-262 Edition 3's changes for Unicode (I moved from the JS group I'd founded to help form mozilla.org in late 1997). Edition 2 has no such paragraphs excluding format-control characters, so this was an Edition 3 change made after August 1998. Possibly waldemar remembers the rationale, or knows of a document with rationales for the changes from Edition 2 to Edition 3. I'll whip up a patch to allow all Unicode characters in string literals and regular expressions, and attach it here. /be
> In the real world, there is not much rationale behind modifying string > literals (and regular expressions) in this manner. Is that true for regular expressions? For example, consider the regular expression (in English): /[abc]/ Now with a Persian equivalent, I assume I'd want to put ZWNJ in between the letters to keep them distinct from each other, for readability. But then, per the regular expression matching algorithm, the regular expression will match any string containing ZWNJ, which is clearly undesirable. At the same time, I agree that the regular expression /ab/ (with a ZWNJ between the two characters) should probably not match the string "ab" (without a ZWNJ between the two characters)...
bz: good point, character classes (which don't scale from ASCII to Unicode, hence lwall et al. changing Perl 6 to reuse  for non-capturing parens) might want to unquote. Patch soon. /be
This is a problem outside character classes too... what about: /ab?c/ in situations where one would normally put a ZWNJ between "a" and "b" but not between "a" and "c"? Consider what the strings it'll be matched against will look like.... Perhaps the right solution is to simply have the regexp engine skip over ZWJ and ZWNJ when matching? Otherwise, I bet the current impl doesn't match random strings that don't come from literals (eg text values from the DOM)...
Created attachment 171584 [details] [diff] [review] patch for testing and discussion This keeps format-control chars inside string literals and regexps. bz's further point is excellent and further highlights the asymmetry in Edition 3 between computed strings vs. literals, and computed RegExp objects and literal regexps: any computed string may contain format-control characters, likewise regular expressions created via new RegExp. What matches what depends on whether the subject or object was expressed literally. Seems like a big bad bug to me. /be
Thanks Brendan, the patch looks pretty good. Behnam, would you please test it. About regexps, I think any kind of special handling simply introduces more confusion. Personally I would never use ZWNJ between letters in a regexp to make them look better. OTOH, I have written regexps with ZWNJ in them. Humm, in Persian ZWNJ is used almost like a dash is used in English... And like Brendan noted, the asymmetry too. I'd say, if anything, it should be like that a regexp modifier can be introduced (like 'i' is for case insensitiveness), to ignore all format characters when matching regexps. Removing them from the pattern doesn't help, as long as there are out there in the to-be-matched text, and not ignored. And don't forget, this all should be optional, and off by default. Humm, you said regexp classes do not scale to Unicode in ECMA Script? That's new to me. They work pretty well in Perl 5.8. I vote for applying the attached patch after testing, and postpone regexp engine stuff for now. Thanks again,
Any word on the patch? I'll get it reviewed and checked in if it seems good. I didn't get much reaction from ECMA TG1 (really, the subset who met today), as we were busy with E4X issues, but it's clear that IE differs from ECMA-262 Edition 3 (and the MS guy was in the room). I think we should agree on something like what this patch does, but it may take a while. /be
Again, I'm fine with it. Roozbeh, maybe you can test it? Behnam?
Thanks Brendan. It WORKS FOR ME. ;)
Comment on attachment 171584 [details] [diff] [review] patch for testing and discussion This patch breaks the invariant that a['bcd'] is equivalent to a.bcd when c is a format-control character. But, it allows users to spell strings the natural way, without having to use \uXXXX sequences. Hoping waldemar can give his thoughts. /be
I'm afraid I'm not following you. What is the invariant?
If c is a format-control character, it will be stripped from a.bcd but not from a['bcd'], and the two forms will denote different properties. /be
Several times I attempted to "fix" this in the ECMA committee but was unable to obtain support for any fix -- most of the other representatives preferred the text as written. There are technical problems with doing such a fix as well, particularly with the interaction of formatting characters and escape sequences -- what happens if you have a formatting character right after a backslash, within the characters of an escape sequence (such as \uab<formatchar>05), etc. This may not be an issue with ZWNJ, but things like these will come up with other popular ones such as LTR and RTL marks. I don't necessarily agree with the committee's conclusion on this one, but I assure you that this issue was looked at in detail several times, and the conclusion was deliberate. I'd campaign for changing this in the future, but only if we can come up with a sensible proposal that explains what happens in all of the cases, including in/around escape sequences and in regular expression literals. I don't see that here yet.
Comment on attachment 171584 [details] [diff] [review] patch for testing and discussion That approach looks fine, for what we're doing in this patch, but I'm still not sure that we have a really good story on where and why we want various format-control chars to be kept or discarded.
Cc'ing i18n gurus who may have good ideas. I'm supposed to write up a proposal for regular expression changes in Edition 4, and better Unicode support (without switching incompatibly to Perl 6 regular expressions!) is a goal. That work item of mine is not directly related to this bug, but it touches on this topic. I am willing to write a separate proposal on ZWNJ and ZWJ, with the right guidance. When should format-control characters be ignored, and when not? Should ZWNJ and ZWJ be treated specially? /be
I for one think there shouldn't be anything special about ZW[N]J at all. Simply that string literals should be literal, no characters added or removed.
Behdad: got that, and the patch in this bug does that -- likewise for regular expressions. But should ZWNJ in a regexp, not in an escape or a character class, be taken verbatim? /be
Hi Brendan, this is not exactly one of my areas, but it might be a good idea to talk to Mark Davis and/or take a look at his latest draft of TR18: http://www.unicode.org/reports/tr18/tr18-10.html In particular, go to Annex C and start a search (find) for "format".
This does cause js/testsecma_3/Unicode/uc-001.js to fail.
Certainly another bug, but worth mentioning here: I've got reports that the new versions of Firefox (with Uniscribe backend I guess) simply ignore '\u200c' too and '‌' should be used instead.
(In reply to comment #36) > Certainly another bug, but worth mentioning here: I've got reports that the > new > versions of Firefox (with Uniscribe backend I guess) simply ignore '\u200c' too > and '‌' should be used instead. Please file it separately if you can confirm the reports, especially with a testcase. Thanks. I have proposed to ECMA TG1 that for Edition 4 (and all versions, really), we do at least as IE does per comment 14. That is, we do not strip format-control characters in string literals. We still need to test IE, evaluate what it does in these cases, and decide whether that's good enough to become de-jure standard: * In regexps, including in character classes and outside of them. * When matching a regexp against a target string containing a format-control char. * After backslash in string literals and regexps. * Other edge cases? It would be a big help if interested folks would construct a test matrix and fill it in with results from IE, other browsers, and also what's considered ideal. /be
I hear Opera flouts ECMA-262 utterly and does not strip any format-control chars, anywhere in script source. If IE does strip outside of strings and possibly other literals, then Opera is +1. /be
We're testing IE here in an ECMA TG1 meeting, and it appears IE6 and IE7 (a) do not strip format control characters; (b) therefore allow them in string literals and regular expression literals; (c) let them through outside of such quoted literals, where they become invalid source characters. We are going to codify this behavior in Edition 4. Look for mozilla1.9/firefox3 to implement it. /be
Created attachment 240106 [details] [diff] [review] Don't skip format-control chars diff -w version in a sec. /be
Fixed on the mozilla1.9 trunk: js/src/jsscan.c 3.118 /be
*grumble about old QA contact* (In reply to comment #39) > We're testing IE here in an ECMA TG1 meeting, and it appears IE6 and IE7 (a) do > not strip format control characters; (b) therefore allow them in string > literals and regular expression literals; (c) let them through outside of such > quoted literals, where they become invalid source characters. > > We are going to codify this behavior in Edition 4. Look for > mozilla1.9/firefox3 to implement it. I scanned through http://developer.mozilla.org/es4/ and didn't see anything yet codified for this -- I believe I understand (a) and (b), but (c) is less clear to me. In that case, would the program be "syntactically in error" (5.1.4 in ed. 3) and thus result in a SyntaxError exception (well, if the program were being passed as a string to eval)?
We will re-export pretty soon, and the latest wiki contents will say that Fc chars are not stripped. (a) and (b) follow from lexical productions in ECMA-262 Ed. 3 section 7 that have right parts of the form "SourceCharacter but not ...", once you remove stripping of Fc chars specified in 7.1. This allows Fc chars in comments as well as strings and regexps. (c) follows from other lexical productions, e.g. for Identifier, which restrict the classes of component chars, terminating on any char outside the set of valid chars. This leaves the offending char if it's Fc not satisfying the goal symbols of the lexical grammar's topside specified in section 7. /be
js> eval("hi\uFEFFthere") typein:1: SyntaxError: illegal character: typein:1: hi?there typein:1: ..^ /be
/cvsroot/mozilla/js/tests/ecma_3/extensions/regress-274152.js,v <-- regress-274152.js
See bug 372198 where I added ecma_3/Unicode/uc-001.js to spidermonkey-n.tests.
Relevant Unicode Public Review issue: http://www.unicode.org/review/pr-96.html
Any chance we can get the fix (or the partial one in the patch attached earlier by Brendan) into ff2 tree?
updated due to bug 368516 to include format control chars and not BOM. /cvsroot/mozilla/js/tests/ecma_3/extensions/regress-274152.js,v <-- regress-274152.js new revision: 1.3; previous revision: 1.2
Should this fix be ported back to the 1.8 branch so web pages get consistent behavior?
No reply, I guess there's no pressing need on the 1.8 branch, and if not probably best not to break any existing scripts.