Closed Bug 274152 Opened 20 years ago Closed 18 years ago

ECMA-262 Edition 3 specifies ignoring ZWNJ and ZWJ along with other Unicode format-control characters

Categories

(Core :: JavaScript Engine, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla1.8beta3

People

(Reporter: fotemac, Assigned: brendan)

References

(Blocks 1 open bug, )

Details

(Keywords: js1.5)

Attachments

(3 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a6) Gecko/20041202
Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a6) Gecko/20041202

When inserting content into elements dynamically by using the W3C DOM's
createRange ... appendChild procedure or the IE DOM's innerHTML procedure in
documents with a utf-8 charset, any raw ZWNJ (U+200C) or ZWJ (U+200D) mulit-byte
characters in the content are lost, causing serious corruption of Persian/Arabic
words.

Reproducible: Always
Steps to Reproduce:
1. define a utf-8 string that includes ZWNJ and/or ZWJ
2. dynamically insert it into an element (any container element, e.g. a span or
div) using either the W3C or IE DOM procedures supported by the Geckos
3.

Actual Results:  
The ZWMJ and ZWJ multi-byte characters are omitted from the inserted content.

Expected Results:  
The ZWMJ and ZWJ multi-byte characters should have been retained, as by other
browsers such as IE.

ZWNJ and ZNJ are handled properly when present directly in utf-8 documents. 
Also, if one replaces the raw utf-8 ZWNJ and ZNJ with their corresponding HTML
entities (‌ and ‍) in string definitions, that content is handled
properly by dynamic insertions.

The testcase at the indicated URL compares handling of ZWNJ and ZNJ versus their
HTML entities in Persian words or strings as span content present either
directly in an HTML document (with a utf-8 charset specified) or inserted
dynamically with the W3C or IE DOM procedures.  The direct inclusions, and the
dynamic insertions of content with the HTML entities, are handled correctly, and
thus show what should be displayed for the raw utf-8 insertions as well.
Ugh, how many ways can I mis-spell 4- or 3-letter acronyms?  Quite a few if I 
submit a bug report too late at night.  In the Description, the utf-8 multi-
byte character references should be ZWNJ ("Zero-Width Non-Joiner") or ZWJ 
("Zero-Width Joiner") throughout.
I've also noticed that using \u200C and \u200D in the JavaScript code also
solves the problem.  Very weird!
This bug also applies for documents with the windows-1256 (Arabic) charset.
Over to JS engine.  We have the right data, and we're passing it to
JS_EvaluateUCScriptForPrincipals (via EvaluateString()).  So it looks like the
problem is in the JS engine's parsing of string literals...
Assignee: general → general
Status: UNCONFIRMED → NEW
Component: DOM: HTML → JavaScript Engine
Ever confirmed: true
OS: Windows 98 → All
QA Contact: ian → pschwartau
Hardware: PC → All
Summary: utf-8 ZWNJ and ZWJ are lost from dynamiic insertions → utf-8 ZWNJ and ZWJ are lost from dynamic insertions
From ECMA-262 Edition 3 (available at
http://www.mozilla.org/js/language/E262-3.pdf among other places):

7.1 Unicode Format-Control Characters

The Unicode format-control characters (i.e., the characters in category  Cf  in
the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK)
are control codes used to control the formatting of a range of text in the
absence of higher-level protocols for this (such as mark-up languages). It is
useful to allow these in source text to facilitate editing and display.

The format control characters can occur anywhere in the source text of an
ECMAScript program. These characters are removed from the source text before
applying the lexical grammar. Since these characters are removed before
processing string and regular expression literals, one must use a. Unicode
escape sequence (see section 7.6) to include a Unicode format-control character
inside a string or regular expression literal.

/be
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → INVALID
That simply makes it unusable for Persian text.  What kind of string 'literals'
are they if they drop some characters?
ECMA-262 Edition 3 is also ISO-16262, it's pretty set in stone.  I can see about
changing things for Edition 4, but that's supposed to be compatible, and it has
no definite completion date at this point.

If you want to quarrel with the existing standard, it will probably just be
frustrating for all sides.  I don't recall the rationale for stripping
formatting characters, but perhaps someone on the cc: list will.

As a practical matter, I'm interested to hear what IE and Safari do.  If they do
not follow ECMA-262 Edition 3, what rule or rules do they implement?

/be
(In reply to comment #7)
> That simply makes it unusable for Persian text.
> What kind of string 'literals' are they if they
> drop some characters?

Yes.  More importantly, instead of tactlessly charging ahead with a change of 
Status to RESOLVED INVALID based on ECMA-262 E3, it might have been better to 
raise that issue for further discussion.  We might then have been able to 
direct /be to UAX 9:

http://www.unicode.org/reports/tr9/#X9

for the Unicode 4.0.1 standard, particularly its Implementation Notes section 
5.5.1 Joiners, and its section X9 which explains:

"The Zero Width Joiner and Non Joiner affect the shaping of the adjacent 
characters; those that are adjacent in the original backing-store order, even 
though those characters may end up being rearranged to be non-adjacent by the 
BIDI algorithm. For more information, see Joiners."

It seems to me that getting past the Geckos' present adherence to the ECMA's 
brain dead formula for handling ZWNJ and ZWJ as "expendable" control 
characters, so that authors of Persian / Farsi resources might stop viewing the 
Geckos' implementation as inferior to IE's, is just as important as things such 
as the Geckos' "undetected document.all" or a number of "detected" non-
standards implementations under the rubric of "de-facto" standards.
(In reply to comment #8)
> I'm interested to hear what IE and Safari do.

As I indicated in my description for this bug, IE behaves as if the ZWNJ or ZWJ 
were retained in the dynamically loaded (script-based) text, so that the 
adjacent Persian / Farsi characters are properly formed, just as when the ZWNJ 
or ZWJ are used directly in the document's content.

However, it may be doing what UAX 9 of the Unicode 4.0.1 standard to which I 
referred you now recommends(which I understand to have been coordinated with 
what ISO now co-recommends):

http://www.unicode.org/reports/tr9/
"5.3. Joiners

As described under X9, the Zero Width Joiner and Non Joiner affect the shaping 
of the adjacent characters葉hose that are adjacent in the original backing-
store order容ven though those characters may end up being rearranged to be non-
adjacent by the BIDI algorithm. In order to determine the joining behavior of a 
particular character after applying the BIDI algorithm, there are two main 
strategies.

When shaping, an implementation can refer back to the original backing store to 
see if there were adjacent ZWNJ or ZWJ characters. 

Alternatively, the implementation can replace ZWJ and ZWNJ by an out-of-band 
character property associated with those adjacent characters, so that the 
information does not interfere with the BIDI algorithm and the information
is preserved across rearrangement of those characters. Once the BIDI algorithm 
has been applied, that out-of-band information can then be used for proper 
shaping."
Foteos: there's nothing tactless in resolving a bug INVALID, it happens all the
time.  The issue here is whether we can have a sane, coherent, *specifiable*
result for every input, if we abandon ECMA-262 Edition 3 as you advocate.

What's the spec now, "do what IE does"?  It's true that we try to do that for a
select list of hard cases (undetected document.all being just one of those).  In
this case, though -- and you are the one advocating a change, so it's on you to
be specific -- what are the rules?

Should we include all Unicode formatting characters in ECMAScript source?  Does
IE?  Or is the issue *only* ZWNJ and ZWJ?  Please test IE and report complete
results, that will help get this bug REOPENED, or if appropriate, get a new bug
filed.  Thanks,

/be
(In reply to comment #11)
> can have a sane, coherent, *specifiable* result for every input,
> if we abandon ECMA-262 Edition 3 as you advocate.

Please read what I wrote more carefully and do not put your own words in my 
mouth. I posted a bug specifically about "the Joiners" (ZWNJ and ZWJ).  It was 
you who raised the broader issue of all Unicode format-control characters and 
seems to think that handling the two Joiners adequately would require that 
you "abandon" ECMA-262 E3. The issue of the Joiners and the inappropriateness 
of removing them without subsequently following a strategy for properly shaping 
the adjacent characters so as not to trash languages such as Persian / Farsi 
has been discussed by the standards-making organizations, and two 
implementation strategies have been offered by the Unicode Consortiun in 
coordination with ISO for the Unicode 4.0.1 standard and its ISO-10646 
homolog.  In my Comment 9 and Comment 10 I posted URLs for the appropriate 
standards documents, together with quotations concerning the need for special 
handling of the Joiners (not all Unicode format-control characters) with the 
two recommended implementation strageties for accomplishing the needed special 
handling. It is adherence to those standards and adoption of a recommended 
implementation strategy that I "advocate."

If you are curious about how IE handles the other format-control characters, 
feel free to investigate that yourself.  But this bug is specifically about the 
Joiners, and should be REOPENED because the marking of it as INVALID was pre-
mature and is invalid.
 
Foteos, I did read what you wrote, so calm down.  If I didn't hop to it and
reopen this bug as fast as you would like, flaming me isn't going to help.  Keep
a civil tongue if you want to work well with others in the Mozilla community.

I'll raise this issue with the ECMA TG1 working group next week.

/be
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Summary: utf-8 ZWNJ and ZWJ are lost from dynamic insertions → ECMA-262 Edition 3 specifies ignoring ZWNJ and ZWJ along with other Unicode format-control characters
Assignee: general → brendan
Status: REOPENED → NEW
Hi everybody,

Fote, Thanks for openning this bug in the first place.  I think you can leave it
to me, I will follow up as best as I can.

Brendan, I read the ECMA spec, you are right.  That's exactly the source of the
problem, and the implementation is exactly following the spec.  Now a brief
discussion:

  - IE does not remove ANY of the format characters from string literals.  In
Persian at least, we use LRM, RLM, ZWJ, and ZWNJ, and a few others.  Among them,
ZWNJ is part of Persian orthography and used almost surely in any paragraph of
Persian text.  I don't have access to other browsers.

  - Reading the excerpt from the spec you quoted above, it looks like the
problem discribed in this bug is merely a technical side-effect of removing
Unicode format characters before lexical analysis.  In the real world, there is
not much rationale behind modifying string literals (and regular expressions) in
this manner.

  - I can imagine myself proposing to the Unicode Technical Committee, to change
the General Category of the ZWJ and ZWNJ characters to leverage the problem with
Persian orthography once and for all (not that this has not been discussed there
before), but there still remains the problem with other format characters.  The
offending assumption is that format characters are used to format the source
code for better visual rendering, where they are really much more useful in
string literals.  And the more annoying part is that one doesn't know which of
the characters one is using are going to be removed in this process.

  - Another problem happens when the \uxxxx escaped sequence is not ignored,
while the UTF-8 representation is.  For example, using java2ascii and ascii2java
converters affects the semantics of the code.  I'm not sure what happens when
one uses \u200C (ZWNJ) in the middle of an identifier.  I suspect it's not ignored.


So, I appreciate you discussing this problem with the ECMA TG1 WG.  The
resolution IMHO should be "allowing" implementations to not remove format
characters from string literals and regular expressions, though I prefer it if
they change it to "should not" remove format characters...

Moreover, it would be helpful to hear your opinion about deviating from the
standard in the implementation, by not removing format characters from string
literals and regular expressions.  I believe there is no offensive side-effect
to it, and I can survey masses of JavaScript code to see what is the current
practice of using format characters in string literals, if that helps.

--be
Behdad, thanks for your comments.  I have no problem with deviating from a spec,
if there's a good interoperation or utility reason, and provided there is some
kind of alternative spec to follow.

I was not in on ECMA-262 Edition 3's changes for Unicode (I moved from the JS
group I'd founded to help form mozilla.org in late 1997).  Edition 2 has no such
paragraphs excluding format-control characters, so this was an Edition 3 change
made after August 1998.  Possibly waldemar remembers the rationale, or knows of
a document with rationales for the changes from Edition 2 to Edition 3.

I'll whip up a patch to allow all Unicode characters in string literals and
regular expressions, and attach it here.

/be
Status: NEW → ASSIGNED
Keywords: js1.5
Priority: -- → P2
Target Milestone: --- → mozilla1.8beta
> In the real world, there is not much rationale behind modifying string
> literals (and regular expressions) in this manner.

Is that true for regular expressions?  For example, consider the regular
expression (in English):

  /[abc]/

Now with a Persian equivalent, I assume I'd want to put ZWNJ in between the
letters to keep them distinct from each other, for readability.  But then, per
the regular expression matching algorithm, the regular expression will match any
string containing ZWNJ, which is clearly undesirable.

At the same time, I agree that the regular expression

  /ab/

(with a ZWNJ between the two characters) should probably not match the string
"ab" (without a ZWNJ between the two characters)...
bz: good point, character classes (which don't scale from ASCII to Unicode,
hence lwall et al. changing Perl 6 to reuse [] for non-capturing parens) might
want to unquote.  Patch soon.

/be
This is a problem outside character classes too... what about:

  /ab?c/

in situations where one would normally put a ZWNJ between "a" and "b" but not
between "a" and "c"?  Consider what the strings it'll be matched against will
look like....

Perhaps the right solution is to simply have the regexp engine skip over ZWJ and
ZWNJ when matching?  Otherwise, I bet the current impl doesn't match random
strings that don't come from literals (eg text values from the DOM)...
Attached patch patch for testing and discussion (obsolete) — Splinter Review
This keeps format-control chars inside string literals and regexps.

bz's further point is excellent and further highlights the asymmetry in Edition
3 between computed strings vs. literals, and computed RegExp objects and
literal regexps: any computed string may contain format-control characters,
likewise regular expressions created via new RegExp.  What matches what depends
on whether the subject or object was expressed literally.  Seems like a big bad
bug to me.

/be
Thanks Brendan, the patch looks pretty good.  Behnam, would you please test it.

About regexps, I think any kind of special handling simply introduces more
confusion.  Personally I would never use ZWNJ between letters in a regexp to
make them look better.  OTOH, I have written regexps with ZWNJ in them.  Humm,
in Persian ZWNJ is used almost like a dash is used in English...  And like
Brendan noted, the asymmetry too.  I'd say, if anything, it should be like that
a regexp modifier can be introduced (like 'i' is for case insensitiveness), to
ignore all format characters when matching regexps.  Removing them from the
pattern doesn't help, as long as there are out there in the to-be-matched text,
and not ignored.  And don't forget, this all should be optional, and off by default.

Humm, you said regexp classes do not scale to Unicode in ECMA Script?  That's
new to me.  They work pretty well in Perl 5.8.

I vote for applying the attached patch after testing, and postpone regexp engine
stuff for now.

Thanks again,
Any word on the patch?  I'll get it reviewed and checked in if it seems good.  I
didn't get much reaction from ECMA TG1 (really, the subset who met today), as we
were busy with E4X issues, but it's clear that IE differs from ECMA-262 Edition
3 (and the MS guy was in the room).  I think we should agree on something like
what this patch does, but it may take a while.

/be
Again, I'm fine with it.  Roozbeh, maybe you can test it?  Behnam?
Thanks Brendan.  It WORKS FOR ME. ;)
Comment on attachment 171584 [details] [diff] [review]
patch for testing and discussion

This patch breaks the invariant that a['bcd'] is equivalent to a.bcd when c is
a format-control character.  But, it allows users to spell strings the natural
way, without having to use \uXXXX sequences.

Hoping waldemar can give his thoughts.

/be
Attachment #171584 - Flags: superreview?(shaver)
Attachment #171584 - Flags: review?(waldemar)
I'm afraid I'm not following you.  What is the invariant?  
If c is a format-control character, it will be stripped from a.bcd but not from
a['bcd'], and the two forms will denote different properties.

/be
Several times I attempted to "fix" this in the ECMA committee but was unable to
obtain support for any fix -- most of the other representatives preferred the
text as written.  There are technical problems with doing such a fix as well,
particularly with the interaction of formatting characters and escape sequences
-- what happens if you have a formatting character right after a backslash,
within the characters of an escape sequence (such as \uab<formatchar>05), etc. 
This may not be an issue with ZWNJ, but things like these will come up with
other popular ones such as LTR and RTL marks.

I don't necessarily agree with the committee's conclusion on this one, but I
assure you that this issue was looked at in detail several times, and the
conclusion was deliberate.  I'd campaign for changing this in the future, but
only if we can come up with a sensible proposal that explains what happens in
all of the cases, including in/around escape sequences and in regular expression
literals.  I don't see that here yet.
Well, the way I see it is that JavaScript tried to be the first language to
handle Unicode format characters intelligently, but has failed so far, and
sticking to the current spec, is quite an unwanted pain, offering almost no
advantage in return.  I've been in the Unicode debates for a few years now, and
I'm a native Persian speaker.  I've never ever seen people using format
characters (be it LRM, RLM, etc) for formatting source codes.  They just don't
think that way.  On the other hand, must of the time they need to put these very
same characters in their literals.

Around escape sequences and probably other cases, the change is pretty simple: 
They work quite like any non-format character:  They are not ignored.  In other
words, one should not use them inside an escape sequence, and why should they
need it really?  Being a master of the Unicode Bidirectional Algorithm, I can
verify that \uXXXX needs no format character and will be rendered either as
\uXXXX or uXXXX\ all the time, which is quite normal in a right-to-left context.
Target Milestone: mozilla1.8beta1 → mozilla1.8beta2
Blocks: Persian
Comment on attachment 171584 [details] [diff] [review]
patch for testing and discussion

That approach looks fine, for what we're doing in this patch, but I'm still not
sure that we have a really good story on where and why we want various
format-control chars to be kept or discarded.
Attachment #171584 - Flags: superreview?(shaver) → superreview+
Cc'ing i18n gurus who may have good ideas.  I'm supposed to write up a proposal
for regular expression changes in Edition 4, and better Unicode support (without
switching incompatibly to Perl 6 regular expressions!) is a goal.

That work item of mine is not directly related to this bug, but it touches on
this topic.  I am willing to write a separate proposal on ZWNJ and ZWJ, with the
right guidance.  When should format-control characters be ignored, and when not?
 Should ZWNJ and ZWJ be treated specially?

/be
I for one think there shouldn't be anything special about ZW[N]J at all.  Simply
that string literals should be literal, no characters added or removed.
Behdad: got that, and the patch in this bug does that -- likewise for regular
expressions.  But should ZWNJ in a regexp, not in an escape or a character
class, be taken verbatim?

/be
Target Milestone: mozilla1.8beta2 → mozilla1.8beta3
I think so.  That's what I'm using these days.  ZWNJ is a regular character used
in Persian and a few other languages (Indic languages IIRC).  Removing ZWNJ from
regexps introduces the same problem as for strings.  And like somebody pointed
earlier, string and regexp literals are not the only source of strings and
regexps in JavaScript, so, any special handling at parsing level just breaks things.
Hi Brendan, this is not exactly one of my areas, but it might be a good idea
to talk to Mark Davis and/or take a look at his latest draft of TR18:

http://www.unicode.org/reports/tr18/tr18-10.html

In particular, go to Annex C and start a search (find) for "format".
This does cause js/testsecma_3/Unicode/uc-001.js to fail.
Certainly another bug, but worth mentioning here:  I've got reports that the new
versions of Firefox (with Uniscribe backend I guess) simply ignore '\u200c' too
and '&zwnj;' should be used instead.
(In reply to comment #36)
> Certainly another bug, but worth mentioning here:  I've got reports that the
> new
> versions of Firefox (with Uniscribe backend I guess) simply ignore '\u200c' too
> and '&zwnj;' should be used instead.

Please file it separately if you can confirm the reports, especially with a testcase.  Thanks.

I have proposed to ECMA TG1 that for Edition 4 (and all versions, really), we do at least as IE does per comment 14.  That is, we do not strip format-control characters in string literals.

We still need to test IE, evaluate what it does in these cases, and decide whether that's good enough to become de-jure standard:

* In regexps, including in character classes and outside of them.

* When matching a regexp against a target string containing a format-control char.

* After backslash in string literals and regexps.

* Other edge cases?

It would be a big help if interested folks would construct a test matrix and fill it in with results from IE, other browsers, and also what's considered ideal.

/be
I hear Opera flouts ECMA-262 utterly and does not strip any format-control chars, anywhere in script source.  If IE does strip outside of strings and possibly other literals, then Opera is +1.

/be
We're testing IE here in an ECMA TG1 meeting, and it appears IE6 and IE7 (a) do not strip format control characters; (b) therefore allow them in string literals and regular expression literals; (c) let them through outside of such quoted literals, where they become invalid source characters.

We are going to codify this behavior in Edition 4.  Look for mozilla1.9/firefox3 to implement it.

/be
Thanks Brendan
diff -w version in a sec.

/be
Attachment #171584 - Attachment is obsolete: true
Attachment #240106 - Flags: review?(mrbkap)
Attachment #171584 - Flags: review?(waldemar)
Attachment #240106 - Flags: review?(mrbkap) → review+
Fixed on the mozilla1.9 trunk:

js/src/jsscan.c 3.118

/be
Status: ASSIGNED → RESOLVED
Closed: 20 years ago18 years ago
Resolution: --- → FIXED
*grumble about old QA contact*

(In reply to comment #39)
> We're testing IE here in an ECMA TG1 meeting, and it appears IE6 and IE7 (a) do
> not strip format control characters; (b) therefore allow them in string
> literals and regular expression literals; (c) let them through outside of such
> quoted literals, where they become invalid source characters.
> 
> We are going to codify this behavior in Edition 4.  Look for
> mozilla1.9/firefox3 to implement it.

I scanned through http://developer.mozilla.org/es4/ and didn't see anything yet codified for this -- I believe I understand (a) and (b), but (c) is less clear to me.  In that case, would the program be "syntactically in error" (5.1.4 in ed. 3) and thus result in a SyntaxError exception (well, if the program were being passed as a string to eval)?
QA Contact: pschwartau → general
We will re-export pretty soon, and the latest wiki contents will say that Fc chars are not stripped.

(a) and (b) follow from lexical productions in ECMA-262 Ed. 3 section 7 that have right parts of the form "SourceCharacter but not ...", once you remove stripping of Fc chars specified in 7.1.  This allows Fc chars in comments as well as strings and regexps.

(c) follows from other lexical productions, e.g. for Identifier, which restrict the classes of component chars, terminating on any char outside the set of valid chars. This leaves the offending char if it's Fc not satisfying the goal symbols of the lexical grammar's topside specified in section 7.

/be
js> eval("hi\uFEFFthere")
typein:1: SyntaxError: illegal character:
typein:1: hi?there
typein:1: ..^

/be
Blocks: 361783
Depends on: 368516
/cvsroot/mozilla/js/tests/ecma_3/extensions/regress-274152.js,v  <--  regress-274152.js
See bug 372198 where I added ecma_3/Unicode/uc-001.js to spidermonkey-n.tests.
Relevant Unicode Public Review issue:
http://www.unicode.org/review/pr-96.html
Any chance we can get the fix (or the partial one in the patch attached earlier by Brendan) into ff2 tree?
updated due to bug 368516 to include format control chars and not BOM.

/cvsroot/mozilla/js/tests/ecma_3/extensions/regress-274152.js,v  <--  regress-274152.js
new revision: 1.3; previous revision: 1.2
Flags: in-testsuite+
v 1.9.0
Status: RESOLVED → VERIFIED
Should this fix be ported back to the 1.8 branch so web pages get consistent behavior?
Flags: wanted1.8.1.x?
Flags: blocking1.8.1.16?
No reply, I guess there's no pressing need on the 1.8 branch, and if not probably best not to break any existing scripts.
Flags: wanted1.8.1.x?
Flags: blocking1.8.1.17?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: