274152 - ECMA-262 Edition 3 specifies ignoring ZWNJ and ZWJ along with other Unicode format-control characters

Reporter

Description

•

20 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a6) Gecko/20041202
Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a6) Gecko/20041202

When inserting content into elements dynamically by using the W3C DOM's
createRange ... appendChild procedure or the IE DOM's innerHTML procedure in
documents with a utf-8 charset, any raw ZWNJ (U+200C) or ZWJ (U+200D) mulit-byte
characters in the content are lost, causing serious corruption of Persian/Arabic
words.

Reproducible: Always
Steps to Reproduce:
1. define a utf-8 string that includes ZWNJ and/or ZWJ
2. dynamically insert it into an element (any container element, e.g. a span or
div) using either the W3C or IE DOM procedures supported by the Geckos
3.

Actual Results:  
The ZWMJ and ZWJ multi-byte characters are omitted from the inserted content.

Expected Results:  
The ZWMJ and ZWJ multi-byte characters should have been retained, as by other
browsers such as IE.

ZWNJ and ZNJ are handled properly when present directly in utf-8 documents. 
Also, if one replaces the raw utf-8 ZWNJ and ZNJ with their corresponding HTML
entities (&zwnj; and &zwj;) in string definitions, that content is handled
properly by dynamic insertions.

The testcase at the indicated URL compares handling of ZWNJ and ZNJ versus their
HTML entities in Persian words or strings as span content present either
directly in an HTML document (with a utf-8 charset specified) or inserted
dynamically with the W3C or IE DOM procedures.  The direct inclusions, and the
dynamic insertions of content with the HTML entities, are handled correctly, and
thus show what should be displayed for the raw utf-8 insertions as well.

Foteos Macrides

Reporter

Comment 1

•

20 years ago

Ugh, how many ways can I mis-spell 4- or 3-letter acronyms?  Quite a few if I 
submit a bug report too late at night.  In the Description, the utf-8 multi-
byte character references should be ZWNJ ("Zero-Width Non-Joiner") or ZWJ 
("Zero-Width Joiner") throughout.

Behdad Esfahbod

Comment 2

•

20 years ago

I've also noticed that using \u200C and \u200D in the JavaScript code also
solves the problem.  Very weird!

Foteos Macrides

Reporter

Comment 3

•

20 years ago

This bug also applies for documents with the windows-1256 (Arabic) charset.

Boris Zbarsky [:bzbarsky]

Comment 4

•

20 years ago

Attached file Minimalish testcase demonstrating that the content model is correct — Details

Boris Zbarsky [:bzbarsky]

Comment 5

•

20 years ago

Over to JS engine.  We have the right data, and we're passing it to
JS_EvaluateUCScriptForPrincipals (via EvaluateString()).  So it looks like the
problem is in the JS engine's parsing of string literals...

Assignee: general → general

Status: UNCONFIRMED → NEW

Component: DOM: HTML → JavaScript Engine

Ever confirmed: true

OS: Windows 98 → All

QA Contact: ian → pschwartau

Hardware: PC → All

Summary: utf-8 ZWNJ and ZWJ are lost from dynamiic insertions → utf-8 ZWNJ and ZWJ are lost from dynamic insertions

Brendan Eich [:brendan]

Assignee

Comment 6

•

20 years ago

From ECMA-262 Edition 3 (available at
http://www.mozilla.org/js/language/E262-3.pdf among other places):

7.1 Unicode Format-Control Characters

The Unicode format-control characters (i.e., the characters in category  Cf  in
the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK)
are control codes used to control the formatting of a range of text in the
absence of higher-level protocols for this (such as mark-up languages). It is
useful to allow these in source text to facilitate editing and display.

The format control characters can occur anywhere in the source text of an
ECMAScript program. These characters are removed from the source text before
applying the lexical grammar. Since these characters are removed before
processing string and regular expression literals, one must use a. Unicode
escape sequence (see section 7.6) to include a Unicode format-control character
inside a string or regular expression literal.

/be

Status: NEW → RESOLVED

Closed: 20 years ago

Resolution: --- → INVALID

Behdad Esfahbod

Comment 7

•

20 years ago

That simply makes it unusable for Persian text.  What kind of string 'literals'
are they if they drop some characters?

Brendan Eich [:brendan]

Assignee

Comment 8

•

20 years ago

ECMA-262 Edition 3 is also ISO-16262, it's pretty set in stone.  I can see about
changing things for Edition 4, but that's supposed to be compatible, and it has
no definite completion date at this point.

If you want to quarrel with the existing standard, it will probably just be
frustrating for all sides.  I don't recall the rationale for stripping
formatting characters, but perhaps someone on the cc: list will.

As a practical matter, I'm interested to hear what IE and Safari do.  If they do
not follow ECMA-262 Edition 3, what rule or rules do they implement?

/be

Foteos Macrides

Reporter

Comment 9

•

20 years ago

(In reply to comment #7)
> That simply makes it unusable for Persian text.
> What kind of string 'literals' are they if they
> drop some characters?

Yes.  More importantly, instead of tactlessly charging ahead with a change of 
Status to RESOLVED INVALID based on ECMA-262 E3, it might have been better to 
raise that issue for further discussion.  We might then have been able to 
direct /be to UAX 9:

http://www.unicode.org/reports/tr9/#X9

for the Unicode 4.0.1 standard, particularly its Implementation Notes section 
5.5.1 Joiners, and its section X9 which explains:

"The Zero Width Joiner and Non Joiner affect the shaping of the adjacent 
characters; those that are adjacent in the original backing-store order, even 
though those characters may end up being rearranged to be non-adjacent by the 
BIDI algorithm. For more information, see Joiners."

It seems to me that getting past the Geckos' present adherence to the ECMA's 
brain dead formula for handling ZWNJ and ZWJ as "expendable" control 
characters, so that authors of Persian / Farsi resources might stop viewing the 
Geckos' implementation as inferior to IE's, is just as important as things such 
as the Geckos' "undetected document.all" or a number of "detected" non-
standards implementations under the rubric of "de-facto" standards.

Foteos Macrides

Reporter

Comment 10

•

20 years ago

(In reply to comment #8)
> I'm interested to hear what IE and Safari do.

As I indicated in my description for this bug, IE behaves as if the ZWNJ or ZWJ 
were retained in the dynamically loaded (script-based) text, so that the 
adjacent Persian / Farsi characters are properly formed, just as when the ZWNJ 
or ZWJ are used directly in the document's content.

However, it may be doing what UAX 9 of the Unicode 4.0.1 standard to which I 
referred you now recommends(which I understand to have been coordinated with 
what ISO now co-recommends):

http://www.unicode.org/reports/tr9/
"5.3. Joiners

As described under X9, the Zero Width Joiner and Non Joiner affect the shaping 
of the adjacent characters葉hose that are adjacent in the original backing-
store order容ven though those characters may end up being rearranged to be non-
adjacent by the BIDI algorithm. In order to determine the joining behavior of a 
particular character after applying the BIDI algorithm, there are two main 
strategies.

When shaping, an implementation can refer back to the original backing store to 
see if there were adjacent ZWNJ or ZWJ characters. 

Alternatively, the implementation can replace ZWJ and ZWNJ by an out-of-band 
character property associated with those adjacent characters, so that the 
information does not interfere with the BIDI algorithm and the information
is preserved across rearrangement of those characters. Once the BIDI algorithm 
has been applied, that out-of-band information can then be used for proper 
shaping."

Brendan Eich [:brendan]

Assignee

Comment 11

•

20 years ago

Foteos: there's nothing tactless in resolving a bug INVALID, it happens all the
time.  The issue here is whether we can have a sane, coherent, *specifiable*
result for every input, if we abandon ECMA-262 Edition 3 as you advocate.

What's the spec now, "do what IE does"?  It's true that we try to do that for a
select list of hard cases (undetected document.all being just one of those).  In
this case, though -- and you are the one advocating a change, so it's on you to
be specific -- what are the rules?

Should we include all Unicode formatting characters in ECMAScript source?  Does
IE?  Or is the issue *only* ZWNJ and ZWJ?  Please test IE and report complete
results, that will help get this bug REOPENED, or if appropriate, get a new bug
filed.  Thanks,

/be

Foteos Macrides

Reporter

Comment 12

•

20 years ago

(In reply to comment #11)
> can have a sane, coherent, *specifiable* result for every input,
> if we abandon ECMA-262 Edition 3 as you advocate.

Please read what I wrote more carefully and do not put your own words in my 
mouth. I posted a bug specifically about "the Joiners" (ZWNJ and ZWJ).  It was 
you who raised the broader issue of all Unicode format-control characters and 
seems to think that handling the two Joiners adequately would require that 
you "abandon" ECMA-262 E3. The issue of the Joiners and the inappropriateness 
of removing them without subsequently following a strategy for properly shaping 
the adjacent characters so as not to trash languages such as Persian / Farsi 
has been discussed by the standards-making organizations, and two 
implementation strategies have been offered by the Unicode Consortiun in 
coordination with ISO for the Unicode 4.0.1 standard and its ISO-10646 
homolog.  In my Comment 9 and Comment 10 I posted URLs for the appropriate 
standards documents, together with quotations concerning the need for special 
handling of the Joiners (not all Unicode format-control characters) with the 
two recommended implementation strageties for accomplishing the needed special 
handling. It is adherence to those standards and adoption of a recommended 
implementation strategy that I "advocate."

If you are curious about how IE handles the other format-control characters, 
feel free to investigate that yourself.  But this bug is specifically about the 
Joiners, and should be REOPENED because the marking of it as INVALID was pre-
mature and is invalid.

Brendan Eich [:brendan]

Assignee

Comment 13

•

20 years ago

Foteos, I did read what you wrote, so calm down.  If I didn't hop to it and
reopen this bug as fast as you would like, flaming me isn't going to help.  Keep
a civil tongue if you want to work well with others in the Mozilla community.

I'll raise this issue with the ECMA TG1 working group next week.

/be

Status: RESOLVED → REOPENED

Resolution: INVALID → ---

Summary: utf-8 ZWNJ and ZWJ are lost from dynamic insertions → ECMA-262 Edition 3 specifies ignoring ZWNJ and ZWJ along with other Unicode format-control characters

Brendan Eich [:brendan]

Assignee

Updated

•

20 years ago

Assignee: general → brendan

Status: REOPENED → NEW

Behdad Esfahbod

Comment 14

•

20 years ago

Hi everybody,

Fote, Thanks for openning this bug in the first place.  I think you can leave it
to me, I will follow up as best as I can.

Brendan, I read the ECMA spec, you are right.  That's exactly the source of the
problem, and the implementation is exactly following the spec.  Now a brief
discussion:

  - IE does not remove ANY of the format characters from string literals.  In
Persian at least, we use LRM, RLM, ZWJ, and ZWNJ, and a few others.  Among them,
ZWNJ is part of Persian orthography and used almost surely in any paragraph of
Persian text.  I don't have access to other browsers.

  - Reading the excerpt from the spec you quoted above, it looks like the
problem discribed in this bug is merely a technical side-effect of removing
Unicode format characters before lexical analysis.  In the real world, there is
not much rationale behind modifying string literals (and regular expressions) in
this manner.

  - I can imagine myself proposing to the Unicode Technical Committee, to change
the General Category of the ZWJ and ZWNJ characters to leverage the problem with
Persian orthography once and for all (not that this has not been discussed there
before), but there still remains the problem with other format characters.  The
offending assumption is that format characters are used to format the source
code for better visual rendering, where they are really much more useful in
string literals.  And the more annoying part is that one doesn't know which of
the characters one is using are going to be removed in this process.

  - Another problem happens when the \uxxxx escaped sequence is not ignored,
while the UTF-8 representation is.  For example, using java2ascii and ascii2java
converters affects the semantics of the code.  I'm not sure what happens when
one uses \u200C (ZWNJ) in the middle of an identifier.  I suspect it's not ignored.


So, I appreciate you discussing this problem with the ECMA TG1 WG.  The
resolution IMHO should be "allowing" implementations to not remove format
characters from string literals and regular expressions, though I prefer it if
they change it to "should not" remove format characters...

Moreover, it would be helpful to hear your opinion about deviating from the
standard in the implementation, by not removing format characters from string
literals and regular expressions.  I believe there is no offensive side-effect
to it, and I can survey masses of JavaScript code to see what is the current
practice of using format characters in string literals, if that helps.

--be

Brendan Eich [:brendan]

Assignee

Comment 15

•

20 years ago

Behdad, thanks for your comments.  I have no problem with deviating from a spec,
if there's a good interoperation or utility reason, and provided there is some
kind of alternative spec to follow.

I was not in on ECMA-262 Edition 3's changes for Unicode (I moved from the JS
group I'd founded to help form mozilla.org in late 1997).  Edition 2 has no such
paragraphs excluding format-control characters, so this was an Edition 3 change
made after August 1998.  Possibly waldemar remembers the rationale, or knows of
a document with rationales for the changes from Edition 2 to Edition 3.

I'll whip up a patch to allow all Unicode characters in string literals and
regular expressions, and attach it here.

/be

Status: NEW → ASSIGNED

Keywords: js1.5

Priority: -- → P2

Target Milestone: --- → mozilla1.8beta

Boris Zbarsky [:bzbarsky]

Comment 16

•

20 years ago

> In the real world, there is not much rationale behind modifying string
> literals (and regular expressions) in this manner.

Is that true for regular expressions?  For example, consider the regular
expression (in English):

  /[abc]/

Now with a Persian equivalent, I assume I'd want to put ZWNJ in between the
letters to keep them distinct from each other, for readability.  But then, per
the regular expression matching algorithm, the regular expression will match any
string containing ZWNJ, which is clearly undesirable.

At the same time, I agree that the regular expression

  /ab/

(with a ZWNJ between the two characters) should probably not match the string
"ab" (without a ZWNJ between the two characters)...

Brendan Eich [:brendan]

Assignee

Comment 17

•

20 years ago

bz: good point, character classes (which don't scale from ASCII to Unicode,
hence lwall et al. changing Perl 6 to reuse [] for non-capturing parens) might
want to unquote.  Patch soon.

/be

Boris Zbarsky [:bzbarsky]

Comment 18

•

20 years ago

This is a problem outside character classes too... what about:

  /ab?c/

in situations where one would normally put a ZWNJ between "a" and "b" but not
between "a" and "c"?  Consider what the strings it'll be matched against will
look like....

Perhaps the right solution is to simply have the regexp engine skip over ZWJ and
ZWNJ when matching?  Otherwise, I bet the current impl doesn't match random
strings that don't come from literals (eg text values from the DOM)...

Brendan Eich [:brendan]

Assignee

Comment 19

•

20 years ago

Attached patch patch for testing and discussion (obsolete) — Details — Splinter Review

This keeps format-control chars inside string literals and regexps.

bz's further point is excellent and further highlights the asymmetry in Edition
3 between computed strings vs. literals, and computed RegExp objects and
literal regexps: any computed string may contain format-control characters,
likewise regular expressions created via new RegExp.  What matches what depends
on whether the subject or object was expressed literally.  Seems like a big bad
bug to me.

/be

Behdad Esfahbod

Comment 20

•

20 years ago

Thanks Brendan, the patch looks pretty good.  Behnam, would you please test it.

About regexps, I think any kind of special handling simply introduces more
confusion.  Personally I would never use ZWNJ between letters in a regexp to
make them look better.  OTOH, I have written regexps with ZWNJ in them.  Humm,
in Persian ZWNJ is used almost like a dash is used in English...  And like
Brendan noted, the asymmetry too.  I'd say, if anything, it should be like that
a regexp modifier can be introduced (like 'i' is for case insensitiveness), to
ignore all format characters when matching regexps.  Removing them from the
pattern doesn't help, as long as there are out there in the to-be-matched text,
and not ignored.  And don't forget, this all should be optional, and off by default.

Humm, you said regexp classes do not scale to Unicode in ECMA Script?  That's
new to me.  They work pretty well in Perl 5.8.

I vote for applying the attached patch after testing, and postpone regexp engine
stuff for now.

Thanks again,

Brendan Eich [:brendan]

Assignee

Comment 21

•

20 years ago

Any word on the patch?  I'll get it reviewed and checked in if it seems good.  I
didn't get much reaction from ECMA TG1 (really, the subset who met today), as we
were busy with E4X issues, but it's clear that IE differs from ECMA-262 Edition
3 (and the MS guy was in the room).  I think we should agree on something like
what this patch does, but it may take a while.

/be

Behdad Esfahbod

Comment 22

•

20 years ago

Again, I'm fine with it.  Roozbeh, maybe you can test it?  Behnam?

Behnam Esfahbod [:zwnj]

Comment 23

•

20 years ago

Thanks Brendan.  It WORKS FOR ME. ;)

Brendan Eich [:brendan]

Assignee

Comment 24

•

20 years ago

Comment on attachment 171584 [details] [diff] [review]
patch for testing and discussion

This patch breaks the invariant that a['bcd'] is equivalent to a.bcd when c is
a format-control character.  But, it allows users to spell strings the natural
way, without having to use \uXXXX sequences.

Hoping waldemar can give his thoughts.

/be

Attachment #171584 - Flags: superreview?(shaver)

Attachment #171584 - Flags: review?(waldemar)

Behdad Esfahbod

Comment 25

•

20 years ago

I'm afraid I'm not following you.  What is the invariant?

Brendan Eich [:brendan]

Assignee

Comment 26

•

20 years ago

If c is a format-control character, it will be stripped from a.bcd but not from
a['bcd'], and the two forms will denote different properties.

/be

Waldemar Horwat

Comment 27

•

20 years ago

Several times I attempted to "fix" this in the ECMA committee but was unable to
obtain support for any fix -- most of the other representatives preferred the
text as written.  There are technical problems with doing such a fix as well,
particularly with the interaction of formatting characters and escape sequences
-- what happens if you have a formatting character right after a backslash,
within the characters of an escape sequence (such as \uab<formatchar>05), etc. 
This may not be an issue with ZWNJ, but things like these will come up with
other popular ones such as LTR and RTL marks.

I don't necessarily agree with the committee's conclusion on this one, but I
assure you that this issue was looked at in detail several times, and the
conclusion was deliberate.  I'd campaign for changing this in the future, but
only if we can come up with a sensible proposal that explains what happens in
all of the cases, including in/around escape sequences and in regular expression
literals.  I don't see that here yet.

Behdad Esfahbod

Comment 28

•

20 years ago

Well, the way I see it is that JavaScript tried to be the first language to
handle Unicode format characters intelligently, but has failed so far, and
sticking to the current spec, is quite an unwanted pain, offering almost no
advantage in return.  I've been in the Unicode debates for a few years now, and
I'm a native Persian speaker.  I've never ever seen people using format
characters (be it LRM, RLM, etc) for formatting source codes.  They just don't
think that way.  On the other hand, must of the time they need to put these very
same characters in their literals.

Around escape sequences and probably other cases, the change is pretty simple: 
They work quite like any non-format character:  They are not ignored.  In other
words, one should not use them inside an escape sequence, and why should they
need it really?  Being a master of the Unicode Bidirectional Algorithm, I can
verify that \uXXXX needs no format character and will be rendered either as
\uXXXX or uXXXX\ all the time, which is quite normal in a right-to-left context.

Brendan Eich [:brendan]

Assignee

Updated

•

20 years ago

Target Milestone: mozilla1.8beta1 → mozilla1.8beta2

Behnam Esfahbod [:zwnj]

Updated

•

19 years ago

Blocks: Persian

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 29

•

19 years ago

Comment on attachment 171584 [details] [diff] [review]
patch for testing and discussion

That approach looks fine, for what we're doing in this patch, but I'm still not
sure that we have a really good story on where and why we want various
format-control chars to be kept or discarded.

Attachment #171584 - Flags: superreview?(shaver) → superreview+

Brendan Eich [:brendan]

Assignee

Comment 30

•

19 years ago

Cc'ing i18n gurus who may have good ideas.  I'm supposed to write up a proposal
for regular expression changes in Edition 4, and better Unicode support (without
switching incompatibly to Perl 6 regular expressions!) is a goal.

That work item of mine is not directly related to this bug, but it touches on
this topic.  I am willing to write a separate proposal on ZWNJ and ZWJ, with the
right guidance.  When should format-control characters be ignored, and when not?
 Should ZWNJ and ZWJ be treated specially?

/be

Behdad Esfahbod

Comment 31

•

19 years ago

I for one think there shouldn't be anything special about ZW[N]J at all.  Simply
that string literals should be literal, no characters added or removed.

Brendan Eich [:brendan]

Assignee

Comment 32

•

19 years ago

Behdad: got that, and the patch in this bug does that -- likewise for regular
expressions.  But should ZWNJ in a regexp, not in an escape or a character
class, be taken verbatim?

/be

Target Milestone: mozilla1.8beta2 → mozilla1.8beta3

Behdad Esfahbod

Comment 33

•

19 years ago

I think so.  That's what I'm using these days.  ZWNJ is a regular character used
in Persian and a few other languages (Indic languages IIRC).  Removing ZWNJ from
regexps introduces the same problem as for strings.  And like somebody pointed
earlier, string and regexp literals are not the only source of strings and
regexps in JavaScript, so, any special handling at parsing level just breaks things.

Erik van der Poel

Comment 34

•

19 years ago

Hi Brendan, this is not exactly one of my areas, but it might be a good idea
to talk to Mark Davis and/or take a look at his latest draft of TR18:

http://www.unicode.org/reports/tr18/tr18-10.html

In particular, go to Annex C and start a search (find) for "format".

Bob Clary [:bc] (inactive)

Comment 35

•

19 years ago

This does cause js/testsecma_3/Unicode/uc-001.js to fail.

Behdad Esfahbod

Comment 36

•

19 years ago

Certainly another bug, but worth mentioning here:  I've got reports that the new
versions of Firefox (with Uniscribe backend I guess) simply ignore '\u200c' too
and '&zwnj;' should be used instead.

Brendan Eich [:brendan]

Assignee

Comment 37

•

18 years ago

(In reply to comment #36)
> Certainly another bug, but worth mentioning here:  I've got reports that the
> new
> versions of Firefox (with Uniscribe backend I guess) simply ignore '\u200c' too
> and '&zwnj;' should be used instead.

Please file it separately if you can confirm the reports, especially with a testcase.  Thanks.

I have proposed to ECMA TG1 that for Edition 4 (and all versions, really), we do at least as IE does per comment 14.  That is, we do not strip format-control characters in string literals.

We still need to test IE, evaluate what it does in these cases, and decide whether that's good enough to become de-jure standard:

* In regexps, including in character classes and outside of them.

* When matching a regexp against a target string containing a format-control char.

* After backslash in string literals and regexps.

* Other edge cases?

It would be a big help if interested folks would construct a test matrix and fill it in with results from IE, other browsers, and also what's considered ideal.

/be

Brendan Eich [:brendan]

Assignee

Comment 38

•

18 years ago

I hear Opera flouts ECMA-262 utterly and does not strip any format-control chars, anywhere in script source.  If IE does strip outside of strings and possibly other literals, then Opera is +1.

/be

Brendan Eich [:brendan]

Assignee

Comment 39

•

18 years ago

We're testing IE here in an ECMA TG1 meeting, and it appears IE6 and IE7 (a) do not strip format control characters; (b) therefore allow them in string literals and regular expression literals; (c) let them through outside of such quoted literals, where they become invalid source characters.

We are going to codify this behavior in Edition 4.  Look for mozilla1.9/firefox3 to implement it.

/be

Behdad Esfahbod

Comment 40

•

18 years ago

Thanks Brendan

Brendan Eich [:brendan]

Assignee

Comment 41

•

18 years ago

Attached patch Don't skip format-control chars — Details — Splinter Review

diff -w version in a sec.

/be

Attachment #171584 - Attachment is obsolete: true

Attachment #240106 - Flags: review?(mrbkap)

Attachment #171584 - Flags: review?(waldemar)

Brendan Eich [:brendan]

Assignee

Comment 42

•

18 years ago

Attached patch diff -w version of last attachment — Details — Splinter Review

Blake Kaplan (:mrbkap) (inactive)

Updated

•

18 years ago

Attachment #240106 - Flags: review?(mrbkap) → review+

Brendan Eich [:brendan]

Assignee

Comment 43

•

18 years ago

Fixed on the mozilla1.9 trunk:

js/src/jsscan.c 3.118

/be

Status: ASSIGNED → RESOLVED

Closed: 20 years ago → 18 years ago

Resolution: --- → FIXED

Jeff Walden [:Waldo]

Comment 44

•

18 years ago

*grumble about old QA contact*

(In reply to comment #39)
> We're testing IE here in an ECMA TG1 meeting, and it appears IE6 and IE7 (a) do
> not strip format control characters; (b) therefore allow them in string
> literals and regular expression literals; (c) let them through outside of such
> quoted literals, where they become invalid source characters.
> 
> We are going to codify this behavior in Edition 4.  Look for
> mozilla1.9/firefox3 to implement it.

I scanned through http://developer.mozilla.org/es4/ and didn't see anything yet codified for this -- I believe I understand (a) and (b), but (c) is less clear to me.  In that case, would the program be "syntactically in error" (5.1.4 in ed. 3) and thus result in a SyntaxError exception (well, if the program were being passed as a string to eval)?

QA Contact: pschwartau → general

Brendan Eich [:brendan]

Assignee

Comment 45

•

18 years ago

We will re-export pretty soon, and the latest wiki contents will say that Fc chars are not stripped.

(a) and (b) follow from lexical productions in ECMA-262 Ed. 3 section 7 that have right parts of the form "SourceCharacter but not ...", once you remove stripping of Fc chars specified in 7.1.  This allows Fc chars in comments as well as strings and regexps.

(c) follows from other lexical productions, e.g. for Identifier, which restrict the classes of component chars, terminating on any char outside the set of valid chars. This leaves the offending char if it's Fc not satisfying the goal symbols of the lexical grammar's topside specified in section 7.

/be

Brendan Eich [:brendan]

Assignee

Comment 46

•

18 years ago

js> eval("hi\uFEFFthere")
typein:1: SyntaxError: illegal character:
typein:1: hi?there
typein:1: ..^

/be

Biju

Updated

•

18 years ago

Blocks: 361783

-fullmetaljacket-

Updated

•

18 years ago

Depends on: 368516

Bob Clary [:bc] (inactive)

Comment 47

•

17 years ago

/cvsroot/mozilla/js/tests/ecma_3/extensions/regress-274152.js,v  <--  regress-274152.js

Bob Clary [:bc] (inactive)

Comment 48

•

17 years ago

See bug 372198 where I added ecma_3/Unicode/uc-001.js to spidermonkey-n.tests.

Behdad Esfahbod

Comment 49

•

17 years ago

Relevant Unicode Public Review issue:
http://www.unicode.org/review/pr-96.html

Behdad Esfahbod

Comment 50

•

17 years ago

Any chance we can get the fix (or the partial one in the patch attached earlier by Brendan) into ff2 tree?

Bob Clary [:bc] (inactive)

Comment 51

•

17 years ago

updated due to bug 368516 to include format control chars and not BOM.

/cvsroot/mozilla/js/tests/ecma_3/extensions/regress-274152.js,v  <--  regress-274152.js
new revision: 1.3; previous revision: 1.2

Flags: in-testsuite+

Bob Clary [:bc] (inactive)

Comment 52

•

16 years ago

v 1.9.0

Status: RESOLVED → VERIFIED

Daniel Veditz [:dveditz]

Comment 53

•

16 years ago

Should this fix be ported back to the 1.8 branch so web pages get consistent behavior?

Flags: wanted1.8.1.x?

Flags: blocking1.8.1.16?

Daniel Veditz [:dveditz]

Comment 54

•

16 years ago

No reply, I guess there's no pressing need on the 1.8 branch, and if not probably best not to break any existing scripts.

Flags: wanted1.8.1.x?

Flags: blocking1.8.1.17?

Minimalish testcase demonstrating that the content model is correct 20 years ago Boris Zbarsky [:bzbarsky] 1.65 KB, text/html		Details
patch for testing and discussion 20 years ago Brendan Eich [:brendan] 4.20 KB, patch	shaver : superreview+	Details \| Diff \| Splinter Review
Don't skip format-control chars 18 years ago Brendan Eich [:brendan] 12.07 KB, patch	mrbkap : review+	Details \| Diff \| Splinter Review
diff -w version of last attachment 18 years ago Brendan Eich [:brendan] 1.30 KB, patch		Details \| Diff \| Splinter Review