Pattern matching failing on non-Latin1 characters

VERIFIED FIXED

Status

()

Core
JavaScript Engine
VERIFIED FIXED
17 years ago
13 years ago

People

(Reporter: hyvu, Assigned: rogerl (gone))

Tracking

({intl})

Trunk
x86
Windows NT
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [Same problem exists in Rhino; see bug 91343])

Attachments

(5 attachments)

(Reporter)

Description

17 years ago
If you switch character encoding to Chinese Simplified (GB2312), "seach" method
returns other than -1 all the times.  It happens in NS6.0, 6.01, and Mozilla
build 20010215.
----------

<html>
<script>
    // check if only contains white space characters
    //var str = "    ";
    var str = "¿Í»§";
    // var str = "Hello World!";
    if( str.search( /[\S]+/ ) != -1 )
        alert("No space: " + str);
    else
        alert("All space: " + str);

    document.write("<body bgcolor=white>");
    document.write("<a href='http://www.yahoo.com'>");
    document.write(str);
    document.write("</a>");
    document.write("</body>");
</script>
</html>

Comment 1

17 years ago
Created attachment 28553 [details]
HTML testcase

Comment 2

17 years ago
Try the attached testcase above in NN 4.7 side-by-side with Mozilla.
This tests the literal string shown above, and a Unicode-escaped string: 


var str = "¿Í»§";
var str = '\u00BF\u00CD\u00BB\u00A7';


Mozlla/N6 behaves identically with NN4.7 as far as I can see.
Method: switch character encoding to Chinese Simplified (GB2312)
in both browsers, and bring them up side-by-side. Click on the
above testcase. 


This shows the problem is not with the regular expression engine
inside the JS Engine. When provided with the correct Unicode characters,
it does the search and match correctly. 


The problem may be in Mozilla the literal string produces the wrong 
character codes. This reminds me of bugs like bug 55468, but I don't 
know if it's a duplicate. 


Reassigning to Internationalization component for further analysis - 
Assignee: rogerl → nhotta
Component: Javascript Engine → Internationalization
QA Contact: pschwartau → andreasb

Comment 3

17 years ago
Changing QA Contact to ylong.  Yuying, can you try and confirm this bug?  Thanks.
QA Contact: andreasb → ylong

Comment 4

17 years ago
Looks like non Latin1 characters without unicode escape are regarded as spaces
when a document charset is multibyte (e.g. UTF-8, EUC-JP, GB2312).
Status: UNCONFIRMED → NEW
Ever confirmed: true

Updated

17 years ago
Target Milestone: --- → Future

Comment 5

17 years ago
JavaScript is now unicode internal. I think this is a JavaScript issue. naoki, 
could you please supply a better sample page which have meta tag in it and 
reassign this to beard. Thanks.

Updated

17 years ago
Keywords: intl
Frank, didn't pschwartau's comments dated 2001-03-22 18:59 convincingly
demonstrate that this is *not* a JS engine problem?

/be

Updated

17 years ago
Status: NEW → ASSIGNED

Comment 7

17 years ago
Created attachment 42088 [details]
test cases with Chinese in unicode escaped

Comment 8

17 years ago
brendan- this IS a JS engine problem. I attach a modified version of test case
which include CJK character in /u escaped form. And it still does not work. If
the problem is outside the JS engine, then the /u escaped of CJK characters will
work but direct encoded characters won't. However, the new test case prove that
is not the case. /u encoded non Latin1 (Unicode code point > U+00FF) does not
work. Change the compoenet to JS and reassign 
Try the 2nd attachement. You should not see -1 in the test 5. 
Assignee: nhotta → rogerl
Status: ASSIGNED → NEW
Component: Internationalization → Javascript Engine
QA Contact: ylong → pschwartau
Target Milestone: Future → ---

Comment 9

17 years ago
It looks like Frank is right. 

I am running this test in the standalone JS shell. The regular expression,
as always in this bug, is /[\S]+/ , which is looking for one or more 
non-white space characters:


var status = '';
var str = '';
var re = /[\S]+/;
print('\n**********   re = ' +  re +  '   **********\n');

status = '4 low Unicode chars = Latin1';
str = '\u00BF\u00CD\u00BB\u00A7';
showThis();

status = 'Now put a space in the middle';
str = '\u00BF\u00CD \u00BB\u00A7';
showThis();

status = '4 high Unicode chars = non-Latin1';
str = '\u4e00\uac00\u4e03\u4e00';
showThis();

status = 'Now put a space in the middle';
str = '\u4e00\uac00 \u4e03\u4e00';
showThis();

function showThis()
{
  print(status);
  print('str = ' + str);
  print('str.match(re) = ' + str.match(re));
  print('\n');
}


As Frank has stated, as soon as we have characters > \u00FF,
the pattern-matching fails - here is the output from the JS shell:


re = /[\S]+/

4 low Unicode chars = Latin1
str = +-+º
str.match(re) = +-+º

Now put a space in the middle
str = +- +º
str.match(re) = +-

4 high Unicode chars = non-Latin1
str =
str.match(re) = null

Now put a space in the middle
str =
str.match(re) = null 

Comment 10

17 years ago
This failure reminds me of the comment made above:


 ----- Additional Comments From nhotta@netscape.com 2001-03-23 16:19 -----

 Looks like non Latin1 characters without unicode escape are regarded as spaces
 when a document charset is multibyte (e.g. UTF-8, EUC-JP, GB2312).


But we are seeing this failure directly in the JS shell!
Let me ask rogerl if I'm making a mistake here or not -

Comment 11

17 years ago
Created attachment 42672 [details]
HTML testcase comparing high Unicode vs. low Unicode chars

Comment 12

17 years ago
Here's what I find with the HTML testcase directly above:

IE4.7  :  matches both the low and high Unicode chars
NN4.7  :  matches the low Unicode chars only
Moz/N6 :  matches the low Unicode chars only

Comment 13

17 years ago
Testcase added to JS testsuite:

           mozilla/js/tests/ecma_3/RegExp/regress-72964.js


Testcase failing in Rhino as well as in SpiderMonkey -

Comment 14

17 years ago
Bug 91343 has been filed against Rhino for this issue - 
Summary: String method for pattern matching failed for Chinese Simplified (GB2312) → Pattern matching failing on non-Latin1 characters
Whiteboard: [Same problem exists in Rhino; see bug 91343]

Comment 15

17 years ago
Rhino fix verified on WinNT. Testcase passes in both rhino, rhinoi shells -
(Assignee)

Comment 16

16 years ago
Created attachment 53594 [details] [diff] [review]
Fix by expanding not-cases to full unicode range

Comment 17

16 years ago
Need sr=, r= for this patch
(Assignee)

Updated

16 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 18

16 years ago
Created attachment 62145 [details] [diff] [review]
Updated to current rev. & added context

Comment 19

16 years ago
I am getting a failure of tests from Phil (comment #9) in Spidermonkey using the
latest patch (12/18/2001)

Comment 20

16 years ago
Comment on attachment 62145 [details] [diff] [review]
Updated to current rev. & added context

r=khanson
Previous comment in error.  Codewarrior shell was not correctly reading
non-Latin characters
Attachment #62145 - Flags: review+
Comment on attachment 62145 [details] [diff] [review]
Updated to current rev. & added context

Can't we make the not-case use the smaller bitmap (for chars 0-255) but still
handle full Unicode?  Don't let that stop this fix from going in now, just rent
me a clue.  sr=brendan@mozilla.org.

/be
Attachment #62145 - Flags: superreview+
(Assignee)

Comment 22

16 years ago
Not sure what scheme you have in mind? We need to match high-byte unicode 
characters against the class. If the bitmap stops at 255, those characters will 
fail to match.  [I guess a potential idea is to flip the sense of the 
bitmap-test and build the not-bitmap, assuming it is smaller, though this will 
not help in some cases - where a high-byte unicode literal is specified in the 
class.]
For a negated class where the characters not included are all ISO-Latin-1 or
even ASCII, remember to check the range of the character being matched against
the domain limit (256 or 128).  I thought the 4.x code did this.

Of course you need a full-sized bitmap if there are high-byte Unichars in the
negated class, just as in a non-negated class.

/be
(Assignee)

Comment 24

16 years ago
That code's still there, if a text character is outside of the range of the 
bitmap then the '^' flag determines if that means inclusion or not. In a class 
with a negative term - [\S] we end up turning on most of the bits and testing 
for inclusion. Since the highest unicode whitespace is 12288, we could turn [\S] 
into [^\s] and save a bunch of those bits.
Patch has review, what's the hold-up?  (I missed an else after break
non-sequitur, fix that if you can please.)

/be
(Assignee)

Comment 26

16 years ago
Fix checked in with else fixed.
Status: ASSIGNED → RESOLVED
Last Resolved: 16 years ago
Resolution: --- → FIXED

Comment 27

16 years ago
Verified FIXED. The above JS testcase now passes in the debug and
optimized JS shell on Linux, WinNT, and Mac9.1.

In addition, the HTML testcase above now matches both the high- and
low-byte characters in the test string, using Mozilla trunk binaries
20020116xx on all three platforms.
Status: RESOLVED → VERIFIED

Updated

13 years ago
Flags: testcase+
You need to log in before you can comment on or make changes to this bug.