Closed Bug 72964 Opened 19 years ago Closed 18 years ago

Pattern matching failing on non-Latin1 characters

Categories

(Core :: JavaScript Engine, defect)

x86
Windows NT
defect
Not set

Tracking

()

VERIFIED FIXED

People

(Reporter: hyvu, Assigned: rogerl)

Details

(Keywords: intl, Whiteboard: [Same problem exists in Rhino; see bug 91343])

Attachments

(5 files)

If you switch character encoding to Chinese Simplified (GB2312), "seach" method
returns other than -1 all the times.  It happens in NS6.0, 6.01, and Mozilla
build 20010215.
----------

<html>
<script>
    // check if only contains white space characters
    //var str = "    ";
    var str = "¿Í»§";
    // var str = "Hello World!";
    if( str.search( /[\S]+/ ) != -1 )
        alert("No space: " + str);
    else
        alert("All space: " + str);

    document.write("<body bgcolor=white>");
    document.write("<a href='http://www.yahoo.com'>");
    document.write(str);
    document.write("</a>");
    document.write("</body>");
</script>
</html>
Attached file HTML testcase
Try the attached testcase above in NN 4.7 side-by-side with Mozilla.
This tests the literal string shown above, and a Unicode-escaped string: 


var str = "¿Í»§";
var str = '\u00BF\u00CD\u00BB\u00A7';


Mozlla/N6 behaves identically with NN4.7 as far as I can see.
Method: switch character encoding to Chinese Simplified (GB2312)
in both browsers, and bring them up side-by-side. Click on the
above testcase. 


This shows the problem is not with the regular expression engine
inside the JS Engine. When provided with the correct Unicode characters,
it does the search and match correctly. 


The problem may be in Mozilla the literal string produces the wrong 
character codes. This reminds me of bugs like bug 55468, but I don't 
know if it's a duplicate. 


Reassigning to Internationalization component for further analysis - 
Assignee: rogerl → nhotta
Component: Javascript Engine → Internationalization
QA Contact: pschwartau → andreasb
Changing QA Contact to ylong.  Yuying, can you try and confirm this bug?  Thanks.
QA Contact: andreasb → ylong
Looks like non Latin1 characters without unicode escape are regarded as spaces
when a document charset is multibyte (e.g. UTF-8, EUC-JP, GB2312).
Status: UNCONFIRMED → NEW
Ever confirmed: true
Target Milestone: --- → Future
JavaScript is now unicode internal. I think this is a JavaScript issue. naoki, 
could you please supply a better sample page which have meta tag in it and 
reassign this to beard. Thanks.
Keywords: intl
Frank, didn't pschwartau's comments dated 2001-03-22 18:59 convincingly
demonstrate that this is *not* a JS engine problem?

/be
Status: NEW → ASSIGNED
brendan- this IS a JS engine problem. I attach a modified version of test case
which include CJK character in /u escaped form. And it still does not work. If
the problem is outside the JS engine, then the /u escaped of CJK characters will
work but direct encoded characters won't. However, the new test case prove that
is not the case. /u encoded non Latin1 (Unicode code point > U+00FF) does not
work. Change the compoenet to JS and reassign 
Try the 2nd attachement. You should not see -1 in the test 5. 
Assignee: nhotta → rogerl
Status: ASSIGNED → NEW
Component: Internationalization → Javascript Engine
QA Contact: ylong → pschwartau
Target Milestone: Future → ---
It looks like Frank is right. 

I am running this test in the standalone JS shell. The regular expression,
as always in this bug, is /[\S]+/ , which is looking for one or more 
non-white space characters:


var status = '';
var str = '';
var re = /[\S]+/;
print('\n**********   re = ' +  re +  '   **********\n');

status = '4 low Unicode chars = Latin1';
str = '\u00BF\u00CD\u00BB\u00A7';
showThis();

status = 'Now put a space in the middle';
str = '\u00BF\u00CD \u00BB\u00A7';
showThis();

status = '4 high Unicode chars = non-Latin1';
str = '\u4e00\uac00\u4e03\u4e00';
showThis();

status = 'Now put a space in the middle';
str = '\u4e00\uac00 \u4e03\u4e00';
showThis();

function showThis()
{
  print(status);
  print('str = ' + str);
  print('str.match(re) = ' + str.match(re));
  print('\n');
}


As Frank has stated, as soon as we have characters > \u00FF,
the pattern-matching fails - here is the output from the JS shell:


re = /[\S]+/

4 low Unicode chars = Latin1
str = +-+º
str.match(re) = +-+º

Now put a space in the middle
str = +- +º
str.match(re) = +-

4 high Unicode chars = non-Latin1
str =
str.match(re) = null

Now put a space in the middle
str =
str.match(re) = null 
This failure reminds me of the comment made above:


 ----- Additional Comments From nhotta@netscape.com 2001-03-23 16:19 -----

 Looks like non Latin1 characters without unicode escape are regarded as spaces
 when a document charset is multibyte (e.g. UTF-8, EUC-JP, GB2312).


But we are seeing this failure directly in the JS shell!
Let me ask rogerl if I'm making a mistake here or not -
Here's what I find with the HTML testcase directly above:

IE4.7  :  matches both the low and high Unicode chars
NN4.7  :  matches the low Unicode chars only
Moz/N6 :  matches the low Unicode chars only
Testcase added to JS testsuite:

           mozilla/js/tests/ecma_3/RegExp/regress-72964.js


Testcase failing in Rhino as well as in SpiderMonkey -
Bug 91343 has been filed against Rhino for this issue - 
Summary: String method for pattern matching failed for Chinese Simplified (GB2312) → Pattern matching failing on non-Latin1 characters
Whiteboard: [Same problem exists in Rhino; see bug 91343]
Rhino fix verified on WinNT. Testcase passes in both rhino, rhinoi shells -
Need sr=, r= for this patch
Status: NEW → ASSIGNED
I am getting a failure of tests from Phil (comment #9) in Spidermonkey using the
latest patch (12/18/2001)
Comment on attachment 62145 [details] [diff] [review]
Updated to current rev. & added context

r=khanson
Previous comment in error.  Codewarrior shell was not correctly reading
non-Latin characters
Attachment #62145 - Flags: review+
Comment on attachment 62145 [details] [diff] [review]
Updated to current rev. & added context

Can't we make the not-case use the smaller bitmap (for chars 0-255) but still
handle full Unicode?  Don't let that stop this fix from going in now, just rent
me a clue.  sr=brendan@mozilla.org.

/be
Attachment #62145 - Flags: superreview+
Not sure what scheme you have in mind? We need to match high-byte unicode 
characters against the class. If the bitmap stops at 255, those characters will 
fail to match.  [I guess a potential idea is to flip the sense of the 
bitmap-test and build the not-bitmap, assuming it is smaller, though this will 
not help in some cases - where a high-byte unicode literal is specified in the 
class.]
For a negated class where the characters not included are all ISO-Latin-1 or
even ASCII, remember to check the range of the character being matched against
the domain limit (256 or 128).  I thought the 4.x code did this.

Of course you need a full-sized bitmap if there are high-byte Unichars in the
negated class, just as in a non-negated class.

/be
That code's still there, if a text character is outside of the range of the 
bitmap then the '^' flag determines if that means inclusion or not. In a class 
with a negative term - [\S] we end up turning on most of the bits and testing 
for inclusion. Since the highest unicode whitespace is 12288, we could turn [\S] 
into [^\s] and save a bunch of those bits.
Patch has review, what's the hold-up?  (I missed an else after break
non-sequitur, fix that if you can please.)

/be
Fix checked in with else fixed.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Verified FIXED. The above JS testcase now passes in the debug and
optimized JS shell on Linux, WinNT, and Mac9.1.

In addition, the HTML testcase above now matches both the high- and
low-byte characters in the test string, using Mozilla trunk binaries
20020116xx on all three platforms.
Status: RESOLVED → VERIFIED
Flags: testcase+
You need to log in before you can comment on or make changes to this bug.