Closed Bug 72964 Opened 23 years ago Closed 23 years ago

Pattern matching failing on non-Latin1 characters

Tracking

()

Status:

VERIFIED FIXED

People

(Reporter: hyvu, Assigned: rogerl)

Details

(Keywords: intl, Whiteboard: [Same problem exists in Rhino; see bug 91343])

Attachments

(5 files)

HTML testcase 23 years ago Phil Schwartau 2.05 KB, text/html		Details
test cases with Chinese in unicode escaped 23 years ago Frank Tang 2.31 KB, text/html		Details
HTML testcase comparing high Unicode vs. low Unicode chars 23 years ago Phil Schwartau 824 bytes, text/html		Details
Fix by expanding not-cases to full unicode range 23 years ago rogerl (gone) 649 bytes, patch		Details \| Diff \| Splinter Review
Updated to current rev. & added context 23 years ago rogerl (gone) 1.60 KB, patch	khanson : review+ brendan : superreview+	Details \| Diff \| Splinter Review

hyvu

Reporter

Description

•

23 years ago

If you switch character encoding to Chinese Simplified (GB2312), "seach" method
returns other than -1 all the times.  It happens in NS6.0, 6.01, and Mozilla
build 20010215.
----------

<html>
<script>
    // check if only contains white space characters
    //var str = "    ";
    var str = "¿Í»§";
    // var str = "Hello World!";
    if( str.search( /[\S]+/ ) != -1 )
        alert("No space: " + str);
    else
        alert("All space: " + str);

    document.write("<body bgcolor=white>");
    document.write("<a href='http://www.yahoo.com'>");
    document.write(str);
    document.write("</a>");
    document.write("</body>");
</script>
</html>

Phil Schwartau

Comment 1

•

23 years ago

Attached file HTML testcase — Details

Phil Schwartau

Comment 2

•

23 years ago

Try the attached testcase above in NN 4.7 side-by-side with Mozilla.
This tests the literal string shown above, and a Unicode-escaped string: 


var str = "¿Í»§";
var str = '\u00BF\u00CD\u00BB\u00A7';


Mozlla/N6 behaves identically with NN4.7 as far as I can see.
Method: switch character encoding to Chinese Simplified (GB2312)
in both browsers, and bring them up side-by-side. Click on the
above testcase. 


This shows the problem is not with the regular expression engine
inside the JS Engine. When provided with the correct Unicode characters,
it does the search and match correctly. 


The problem may be in Mozilla the literal string produces the wrong 
character codes. This reminds me of bugs like bug 55468, but I don't 
know if it's a duplicate. 


Reassigning to Internationalization component for further analysis -

Assignee: rogerl → nhotta

Component: Javascript Engine → Internationalization

QA Contact: pschwartau → andreasb

Andreas Becker

Comment 3

•

23 years ago

Changing QA Contact to ylong.  Yuying, can you try and confirm this bug?  Thanks.

QA Contact: andreasb → ylong

nhottanscp

Comment 4

•

23 years ago

Looks like non Latin1 characters without unicode escape are regarded as spaces
when a document charset is multibyte (e.g. UTF-8, EUC-JP, GB2312).

Status: UNCONFIRMED → NEW

Ever confirmed: true

nhottanscp

Updated

•

23 years ago

Target Milestone: --- → Future

Frank Tang

Comment 5

•

23 years ago

JavaScript is now unicode internal. I think this is a JavaScript issue. naoki, 
could you please supply a better sample page which have meta tag in it and 
reassign this to beard. Thanks.

Andreas Becker

Updated

•

23 years ago

Keywords: intl

Brendan Eich [:brendan]

Comment 6

•

23 years ago

Frank, didn't pschwartau's comments dated 2001-03-22 18:59 convincingly
demonstrate that this is *not* a JS engine problem?

/be

nhottanscp

Updated

•

23 years ago

Status: NEW → ASSIGNED

Frank Tang

Comment 7

•

23 years ago

Attached file test cases with Chinese in unicode escaped — Details

Frank Tang

Comment 8

•

23 years ago

brendan- this IS a JS engine problem. I attach a modified version of test case
which include CJK character in /u escaped form. And it still does not work. If
the problem is outside the JS engine, then the /u escaped of CJK characters will
work but direct encoded characters won't. However, the new test case prove that
is not the case. /u encoded non Latin1 (Unicode code point > U+00FF) does not
work. Change the compoenet to JS and reassign 
Try the 2nd attachement. You should not see -1 in the test 5.

Assignee: nhotta → rogerl

Status: ASSIGNED → NEW

Component: Internationalization → Javascript Engine

QA Contact: ylong → pschwartau

Target Milestone: Future → ---

Phil Schwartau

Comment 9

•

23 years ago

It looks like Frank is right. 

I am running this test in the standalone JS shell. The regular expression,
as always in this bug, is /[\S]+/ , which is looking for one or more 
non-white space characters:


var status = '';
var str = '';
var re = /[\S]+/;
print('\n**********   re = ' +  re +  '   **********\n');

status = '4 low Unicode chars = Latin1';
str = '\u00BF\u00CD\u00BB\u00A7';
showThis();

status = 'Now put a space in the middle';
str = '\u00BF\u00CD \u00BB\u00A7';
showThis();

status = '4 high Unicode chars = non-Latin1';
str = '\u4e00\uac00\u4e03\u4e00';
showThis();

status = 'Now put a space in the middle';
str = '\u4e00\uac00 \u4e03\u4e00';
showThis();

function showThis()
{
  print(status);
  print('str = ' + str);
  print('str.match(re) = ' + str.match(re));
  print('\n');
}


As Frank has stated, as soon as we have characters > \u00FF,
the pattern-matching fails - here is the output from the JS shell:


re = /[\S]+/

4 low Unicode chars = Latin1
str = +-+º
str.match(re) = +-+º

Now put a space in the middle
str = +- +º
str.match(re) = +-

4 high Unicode chars = non-Latin1
str =
str.match(re) = null

Now put a space in the middle
str =
str.match(re) = null

Phil Schwartau

Comment 10

•

23 years ago

This failure reminds me of the comment made above:


 ----- Additional Comments From nhotta@netscape.com 2001-03-23 16:19 -----

 Looks like non Latin1 characters without unicode escape are regarded as spaces
 when a document charset is multibyte (e.g. UTF-8, EUC-JP, GB2312).


But we are seeing this failure directly in the JS shell!
Let me ask rogerl if I'm making a mistake here or not -

Phil Schwartau

Comment 11

•

23 years ago

Attached file HTML testcase comparing high Unicode vs. low Unicode chars — Details

Phil Schwartau

Comment 12

•

23 years ago

Here's what I find with the HTML testcase directly above:

IE4.7  :  matches both the low and high Unicode chars
NN4.7  :  matches the low Unicode chars only
Moz/N6 :  matches the low Unicode chars only

Phil Schwartau

Comment 13

•

23 years ago

Testcase added to JS testsuite:

           mozilla/js/tests/ecma_3/RegExp/regress-72964.js


Testcase failing in Rhino as well as in SpiderMonkey -

Phil Schwartau

Comment 14

•

23 years ago

Bug 91343 has been filed against Rhino for this issue -

Summary: String method for pattern matching failed for Chinese Simplified (GB2312) → Pattern matching failing on non-Latin1 characters

Whiteboard: [Same problem exists in Rhino; see bug 91343]

Phil Schwartau

Comment 15

•

23 years ago

Rhino fix verified on WinNT. Testcase passes in both rhino, rhinoi shells -

rogerl (gone)

Assignee

Comment 16

•

23 years ago

Attached patch Fix by expanding not-cases to full unicode range — Details — Splinter Review

Phil Schwartau

Comment 17

•

23 years ago

Need sr=, r= for this patch

rogerl (gone)

Assignee

Updated

•

23 years ago

Status: NEW → ASSIGNED

rogerl (gone)

Assignee

Comment 18

•

23 years ago

Attached patch Updated to current rev. & added context — Details — Splinter Review

Kenton Hanson (gone)

Comment 19

•

23 years ago

I am getting a failure of tests from Phil (comment #9) in Spidermonkey using the
latest patch (12/18/2001)

Kenton Hanson (gone)

Comment 20

•

23 years ago

Comment on attachment 62145 [details] [diff] [review]
Updated to current rev. & added context

r=khanson
Previous comment in error.  Codewarrior shell was not correctly reading
non-Latin characters

Attachment #62145 - Flags: review+

Brendan Eich [:brendan]

Comment 21

•

23 years ago

Comment on attachment 62145 [details] [diff] [review]
Updated to current rev. & added context

Can't we make the not-case use the smaller bitmap (for chars 0-255) but still
handle full Unicode?  Don't let that stop this fix from going in now, just rent
me a clue.  sr=brendan@mozilla.org.

/be

Attachment #62145 - Flags: superreview+

rogerl (gone)

Assignee

Comment 22

•

23 years ago

Not sure what scheme you have in mind? We need to match high-byte unicode 
characters against the class. If the bitmap stops at 255, those characters will 
fail to match.  [I guess a potential idea is to flip the sense of the 
bitmap-test and build the not-bitmap, assuming it is smaller, though this will 
not help in some cases - where a high-byte unicode literal is specified in the 
class.]

Brendan Eich [:brendan]

Comment 23

•

23 years ago

For a negated class where the characters not included are all ISO-Latin-1 or
even ASCII, remember to check the range of the character being matched against
the domain limit (256 or 128).  I thought the 4.x code did this.

Of course you need a full-sized bitmap if there are high-byte Unichars in the
negated class, just as in a non-negated class.

/be

rogerl (gone)

Assignee

Comment 24

•

23 years ago

That code's still there, if a text character is outside of the range of the 
bitmap then the '^' flag determines if that means inclusion or not. In a class 
with a negative term - [\S] we end up turning on most of the bits and testing 
for inclusion. Since the highest unicode whitespace is 12288, we could turn [\S] 
into [^\s] and save a bunch of those bits.

Brendan Eich [:brendan]

Comment 25

•

23 years ago

Patch has review, what's the hold-up?  (I missed an else after break
non-sequitur, fix that if you can please.)

/be

rogerl (gone)

Assignee

Comment 26

•

23 years ago

Fix checked in with else fixed.

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Phil Schwartau

Comment 27

•

23 years ago

Verified FIXED. The above JS testcase now passes in the debug and
optimized JS shell on Linux, WinNT, and Mac9.1.

In addition, the HTML testcase above now matches both the high- and
low-byte characters in the test string, using Mozilla trunk binaries
20020116xx on all three platforms.

Status: RESOLVED → VERIFIED

Bob Clary [:bc] (inactive)

Updated

•

19 years ago

Flags: testcase+

You need to log in before you can comment on or make changes to this bug.