Closed
Bug 72964
Opened 23 years ago
Closed 23 years ago
Pattern matching failing on non-Latin1 characters
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
VERIFIED
FIXED
People
(Reporter: hyvu, Assigned: rogerl)
Details
(Keywords: intl, Whiteboard: [Same problem exists in Rhino; see bug 91343])
Attachments
(5 files)
2.05 KB,
text/html
|
Details | |
2.31 KB,
text/html
|
Details | |
824 bytes,
text/html
|
Details | |
649 bytes,
patch
|
Details | Diff | Splinter Review | |
1.60 KB,
patch
|
khanson
:
review+
brendan
:
superreview+
|
Details | Diff | Splinter Review |
If you switch character encoding to Chinese Simplified (GB2312), "seach" method returns other than -1 all the times. It happens in NS6.0, 6.01, and Mozilla build 20010215. ---------- <html> <script> // check if only contains white space characters //var str = " "; var str = "¿Í»§"; // var str = "Hello World!"; if( str.search( /[\S]+/ ) != -1 ) alert("No space: " + str); else alert("All space: " + str); document.write("<body bgcolor=white>"); document.write("<a href='http://www.yahoo.com'>"); document.write(str); document.write("</a>"); document.write("</body>"); </script> </html>
Comment 1•23 years ago
|
||
Comment 2•23 years ago
|
||
Try the attached testcase above in NN 4.7 side-by-side with Mozilla. This tests the literal string shown above, and a Unicode-escaped string: var str = "¿Í»§"; var str = '\u00BF\u00CD\u00BB\u00A7'; Mozlla/N6 behaves identically with NN4.7 as far as I can see. Method: switch character encoding to Chinese Simplified (GB2312) in both browsers, and bring them up side-by-side. Click on the above testcase. This shows the problem is not with the regular expression engine inside the JS Engine. When provided with the correct Unicode characters, it does the search and match correctly. The problem may be in Mozilla the literal string produces the wrong character codes. This reminds me of bugs like bug 55468, but I don't know if it's a duplicate. Reassigning to Internationalization component for further analysis -
Assignee: rogerl → nhotta
Component: Javascript Engine → Internationalization
QA Contact: pschwartau → andreasb
Comment 3•23 years ago
|
||
Changing QA Contact to ylong. Yuying, can you try and confirm this bug? Thanks.
QA Contact: andreasb → ylong
Comment 4•23 years ago
|
||
Looks like non Latin1 characters without unicode escape are regarded as spaces when a document charset is multibyte (e.g. UTF-8, EUC-JP, GB2312).
Status: UNCONFIRMED → NEW
Ever confirmed: true
Updated•23 years ago
|
Target Milestone: --- → Future
Comment 5•23 years ago
|
||
JavaScript is now unicode internal. I think this is a JavaScript issue. naoki, could you please supply a better sample page which have meta tag in it and reassign this to beard. Thanks.
Comment 6•23 years ago
|
||
Frank, didn't pschwartau's comments dated 2001-03-22 18:59 convincingly demonstrate that this is *not* a JS engine problem? /be
Updated•23 years ago
|
Status: NEW → ASSIGNED
Comment 7•23 years ago
|
||
Comment 8•23 years ago
|
||
brendan- this IS a JS engine problem. I attach a modified version of test case which include CJK character in /u escaped form. And it still does not work. If the problem is outside the JS engine, then the /u escaped of CJK characters will work but direct encoded characters won't. However, the new test case prove that is not the case. /u encoded non Latin1 (Unicode code point > U+00FF) does not work. Change the compoenet to JS and reassign Try the 2nd attachement. You should not see -1 in the test 5.
Assignee: nhotta → rogerl
Status: ASSIGNED → NEW
Component: Internationalization → Javascript Engine
QA Contact: ylong → pschwartau
Target Milestone: Future → ---
Comment 9•23 years ago
|
||
It looks like Frank is right. I am running this test in the standalone JS shell. The regular expression, as always in this bug, is /[\S]+/ , which is looking for one or more non-white space characters: var status = ''; var str = ''; var re = /[\S]+/; print('\n********** re = ' + re + ' **********\n'); status = '4 low Unicode chars = Latin1'; str = '\u00BF\u00CD\u00BB\u00A7'; showThis(); status = 'Now put a space in the middle'; str = '\u00BF\u00CD \u00BB\u00A7'; showThis(); status = '4 high Unicode chars = non-Latin1'; str = '\u4e00\uac00\u4e03\u4e00'; showThis(); status = 'Now put a space in the middle'; str = '\u4e00\uac00 \u4e03\u4e00'; showThis(); function showThis() { print(status); print('str = ' + str); print('str.match(re) = ' + str.match(re)); print('\n'); } As Frank has stated, as soon as we have characters > \u00FF, the pattern-matching fails - here is the output from the JS shell: re = /[\S]+/ 4 low Unicode chars = Latin1 str = +-+º str.match(re) = +-+º Now put a space in the middle str = +- +º str.match(re) = +- 4 high Unicode chars = non-Latin1 str = str.match(re) = null Now put a space in the middle str = str.match(re) = null
Comment 10•23 years ago
|
||
This failure reminds me of the comment made above: ----- Additional Comments From nhotta@netscape.com 2001-03-23 16:19 ----- Looks like non Latin1 characters without unicode escape are regarded as spaces when a document charset is multibyte (e.g. UTF-8, EUC-JP, GB2312). But we are seeing this failure directly in the JS shell! Let me ask rogerl if I'm making a mistake here or not -
Comment 11•23 years ago
|
||
Comment 12•23 years ago
|
||
Here's what I find with the HTML testcase directly above: IE4.7 : matches both the low and high Unicode chars NN4.7 : matches the low Unicode chars only Moz/N6 : matches the low Unicode chars only
Comment 13•23 years ago
|
||
Testcase added to JS testsuite: mozilla/js/tests/ecma_3/RegExp/regress-72964.js Testcase failing in Rhino as well as in SpiderMonkey -
Comment 14•23 years ago
|
||
Bug 91343 has been filed against Rhino for this issue -
Summary: String method for pattern matching failed for Chinese Simplified (GB2312) → Pattern matching failing on non-Latin1 characters
Whiteboard: [Same problem exists in Rhino; see bug 91343]
Comment 15•23 years ago
|
||
Rhino fix verified on WinNT. Testcase passes in both rhino, rhinoi shells -
Assignee | ||
Comment 16•23 years ago
|
||
Comment 17•23 years ago
|
||
Need sr=, r= for this patch
Assignee | ||
Updated•23 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 18•23 years ago
|
||
Comment 19•23 years ago
|
||
I am getting a failure of tests from Phil (comment #9) in Spidermonkey using the latest patch (12/18/2001)
Comment 20•23 years ago
|
||
Comment on attachment 62145 [details] [diff] [review] Updated to current rev. & added context r=khanson Previous comment in error. Codewarrior shell was not correctly reading non-Latin characters
Attachment #62145 -
Flags: review+
Comment 21•23 years ago
|
||
Comment on attachment 62145 [details] [diff] [review] Updated to current rev. & added context Can't we make the not-case use the smaller bitmap (for chars 0-255) but still handle full Unicode? Don't let that stop this fix from going in now, just rent me a clue. sr=brendan@mozilla.org. /be
Attachment #62145 -
Flags: superreview+
Assignee | ||
Comment 22•23 years ago
|
||
Not sure what scheme you have in mind? We need to match high-byte unicode characters against the class. If the bitmap stops at 255, those characters will fail to match. [I guess a potential idea is to flip the sense of the bitmap-test and build the not-bitmap, assuming it is smaller, though this will not help in some cases - where a high-byte unicode literal is specified in the class.]
Comment 23•23 years ago
|
||
For a negated class where the characters not included are all ISO-Latin-1 or even ASCII, remember to check the range of the character being matched against the domain limit (256 or 128). I thought the 4.x code did this. Of course you need a full-sized bitmap if there are high-byte Unichars in the negated class, just as in a non-negated class. /be
Assignee | ||
Comment 24•23 years ago
|
||
That code's still there, if a text character is outside of the range of the bitmap then the '^' flag determines if that means inclusion or not. In a class with a negative term - [\S] we end up turning on most of the bits and testing for inclusion. Since the highest unicode whitespace is 12288, we could turn [\S] into [^\s] and save a bunch of those bits.
Comment 25•23 years ago
|
||
Patch has review, what's the hold-up? (I missed an else after break non-sequitur, fix that if you can please.) /be
Assignee | ||
Comment 26•23 years ago
|
||
Fix checked in with else fixed.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Comment 27•23 years ago
|
||
Verified FIXED. The above JS testcase now passes in the debug and optimized JS shell on Linux, WinNT, and Mac9.1. In addition, the HTML testcase above now matches both the high- and low-byte characters in the test string, using Mozilla trunk binaries 20020116xx on all three platforms.
Status: RESOLVED → VERIFIED
Updated•19 years ago
|
Flags: testcase+
You need to log in
before you can comment on or make changes to this bug.
Description
•