Problem with RegExp using \b with unicode chars

RESOLVED INVALID

Status

()

Core
JavaScript Engine
RESOLVED INVALID
7 years ago
7 years ago

People

(Reporter: Quim Perez, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

7 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; ca; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13
Build Identifier: Firefox 3.6 and Firefox 4 (beta 1..10)

The RegExp /\bword\b/ return true when it finds the whole word "word". But when the search word starts with or ends with an Unicode char it does very strange things. 



Reproducible: Always

Actual Results:  
For example:

1)  /\baixò\b/.test(" això ")  :  false
    should be TRUE

2)  /\baixò\b/.test("això")  :  false
    should be TRUE

3)  /\baixò\b/.test("aixòs")  :  true
   should be FALSE

Without Unicode chars it works fine:

/\baixo\b/.test(" aixo ")  : true

/\baixo\b/.test("aixo")   : true

/\baixo\b/.test("aixos")  : false




I've found some bugs related with this:

https://bugzilla.mozilla.org/show_bug.cgi?id=247179
https://bugzilla.mozilla.org/show_bug.cgi?id=550984

Both bugs talk about per ECMA-262 15.10.2.6, the \b assertion should break at 'word' boundaries, where 'word' means characters A-Za-z0-9_ and no others.
If that case, examples 1 and 2 are correct and should return FALSE, but example 3 is still incorrect and should return FALSE.
> but example 3 is still incorrect

No, it's correct and should return true.  \b matches between chars X and Y if one of X and Y is in [A-Za-z0-9_] and the other is not.  It also matches at the beginning and end of the string.  So your example 3 matches, because 's' is in that set but 'ò' is not, so /ò\b/ matches the string "òs".
Status: UNCONFIRMED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → INVALID
(Reporter)

Comment 2

7 years ago
Thank you Boris, now I understand how it works. But it seems that \b a little useless when working with localized strings.
Well, yes.  It is.  But for general Unicode strings the concept of "word" doesn't exactly make sense (or more precisely, what sort of sense it makes, if any, is still an active research topic in linguistics, even with decades of work behind us).
lwall realized Perl 5 went down a bad path with regex extensions that were long-winded for the more common operations, e.g. (?:...) for non-capturing groups, yet still had [abc] for Unicode-hostile character classes. So for Perl 6 he broke compat utterly.

JS is stuck with Perl5-based regexps but we'll try to fix things up for Harmony. If you are interested, see

http://wiki.ecmascript.org/doku.php?id=strawman:strawman

under "Regular Expressions".

/be
You need to log in before you can comment on or make changes to this bug.