Closed Bug 629163 Opened 15 years ago Closed 15 years ago

Problem with RegExp using \b with unicode chars

Tracking

()

Status:

RESOLVED INVALID

People

(Reporter: noguer, Unassigned)

Details

Quim Perez

Reporter

Description

•

15 years ago

User-Agent: Mozilla/5.0 (X11; U; Linux i686; ca; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13 Build Identifier: Firefox 3.6 and Firefox 4 (beta 1..10) The RegExp /\bword\b/ return true when it finds the whole word "word". But when the search word starts with or ends with an Unicode char it does very strange things. Reproducible: Always Actual Results: For example: 1) /\baixò\b/.test(" això ") : false should be TRUE 2) /\baixò\b/.test("això") : false should be TRUE 3) /\baixò\b/.test("aixòs") : true should be FALSE Without Unicode chars it works fine: /\baixo\b/.test(" aixo ") : true /\baixo\b/.test("aixo") : true /\baixo\b/.test("aixos") : false I've found some bugs related with this: https://bugzilla.mozilla.org/show_bug.cgi?id=247179 https://bugzilla.mozilla.org/show_bug.cgi?id=550984 Both bugs talk about per ECMA-262 15.10.2.6, the \b assertion should break at 'word' boundaries, where 'word' means characters A-Za-z0-9_ and no others. If that case, examples 1 and 2 are correct and should return FALSE, but example 3 is still incorrect and should return FALSE.

Boris Zbarsky [:bzbarsky]

Comment 1

•

15 years ago

> but example 3 is still incorrect No, it's correct and should return true. \b matches between chars X and Y if one of X and Y is in [A-Za-z0-9_] and the other is not. It also matches at the beginning and end of the string. So your example 3 matches, because 's' is in that set but 'ò' is not, so /ò\b/ matches the string "òs".

Status: UNCONFIRMED → RESOLVED

Closed: 15 years ago

Resolution: --- → INVALID

Quim Perez

Reporter

Comment 2

•

15 years ago

Thank you Boris, now I understand how it works. But it seems that \b a little useless when working with localized strings.

Boris Zbarsky [:bzbarsky]

Comment 3

•

15 years ago

Well, yes. It is. But for general Unicode strings the concept of "word" doesn't exactly make sense (or more precisely, what sort of sense it makes, if any, is still an active research topic in linguistics, even with decades of work behind us).

Brendan Eich [:brendan]

Comment 4

•

15 years ago

lwall realized Perl 5 went down a bad path with regex extensions that were long-winded for the more common operations, e.g. (?:...) for non-capturing groups, yet still had [abc] for Unicode-hostile character classes. So for Perl 6 he broke compat utterly. JS is stuck with Perl5-based regexps but we'll try to fix things up for Harmony. If you are interested, see http://wiki.ecmascript.org/doku.php?id=strawman:strawman under "Regular Expressions". /be

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Problem with RegExp using \b with unicode chars

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: noguer, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4