Open Bug 939739 Opened 7 years ago Updated 7 years ago

Wrong word boundary detection for MidLetter character

Categories

(Core :: Graphics: Text, defect)

28 Branch
defect
Not set
normal

Tracking

()

People

(Reporter: jmontane, Unassigned)

Details

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0 (Beta/Release)
Build ID: 20130911164256

Steps to reproduce:

1.a.- with Firefox: vist a webpage with following words, visit this bug page is enough.
1.b.- with Thunderbird: copy and paste following text in a mail.

----------8<--8<--8<----------
U+00B7--> abc·def
U+0387--> abc·def
U+05F4 --> abc״def
U+2027 --> abc‧def
U+003A --> abc:def
U+FE13 --> abc︓def
U+FE55 --> abc﹕def
U+FF1A --> abc:def
U+02D7 --> abc˗def
----------8<--8<--8<----------

2.- Double-click in any "abc?def" word.


Actual results:

Only "abc" (or "def") part of word is selected.


Expected results:

Following Unicode UAX TR29 [1] full word "abc?def" must be selected. Same problem moving cursor with Ctrl+Left (or Right) arrow.

After some testing: Safari, Konqueror, Opera, Web GNOME browser(often called Epiphany), rekonq and Chromium work as intended. Full word adb?def is selected with double-click.


[1] http://www.unicode.org/reports/tr29/#MidLetter, see WB6 and WB7.
Confirmed in FF 28.0a1 (2013-11-27), Win 7 x64
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: x86_64 → All
(In reply to Joan Montané from comment #0)

> After some testing: Safari, Konqueror, Opera, Web GNOME browser(often called
> Epiphany), rekonq and Chromium work as intended. Full word adb?def is
> selected with double-click.

Not quite, in my testing - in the last example (U+02D7 --> abc˗def), both Chrome and Safari on OS X select the parts separately. The others do all select as a single word, though.
(In reply to Jonathan Kew (:jfkthame) from comment #2)
> Not quite, in my testing - in the last example (U+02D7 --> abc˗def), both
> Chrome and Safari on OS X select the parts separately. The others do all
> select as a single word, though.

You are right, my fault. Chrome and Safari treat U+02D7 as a word separataor. I guess it's because they use an outdated ICU library. Compare[1] (Unicode 6; date:2010-08-19)for ICU data used by Chrome and [2] (Unicode 6.3; date:2013-07-05) for current ICU library data around Midletter characters.

But the issue here is Mozilla doesn't follow UAX TR29 when double-clicking (or arrow) selection. It's annoying when selecting Catalan text, where "·" (U+00B7) is used as Midletter character.

[1] https://code.google.com/p/chromium/codesearch#chromium/src/third_party/icu/source/data/unidata/WordBreakProperty.txt&l=830
[2]http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakProperty.txt
You need to log in before you can comment on or make changes to this bug.