GTK immodule receives invalid cursor position when the surrounding text contains any non-BMP Unicode character

VERIFIED FIXED in mozilla15

Status

()

Core
Widget: Gtk
VERIFIED FIXED
5 years ago
5 years ago

People

(Reporter: Yohei Yukawa, Assigned: masayuki)

Tracking

Trunk
mozilla15
All
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 2 obsolete attachments)

(Reporter)

Description

5 years ago
This is a corner case of Bug 353776.

If I understood correctly, there exists a confusion between UTF-16 and the number of Unicode characters in the current implementation of nsGtkIMModule::OnRetrieveSurroundingNative.

nsGtkIMModule::GetCurrentParagraph returns aCursorPos as the number of UTF-16 elements.
http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l1469

Then the returned aCursorPos will be passed as the second argument of g_utf8_offset_to_pointer, which expects the number of Unicode characters instead.
http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l923

As a result, an invalid cursor position will be passed to gtk_im_context_set_surrounding when the surrounding text contains any non-BMP Unicode character before the cursor location. The current implementation should work as long as there is no non-BMP character before the cursor.

Please let me know if I need to elaborate more.

Thanks,
Blocks: 353776
Status: UNCONFIRMED → NEW
Ever confirmed: true
Hmm... There is another issue. I'm not sure whether IVS is an independent character of UTF-8 in gnome. The document said nothing about IVS.

http://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-utf8-strlen
I expect variation selectors to count as separate characters in UTF-8.
http://unicode.org/reports/tr37/ says
"An Ideographic Variation Sequence (IVS) is a sequence of two coded characters, the first being a character with the Unified_Ideograph property, the second being a variation selector character in the range U+E0100 to U+E01EF."
Created attachment 625985 [details] [diff] [review]
Patch

I think that this the safest fix. I think that it doesn't make we handle all undocumented behavior of glib APIs. This make the cost increase. But it must not be too expensive.
Assignee: nobody → masayuki
Status: NEW → ASSIGNED
Attachment #625985 - Flags: review?(karlt)
> it doesn't make we handle all undocumented behavior of glib APIs.

I meant it doesn't make sense...
Created attachment 625986 [details] [diff] [review]
Patch

Oops. I attached wrong file. Sorry for the spam.
Attachment #625985 - Attachment is obsolete: true
Attachment #625985 - Flags: review?(karlt)
Attachment #625986 - Flags: review?(karlt)
Comment on attachment 625986 [details] [diff] [review]
Patch

>+    // Note that we shouldn't use g_utf8_offset_to_pointer() in this situation.
>+    // cursorPos is cursor position in UTF16 string.  If we want to convert
>+    // it to number of UTF-8 characters, we need to know the exactly same rules
>+    // of the glib APIs.  E.g., How do they count IVS sequence? How do they
>+    // handle broken surrogate pairs or IVS sequence at converting from UTF-16?
>+    // For safety, we should convert the string before cursor from UTF-16 to
>+    // UTF-8 by a glib API.  This must not be too expensive.

I expect these are not real issues.
Characters in g_utf8_offset_to_pointer are Unicode characters.  These are well
defined in UTF-8.  No characters (not even variation selectors) are treated
specially.  If uniStr does not contain valid UTF-16 (e.g. unpaired surrogates)
then g_utf16_to_utf8() will fail.

However, GLib doesn't provide a better method to use here.

If you feel like trying to tidy this up further so that text is only converted
once and zero/Length() special-case optimizations are not necessary,
AppendUTF16toUTF8 could be used with Gecko string classes to first convert the
portion before the cursor and then the rest.
Attachment #625986 - Flags: review?(karlt) → review+
The same bug exists in the delete-surrounding signal handler, where offset and
n_chars measure Unicode characters, but I assume nsSelectionEvent wants UTF-16
code point counts.

Should that be fixed at the same time?
(In reply to Karl Tomlinson (:karlt) from comment #7)
> The same bug exists in the delete-surrounding signal handler, where offset
> and
> n_chars measure Unicode characters, but I assume nsSelectionEvent wants
> UTF-16
> code point counts.
> 
> Should that be fixed at the same time?

Ah, I guess so. IME developers might be able to avoid this bug if they know this. In such situation, both of them should work in the same way.
Thinking about it a bit more, retrieve-surrounding and delete-surrounding differ in that the former wants a byte offset while the latter uses character counts, so I'm guessing it would be fine to fix retrieve-surrounding first.

Also delete-surrounding uses offsets from the cursor position, so there would only be a problem if there was a non-BMP character either in the deleted text or between the deleted text and the cursor.  That is possibly less likely than having a non-BMP character in the surrounding text before the cursor.
But IME may use both of them same time and compute the offset from the result of retrieve-surrounding. Anyway, I have a patch for both bugs, I'll test them tonight and tomorrow. I'll post the new patch soon.
Created attachment 627985 [details] [diff] [review]
Patch

How about this?

testing on tryserver:
https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=3665a4af9925
Attachment #625986 - Attachment is obsolete: true
Attachment #627985 - Flags: review?(karlt)
Comment on attachment 627985 [details] [diff] [review]
Patch

>+    NS_ENSURE_TRUE(!utf8Str.IsEmpty(), NS_ERROR_FAILURE);

This shouldn't be necessary.
The offset and length checks already deal with this.

>+    gchar* charAtOffset =
>+        g_utf8_offset_to_pointer(utf8Str.get(), offsetInUTF8Characters);
>+    if (!charAtOffset) {
>+        PR_LOG(gGtkIMLog, PR_LOG_ALWAYS,
>+            ("    FAILED, charAtOffset is NULL"));
>+        return NS_ERROR_FAILURE;
>+    }
>+
>+    gchar* charAtEnd =
>+        g_utf8_offset_to_pointer(utf8Str.get(), endInUTF8Characters);
>+    if (!charAtEnd) {
>+        PR_LOG(gGtkIMLog, PR_LOG_ALWAYS,
>+            ("    FAILED, charAtEnd is NULL"));
>+        return NS_ERROR_FAILURE;
>+    }

g_utf8_offset_to_pointer won't return NULL so don't bother checking the
result.  It can only imply a false sense of security.
Attachment #627985 - Flags: review?(karlt) → review+
Okay, I added aNChar check. That's never been checked.

https://hg.mozilla.org/integration/mozilla-inbound/rev/291c61234c4a
Target Milestone: --- → mozilla15
https://hg.mozilla.org/mozilla-central/rev/291c61234c4a
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
(Reporter)

Comment 15

5 years ago
I confirmed that undo command and reverse conversion command of Mozc 1.6.1187.102 are working fine on Firefox 15.0 / Ubuntu 12.04 even when surrounding text contains non-BMP characters.

Thank you for fixing this issue so quickly!
-> v. per comment 15

Thank you Yukawa-san.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.