Last Comment Bug 757049 - GTK immodule receives invalid cursor position when the surrounding text contains any non-BMP Unicode character
: GTK immodule receives invalid cursor position when the surrounding text conta...
Status: VERIFIED FIXED
:
Product: Core
Classification: Components
Component: Widget: Gtk (show other bugs)
: Trunk
: All Linux
: -- normal (vote)
: mozilla15
Assigned To: Masayuki Nakano [:masayuki] (Mozilla Japan)
:
Mentors:
Depends on:
Blocks: 353776
  Show dependency treegraph
 
Reported: 2012-05-21 07:57 PDT by Yohei Yukawa
Modified: 2012-08-31 01:22 PDT (History)
4 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch (14.83 KB, patch)
2012-05-22 06:19 PDT, Masayuki Nakano [:masayuki] (Mozilla Japan)
no flags Details | Diff | Splinter Review
Patch (2.14 KB, patch)
2012-05-22 06:24 PDT, Masayuki Nakano [:masayuki] (Mozilla Japan)
karlt: review+
Details | Diff | Splinter Review
Patch (5.47 KB, patch)
2012-05-29 09:14 PDT, Masayuki Nakano [:masayuki] (Mozilla Japan)
karlt: review+
Details | Diff | Splinter Review

Description Yohei Yukawa 2012-05-21 07:57:22 PDT
This is a corner case of Bug 353776.

If I understood correctly, there exists a confusion between UTF-16 and the number of Unicode characters in the current implementation of nsGtkIMModule::OnRetrieveSurroundingNative.

nsGtkIMModule::GetCurrentParagraph returns aCursorPos as the number of UTF-16 elements.
http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l1469

Then the returned aCursorPos will be passed as the second argument of g_utf8_offset_to_pointer, which expects the number of Unicode characters instead.
http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l923

As a result, an invalid cursor position will be passed to gtk_im_context_set_surrounding when the surrounding text contains any non-BMP Unicode character before the cursor location. The current implementation should work as long as there is no non-BMP character before the cursor.

Please let me know if I need to elaborate more.

Thanks,
Comment 1 Masayuki Nakano [:masayuki] (Mozilla Japan) 2012-05-21 18:57:10 PDT
Hmm... There is another issue. I'm not sure whether IVS is an independent character of UTF-8 in gnome. The document said nothing about IVS.

http://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-utf8-strlen
Comment 2 Karl Tomlinson (:karlt) 2012-05-21 19:05:44 PDT
I expect variation selectors to count as separate characters in UTF-8.
http://unicode.org/reports/tr37/ says
"An Ideographic Variation Sequence (IVS) is a sequence of two coded characters, the first being a character with the Unified_Ideograph property, the second being a variation selector character in the range U+E0100 to U+E01EF."
Comment 3 Masayuki Nakano [:masayuki] (Mozilla Japan) 2012-05-22 06:19:47 PDT
Created attachment 625985 [details] [diff] [review]
Patch

I think that this the safest fix. I think that it doesn't make we handle all undocumented behavior of glib APIs. This make the cost increase. But it must not be too expensive.
Comment 4 Masayuki Nakano [:masayuki] (Mozilla Japan) 2012-05-22 06:22:03 PDT
> it doesn't make we handle all undocumented behavior of glib APIs.

I meant it doesn't make sense...
Comment 5 Masayuki Nakano [:masayuki] (Mozilla Japan) 2012-05-22 06:24:21 PDT
Created attachment 625986 [details] [diff] [review]
Patch

Oops. I attached wrong file. Sorry for the spam.
Comment 6 Karl Tomlinson (:karlt) 2012-05-28 19:15:13 PDT
Comment on attachment 625986 [details] [diff] [review]
Patch

>+    // Note that we shouldn't use g_utf8_offset_to_pointer() in this situation.
>+    // cursorPos is cursor position in UTF16 string.  If we want to convert
>+    // it to number of UTF-8 characters, we need to know the exactly same rules
>+    // of the glib APIs.  E.g., How do they count IVS sequence? How do they
>+    // handle broken surrogate pairs or IVS sequence at converting from UTF-16?
>+    // For safety, we should convert the string before cursor from UTF-16 to
>+    // UTF-8 by a glib API.  This must not be too expensive.

I expect these are not real issues.
Characters in g_utf8_offset_to_pointer are Unicode characters.  These are well
defined in UTF-8.  No characters (not even variation selectors) are treated
specially.  If uniStr does not contain valid UTF-16 (e.g. unpaired surrogates)
then g_utf16_to_utf8() will fail.

However, GLib doesn't provide a better method to use here.

If you feel like trying to tidy this up further so that text is only converted
once and zero/Length() special-case optimizations are not necessary,
AppendUTF16toUTF8 could be used with Gecko string classes to first convert the
portion before the cursor and then the rest.
Comment 7 Karl Tomlinson (:karlt) 2012-05-28 19:16:09 PDT
The same bug exists in the delete-surrounding signal handler, where offset and
n_chars measure Unicode characters, but I assume nsSelectionEvent wants UTF-16
code point counts.

Should that be fixed at the same time?
Comment 8 Masayuki Nakano [:masayuki] (Mozilla Japan) 2012-05-28 21:16:26 PDT
(In reply to Karl Tomlinson (:karlt) from comment #7)
> The same bug exists in the delete-surrounding signal handler, where offset
> and
> n_chars measure Unicode characters, but I assume nsSelectionEvent wants
> UTF-16
> code point counts.
> 
> Should that be fixed at the same time?

Ah, I guess so. IME developers might be able to avoid this bug if they know this. In such situation, both of them should work in the same way.
Comment 9 Karl Tomlinson (:karlt) 2012-05-29 00:47:26 PDT
Thinking about it a bit more, retrieve-surrounding and delete-surrounding differ in that the former wants a byte offset while the latter uses character counts, so I'm guessing it would be fine to fix retrieve-surrounding first.

Also delete-surrounding uses offsets from the cursor position, so there would only be a problem if there was a non-BMP character either in the deleted text or between the deleted text and the cursor.  That is possibly less likely than having a non-BMP character in the surrounding text before the cursor.
Comment 10 Masayuki Nakano [:masayuki] (Mozilla Japan) 2012-05-29 02:01:59 PDT
But IME may use both of them same time and compute the offset from the result of retrieve-surrounding. Anyway, I have a patch for both bugs, I'll test them tonight and tomorrow. I'll post the new patch soon.
Comment 11 Masayuki Nakano [:masayuki] (Mozilla Japan) 2012-05-29 09:14:52 PDT
Created attachment 627985 [details] [diff] [review]
Patch

How about this?

testing on tryserver:
https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=3665a4af9925
Comment 12 Karl Tomlinson (:karlt) 2012-05-31 23:05:43 PDT
Comment on attachment 627985 [details] [diff] [review]
Patch

>+    NS_ENSURE_TRUE(!utf8Str.IsEmpty(), NS_ERROR_FAILURE);

This shouldn't be necessary.
The offset and length checks already deal with this.

>+    gchar* charAtOffset =
>+        g_utf8_offset_to_pointer(utf8Str.get(), offsetInUTF8Characters);
>+    if (!charAtOffset) {
>+        PR_LOG(gGtkIMLog, PR_LOG_ALWAYS,
>+            ("    FAILED, charAtOffset is NULL"));
>+        return NS_ERROR_FAILURE;
>+    }
>+
>+    gchar* charAtEnd =
>+        g_utf8_offset_to_pointer(utf8Str.get(), endInUTF8Characters);
>+    if (!charAtEnd) {
>+        PR_LOG(gGtkIMLog, PR_LOG_ALWAYS,
>+            ("    FAILED, charAtEnd is NULL"));
>+        return NS_ERROR_FAILURE;
>+    }

g_utf8_offset_to_pointer won't return NULL so don't bother checking the
result.  It can only imply a false sense of security.
Comment 13 Masayuki Nakano [:masayuki] (Mozilla Japan) 2012-06-02 20:30:12 PDT
Okay, I added aNChar check. That's never been checked.

https://hg.mozilla.org/integration/mozilla-inbound/rev/291c61234c4a
Comment 14 Phil Ringnalda (:philor) 2012-06-03 12:32:17 PDT
https://hg.mozilla.org/mozilla-central/rev/291c61234c4a
Comment 15 Yohei Yukawa 2012-08-31 01:00:47 PDT
I confirmed that undo command and reverse conversion command of Mozc 1.6.1187.102 are working fine on Firefox 15.0 / Ubuntu 12.04 even when surrounding text contains non-BMP characters.

Thank you for fixing this issue so quickly!
Comment 16 Masayuki Nakano [:masayuki] (Mozilla Japan) 2012-08-31 01:22:34 PDT
-> v. per comment 15

Thank you Yukawa-san.

Note You need to log in before you can comment on or make changes to this bug.