Closed Bug 757049 Opened 13 years ago Closed 13 years ago

GTK immodule receives invalid cursor position when the surrounding text contains any non-BMP Unicode character

Categories

(Core :: Widget: Gtk, defect)

All
Linux
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla15

People

(Reporter: yukawa, Assigned: masayuki)

References

Details

Attachments

(1 file, 2 obsolete files)

This is a corner case of Bug 353776. If I understood correctly, there exists a confusion between UTF-16 and the number of Unicode characters in the current implementation of nsGtkIMModule::OnRetrieveSurroundingNative. nsGtkIMModule::GetCurrentParagraph returns aCursorPos as the number of UTF-16 elements. http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l1469 Then the returned aCursorPos will be passed as the second argument of g_utf8_offset_to_pointer, which expects the number of Unicode characters instead. http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l923 As a result, an invalid cursor position will be passed to gtk_im_context_set_surrounding when the surrounding text contains any non-BMP Unicode character before the cursor location. The current implementation should work as long as there is no non-BMP character before the cursor. Please let me know if I need to elaborate more. Thanks,
Blocks: 353776
Status: UNCONFIRMED → NEW
Ever confirmed: true
Hmm... There is another issue. I'm not sure whether IVS is an independent character of UTF-8 in gnome. The document said nothing about IVS. http://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-utf8-strlen
I expect variation selectors to count as separate characters in UTF-8. http://unicode.org/reports/tr37/ says "An Ideographic Variation Sequence (IVS) is a sequence of two coded characters, the first being a character with the Unified_Ideograph property, the second being a variation selector character in the range U+E0100 to U+E01EF."
Attached patch Patch (obsolete) — Splinter Review
I think that this the safest fix. I think that it doesn't make we handle all undocumented behavior of glib APIs. This make the cost increase. But it must not be too expensive.
Assignee: nobody → masayuki
Status: NEW → ASSIGNED
Attachment #625985 - Flags: review?(karlt)
> it doesn't make we handle all undocumented behavior of glib APIs. I meant it doesn't make sense...
Attached patch Patch (obsolete) — Splinter Review
Oops. I attached wrong file. Sorry for the spam.
Attachment #625985 - Attachment is obsolete: true
Attachment #625985 - Flags: review?(karlt)
Attachment #625986 - Flags: review?(karlt)
Comment on attachment 625986 [details] [diff] [review] Patch >+ // Note that we shouldn't use g_utf8_offset_to_pointer() in this situation. >+ // cursorPos is cursor position in UTF16 string. If we want to convert >+ // it to number of UTF-8 characters, we need to know the exactly same rules >+ // of the glib APIs. E.g., How do they count IVS sequence? How do they >+ // handle broken surrogate pairs or IVS sequence at converting from UTF-16? >+ // For safety, we should convert the string before cursor from UTF-16 to >+ // UTF-8 by a glib API. This must not be too expensive. I expect these are not real issues. Characters in g_utf8_offset_to_pointer are Unicode characters. These are well defined in UTF-8. No characters (not even variation selectors) are treated specially. If uniStr does not contain valid UTF-16 (e.g. unpaired surrogates) then g_utf16_to_utf8() will fail. However, GLib doesn't provide a better method to use here. If you feel like trying to tidy this up further so that text is only converted once and zero/Length() special-case optimizations are not necessary, AppendUTF16toUTF8 could be used with Gecko string classes to first convert the portion before the cursor and then the rest.
Attachment #625986 - Flags: review?(karlt) → review+
The same bug exists in the delete-surrounding signal handler, where offset and n_chars measure Unicode characters, but I assume nsSelectionEvent wants UTF-16 code point counts. Should that be fixed at the same time?
(In reply to Karl Tomlinson (:karlt) from comment #7) > The same bug exists in the delete-surrounding signal handler, where offset > and > n_chars measure Unicode characters, but I assume nsSelectionEvent wants > UTF-16 > code point counts. > > Should that be fixed at the same time? Ah, I guess so. IME developers might be able to avoid this bug if they know this. In such situation, both of them should work in the same way.
Thinking about it a bit more, retrieve-surrounding and delete-surrounding differ in that the former wants a byte offset while the latter uses character counts, so I'm guessing it would be fine to fix retrieve-surrounding first. Also delete-surrounding uses offsets from the cursor position, so there would only be a problem if there was a non-BMP character either in the deleted text or between the deleted text and the cursor. That is possibly less likely than having a non-BMP character in the surrounding text before the cursor.
But IME may use both of them same time and compute the offset from the result of retrieve-surrounding. Anyway, I have a patch for both bugs, I'll test them tonight and tomorrow. I'll post the new patch soon.
Attached patch PatchSplinter Review
Attachment #625986 - Attachment is obsolete: true
Attachment #627985 - Flags: review?(karlt)
Comment on attachment 627985 [details] [diff] [review] Patch >+ NS_ENSURE_TRUE(!utf8Str.IsEmpty(), NS_ERROR_FAILURE); This shouldn't be necessary. The offset and length checks already deal with this. >+ gchar* charAtOffset = >+ g_utf8_offset_to_pointer(utf8Str.get(), offsetInUTF8Characters); >+ if (!charAtOffset) { >+ PR_LOG(gGtkIMLog, PR_LOG_ALWAYS, >+ (" FAILED, charAtOffset is NULL")); >+ return NS_ERROR_FAILURE; >+ } >+ >+ gchar* charAtEnd = >+ g_utf8_offset_to_pointer(utf8Str.get(), endInUTF8Characters); >+ if (!charAtEnd) { >+ PR_LOG(gGtkIMLog, PR_LOG_ALWAYS, >+ (" FAILED, charAtEnd is NULL")); >+ return NS_ERROR_FAILURE; >+ } g_utf8_offset_to_pointer won't return NULL so don't bother checking the result. It can only imply a false sense of security.
Attachment #627985 - Flags: review?(karlt) → review+
Okay, I added aNChar check. That's never been checked. https://hg.mozilla.org/integration/mozilla-inbound/rev/291c61234c4a
Target Milestone: --- → mozilla15
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
I confirmed that undo command and reverse conversion command of Mozc 1.6.1187.102 are working fine on Firefox 15.0 / Ubuntu 12.04 even when surrounding text contains non-BMP characters. Thank you for fixing this issue so quickly!
-> v. per comment 15 Thank you Yukawa-san.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: