Closed
Bug 757049
Opened 11 years ago
Closed 11 years ago
GTK immodule receives invalid cursor position when the surrounding text contains any non-BMP Unicode character
Categories
(Core :: Widget: Gtk, defect)
Tracking
()
VERIFIED
FIXED
mozilla15
People
(Reporter: yukawa, Assigned: masayuki)
References
Details
Attachments
(1 file, 2 obsolete files)
5.47 KB,
patch
|
karlt
:
review+
|
Details | Diff | Splinter Review |
This is a corner case of Bug 353776. If I understood correctly, there exists a confusion between UTF-16 and the number of Unicode characters in the current implementation of nsGtkIMModule::OnRetrieveSurroundingNative. nsGtkIMModule::GetCurrentParagraph returns aCursorPos as the number of UTF-16 elements. http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l1469 Then the returned aCursorPos will be passed as the second argument of g_utf8_offset_to_pointer, which expects the number of Unicode characters instead. http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l923 As a result, an invalid cursor position will be passed to gtk_im_context_set_surrounding when the surrounding text contains any non-BMP Unicode character before the cursor location. The current implementation should work as long as there is no non-BMP character before the cursor. Please let me know if I need to elaborate more. Thanks,
Updated•11 years ago
|
Assignee | ||
Comment 1•11 years ago
|
||
Hmm... There is another issue. I'm not sure whether IVS is an independent character of UTF-8 in gnome. The document said nothing about IVS. http://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-utf8-strlen
Comment 2•11 years ago
|
||
I expect variation selectors to count as separate characters in UTF-8. http://unicode.org/reports/tr37/ says "An Ideographic Variation Sequence (IVS) is a sequence of two coded characters, the first being a character with the Unified_Ideograph property, the second being a variation selector character in the range U+E0100 to U+E01EF."
Assignee | ||
Comment 3•11 years ago
|
||
I think that this the safest fix. I think that it doesn't make we handle all undocumented behavior of glib APIs. This make the cost increase. But it must not be too expensive.
Assignee | ||
Comment 4•11 years ago
|
||
> it doesn't make we handle all undocumented behavior of glib APIs.
I meant it doesn't make sense...
Assignee | ||
Comment 5•11 years ago
|
||
Oops. I attached wrong file. Sorry for the spam.
Attachment #625985 -
Attachment is obsolete: true
Attachment #625985 -
Flags: review?(karlt)
Attachment #625986 -
Flags: review?(karlt)
Comment 6•11 years ago
|
||
Comment on attachment 625986 [details] [diff] [review] Patch >+ // Note that we shouldn't use g_utf8_offset_to_pointer() in this situation. >+ // cursorPos is cursor position in UTF16 string. If we want to convert >+ // it to number of UTF-8 characters, we need to know the exactly same rules >+ // of the glib APIs. E.g., How do they count IVS sequence? How do they >+ // handle broken surrogate pairs or IVS sequence at converting from UTF-16? >+ // For safety, we should convert the string before cursor from UTF-16 to >+ // UTF-8 by a glib API. This must not be too expensive. I expect these are not real issues. Characters in g_utf8_offset_to_pointer are Unicode characters. These are well defined in UTF-8. No characters (not even variation selectors) are treated specially. If uniStr does not contain valid UTF-16 (e.g. unpaired surrogates) then g_utf16_to_utf8() will fail. However, GLib doesn't provide a better method to use here. If you feel like trying to tidy this up further so that text is only converted once and zero/Length() special-case optimizations are not necessary, AppendUTF16toUTF8 could be used with Gecko string classes to first convert the portion before the cursor and then the rest.
Attachment #625986 -
Flags: review?(karlt) → review+
Comment 7•11 years ago
|
||
The same bug exists in the delete-surrounding signal handler, where offset and n_chars measure Unicode characters, but I assume nsSelectionEvent wants UTF-16 code point counts. Should that be fixed at the same time?
Assignee | ||
Comment 8•11 years ago
|
||
(In reply to Karl Tomlinson (:karlt) from comment #7) > The same bug exists in the delete-surrounding signal handler, where offset > and > n_chars measure Unicode characters, but I assume nsSelectionEvent wants > UTF-16 > code point counts. > > Should that be fixed at the same time? Ah, I guess so. IME developers might be able to avoid this bug if they know this. In such situation, both of them should work in the same way.
Comment 9•11 years ago
|
||
Thinking about it a bit more, retrieve-surrounding and delete-surrounding differ in that the former wants a byte offset while the latter uses character counts, so I'm guessing it would be fine to fix retrieve-surrounding first. Also delete-surrounding uses offsets from the cursor position, so there would only be a problem if there was a non-BMP character either in the deleted text or between the deleted text and the cursor. That is possibly less likely than having a non-BMP character in the surrounding text before the cursor.
Assignee | ||
Comment 10•11 years ago
|
||
But IME may use both of them same time and compute the offset from the result of retrieve-surrounding. Anyway, I have a patch for both bugs, I'll test them tonight and tomorrow. I'll post the new patch soon.
Assignee | ||
Comment 11•11 years ago
|
||
How about this? testing on tryserver: https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=3665a4af9925
Attachment #625986 -
Attachment is obsolete: true
Attachment #627985 -
Flags: review?(karlt)
Comment 12•11 years ago
|
||
Comment on attachment 627985 [details] [diff] [review] Patch >+ NS_ENSURE_TRUE(!utf8Str.IsEmpty(), NS_ERROR_FAILURE); This shouldn't be necessary. The offset and length checks already deal with this. >+ gchar* charAtOffset = >+ g_utf8_offset_to_pointer(utf8Str.get(), offsetInUTF8Characters); >+ if (!charAtOffset) { >+ PR_LOG(gGtkIMLog, PR_LOG_ALWAYS, >+ (" FAILED, charAtOffset is NULL")); >+ return NS_ERROR_FAILURE; >+ } >+ >+ gchar* charAtEnd = >+ g_utf8_offset_to_pointer(utf8Str.get(), endInUTF8Characters); >+ if (!charAtEnd) { >+ PR_LOG(gGtkIMLog, PR_LOG_ALWAYS, >+ (" FAILED, charAtEnd is NULL")); >+ return NS_ERROR_FAILURE; >+ } g_utf8_offset_to_pointer won't return NULL so don't bother checking the result. It can only imply a false sense of security.
Attachment #627985 -
Flags: review?(karlt) → review+
Assignee | ||
Comment 13•11 years ago
|
||
Okay, I added aNChar check. That's never been checked. https://hg.mozilla.org/integration/mozilla-inbound/rev/291c61234c4a
Target Milestone: --- → mozilla15
Comment 14•11 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/291c61234c4a
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 15•11 years ago
|
||
I confirmed that undo command and reverse conversion command of Mozc 1.6.1187.102 are working fine on Firefox 15.0 / Ubuntu 12.04 even when surrounding text contains non-BMP characters. Thank you for fixing this issue so quickly!
Assignee | ||
Comment 16•11 years ago
|
||
-> v. per comment 15 Thank you Yukawa-san.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•