Closed Bug 2016451 Opened 4 months ago Closed 3 months ago

Dragging CJK text from external apps to Firefox results in corrupted characters (U+FFFD)

Categories

(Core :: Widget: Gtk, defect, P3)

Firefox 146
Desktop
Linux
defect

Tracking

()

VERIFIED FIXED
150 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox-esr140 --- unaffected
firefox148 --- wontfix
firefox149 --- verified
firefox150 --- verified

People

(Reporter: oceancat365, Assigned: stransky)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: nightly-community, regression)

Attachments

(3 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:146.0) Gecko/20100101 Firefox/146.0

Steps to reproduce:

  1. On Arch Linux (with KDE Plasma 6.5.5), open any application other than Firefox (e.g., Kate or Telegram Desktop)
  2. Select a string that contains CJK characters in that application.
  3. Drag the selected text and drop it onto the Firefox tab bar (which triggers a search).
  4. Drag the selected CJK text and drop it into a text input field on a web page opened in Firefox.
  5. Drag CJK text to anywhere within Firefox, for example, in the same tab, to another tab, or to the tab bar.
  6. Repeat steps 3, 4, and 5 with pure English text.

Actual results:

  1. When dropping CJK text on the tab bar, Firefox opens a new tab and searches for the Unicode Replacement Character (U+FFFD).
  2. In an input field, if the CJK text is short, it pastes U+FFFD too. If the text is relatively long, sometimes a portion of the characters is successfully moved, but often it results in corruption.
  3. Dragging any text within Firefox, with or without CJK characters, works perfectly fine.
  4. English text works correctly in all scenarios.

Expected results:

Firefox should correctly receive the UTF-8 string from the Wayland drag-and-drop protocol and search for or paste the original CJK text, matching the behavior of English text and internal drag operations.

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core

Can you run on terminal with MOZ_LOG="WidgetDrag:5" env variable, reproduce the issue and attach the log here?
Thanks.

Blocks: linuxdad
Flags: needinfo?(oceancat365)
Priority: -- → P3
Flags: needinfo?(oceancat365)

I wonder if we use just wrong MIME type for the text string so the result is corrupted. But looks like we're getting the data correctly in UTF-8 format.
Please check the attached log and look for "DragData() plain data MIME" entry - is the text correct?
Thanks.

Flags: needinfo?(oceancat365)
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Unspecified → Linux
Hardware: Unspecified → Desktop

Hi, I checked the logs as requested.
Testing with the string "我能吞下玻璃而不伤身体" (I can eat glass, it doesn't hurt me) shows the following:

[Parent 18562: Main Thread]: D/WidgetDrag [D 2][7f53b58d7400]     nsDragSession::TargetDataReceived(7f53f723b710) MIME text/plain;charset=utf-8 
[Parent 18562: Main Thread]: D/WidgetDrag [D 2][7f53b58d7400]       TargetDataReceived(): plain data, MIME text/plain;charset=utf-8 len = 11
[Parent 18562: Main Thread]: D/WidgetDrag DragData() plain data MIME: text/plain;charset=utf-8 : 我能吞ä¸
[Parent 18562: Main Thread]: D/WidgetDrag [D 1][7f53b58d7400]   text/plain;charset=utf-8 received

I dug a bit deeper and found the "garbled" text may be truncation at the byte level.
The output 我能åžä¸ occurs because the raw UTF-8 bytes are being cut off and then interpreted as single-byte characters (likely Latin-1).

  1. Input String: "我能吞下玻璃而不伤身体" (11 CJK characters)
  • Original HEX (UTF-8): e6 88 91 e8 83 bd e5 90 9e e4 b8 8b e7 8e bb e7 92 83 e8 80 8c e4 b8 8d e4 bc a4 e8 ba ab e4 bd 93 (33 bytes)
  • Log shows len = 11. It seems the code uses the character count (11) to determine the buffer size or read length in bytes, instead of the actual byte length.
  • The first 11 bytes of the HEX are: e6 88 91 e8 83 bd e5 90 9e e4 b8.
  • Interpreting these 11 bytes as Latin-1 gives exactly 我能吞ä¸. The 4th character is corrupted because its 3rd byte was dropped.
  1. Another case: "测试" (2 CJK characters)
  • Original HEX (UTF-8): e6 b5 8b e8 af 95 (6 bytes)
  • Log shows len = 2.
  • The first 2 bytes are e6 b5, which renders as æµ in the log.
  1. UTF-16 test: "🐶🐶🐶"
  • Original HEX: f0 9f 90 b6 f0 9f 90 b6 f0 9f 90 b6 (12 bytes)
  • Log shows TargetDataReceived(): plain data, MIME text/plain;charset=utf-8 len = 3
  • The first 3 bytes are f0 9f 90, renders as 🐠in the log

I assumed that it is only requesting or reading N bytes for a string of N characters.
This works for ASCII where the ratio is 1:1, but causes truncation for multi-byte text.

Flags: needinfo?(oceancat365)

:handyman, since you are the author of the regressor, bug 1966443, could you take a look?

Flags: needinfo?(davidp99)

Thanks for the analysis. It's because we use g_utf8_strlen() to get sting len in chars in utf8. Will look at it.

Flags: needinfo?(davidp99) → needinfo?(stransky)

It's caused by this revision: https://phabricator.services.mozilla.com/D256877
We need to add corresponding call to DragData to deal with UTF8 as char len and not byte len.

Flags: needinfo?(stransky)
Assignee: nobody → stransky
Status: NEW → ASSIGNED
Pushed by stransky@redhat.com: https://github.com/mozilla-firefox/firefox/commit/689ebeae90d9 https://hg.mozilla.org/integration/autoland/rev/27437d27a166 [Linux] Use input string byte len for UTF8ToNewUnicode() instead of UTF-8 char len r=emilio
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 150 Branch

The patch landed in nightly and beta is affected.
:stransky, is this bug important enough to require an uplift?

For more information, please visit BugBot documentation.

Flags: needinfo?(stransky)

firefox-beta Uplift Approval Request

  • User impact if declined: Broken D&D of CJK text.
  • Code covered by automated testing: no
  • Fix verified in Nightly: no
  • Needs manual QE test: yes
  • Steps to reproduce for manual QE testing: D&D any CJK text.
  • Risk associated with taking this patch: low
  • Explanation of risk level: We use correct text length for D&D.
  • String changes made/needed: none
  • Is Android affected?: no
Attachment #9551392 - Flags: approval-mozilla-beta?
Flags: qe-verify+
Attachment #9551392 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
QA Whiteboard: [uplift] [qa-ver-needed-c150/b149]
Flags: needinfo?(stransky)

Reproducible on a 2026-03-07 Firefox Nightly build on Ubuntu 22, following the STR from Comment 0.

Verified as fixed on Firefox Nightly 150.0a1 and Firefox 149.0b7 on Ubuntu 22.

Status: RESOLVED → VERIFIED
QA Whiteboard: [uplift] [qa-ver-needed-c150/b149] → [uplift] [qa-ver-done-c150/b149]
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: