Closed Bug 991427 Opened 10 years ago Closed 10 years ago

formatSearchContextItem does not handle surrogate pairs correctly

Categories

(Firefox :: Menus, defect)

28 Branch
defect
Not set
normal
Points:
5

Tracking

()

VERIFIED FIXED
Firefox 34
Iteration:
34.2

People

(Reporter: masanori.ogino, Assigned: andreaio.it, Mentored)

Details

Attachments

(2 files, 3 obsolete files)

Attached file test.html
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0 (Beta/Release)
Build ID: 20140331100457

Steps to reproduce:

1. Open attached test.html,
2. Select all, and
3. Do right-click.


Actual results:

In the context menu, <selection text> in 'Search <engine> for "<selection text>"' item contains seven fires and a "D83D in a box" character (depending fonts installed to the machine; the test file uses an Emoji character but any character beyond BMP causes similar problem) before an elipsis.


Expected results:

Selection text is trancated by a correct boundary (hopefully grapheme, at least code point).
Reproducible on:
Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Firefox/31.0

A character that is not part of the text on the page is displayed before the ellipsis.
Status: UNCONFIRMED → NEW
Component: Untriaged → Menus
Ever confirmed: true
Flags: firefox-backlog+
Whiteboard: [mentor] p=5
Marco: You're mentoring this?
I think Marco just marked this to be mentored by someone TBD
Whiteboard: [mentor] p=5 → [mentor wanted] p=5
I think that truncating the selected characters to an even number in the formatSearchContextItem function will solve the problem.
Can someone assign me to this bug?
Yes, I can. Thanks for taking this up, Andrea! Please ask any questions you have here in the bug.
Assignee: nobody → andreaio.it
Unfortunately there's no guarantee that truncating at 16 "JS characters" is safer - it might fix that particular test case, but it's still possible for there to be a UTF-16 surrogate pair boundary at the 16 "JS character" point. Here's an example:

data:text/html,<p>aaaaaaaaaaaaaaa%26%23x1f525%3B</p>

I'm not actually sure how to best fix this "the right way". What we want is to truncate by code points rather than JS characters, so we need a way to map "15 Unicode code points of string X" to "X JS characters of string X", but I'm not sure how to do that via the XPCOM tools we have offhand.
Comment on attachment 8469319 [details] [diff] [review]
Changed the number of characters displayed from 15 to 16 to make UTF 16 work correctly

thanks for writing the patch, though! Let me see if I can provide better advice for an alternative approach.
Attachment #8469319 - Flags: review?(gavin.sharp) → review-
Attachment #8469319 - Attachment is obsolete: true
Attachment #8470465 - Flags: review?(gavin.sharp)
I did not know codePointAt existed! Neat.

I might suggest an alternate approach, though: how about just checking whether the 16th character in the string is a lead surrogate, and if so just including the next character? I.e.:

function truncateText(str) {
  let truncLength = 15;
  if (str.length > 15) {
    // Don't chop a surrogate pair in half
    let truncChar = str[15].charCodeAt(0);
    if (truncChar >= 0xD800 && truncChar <= 0xDBFF)
      truncLength++;
  }
  return str.substr(0, truncLength);
}
Because a surrogate pair is one character. If you just check the last the string would be a character shorter for every surrogate pair.
A surrogate pair is two "JS characters", though (that's why codePointAt and charCodeAt both exist). So the code I mocked up should work (though I haven't tested it!).
Essentially, I'm suggesting truncating at 15 "JS characters", like the current code, but change that to truncating to 16 characters if the 15th and 16th JS characters are a surrogate pair (you can tell whether they are by just looking for the "lead surrogate" in the 15th character).
Comment on attachment 8470465 [details] [diff] [review]
Now the length of text showed is right

I think my suggested fix is simpler and should work equally well, but let me know if I'm missing something!
Attachment #8470465 - Flags: review?(gavin.sharp) → feedback-
Attachment #8470465 - Attachment is obsolete: true
Attachment #8471683 - Flags: review?(gavin.sharp)
Comment on attachment 8471683 [details] [diff] [review]
Check if the last character of the string is part of a surrogate pair

Thanks Andrea! I would just elaborate in the comment:

// If the JS character after our truncation point is a trail surrogate,
// include it in the truncated string to avoid splitting a surrogate pair.
if (truncChar >= 0xDC00 && truncChar <= 0xDFFF)
[...]

We should get an automated test for this feature added as well. It should be relatively simple to edit browser_bug970746.js (and browser_bug970746.xhtml) to add coverage for this case. The details on how to run the browser chrome test suite (of which browser_bug970746 is part) are at https://developer.mozilla.org/en-US/docs/Browser_chrome_tests .

So the next steps would be to adjust the patch to include the new comment, as well as the test changes I propose above.

Feel free to let me know (here, or via IRC or email) if you have other questions, or if you'd like me to write the test instead.
Attachment #8471683 - Flags: review?(gavin.sharp) → review+
Mentor: gavin.sharp
Comment on attachment 8474156 [details] [diff] [review]
Check if the last character of the string is part of a surrogate pair and added the test

Thanks so much Andrea!

I pushed this to fx-team with a couple of minor tweaks (removed a now-redundant comment, fixed whitespace in browser_bug970746.xhtml).
Attachment #8474156 - Flags: review+
https://hg.mozilla.org/integration/fx-team/rev/82722ad59732
Points: --- → 5
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: [mentor wanted] p=5
Target Milestone: --- → Firefox 34
https://hg.mozilla.org/mozilla-central/rev/82722ad59732
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Iteration: --- → 34.2
Flags: qe-verify?
Flags: qe-verify? → qe-verify+
Steps for verification:
- run the STR in comment 0 both with the testcase attached here (attachment 8401038 [details]) and the URL from comment 8:

data:text/html,<p>aaaaaaaaaaaaaaa%26%23x1f525%3B</p>
Reproduced the initial issue on Firefox 28 (build ID: 20140314220517)

Confirming the fix on Firefox 34 beta 10 (build ID: 20141117202603) using the STR from comment 0 and the URL from comment 8.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.