The highlight is off by a few characters in the search result view when some characters are UTF8 encoded on 4 bytes

RESOLVED FIXED in Thunderbird 16.0

Status

Thunderbird
Search
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: florian, Assigned: florian)

Tracking

Trunk
Thunderbird 16.0

Thunderbird Tracking Flags

(thunderbird13 wontfix, thunderbird14+ fixed, thunderbird15+ fixed)

Details

Attachments

(2 attachments, 1 obsolete attachment)

Created attachment 623643 [details]
Screenshot of the bug

I noticed this with the <U+1F493> character (http://www.fileformat.info/info/unicode/char/1f493/index.htm). See the attached screenshot.

The cause of the problem is that the current code assumes that an UTF8 character that requires more than 2 bytes is coded on 3 bytes. However, some UTF8 characters are coded on 4 bytes; they are seen by the JS code as 2 separate characters so I assumed at the time I wrote the flawed code that this case would just work, but actually it doesn't because the current code returns 3 bytes for each half of the 4 bytes character, and the character ends up counted as 6 bytes.
Blocks: 518597
Created attachment 623644 [details] [diff] [review]
Patch
Assignee: nobody → florian
Attachment #623644 - Flags: review?(bugmail)
Attachment #623644 - Flags: review?(bugmail) → review+
http://hg.mozilla.org/comm-central/rev/ec7a7bf505fd
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 15.0
Comment on attachment 623644 [details] [diff] [review]
Patch

[Approval Request Comment]
This patch fixes a bug in a feature landed for Thunderbird 12 (bug 518597) so I think we want to take the fix to aurora, and maybe even to beta.
Attachment #623644 - Flags: approval-comm-aurora?
Attachment #623644 - Flags: approval-comm-aurora? → approval-comm-aurora+
https://hg.mozilla.org/releases/comm-aurora/rev/b705a1702ac8
status-thunderbird13: --- → wontfix
status-thunderbird14: --- → fixed
> c >= 32768
c >= 65536
(In reply to Masatoshi Kimura [:emk] from comment #5)
> > c >= 32768
> c >= 65536

Why? The character with which I noticed the bug was <U+1F493>, and it's coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.
(In reply to Florian Quèze from comment #6)
> (In reply to Masatoshi Kimura [:emk] from comment #5)
> > > c >= 32768
> > c >= 65536
> 
> Why? The character with which I noticed the bug was <U+1F493>, and it's
> coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.

Ah, I forgot JS encodes characters in UTF-8. Then it should be (c >= 55296 || c <= 57343).
I'm still a bit confused by your comment. To clarify, could you give an example of a character that isn't correctly handled by the current code?
(In reply to Masatoshi Kimura [:emk] from comment #7)
> it should be (c >= 55296 || c <= 57343).

Right. Reopening to fix this.
Status: RESOLVED → REOPENED
tracking-thunderbird14: --- → ?
tracking-thunderbird15: --- → ?
Resolution: FIXED → ---
Created attachment 637159 [details] [diff] [review]
Follow-up to only match UTF-16 surrogate halves

This time I actually looked for documentation about unicode instead of trying/guessing.

http://en.wikipedia.org/wiki/UTF-16 says:
"Code points U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character"


http://en.wikipedia.org/wiki/UTF-8 says:
"code points below U+0080 (which UTF-8 encodes in one byte)"

"for text using only code points below U+0800 [...] each code point's UTF-8 encoding is one or two bytes"

"Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16."

"Invalid code points

According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence"
Attachment #623644 - Attachment is obsolete: true
Attachment #637159 - Flags: review?(bugmail)
Attachment #637159 - Flags: review?(bugmail) → review+
Comment on attachment 637159 [details] [diff] [review]
Follow-up to only match UTF-16 surrogate halves

[Approval Request Comment]
My previous broken patch landed in aurora which was Tb14 at the time, so I would like to take the follow-up to aurora and beta.
Attachment #637159 - Flags: approval-comm-beta?
Attachment #637159 - Flags: approval-comm-aurora?
Landed attachment 637159 [details] [diff] [review] as https://hg.mozilla.org/comm-central/rev/c2e2bef7c4ac
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: Thunderbird 15.0 → Thunderbird 16.0
Attachment #637159 - Flags: approval-comm-beta?
Attachment #637159 - Flags: approval-comm-beta+
Attachment #637159 - Flags: approval-comm-aurora?
Attachment #637159 - Flags: approval-comm-aurora+
tracking-thunderbird14: ? → +
tracking-thunderbird15: ? → +
Setting status 14 to affected until we land the second patch on beta.
status-thunderbird14: fixed → affected
Checked in:

https://hg.mozilla.org/releases/comm-aurora/rev/c90cfc006a30
https://hg.mozilla.org/releases/comm-beta/rev/949cb94b5667
status-thunderbird14: affected → fixed
status-thunderbird15: --- → fixed
You need to log in before you can comment on or make changes to this bug.