Closed Bug 545478 Opened 15 years ago Closed 5 years ago

UTF-8 containing astral plane glyphs wraps in the wrong place

Categories

(Thunderbird :: Untriaged, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dg, Unassigned)

Details

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/533.1 (KHTML, like Gecko) Chrome/5.0.323.0 Safari/533.1
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100111 Thunderbird/3.0.1

If I send a plain-text message using the UTF-8 encoding, with a line of text containing astral plane glyphs (that is, code points above U+FFFF), then the line is wrapped in the wrong place.

I've filed this as General because I don't know which component contains the linewrapping code...

Reproducible: Always

Steps to Reproduce:
1. Start composing a plain-text message in UTF-8.
2. Paste in '
Oh, joy. Bugzilla has utterly mangled my bug. Apparently it doesn't support astral plane Unicode properly. Maybe I should file a bug on it...

My test text is the phrase "Ph'nglui mglw'nafh Cthulhu R'lnyeh wgah'nagl fhtagn.", but using letters from the Mathematical Alphanumeric Symbols unicode range up in U+1D400 instead of plain text. I tried to post a copy to pastebin but that doesn't support Unicode either --- you can find a copy of the string on Twitter, here: http://twitter.com/hjalfi/statuses/8690602802

To reproduce:

1. Start composing a plain-text message in UTF-8.
2. Paste in the string above (the exotic Unicode form).
3. Send it to yourself.

What you see:

Ph'nglui mglw'nafh Cthulhu R'lnyeh
wgah'nagl fhtagn.

What you should see:

Ph'nglui mglw'nafh Cthulhu R'lnyeh wgah'nagl fhtagn.

This is a classic symptom of a particular problem: one astral plane code point is represented as two UTF-16 values, using surrogates. An awful lot of code makes the erroneous assumption that one UTF-16 value represents on glyph. If you encode the above message in UTF-16, the line has been wrapped 60 half-words in (even though it's only 33 codepoints). This leads me to suspect that the line-wrapping code is treating each UTF-16 value as if it's a character, rather than actually parsing the Unicode properly.
Is this also true when using Ff ?
HTML containing the test phrase does seem to wrap correctly.

Also, the test phrase gets miswrapped only when the message is *sent*, not when it's in the compose window. Does that help?

Also also it would appear that Twitter's astral plane support is also dodgy, and the above link to the test phrase no longer works (it now links to an empty message!). In the interests of having at least *something* to test with, there's an entity-encoded version here: http://pastebin.com/f11edd070 But you'll need to convert that to UTF-8 somehow before pasteing it into a compose window...
(Sorry, HTML containing the test phrase seems to wrap correctly *in Firefox*.)
David,

can you reproduce this using a current version of thunderbird?

if you are unable to reproduce, please close by setting stats to resolved, and resolution to WORKSFORME or another appropriate setting.

If you are able to reproduce, add new details, and a testcase if one does not already exist in the bug report.
Version: unspecified → 3.0
No, this is still extant in 10.0.2 and the supplied test case still verifies it.
Component: General → Untriaged
Having just rechecked, it looks like this is fixed in 17.0.8.

Then I think it is safe to close this bug...

Status: UNCONFIRMED → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.