contentWhittle doesn't cope with messages that include unicode non-breaking spaces

RESOLVED FIXED

Status

MailNews Core
Database
RESOLVED FIXED
9 years ago
9 years ago

People

(Reporter: davida, Assigned: davida)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Assignee)

Description

9 years ago
Created attachment 380049 [details] [diff] [review]
patch with test

I was noticing that some messages end up not de-quoted correctly by the contentWhittle code.

The following patch has code that works in my test email, which I'll attach.  The test was made to work with the code, which isn't the right way to do things, but the relationship between the test format and the original email is puzzling. Specifically, the key lines in the test email are:

 "Mark Banner <bugzilla@standard8.plus.com> wrote:\n\xc2\xa0\n> We haven't nailed anything down in detail yet, depending on how we are \n> when we get to beta 3, my proposal would be to do the final freeze two"  

note that the second line is \xc2\xa0, which is (so i learn) a unicode sequence meaning non-breaking space.

When dealing with that mail, the main for each loop in contentWhittle gets a line that is just "\xa0", not the "\xc2\xa0" that I'd expect.

So, the code seems to work in practice, the test matches the code, but I don't understand why the test shouldn't have the two-character sequence.
Attachment #380049 - Flags: review?(bugmail)
(Assignee)

Comment 1

9 years ago
Created attachment 380051 [details]
test email
(In reply to comment #0)
> When dealing with that mail, the main for each loop in contentWhittle gets a
> line that is just "\xa0", not the "\xc2\xa0" that I'd expect.

0xc2 0xa0 is the UTF-8 encoding of 0xa0, the nbsp character.  Once your real underlying character needs the high order bit set, it needs the second byte.

ipython:
In [30]: unichr(0xa0).encode('utf-8')
Out[30]: '\xc2\xa0'
Comment on attachment 380049 [details] [diff] [review]
patch with test

>diff --git a/mailnews/db/gloda/modules/fundattr.js b/mailnews/db/gloda/modules/fundattr.js
>-      if (!line)
>+      if (!line || (line == "\xa0")) /* unicode non breaking space */

as someone who just thought about this code, any thoughts on just doing:
if (!line.trim())
then any empty or all-whitespace line is treated as empty:

js> z = "\xa0"
�
js> !z.trim()
true

Please make that change unless you have a reason not to.

>+    bode: [[false, "Mark Banner <bugzilla@standard8.plus.com> wrote:"],
>+           [false, "\xa0"],
>+           [false, "> We haven't nailed anything down in detail yet, depending on 

You are a lucky man... messageGenerator doesn't perform any type of escaping/encoding on its output, and our default charset in use for these tests is ISO-8859-1, and 0xa0 is NBSP in ISO-8859-1, so this works.  Of course, it's probably no accident that 0xa0 is nbsp in unicode, but the lucky thing was that you didn't have to deal with making messageGenerator be cool.

Please add a comment before that test-case or that line indicating that the \xa0 is fine as long as the charset in use is ISO-8859-1 or the 0xa0 otherwise doesn't need to be encoded.  That way when that test inevitably explodes because of some change, the person will have a better chance of knowing why it broke.

r+ with those changes.
Attachment #380049 - Flags: superreview?(bienvenu)
Attachment #380049 - Flags: review?(bugmail)
Attachment #380049 - Flags: review+

Updated

9 years ago
Attachment #380049 - Flags: superreview?(bienvenu) → superreview+

Comment 4

9 years ago
Comment on attachment 380049 [details] [diff] [review]
patch with test

sr=me, modulo asuth's comments.
pushed: http://hg.mozilla.org/comm-central/rev/3016655dbf53

This got lost for a bit because of general bugzilla workflow issues.  davida, when possible, please mark things checkin-needed once reviews are complete.
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
Assignee: bugmail → david.ascher
You need to log in before you can comment on or make changes to this bug.