User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040526 MultiZilla/220.127.116.11h Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040526 displays unicode text (french, english) in an attached file, as chinese/japanese inline (even mixed with iso-latin characters) Reproducible: Always Steps to Reproduce: 1.Send a unicode text file as attachment to yourself. Actual Results: 2.When received, is displayed in-line with a mix of occidental characters and asian signs (japanese or chinese, I don't know) 3.Hopefully, when you save the file attachement, you get it back as a text ! Expected Results: Display the text content Or do not display inline Saving the file allows to access the data with another software :(-
Created attachment 149397 [details] sample text in unicode U-DOS (done with ultraedit 10) zipped to preserve the file encoding
down to the point where we are just trying to fix regressions. is this a regression? and if so any ideas when it might have been introduced?
I'm not sure if this is a recent regression. I doubt it, but I'll compare against 1.4 and 1.6. some notes: 1) when sending this file from OE6, they send as "Content-Disposition: attachment;" 2) when sending this file from 1.7, we send as "Content-Disposition: inline;" 3) when OE receives the attachment version, it seems to handle it best. it doesn't show it inline, and if you go to open the attachment, it opens it into an external txt viewer: notepad 4) when 1.7 receives the attachment version, we try to show it inline. my guess is that we know we can render text/plain inline, so we try. but we do a poor job possibly because there is no charset (or something) on the attachment? 5) when I double click on the attachment in 1.7, and we load the attachment (in the browser), it seems to display correctly. (just inline is bad) 6) maybe we are supposed to be specifying the charset when we attach a unicode file, like this? Content-Type: text/plain; charset=<something> name="spip-boucles-fake.txt" Content-Transfer-Encoding: base64 Content-Disposition: inline; filename="spip-boucles-fake.txt"
simon and jshin know way more about this than me, perhaps they have thoughts.
(In reply to comment #5) > I'm not sure if this is a recent regression. > I doubt it, but I'll compare against 1.4 and 1.6. I also doubt it, but anyway your testing result would be nice to have. > 6) maybe we are supposed to be specifying the charset when we attach a unicode > file, like this? Yeah, we may have to, but we have to come up with a way that doesn't burden Mom'n'Pop users who've got little clue about 'charset'. Currently, I think it's assumed that text attachment has the same character encoding as the main body of the message. It doesn't hold in cases like this but in the majority of the cases it holds (although it may change as time goes by). This case has another twist. The attached file in the sample message uploaded here is in UTF-16 with BOM at the beginning, which is why notepad has no trouble opening it (it detects the BOM and does the right thing). If it's a web page, we'd have no trouble because for a web page we subject it to multiple mechanisms to determine the character encoding, one of which is BOM detection. For mail attachment(in text), I'm not sure what exactly we do. In summing up, there are two aspects in this bug. One is how to add the least 'obstructive' UI(or if possible, automatic way) to figure out the character encoding of a text attachment and add that information explicitly to 'text/*' attachment (especially text/plain). The other is how to figure out the character encoding of an unlabelled text attachment (which may be different from those of other 'text' parts of the same message).
>> I'm not sure if this is a recent regression. >> I doubt it, but I'll compare against 1.4 and 1.6. > >I also doubt it, but anyway your testing result would be nice to have. I tested both 1.4 and 1.6 release bits, and this bug exists there as well. since it is not a recent regression, blocking 1.7-. jshin, thanks for the info.
Well, this bug doesn't depend on bug 236941. It might be argued that it's related to that bug remotely, but that's about it.
just a comment: the fact that 'we' display part of the content as chinese/japanese characters shows that unicode is recognized as the charset, but not the right flavor of it ? (UTF-8 / UTF-16 for example) ?
Could people interested in this bug report please take a look at Bug 241821 ? My original subject/phrasing for 241821 is not appropriate now that I understand the nature of the problem. I agree now with "do not show the attachment inline" sentiment. I wonder what others think.
Still occuring in Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050306 From - Mon Mar 14 11:23:10 2005 X-Account-Key: account3 X-UIDL: 1110795605.16178.mrelay4-1 X-Mozilla-Status: 0001 X-Mozilla-Status2: 10000000 Return-Path: <firstname.lastname@example.org> Delivered-To: email@example.com Received: (qmail 16077 invoked from network); 14 Mar 2005 10:20:04 -0000 Received: from florius.duke-interactive.net (18.104.22.168) by mrelay4-1.free.fr with SMTP; 14 Mar 2005 10:20:04 -0000 Received: from aph-aug-103-2-1-4.w193-252.abo.wanadoo.fr ([22.214.171.124] helo=mail.duke-interactive.com) by florius.duke-interactive.net with smtp (Exim 4.44) id 1DAmg8-0004d9-3b for firstname.lastname@example.org; Mon, 14 Mar 2005 11:20:04 +0100 Received: from [10.42.10.79] (helo=[10.42.10.79]) by mail.duke-interactive.com with esmtp (Exim 3.35 #1 (Debian)) id 1DAmfu-0003JW-00 for <email@example.com>; Mon, 14 Mar 2005 11:19:50 +0100 Message-ID: <423565F0.firstname.lastname@example.org> Date: Mon, 14 Mar 2005 11:22:40 +0100 From: Olivier Vit <email@example.com> User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050306 MIME-Version: 1.0 To: undisclosed-recipients:; Subject: test utf8 Content-Type: multipart/mixed; boundary="------------050801070906090602040700" X-Duke-MailScanner: Found to be clean This is a multi-part message in MIME format. --------------050801070906090602040700 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit --------------050801070906090602040700 Content-Type: text/plain; name="spip-boucles-fake.txt" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="spip-boucles-fake.txt" //4NAAoADQAKAHEAcwBtAGQAbABtAHEAcwBsACAAcQBzAA0ACgANAAoADQAKAA0ACgBxAGQA cQBzAGQAawBzAHEAZAANAAoAcQBzAGQAawBzAHEAagBkACAAcwBxAA0ACgBkAHEAcwBrAGQA cQBzAGQAcwBxAGQADQAKAA== --------------050801070906090602040700--
Looking at the sample message, I am not (initially) seeing the Asian characters displayed in the attachment. (The test would have been more useful if composed of some actual text, rather than random characters.) I have Auto-detect OFF. When I select the message, TB's encoding menu shows ISO-8859-1 selected (which is my default encoding); the inline'd attachment is shown as: ÿþ which I think is the "Little Endian" flag for 16-bit encoded messages. If I save the attachment and open it in MS Word, it identifies the file as "little- endian Unicode." Word does not say UTF-16, so I'm not sure exactly which encoding this attachment has; but looking at it in hex, the text seems to be simple 7-bit ASCII values encoded, lo-byte first, in 16 bits. If I add all the various Unicode encoding varieties to my "custom list" and select them all in turn, none of them display correctly. Naturally, all of the 16- and 32-bit varieties display the message body as question-marks; selecting any of the UTF-16 varieties, the attachment text is misdisplayed but partly legible, and I see one ideogram character in the attachment text. Note: If I tweak the message to add an explicit "charset=utf-16" to the attachment's headers, the attachment displays inline but not quite right, appearing just as it did when I selected UTF-16 from the encoding menu for the original message; so the problem is not (entirely) that the charset is missing. Bug 238152 is about doing something to specify a correct charset when attaching a text/plain file.
I recently noticed that on Win 2000 it doesn't display asian characters but just blanks May be some behaviour is it specific to XP ?
> save the attachment and open it in MS Word, it identifies the file as "little- > endian Unicode." Word does not say UTF-16, so I'm not sure exactly which > encoding this attachment has; but looking at it in hex, the text seems to be > simple 7-bit ASCII values encoded, lo-byte first, in 16 bits. That is UTF-16LE (little endian) is :-)
sorry for the spam. making bugzilla reflect reality as I'm not working on these bugs. filter on FOOBARCHEESE to remove these in bulk.