Closed Bug 129443 Opened 23 years ago Closed 2 years ago

Incorrect encoding (charset) for mail and news/nntp URIs in browser

Categories

(MailNews Core :: Internationalization, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: xslf, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: intl)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Macintosh; U; PPC; en-US; rv:0.9.9+) Gecko/20020306
BuildID:    2002030608

when opening in the browser a Hebrew file with the MIME type of message/rfc822 ,
Mozilla incorrectly displays it as windows-1255, and show junk. The user has to
manually change the encoding to unicode

Reproducible: Always
Steps to Reproduce:
1. go to http://www.typo.co.il/~sbforum/MESSAGES/86/1586.eml
2. see how the hebrew is (not) displayed
3. change the encoding to unicode in order to view correctly

Actual Results:  mozilla displays the message with the wrong encoding

Expected Results:  mozilla should pick the correct encoding

I have heard that this happens in linux and windows as well, but haven't tested
on them.
Summary: When opening a URL with a Hebrew file with the mime type of message/rfc822, mozilla incorrectly detects it as being windows-1255 → When opening a URL with a Hebrew file with the mime type of message/rfc822, mozilla incorrectly detects it as being windows-1255
This problem is not specific to Mac. It happens in Windows and Linux as well.

Some more examples:

http://www.typo.co.il/~sbforum/MESSAGES/80/1580.eml
http://www.typo.co.il/~sbforum/MESSAGES/73/1573.eml

These messages are encoded in 'windows-1255'. 

Mozilla displays them as either 'iso-8859-1' (latin) or 'windows-1255' 
(actually I'm not sure about the later -- I'm using Windows now and the 
messages are shown using 'iso-8859-1', not 'windows-1255') and you see junk 
instead of Hebrew.

You have to switch to UTF-8 in order to read them.

It seems that the user's default encoding doesn't matter. Nor does it matter if 
the message is 'multipart/alternative' or single.

The following messages are encoded in UTF-8, so the problem is not specific to 
Hebrew:

http://www.typo.co.il/~sbforum/MESSAGES/75/975.eml
http://www.typo.co.il/~sbforum/MESSAGES/15/715.eml

Mozilla displays them as 'iso-8859-1' (latin) and you see junk instead of 
Hebrew.

OS: Mac System 9.x → All
Hardware: Macintosh → All
This should probably be in the intl component.
Assignee: mkaply → yokoyama
Component: BiDi Hebrew & Arabic → Internationalization
QA Contact: zach → ruixu
There is a summary of what's going on here at
http://bugzilla.mozilla.org/show_bug.cgi?id=33049#c17

Bug 33049 was resolved as WORKSFORME, but this seems to be a real problem.
.eml files are files saved by Mozila/Netscape 6Mail. If it is saved by 
Mozilla/Netscape 6, they are saved in UTF-8. So what you're seeing
is not a bug but according to the current spec. If you want to see them in the
encoding of the system you are using, you shoud save them as ".txt" files.
Better yet, use HTML format when saving.

So in essence this is not a browser bug. These people are exposing
saved mail msgs without pointers and they should be told to 
include an instruction. My suggestions to eliminate the problem:

1. When you save mail msgs, use HTML format. This should get you the
   document encoding tag.
2. Turn on View | Character Coding | Auto-Detect | All.
   Auto-detectors normally check for UTF-8 sequences 

It is possible that we can build in automatic UTF-8 check on any 
encoding menu item. I wonder if that is a good idea or bad idea. 

During Communicator 4.x days, we used to check for UCS-2 on any incoming
data and that turned out to cause some problems and so we restricted
the UCS-2 check to just when one of the Unicode encodings are 
chosen.
> .eml files are files saved by Mozila/Netscape 6Mail. If it is saved by 
> Mozilla/Netscape 6, they are saved in UTF-8.

I correct myself. I explained this much better in 

http://bugzilla.mozilla.org/show_bug.cgi?id=33049#c17

The .eml data is saved as the original RFC 822 data. 

I should add more one workaround.

   Eliminate the .eml extension. You will be able to see it
   as Windows-1255 file.

> Bug 33049 was resolved as WORKSFORME, but this seems to be a real problem.

Before you do anything, please check with the mail team to see
what consequences there are for changing the current behavior
as summarized in the above quoted comment for parsing .eml files.
Keywords: intl
QA Contact: ruixu → ylong
If I understand correctly, the problem is that we construct internally a DOM
representation of the message, with the text in UTF-8, but without setting any
charset attribute. I haven't located the code where this happens, but if my
assumptions are right, the fix ought to be trivial (famous last words)
re-assign to smontagu


Assignee: yokoyama → smontagu
cc Xianglan and marina.
this is totally a mail/charset issue. cc'ing nhotta.
As Kat explained, saving the original RFC822 data in UTF-8 for .eml file
extension is by design. If we add any charset attribute to the file, it won't be
the original RFC822 data anymore. Should we resolve this as WFM then?
QA contact to myself.
Product: Browser → MailNews
QA Contact: ylong → ji
Wiith regard to comment #6 by smontagu, we may be using 
re-using or using the mail code for this because of the
.eml extension. CC'ing bienvenu@netscape.com also.

> As Kat explained, saving the original RFC822 data in UTF-8 
> for .eml file extension is by design.

My comment in this bug is incorrect. I think I was more
accurate in the original bug smontagu cited above. The data
are saved as the original data. But we use UTF-8 in internal
representation.
Kat, you're probably right, but I'm not the right person to ask - you might try
e-mailing mscott directly for the definitive answer.
Status: NEW → ASSIGNED
*** Bug 223225 has been marked as a duplicate of this bug. ***
From dupe: the same bug with news:// and nntp:// URIs
 
nntp://news.mozilla.org:119/tnhhsv6arys1.dlg@borumat.de
news:news.mozilla.org:119/tnhhsv6arys1.dlg@borumat.de
Summary: When opening a URL with a Hebrew file with the mime type of message/rfc822, mozilla incorrectly detects it as being windows-1255 → Incorrect encoding for mail and news URIs in browser
Summary: Incorrect encoding for mail and news URIs in browser → Incorrect encoding (charset) for mail and news/nntp URIs in browser
Yes, I'm the one who submitted the duplicated bug 223225.
In that case it shows that the problem is not the *.EML file in itself.
Apparently the same UTF-8 conversion mentioned in comment #4 is also performed
on external links to news articles. Probably the conversion is performed on all
non-webpages displayed in the browser, and comment #6 and comment #11 are
therefore perfectly right.
*** Bug 231524 has been marked as a duplicate of this bug. ***
*** Bug 244945 has been marked as a duplicate of this bug. ***
Blocks: 254868
None of the URLs provided in this bug as samples are valid any longer.
Could someone *attach* an actual .eml file that exhibits this problem to the 
bug?  Remember to give it type: message/rfc822

The file at attachment 11787 [details] (from bug 33049) is pretty peculiar.  Loading it in 
the browser:
 - Autodetect:Universal identifies the charset as Greek (ISO-8859-7).
 - Autodetect:Japanese identifies the charset as Shift_JIS, which shows a bunch 
of Kanji (or Chinese) mixed with centered-dot characters -- including within the 
vCard.  
 - Forcing an encoding of ISO-2022-JP (the charset specified within the file 
itself), the display is all '?'.  
 - Forcing an encoding of UTF-8, the subject and body appear to be some form of 
kana, except in the vCard where the characters appear as '?'.
(In reply to comment #19)
>  - Forcing an encoding of UTF-8, the subject and body appear to be some form of 
> kana, except in the vCard where the characters appear as '?'.

This needs to be retested, but I believe that that is bug 221631, which has been
fixed since the date of the attachment.
(In reply to comment #20)
> (In reply to comment #19)
> >  - Forcing an encoding of UTF-8, the subject and body appear to be some form 
> >  of kana, except in the vCard where the characters appear as '?'.
> 
> This needs to be retested, but I believe that that is bug 221631, which has
> been fixed since the date of the attachment.

The fix there seems to be forcing a default of utf-8 on (some?) vCards -- which 
is how Mozilla sends vCards now.  The vCard in that attachment has an explicit 
2022-JP encoding.  Even when displayed in Mail/News, those characters are not 
shown correctly, so that problem is unrelated to this bug.


I forgot that attachment 139450 [details], from the bug I filed that was duped to this 
one, shows the basic problem.  One symptom from that attachment which is not 
mentioned here: the 8bit characters which (illegally) are in the Subject header 
of that mail display correctly when the browser's encoding is 8859-1 (whereas 
the body shows the 8859-1 bytes corresponding to the UTF-8 encoding of the 
original 8859-1 characters).  Forcing the encoding to UTF-8, the body displays 
correctly but the headers are wrong.
*** Bug 38109 has been marked as a duplicate of this bug. ***
Product: MailNews → Core
(In reply to comment #20)
> (In reply to comment #19)
> >  - Forcing an encoding of UTF-8, the subject and body appear to be some form of 
> > kana, except in the vCard where the characters appear as '?'.
> 
> This needs to be retested, but I believe that that is bug 221631, which has been
> fixed since the date of the attachment.

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b5pre) Gecko/2008031507 SeaMonkey/2.0a1pre

I see Character Encoding: Autodetect -> Universal and UTF-8. The one line of text is identical to the Subject; they look Japanese (including both hiragana and kanji). The vcard includes only ASCII plus a number of black diamonds with white question marks on them.
(In reply to comment #21)
[...]
> I forgot that attachment 139450 [details], from the bug I filed that was duped to this 
> one, shows the basic problem.  One symptom from that attachment which is not 
> mentioned here: the 8bit characters which (illegally) are in the Subject header 
> of that mail display correctly when the browser's encoding is 8859-1 (whereas 
> the body shows the 8859-1 bytes corresponding to the UTF-8 encoding of the 
> original 8859-1 characters).  Forcing the encoding to UTF-8, the body displays 
> correctly but the headers are wrong.

It is still so using "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b5pre) Gecko/2008031507 SeaMonkey/2.0a1pre":

Autodetect -> Universal and Windows-1252 shows accented characters OK in Subject header and replaced by gibberish in the body. Forcing UTF-8 shows accented characters replaced by black diamonds with white question marks on them in the Subject header and OK in the body.
Product: Core → MailNews Core
QA Contact: ji → i18n
Assignee: smontagu → nobody
Status: ASSIGNED → NEW

Is this expected to still be a problem?

Flags: needinfo?(mkmelin+mozilla)

Probably not. Testcase are no longer available.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(mkmelin+mozilla)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.