Closed Bug 115230 Opened 23 years ago Closed 12 years ago

MIME/UTF-8: Subject is interpreted as UTF-8

Categories

(MailNews Core :: Internationalization, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: 3.14, Unassigned)

Details

(Whiteboard: intl)

Attachments

(2 files)

I use: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.6) Gecko/20011120 Very strange. Someone using Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:0.9.6+) Gecko/20011212 (see <3C19A047.5040505@hmetzger.de>) sent a subject which should have been "Mozilla/Netscape 6: Vote für Bug #70728". It actually came out as "Mozilla/Netscape 6: Vote für Bug #70728". Clearly, this is meant to be UTF-8, which does make no sense for header fields (as does any other non-ASCII). Now the bug: Without any reason Mozilla interprets the unencoded subject to "Mozilla/Netscape 6: Vote für Bug #70728" pi
This is the problem message posted to a newsgroup de.comm.software.newsreader I said in Mozilla MailNews ML that I am not sure if this is a Mozilla bug or NNTP server bug. I think someone needs to provide reproducible steps. One thing is known. Mozilla does not send raw 8-bit headers unless the prefs.js contains the following: user_pref("mail.strictly_mime_headers", false);
Note that the conten-type charset=[object Window], which is weird. Content-Type: text/plain; charset=[object Window] The subject header is encoded in raw UTF-8. I will confirm this bug so that we can determine if this is a Mozilla-side bug or not.
Status: UNCONFIRMED → NEW
Ever confirmed: true
I did a test posting (sending it directly to the server) which again shows that Mozilla reads it as utf-8: <3c19edfb$0$27868$3b214f66@news.univie.ac.at> Regarding the question why this was sent this way I have no clue;-) Sending a follow-up to the above article with my Mozilla results in "Re: ignore no reply =?ISO-8859-1?Q?f=FCr?=", but of course I insist on sending 7bit only as RfC 1036 requires. pi
> I did a test posting (sending it directly to the server) > which again shows that Mozilla reads it as utf-8: > <3c19edfb$0$27868$3b214f66@news.univie.ac.at> This is not accessible. Please post your test message to: netscape.public.mozilla.test on the server news.mozilla.org
[object window] as a font pref was related to bug #108939
Postet as <aee76d711627f178d15044097f20c3c0@pi.logic.univie.ac.at>. Also followup as described above. pi
Thank to Marc for comment 5. Holger Metzger posted this comment to mozilla-mail-news@mozilla.org: "I checked. Under preferences, under character codeing for sending messages I had a completely empty field, character coding under "Message Display" was also empty. The Mail startpage had "[object window]" in it - I set that back to default, and the character codings to iso-8859-1. Hope it works now." This should take care of this problem. With regard to comment 6 by Boris, you posted a news item to "netscape.public.mozilla.test" which contained raw UTF-8 header. Reply to that will be done using the body charset or your default charset for Composition. That should explain your case. I don't if there is any mystery in that.
>With regard to comment 6 by Boris, you posted >a news item to "netscape.public.mozilla.test" >which contained raw UTF-8 header. Yes and no. There is no reason to believe that this is UTF-8 (after all only ASCII is allowed there). >Reply to that >will be done using the body charset or your >default charset for Composition. That should explain >your case. I don't if there is any mystery in that. The problem is: Why does Mozilla assume, this is UTF-8? I intended to post two octets there. pi
IQA, could you help reproduce the problem?
> Yes and no. There is no reason to believe that this is > UTF-8 (after all only ASCII is allowed there). UTF-8 has distinct byte patterns that can be detected. Mozilla checks for UTF-8 bytes in the message headers and that is why UTF-8 is detected.
Looks similar to bug 68394 which has been fixed for a while.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.2
Target Milestone: mozilla1.2alpha → ---
This is a real problem. I just hit it again in a newsgroup (this time it was from Mozilla with Windows NT). Many lusers use 8bit charaters (outside of ASCII) in mail or news headers. Those are, of course, undefined, which is why they must be encoded. In this example the subject had an unencoded ä in it which let Mozilla believe it was an intro of utf-8 (but again, there was no hint that utf-8 would be used in the subject). Strange enough, the original posting was displayed as intended (i.e., with the ä and whatever came later). While the followup by Mozilla had a totally different subject where ä and everything thereafter were question marks. I produced something similar in e-mail, which I will attach next. pi
Severity: normal → major
Keywords: mozilla1.2
OS: Linux → All
Hardware: PC → All
Boris we need to make clear what needs to be fixed. You keep on throwing out different cases to solve. Here's what I think are typical 3 cases to we might consider: 1. Header: raw 8-bit Latin 1chars, no charset info Body: no charset info 2. Header: raw 8bit Latin 1 chars, no charset info Body: ASCII charset info 3. Header: raw 8bit Latin chars,no charset Body: UTF-8 (or other charsets) info With regard to these 3 cases, we should have something like the following: Header Display: Cases 1-3: assumes folder (or default display) charset. will display OK if a Western charset is specified for folder charset. Reply headers (normal cases): Case 1: Should assume default send charset and MIME-encode it since there is no hint from the body charset. Case 2: Should assume Latin 1 (such as IS0-8859-1) since ASCII does not cover 8-bit headers and natural superset of ASCII is Latin 1 such as ISO-8859-1. The subject header should be MIME-encoded. Case 3: Should assume UTF-8 or other charset specified for the body since body and header should go out in the same charset. The subject header should be MIME-encoded. The problem like the one you're describing occurs precisely because some primitive news posting software allows inconsistent charsets for header and body. (Special) Case 4: When you have "Edit | Folder properties | Default Character Coding" set to Latin 1 type like ISO-8859-1 and the checkbox for "applying the default to all messages" is checked, Both subject header and body should go out in ISO-8859-1. The subject header should be MIME-encoded. I am not sure about Case 2 but Cases 1, 3, and 4 are working as I describe above. These seem to be reasonable behavior for a news posting program. I think Case 2 might need some discussion. Boris, if your default charset is ISO-8859-1, you can try Case 4. That should solve your problem. The problem here is that users are posting message header with which there is no easy way to deal with. What Mozilla is doing seems reasonable to me -- though I have not done extensive testing to verify all Cases. I am reesonably sure of how Mozilla responds right now.
Katsuhiko, I don't agree to your cases. Encoding in header fields (including the choice of the charset) is absolutely independent from what happens in the body. I agree, though, that assuming the default charset for undeclared charsets is the best solution. > When you have "Edit | Folder properties | Default Character Coding" > set to Latin 1 type like ISO-8859-1 and the checkbox for > "applying the default to all messages" is checked, I cannot see any good reason why I should ever ignore a charset info. So checking the last is not an option. Let me give you a real world example: Subject: Re: Auto-Vervollständigen deaktivieren Message-ID: <ic34qugl2t7ia7d99etso79kftpeh5v33e@4ax.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Again, the body encoding (8bit) and its charset (utf-8) have no meaning whatsoever for the Subject. This has an ä (a umlaut, coded as for ISO-8859-1 which is the default charset of the group for me). Clearly, it is wrong in the first place to have a non-ASCII character in any header field, but it is common, so we have to deal with it. The best guess is to use the chosen default charset. What actually happens: Mozilla looks at the completely independend charset for the body. This example for e-mail was in my latest attachment 102136 [details]. pi
> I cannot see any good reason why I should ever ignore a > charset info. So checking the last is not an option. A good reason would be if you're reading a newsgroup which is essentially posting messages in Latin 1 charsets but has occasional bad headers like the one you're talking about. The folder charset "force" option applies only to that folder/newsgroup and so you don't have this same setting for any other folder such as your Inbox. We know this works well for some Chinese newsgroups where a sizable number of posters use software which has no charset specified, or incorrect one specified for Chinese. The "force" option for that newsgroup folder onyl actually saves a lot of pain for the user. > Let me give you a real world example: Boris, this test case of yours prompted me to create Case 3 above. > Encoding in header fields (including the choice of the > charset) is absolutely independent from what happens > in the body. Theoretically this is true. However, if a message specifies MIME charsets correctly, Mozilla has no problem displaying them. We are talking abotu cases where there is no such info for the headers and Mozilla has to make an intelligent guess in replying. I outlined 4 cases above to deal with such cases. If you don't like the current behavior for cases 2-4, you should present plausible alternatives. I have no idea at this point as to what you want to do with these 3 cases. By the way your last test case assumes that the header is UTF-8 because of the body charset, I believe. Let me CC jgmyers in case. I don't know if his header UTF-8 auto-detection routine will apply to Case 3. Are we using the body charset info than the UTF-8 detection in Case 3 for the subject header?
Sorry I missed your comment here: > Clearly, it is wrong in the first place to have a non-ASCII > character in any header field, but it is common, so we have > to deal with it. The best guess is to use the chosen default charset. I am not sure if like this solution. In your UTF-8 case, this could mean that the body will go out in UTF-8 while the header might go out in ISO-8859-1. I don't like to see Mozilla send out header and body in different charsets if at all possible. This is theoretically possible but could cause problems for some existing mailers. Let us hear others' opinions on whether we should use the body charset as a hint in Case 3, Or we should use the user's default charset. Boris, this type of problems happen most often in newsgroups, can you not set the folder charset "force" option just for that newsgroup? Would that not be a reasonable compromise?
>> I cannot see any good reason why I should ever ignore a charset >> info. So checking the last is not an option. > > A good reason would be if you're reading a newsgroup which is > essentially posting messages in Latin 1 charsets but has occasional > bad headers like the one you're talking about. This bad header did not have any charset information. So we cannot even ignore it;-) > The folder charset > "force" option applies only to that folder/newsgroup and so you > don't have this same setting for any other folder such as your > Inbox. Maybe I misunderstand the option. My reading is that it will ignore a charset definition, say, the message has utf-8 as the body charset which would be overwritten, which would be really bad. > We know this works well for some Chinese newsgroups where a > sizable number of posters use software which has no charset > specified, That would be fine for articles with no definition, I agree. >> Encoding in header fields (including the choice of the charset) >> is absolutely independent from what happens in the body. > We are talking abotu cases where there is no such info for the > headers Right. > and Mozilla has to make an intelligent guess in replying. I > outlined 4 cases above to deal with such cases. Yes, theree has to be a guess. But a "force option" is no good choice for any of the groups (mostly German) I read. The best guess IMHO is the default charset. Actually, Mozilla used the body charset instead. > By the way your last test case assumes that the header is UTF-8 > because of the body charset, I believe. This seems to be what Mozilla does. There is no reason to do it that way. The charset was given for the body only. > Are we using the body charset info than the UTF-8 detection in Case > 3 for the subject header? It seems to be different. The message in question displayed "correctly", i.e., with the German umlaut. But when replying Mozilla did use UTF-8 interpretation. So this is even inconsistent. pi
>> Clearly, it is wrong in the first place to have a non-ASCII >> character in any header field, but it is common, so we have >> to deal with it. The best guess is to use the chosen default charset. > > I am not sure if like this solution. In your UTF-8 case, > this could mean that the body will go out in UTF-8 while > the header might go out in ISO-8859-1. Sure, why not. Those are independend. Actually, I'd prefer a "minimal charset" (there was no character which does not fit into ISO-8859-1), but there are bugs for that. > I don't like to see Mozilla send out header and body in different > charsets if at all possible. Why? > This is theoretically possible > but could cause problems for some existing mailers. I see this all the time and never encountered problems with any reader who is capable of using the charsets used. If you want to use the same charset anyways, then you have to go to the bigger one, utf-8 in our case. > Boris, this type of problems happen most often in > newsgroups, ACK > can you not set the folder charset "force" > option just for that newsgroup? Would that not be a > reasonable compromise? No, I just tried. The body of the first message was correctly coded for utf-8. So forcing latin1 broken the message. That's why I said it does not make sense to force some charset over a defined one. IMHO force should only apply to undefined charsets (as for the header of this message). pi
As mentioned in comment 10, if Mozilla detects that a header has legal UTF-8 syntax, it will interpret the header as UTF-8, regardless of the default charset. This was done primarily to prepare for son-of-1036, the revised standard for usenet messages which (at least in draft) permits unencoded UTF-8 in headers. The header decoder pays no attention to any charset or charset label on the body. The header decoder has no effect on the header encoder beyond the fact that whichever characters result from the decoding are the characters that will be encoded in replies. The header decoder does not save the name of the charset or any other information about how the header was encoded, so this information is not available to the header encoder.
> As mentioned in comment 10, if Mozilla detects that a header has legal UTF-8 > syntax, it will interpret the header as UTF-8, regardless of the default > charset. This was done primarily to prepare for son-of-1036, the revised > standard for usenet messages which (at least in draft) permits unencoded UTF-8 > in headers. Yes, but in the given case, it does not really have a meaning in utf-8. Actually, Mozilla makes a log of questions marks out of it. I think it is way to early to draw consequences from a draft which has not implemented anywehre that I could note. > The header decoder pays no attention to any charset or charset label on the body. That's good news. pi
For the test case in attachment 61727 [details], the subject is in UTF-8, Mozilla detects it as UTF-8, and Mozilla displays it as intended. In the test case for attachment 102136 [details], the subject does not have legal UTF-8 syntax, Mozilla does not detect it as UTF-8, and Mozilla interprets it per the default charset. I don't see what the problem is.
>For the test case in attachment 61727 [details], the subject is in UTF-8, Mozilla detects >it as UTF-8, and Mozilla displays it as intended. It would be UTF-8 if we would know it. It is also something decodable as ISO-8859-x. There is no rfc which supports the view that undeclared characters are utf-8. There is a draft which may in the far future become an rfc. >In the test case for >attachment 102136 [details], the subject does not have legal UTF-8 syntax, Mozilla does >not detect it as UTF-8, and Mozilla interprets it per the default charset. Not when displaying. But if you reply it does decode it as if it would be utf-8 and then produces a broken Subject as you can see in the second mail in attachment 102136 [details]. pi
Product: MailNews → Core
Product: Core → MailNews Core
QA Contact: ji → i18n
Assignee: nhottanscp → nobody
Status: ASSIGNED → NEW
Whiteboard: intl
Software behaving as intended. Closing out INVALID.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: