Closed
Bug 115230
Opened 23 years ago
Closed 12 years ago
MIME/UTF-8: Subject is interpreted as UTF-8
Categories
(MailNews Core :: Internationalization, defect)
MailNews Core
Internationalization
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: 3.14, Unassigned)
Details
(Whiteboard: intl)
Attachments
(2 files)
I use: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.6) Gecko/20011120
Very strange. Someone using Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:0.9.6+) Gecko/20011212 (see <3C19A047.5040505@hmetzger.de>) sent a subject
which should have been "Mozilla/Netscape 6: Vote für Bug #70728". It actually
came out as "Mozilla/Netscape 6: Vote für Bug #70728". Clearly, this is meant
to be UTF-8, which does make no sense for header fields (as does any other
non-ASCII).
Now the bug: Without any reason Mozilla interprets the unencoded subject to
"Mozilla/Netscape 6: Vote für Bug #70728"
pi
Comment 1•23 years ago
|
||
This is the problem message posted to a newsgroup
de.comm.software.newsreader
I said in Mozilla MailNews ML that I am not sure if this is a
Mozilla bug or NNTP server bug. I think someone needs to provide
reproducible steps. One thing is known. Mozilla does not send raw
8-bit headers unless the prefs.js contains the following:
user_pref("mail.strictly_mime_headers", false);
Comment 2•23 years ago
|
||
Note that the conten-type charset=[object Window], which is
weird.
Content-Type: text/plain; charset=[object Window]
The subject header is encoded in raw UTF-8.
I will confirm this bug so that we can determine if this is
a Mozilla-side bug or not.
Status: UNCONFIRMED → NEW
Ever confirmed: true
| Reporter | ||
Comment 3•23 years ago
|
||
I did a test posting (sending it directly to the server) which again shows that
Mozilla reads it as utf-8: <3c19edfb$0$27868$3b214f66@news.univie.ac.at>
Regarding the question why this was sent this way I have no clue;-)
Sending a follow-up to the above article with my Mozilla results in "Re: ignore
no reply =?ISO-8859-1?Q?f=FCr?=", but of course I insist on sending 7bit only as
RfC 1036 requires.
pi
Comment 4•23 years ago
|
||
> I did a test posting (sending it directly to the server)
> which again shows that Mozilla reads it as utf-8:
> <3c19edfb$0$27868$3b214f66@news.univie.ac.at>
This is not accessible. Please post your test message to:
netscape.public.mozilla.test
on the server
news.mozilla.org
Comment 5•23 years ago
|
||
[object window] as a font pref was related to bug #108939
| Reporter | ||
Comment 6•23 years ago
|
||
Postet as <aee76d711627f178d15044097f20c3c0@pi.logic.univie.ac.at>. Also
followup as described above.
pi
Comment 7•23 years ago
|
||
Thank to Marc for comment 5.
Holger Metzger posted this comment to mozilla-mail-news@mozilla.org:
"I checked.
Under preferences, under character codeing for sending messages I had a
completely empty field, character coding under "Message Display" was
also empty. The Mail startpage had "[object window]" in it - I set that
back to default, and the character codings to iso-8859-1.
Hope it works now."
This should take care of this problem.
With regard to comment 6 by Boris, you posted
a news item to "netscape.public.mozilla.test"
which contained raw UTF-8 header. Reply to that
will be done using the body charset or your
default charset for Composition. That should explain
your case. I don't if there is any mystery in that.
| Reporter | ||
Comment 8•23 years ago
|
||
>With regard to comment 6 by Boris, you posted
>a news item to "netscape.public.mozilla.test"
>which contained raw UTF-8 header.
Yes and no. There is no reason to believe that this is UTF-8 (after all only
ASCII is allowed there).
>Reply to that
>will be done using the body charset or your
>default charset for Composition. That should explain
>your case. I don't if there is any mystery in that.
The problem is: Why does Mozilla assume, this is UTF-8? I intended to post two
octets there.
pi
Comment 9•23 years ago
|
||
IQA, could you help reproduce the problem?
Comment 10•23 years ago
|
||
> Yes and no. There is no reason to believe that this is
> UTF-8 (after all only ASCII is allowed there).
UTF-8 has distinct byte patterns that can be detected.
Mozilla checks for UTF-8 bytes in the message headers and
that is why UTF-8 is detected.
Comment 11•23 years ago
|
||
Looks similar to bug 68394 which has been fixed for a while.
Updated•23 years ago
|
Status: NEW → ASSIGNED
Updated•23 years ago
|
Target Milestone: --- → mozilla1.2
Updated•23 years ago
|
Target Milestone: mozilla1.2alpha → ---
| Reporter | ||
Comment 12•23 years ago
|
||
This is a real problem. I just hit it again in a newsgroup (this time it was
from Mozilla with Windows NT).
Many lusers use 8bit charaters (outside of ASCII) in mail or news headers. Those
are, of course, undefined, which is why they must be encoded. In this example
the subject had an unencoded ä in it which let Mozilla believe it was an intro
of utf-8 (but again, there was no hint that utf-8 would be used in the subject).
Strange enough, the original posting was displayed as intended (i.e., with the ä
and whatever came later). While the followup by Mozilla had a totally different
subject where ä and everything thereafter were question marks.
I produced something similar in e-mail, which I will attach next.
pi
| Reporter | ||
Comment 13•23 years ago
|
||
Comment 14•23 years ago
|
||
Boris we need to make clear what needs to be fixed.
You keep on throwing out different cases to solve.
Here's what I think are typical 3 cases to we might consider:
1. Header: raw 8-bit Latin 1chars, no charset info
Body: no charset info
2. Header: raw 8bit Latin 1 chars, no charset info
Body: ASCII charset info
3. Header: raw 8bit Latin chars,no charset
Body: UTF-8 (or other charsets) info
With regard to these 3 cases, we should have something
like the following:
Header Display:
Cases 1-3: assumes folder (or default display) charset.
will display OK if a Western charset is specified
for folder charset.
Reply headers (normal cases):
Case 1: Should assume default send charset and MIME-encode it
since there is no hint from the body charset.
Case 2: Should assume Latin 1 (such as IS0-8859-1)
since ASCII does not cover 8-bit headers and
natural superset of ASCII is Latin 1 such as ISO-8859-1.
The subject header should be MIME-encoded.
Case 3: Should assume UTF-8 or other charset specified for the body
since body and header should go out in the same
charset. The subject header should be MIME-encoded.
The problem like the one you're describing occurs precisely
because some primitive news posting software allows
inconsistent charsets for header and body.
(Special) Case 4:
When you have "Edit | Folder properties | Default Character Coding"
set to Latin 1 type like ISO-8859-1 and the checkbox for
"applying the default to all messages" is checked,
Both subject header and body should go out in ISO-8859-1. The subject
header should be MIME-encoded.
I am not sure about Case 2 but Cases 1, 3, and 4 are working
as I describe above.
These seem to be reasonable behavior for a news posting
program. I think Case 2 might need some discussion.
Boris, if your default charset is ISO-8859-1, you can
try Case 4. That should solve your problem.
The problem here is that users are posting message header
with which there is no easy way to deal with.
What Mozilla is doing seems reasonable to me -- though I
have not done extensive testing to verify all Cases. I am
reesonably sure of how Mozilla responds right now.
| Reporter | ||
Comment 15•23 years ago
|
||
Katsuhiko, I don't agree to your cases. Encoding in header fields (including the
choice of the charset) is absolutely independent from what happens in the body.
I agree, though, that assuming the default charset for undeclared charsets is
the best solution.
> When you have "Edit | Folder properties | Default Character Coding"
> set to Latin 1 type like ISO-8859-1 and the checkbox for
> "applying the default to all messages" is checked,
I cannot see any good reason why I should ever ignore a charset info. So
checking the last is not an option.
Let me give you a real world example:
Subject: Re: Auto-Vervollständigen deaktivieren
Message-ID: <ic34qugl2t7ia7d99etso79kftpeh5v33e@4ax.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Again, the body encoding (8bit) and its charset (utf-8) have no meaning
whatsoever for the Subject. This has an ä (a umlaut, coded as for ISO-8859-1
which is the default charset of the group for me). Clearly, it is wrong in the
first place to have a non-ASCII character in any header field, but it is common,
so we have to deal with it. The best guess is to use the chosen default charset.
What actually happens: Mozilla looks at the completely independend charset for
the body.
This example for e-mail was in my latest attachment 102136 [details].
pi
Comment 16•23 years ago
|
||
> I cannot see any good reason why I should ever ignore a
> charset info. So checking the last is not an option.
A good reason would be if you're reading a newsgroup
which is essentially posting messages in Latin 1 charsets
but has occasional bad headers like the one you're talking
about. The folder charset "force" option applies only to that
folder/newsgroup and so you don't have this same setting
for any other folder such as your Inbox.
We know this works well for some Chinese newsgroups where
a sizable number of posters use software which has no
charset specified, or incorrect one specified for
Chinese. The "force" option for that newsgroup folder
onyl actually saves a lot of pain for the user.
> Let me give you a real world example:
Boris, this test case of yours prompted me to create Case 3 above.
> Encoding in header fields (including the choice of the
> charset) is absolutely independent from what happens
> in the body.
Theoretically this is true. However, if a message specifies MIME
charsets correctly, Mozilla has no problem displaying
them.
We are talking abotu cases where there is no such info
for the headers and Mozilla has to make an intelligent
guess in replying. I outlined 4 cases above to deal
with such cases.
If you don't like the current behavior for cases 2-4,
you should present plausible alternatives. I have no
idea at this point as to what you want to do with these
3 cases.
By the way your last test case assumes that the
header is UTF-8 because of the body charset, I believe.
Let me CC jgmyers in case. I don't know if his header
UTF-8 auto-detection routine will apply to Case 3.
Are we using the body charset info than the UTF-8 detection
in Case 3 for the subject header?
Comment 17•23 years ago
|
||
Sorry I missed your comment here:
> Clearly, it is wrong in the first place to have a non-ASCII
> character in any header field, but it is common, so we have
> to deal with it. The best guess is to use the chosen default charset.
I am not sure if like this solution. In your UTF-8 case,
this could mean that the body will go out in UTF-8 while
the header might go out in ISO-8859-1.
I don't like to see Mozilla send out header and body in different
charsets if at all possible. This is theoretically possible
but could cause problems for some existing mailers.
Let us hear others' opinions on whether we should
use the body charset as a hint in Case 3, Or we should
use the user's default charset.
Boris, this type of problems happen most often in
newsgroups, can you not set the folder charset "force"
option just for that newsgroup? Would that not be a
reasonable compromise?
| Reporter | ||
Comment 18•23 years ago
|
||
>> I cannot see any good reason why I should ever ignore a charset
>> info. So checking the last is not an option.
>
> A good reason would be if you're reading a newsgroup which is
> essentially posting messages in Latin 1 charsets but has occasional
> bad headers like the one you're talking about.
This bad header did not have any charset information. So we cannot even ignore it;-)
> The folder charset
> "force" option applies only to that folder/newsgroup and so you
> don't have this same setting for any other folder such as your
> Inbox.
Maybe I misunderstand the option. My reading is that it will ignore a charset
definition, say, the message has utf-8 as the body charset which would be
overwritten, which would be really bad.
> We know this works well for some Chinese newsgroups where a
> sizable number of posters use software which has no charset
> specified,
That would be fine for articles with no definition, I agree.
>> Encoding in header fields (including the choice of the charset)
>> is absolutely independent from what happens in the body.
> We are talking abotu cases where there is no such info for the
> headers
Right.
> and Mozilla has to make an intelligent guess in replying. I
> outlined 4 cases above to deal with such cases.
Yes, theree has to be a guess. But a "force option" is no good choice for any of
the groups (mostly German) I read. The best guess IMHO is the default charset.
Actually, Mozilla used the body charset instead.
> By the way your last test case assumes that the header is UTF-8
> because of the body charset, I believe.
This seems to be what Mozilla does. There is no reason to do it that way. The
charset was given for the body only.
> Are we using the body charset info than the UTF-8 detection in Case
> 3 for the subject header?
It seems to be different. The message in question displayed "correctly", i.e.,
with the German umlaut. But when replying Mozilla did use UTF-8 interpretation.
So this is even inconsistent.
pi
| Reporter | ||
Comment 19•23 years ago
|
||
>> Clearly, it is wrong in the first place to have a non-ASCII
>> character in any header field, but it is common, so we have
>> to deal with it. The best guess is to use the chosen default charset.
>
> I am not sure if like this solution. In your UTF-8 case,
> this could mean that the body will go out in UTF-8 while
> the header might go out in ISO-8859-1.
Sure, why not. Those are independend. Actually, I'd prefer a "minimal charset"
(there was no character which does not fit into ISO-8859-1), but there are bugs
for that.
> I don't like to see Mozilla send out header and body in different
> charsets if at all possible.
Why?
> This is theoretically possible
> but could cause problems for some existing mailers.
I see this all the time and never encountered problems with any reader who is
capable of using the charsets used.
If you want to use the same charset anyways, then you have to go to the bigger
one, utf-8 in our case.
> Boris, this type of problems happen most often in
> newsgroups,
ACK
> can you not set the folder charset "force"
> option just for that newsgroup? Would that not be a
> reasonable compromise?
No, I just tried. The body of the first message was correctly coded for utf-8.
So forcing latin1 broken the message. That's why I said it does not make sense
to force some charset over a defined one. IMHO force should only apply to
undefined charsets (as for the header of this message).
pi
Comment 20•23 years ago
|
||
As mentioned in comment 10, if Mozilla detects that a header has legal UTF-8
syntax, it will interpret the header as UTF-8, regardless of the default
charset. This was done primarily to prepare for son-of-1036, the revised
standard for usenet messages which (at least in draft) permits unencoded UTF-8
in headers.
The header decoder pays no attention to any charset or charset label on the body.
The header decoder has no effect on the header encoder beyond the fact that
whichever characters result from the decoding are the characters that will be
encoded in replies. The header decoder does not save the name of the charset or
any other information about how the header was encoded, so this information is
not available to the header encoder.
| Reporter | ||
Comment 21•23 years ago
|
||
> As mentioned in comment 10, if Mozilla detects that a header has legal UTF-8
> syntax, it will interpret the header as UTF-8, regardless of the default
> charset. This was done primarily to prepare for son-of-1036, the revised
> standard for usenet messages which (at least in draft) permits unencoded UTF-8
> in headers.
Yes, but in the given case, it does not really have a meaning in utf-8.
Actually, Mozilla makes a log of questions marks out of it.
I think it is way to early to draw consequences from a draft which has not
implemented anywehre that I could note.
> The header decoder pays no attention to any charset or charset label on the body.
That's good news.
pi
Comment 22•23 years ago
|
||
For the test case in attachment 61727 [details], the subject is in UTF-8, Mozilla detects
it as UTF-8, and Mozilla displays it as intended. In the test case for
attachment 102136 [details], the subject does not have legal UTF-8 syntax, Mozilla does
not detect it as UTF-8, and Mozilla interprets it per the default charset.
I don't see what the problem is.
| Reporter | ||
Comment 23•23 years ago
|
||
>For the test case in attachment 61727 [details], the subject is in UTF-8, Mozilla detects
>it as UTF-8, and Mozilla displays it as intended.
It would be UTF-8 if we would know it. It is also something decodable as
ISO-8859-x. There is no rfc which supports the view that undeclared characters
are utf-8. There is a draft which may in the far future become an rfc.
>In the test case for
>attachment 102136 [details], the subject does not have legal UTF-8 syntax, Mozilla does
>not detect it as UTF-8, and Mozilla interprets it per the default charset.
Not when displaying. But if you reply it does decode it as if it would be utf-8
and then produces a broken Subject as you can see in the second mail in
attachment 102136 [details].
pi
Updated•20 years ago
|
Product: MailNews → Core
| Assignee | ||
Updated•17 years ago
|
Product: Core → MailNews Core
Updated•16 years ago
|
QA Contact: ji → i18n
Updated•12 years ago
|
Assignee: nhottanscp → nobody
Status: ASSIGNED → NEW
Whiteboard: intl
Comment 24•12 years ago
|
||
Software behaving as intended. Closing out INVALID.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•