Closed Bug 41706 Opened 24 years ago Closed 24 years ago

Unicode messages that are quoted printable parts have "ðž" at beginning

Categories

(MailNews Core :: MIME, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED INVALID
Future

People

(Reporter: pajs_1, Assigned: rhp)

Details

This bug also is in netscape 4.7 , basically, if someone posts a mime posting to
a newsgroup, you get a ÿþ< in the message. And if you click reply, that is all
of the message.. In netscape 4.7 , you never even saw any of the posting other
than that.

This happens if the header has anything like this in it :

MIME-Version: 1.0
Content-Type: multipart/alternative;
        boundary="----=_NextPart_000_004F_01BFD000.1228BBA0"
X-Priority: 3

(Would like to say if the header had X-Newsreader: Microsoft Outlook Express
5.00.2919.6600 in it :-) .. pitty)
Can you point to an example...that would help.

- rhp
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Target Milestone: --- → M19
Most of the postings this problem occurs in is the newsgroup 0000000076, which
is not visable outside my ISP im afraid.. But I will put an offending message
under this.. 

In netscape 4.7 I see the message as "

ÿþ<

"
In Mozilla M15 , I see "

ÿþ 
That's the chappie...cheers Tarz
 
NUKE
"
(Without the speech marks). 
If I hit reply in either version I just get 
"
"NUKE" wrote:

> ÿþ<

"

Page source below.
From: "NUKE" <me@here.com>
Newsgroups: 0000000076
References: <393ce7f2@news.server.worldonline.co.uk>
<393ceeae@news.server.worldonline.co.uk>
<393d44e8@news.server.worldonline.co.uk>
Subject: Re: YoYo
Date: Tue, 6 Jun 2000 20:33:27 +0100
Lines: 58
MIME-Version: 1.0
Content-Type: multipart/alternative;
        boundary="----=_NextPart_000_0008_01BFCFF6.788A57A0"
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 5.00.2615.200
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2615.200
NNTP-Posting-Host: 212.49.241.182
X-Original-NNTP-Posting-Host: 212.49.241.182
Message-ID: <393d5453@news.server.worldonline.co.uk>
X-Trace: 6 Jun 2000 19:43:15 GMT, 212.49.241.182
Path: news.server.worldonline.co.uk!212.49.241.182
Xref: news.server.worldonline.co.uk 0000000076:4498

This is a multi-part message in MIME format.

------=_NextPart_000_0008_01BFCFF6.788A57A0
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

That's the chappie...cheers Tarz

NUKE

------=_NextPart_000_0008_01BFCFF6.788A57A0
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

=FF=FE<=00!=00D=00O=00C=00T=00Y=00P=00E=00 =00H=00T=00M=00L=00 =
=00P=00U=00B=00L=00I=00C=00 =
=00"=00-=00/=00/=00W=003=00C=00/=00/=00D=00T=00D=00 =00H=00T=00M=00L=00 =
=004=00.=000=00 =
=00T=00r=00a=00n=00s=00i=00t=00i=00o=00n=00a=00l=00/=00/=00E=00N=00"=00>=00=
=0D=00=0A=
=00<=00H=00T=00M=00L=00>=00<=00H=00E=00A=00D=00>=00=0D=00=0A=
=00<=00M=00E=00T=00A=00 =
=00c=00o=00n=00t=00e=00n=00t=00=3D=00"=00t=00e=00x=00t=00/=00h=00t=00m=00=
l=00;=00 =
=00c=00h=00a=00r=00s=00e=00t=00=3D=00u=00n=00i=00c=00o=00d=00e=00"=00 =
=00h=00t=00t=00p=00-=00e=00q=00u=00i=00v=00=3D=00C=00o=00n=00t=00e=00n=00=
t=00-=00T=00y=00p=00e=00>=00=0D=00=0A=
=00<=00M=00E=00T=00A=00 =
=00c=00o=00n=00t=00e=00n=00t=00=3D=00"=00M=00S=00H=00T=00M=00L=00 =
=005=00.=000=000=00.=002=006=001=004=00.=003=005=000=000=00"=00 =
=00n=00a=00m=00e=00=3D=00G=00E=00N=00E=00R=00A=00T=00O=00R=00>=00=0D=00=0A=
=00<=00S=00T=00Y=00L=00E=00>=00<=00/=00S=00T=00Y=00L=00E=00>=00=0D=00=0A=
=00<=00/=00H=00E=00A=00D=00>=00=0D=00=0A=
=00<=00B=00O=00D=00Y=00 =
=00b=00g=00C=00o=00l=00o=00r=00=3D=00#=00f=00f=00f=00f=00f=00f=00>=00=0D=00=0A=
=00<=00D=00I=00V=00>=00<=00F=00O=00N=00T=00 =
=00f=00a=00c=00e=00=3D=00"=00C=00o=00m=00i=00c=00 =00S=00a=00n=00s=00 =
=00M=00S=00"=00 =
=00s=00i=00z=00e=00=3D=002=00>=00<=00S=00T=00R=00O=00N=00G=00>=00T=00h=00=
a=00t=00'=00s=00 =00t=00h=00e=00 =
=00c=00h=00a=00p=00p=00i=00e=00.=00.=00.=00c=00h=00e=00e=00r=00s=00 =
=00=0D=00=0A=
=00T=00a=00r=00z=00<=00/=00S=00T=00R=00O=00N=00G=00>=00<=00/=00F=00O=00N=00=
T=00>=00<=00/=00D=00I=00V=00>=00=0D=00=0A=
=00<=00D=00I=00V=00>=00&=00n=00b=00s=00p=00;=00<=00/=00D=00I=00V=00>=00=0D=
=00=0A=
=00<=00D=00I=00V=00>=00<=00F=00O=00N=00T=00 =
=00f=00a=00c=00e=00=3D=00"=00C=00o=00m=00i=00c=00 =00S=00a=00n=00s=00 =
=00M=00S=00"=00 =00=0D=00=0A=
=00s=00i=00z=00e=00=3D=002=00>=00<=00S=00T=00R=00O=00N=00G=00>=00N=00U=00=
K=00E=00<=00/=00S=00T=00R=00O=00N=00G=00>=00<=00/=00F=00O=00N=00T=00>=00<=
=00/=00D=00I=00V=00>=00<=00/=00B=00O=00D=00Y=00>=00<=00/=00H=00T=00M=00L=00=
>=00=0D=00=0A=
=00
------=_NextPart_000_0008_01BFCFF6.788A57A0--

There are hundreds of postings this occurs on if you need more feedback.
Keywords: correctness, nsbeta3
Summary: mime messages in news appear as ÿþ< → Quoted printable messages have "ÿþ" at beginning
Ok, the problem here is that the HTML part that is in quoted printable form is 
actually in Unicode. libmime doens't handle that very well. I'll have to see 
what I can do here.

- rhp
Summary: Quoted printable messages have "ÿþ" at beginning → Unicode messages that are quoted printable parts have "ÿþ" at beginning
Target Milestone: M19 → M18
Hi Naoki,
I was wondering if there are any other bugs you've looked at that are similar 
to this and if any of our overrides would work for this. Basically, this is an 
HTML attachment that is in quoted printable encoding, and that HTML doc has a 
charset=unicode META tag.

I've tried hacking around with it, but without much luck. Any ideas or 
insights.

- rhp
Kat,
Wondering if you had any ideas on this one either?

- rhp
Rich, this is my guess but the problem is something like this:

1. Someone uses OE5 to compose a rich text message. He/she uses
   Sans Comic Font and types in text under ISO-8859-1.
2. For some reason (perhaps instead of typing in this text, he/she
   copied from a text), a Byte Order Mark (BOM) u\FFFE gets into
   this message body. Since MS files text files can contain a BOM
   this is not unusual. 
3. So this message goes out, it is really just plain English message
   and not even any 8-bit characters, but because of the spurious
   BOM at the beginning of the text, QP mechanism kicks in and 
   QP-encode the whole thing.
4. My guess is that we don't handle a BOM well in mail messages
   and chokes on this when we try to quote it.
   "ÿþ<" would be simply FFFE + the beginning of an HTML file, which goes
   like: <!DOCTYPE HTML ....

Recently bobj talked about supporting BOM in reading UTF-8 
text files. Maybe he has some more idea about this.

I am curious as to how this type of BOM gets into otherwise
simple ASCII HTML file. My suspicion is copy/paste operation
which may leave something like this without the user
ever knowing about it.
I would have Naoki look at this definitely.

There is a factual error above. It is "FEFF" rather 
than "FFFE" which we find in this message. So
in the above comment, substitute FEFF wherever
I say "FFFE".

To add a bit more, Windows 2000/NT stores files in
Unicode. I understand that they add a BOM for Little 
Endian, FEFF, to saved text files.
There is one more peculiar thing going on with this test message.
In fact I tried MS OE5 with the same type of message structure
and Mozilla has the same porblem with every one of these messages
I created with OE5.

The problem is that we are not reading the Content charset correctly
with the multipart/alternative type messages produced by OE5. 
My test messages contained no BOM (FEFF) and so they are
pure ISO-8859 (actually only ASCII) msgs. 
In fact the test message contained here shows the same problem.

Just try this:

1. Set Default Message View encoding (Edit | Prefs | Mail & Newsgroup | Languages | Characer Coding" to
   something like Turkish.
2. Now display a message other than the test message and then come back to the test message.
3. Check View | Character Coding to see what the encoidng is. It will say "Turkish".
4. Go back to step 1 and change the encoding to something else. And try the steps 2 and 3, the value of
   encoding changes.

1-4 indicate that we are not picking up the charset parameter in the test message at all and
falling back on the viewing deafult charset. If you reply, that will get engaged.

So we don't know seem to be able to tell what the charset of this type of 
multipart/alternative messages. All my other test messages of multipart/alternative
type from OE5 show the same problem.  The all contain charset header info, however.

My test messages don't have the quotep problem, though. 
Summary: Unicode messages that are quoted printable parts have "ÿþ" at beginning → Unicode messages that are quoted printable parts have "ðž" at beginning
One more fact: we don't seem to have a problem with multipart/alternative msgs 
from Communicator as far as picking up the charset info is concerned.
I tried a few more times to create a mail message with OE5 so that it contained nothing but
ASCII HTML text but still encoded in Unicode and QP-ed (with all the "=00" in front of 
the ASCII characters) with a BOM at the beginning. I haven't been successful so far.
The original filer of this bug says that there are a bunch of messages like that. I'm curious as
to how they get created.
I think this is a send problem. Since the message is labeled as 
charset="iso-8859-1", "=FF=FE" is displayed as "ÿþ".
Sender should not include BOM when it's sending as "ISO-8859-1".
So are we thinking this is an invalid formatted message? If so, I'll mark 
invalid.

- rhp
Yes, since ISO-8859-1 includes 0xFE and 0xFF as valid code points, we cannot 
simply ignore them when display the message.
Naoki,
Well, I go back to my original question. Do you have any ideas on handling 
this. I'm in Mountain View today so maybe we can talk. I tried a few hacks, but 
it didn't seem to work...maybe I'm messing something up, but we can discuss.

- rhp
I think we need to know how popular this type of mails are.
If we do hack, we need make sure it won't affect correct MIME mails and 
performance.
Very much agreed. Let me continue down my hacking road and if I get a workable 
solution, I'll let you know.

- rhp
From the previous comments, it looks like we are handling it correctly as
nhotta commented.  If a QP'd message is labeled iso-8859-1, then the correct
interpretation of "=FF=FE" is "ÿþ".  So by "fixing" this we would break any
correct email that starts with "=FF=FE".

But since this is very unlikely, we have a choice:
   - leave it, and tell users to file bugs against OE
   - put a special hack (maybe only for email from OE) to ignore it

What happens if the OE email contains non-ASCII?  Are they sent encoded in
ISO-8859-1 in this case?

What doesn't make sense is that if there is a BOM (FE FF or FE FF), then the
data should be in UTF-16 and all "ASCII" characters should have a leading null
byte.  But that does not seem to be the case from the previous comments, 
otherwise you'd have seen something like a square box before every character.

Kat,
  For UTF-8 browsing, the bug about handling UTF-8 "BOM"s (I don't remember
the bug #) has been resolved as WORKSFORME.  In the UTF-8 case, the "BOM"
is EF BB BF not either FE FF or FE FF.  BOMs (Byte Order Marks) don't really
make sense for UTF-8 since it is a byte stream -- unless used as merely a
UTF-8 signature.
bobj comments:

> What doesn't make sense is that if there is a BOM (FE FF or FE FF), then the
> data should be in UTF-16 and all "ASCII" characters should have a leading null
> byte

You're right. The original test data contain leading null bytes for all ASCII characters. 
And so this is UTF-16/UCS-2 data. 
My question really is how this data can be generated and how often this happens.
If this is an edge case that is not that common, I would MUCH rather mark 
invalid and move on than muck with libmime at this stage of the game.

- rhp
Agreed. I'll keep an eye on the frequency issue now that we know
this kind of data exist.
However, I still need to file a separate  bug on "multipart/alternative" mail by OE5 and 
us not recognizing the charset info. It happens without the Unicode mail and
more general.
If none of us have any idea how prevalent this may be, then we could just
wait and see what PR2 feedback we get or generate.  Let's gather some
data before deciding to punt or fix...
Until this is more pervasive, I'm going to pull back on the beta3 nomination.

- rhp
Keywords: correctness, nsbeta3
Target Milestone: M18 → M20
I just thought I would say I only really get these messages with this problem, 
mainly from one or 2 posters. But all follow ups from that poster then gets the 
problem. At a guess I wouldn't say its too wide spread of a problem because 
netscape 4 has the same problem, and im the only person I know whos encounted 
the problem.
Due to problem happening relatively infrequently, going to future this one.

- rhp
Target Milestone: M20 → Future
Steve Elmer wrote:

Jaime,

Did your team look at this bug during the triage?  It's not even nominated, so
we're thinking that means it should be
FUTUREd.  Let me know what you think.

Thanks,

Steve Jaime,

Did your team look at this bug during the triage?  It's not even nominated, so
we're thinking that means it should be
FUTUREd.  Let me know what you think.

Thanks,

Steve 
Adding Frank and myself to cc: list.

What's the status of this one???
From the comments in this bug, it seems to me that NS6 is handling this
correctly and that the sent mail is incorrectly encoded.

The question is "Whether this was a common OE problem for which we need to
provide a workaround?"  From the data in this bug report, the answer seems to
be "no", that this is a unusual case of a bogus email.
Frank - Based on the comments in the bug, and the low frequency of occurence. I 
vote to nsbeta3- or future this one for now.
I'd vote to RESOLVE this bug as INVALID.
The email is bogus.
Frank do u have any objections to marking as invlaid?  If not, please mark it as 
invalid and let's get it off the radar. On to bigger, badder bugs!
marking INVALID.  reopen if you disagree.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → INVALID
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.