Closed Bug 41706 Opened 25 years ago Closed 24 years ago

Unicode messages that are quoted printable parts have "ðž" at beginning

Categories

(MailNews Core :: MIME, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED INVALID
Future

People

(Reporter: pajs_1, Assigned: rhp)

Details

This bug also is in netscape 4.7 , basically, if someone posts a mime posting to a newsgroup, you get a ÿþ< in the message. And if you click reply, that is all of the message.. In netscape 4.7 , you never even saw any of the posting other than that. This happens if the header has anything like this in it : MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_004F_01BFD000.1228BBA0" X-Priority: 3 (Would like to say if the header had X-Newsreader: Microsoft Outlook Express 5.00.2919.6600 in it :-) .. pitty)
Can you point to an example...that would help. - rhp
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Target Milestone: --- → M19
Most of the postings this problem occurs in is the newsgroup 0000000076, which is not visable outside my ISP im afraid.. But I will put an offending message under this.. In netscape 4.7 I see the message as " ÿþ< " In Mozilla M15 , I see " ÿþ That's the chappie...cheers Tarz NUKE " (Without the speech marks). If I hit reply in either version I just get " "NUKE" wrote: > ÿþ< " Page source below. From: "NUKE" <me@here.com> Newsgroups: 0000000076 References: <393ce7f2@news.server.worldonline.co.uk> <393ceeae@news.server.worldonline.co.uk> <393d44e8@news.server.worldonline.co.uk> Subject: Re: YoYo Date: Tue, 6 Jun 2000 20:33:27 +0100 Lines: 58 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0008_01BFCFF6.788A57A0" X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 5.00.2615.200 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2615.200 NNTP-Posting-Host: 212.49.241.182 X-Original-NNTP-Posting-Host: 212.49.241.182 Message-ID: <393d5453@news.server.worldonline.co.uk> X-Trace: 6 Jun 2000 19:43:15 GMT, 212.49.241.182 Path: news.server.worldonline.co.uk!212.49.241.182 Xref: news.server.worldonline.co.uk 0000000076:4498 This is a multi-part message in MIME format. ------=_NextPart_000_0008_01BFCFF6.788A57A0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable That's the chappie...cheers Tarz NUKE ------=_NextPart_000_0008_01BFCFF6.788A57A0 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable =FF=FE<=00!=00D=00O=00C=00T=00Y=00P=00E=00 =00H=00T=00M=00L=00 = =00P=00U=00B=00L=00I=00C=00 = =00"=00-=00/=00/=00W=003=00C=00/=00/=00D=00T=00D=00 =00H=00T=00M=00L=00 = =004=00.=000=00 = =00T=00r=00a=00n=00s=00i=00t=00i=00o=00n=00a=00l=00/=00/=00E=00N=00"=00>=00= =0D=00=0A= =00<=00H=00T=00M=00L=00>=00<=00H=00E=00A=00D=00>=00=0D=00=0A= =00<=00M=00E=00T=00A=00 = =00c=00o=00n=00t=00e=00n=00t=00=3D=00"=00t=00e=00x=00t=00/=00h=00t=00m=00= l=00;=00 = =00c=00h=00a=00r=00s=00e=00t=00=3D=00u=00n=00i=00c=00o=00d=00e=00"=00 = =00h=00t=00t=00p=00-=00e=00q=00u=00i=00v=00=3D=00C=00o=00n=00t=00e=00n=00= t=00-=00T=00y=00p=00e=00>=00=0D=00=0A= =00<=00M=00E=00T=00A=00 = =00c=00o=00n=00t=00e=00n=00t=00=3D=00"=00M=00S=00H=00T=00M=00L=00 = =005=00.=000=000=00.=002=006=001=004=00.=003=005=000=000=00"=00 = =00n=00a=00m=00e=00=3D=00G=00E=00N=00E=00R=00A=00T=00O=00R=00>=00=0D=00=0A= =00<=00S=00T=00Y=00L=00E=00>=00<=00/=00S=00T=00Y=00L=00E=00>=00=0D=00=0A= =00<=00/=00H=00E=00A=00D=00>=00=0D=00=0A= =00<=00B=00O=00D=00Y=00 = =00b=00g=00C=00o=00l=00o=00r=00=3D=00#=00f=00f=00f=00f=00f=00f=00>=00=0D=00=0A= =00<=00D=00I=00V=00>=00<=00F=00O=00N=00T=00 = =00f=00a=00c=00e=00=3D=00"=00C=00o=00m=00i=00c=00 =00S=00a=00n=00s=00 = =00M=00S=00"=00 = =00s=00i=00z=00e=00=3D=002=00>=00<=00S=00T=00R=00O=00N=00G=00>=00T=00h=00= a=00t=00'=00s=00 =00t=00h=00e=00 = =00c=00h=00a=00p=00p=00i=00e=00.=00.=00.=00c=00h=00e=00e=00r=00s=00 = =00=0D=00=0A= =00T=00a=00r=00z=00<=00/=00S=00T=00R=00O=00N=00G=00>=00<=00/=00F=00O=00N=00= T=00>=00<=00/=00D=00I=00V=00>=00=0D=00=0A= =00<=00D=00I=00V=00>=00&=00n=00b=00s=00p=00;=00<=00/=00D=00I=00V=00>=00=0D= =00=0A= =00<=00D=00I=00V=00>=00<=00F=00O=00N=00T=00 = =00f=00a=00c=00e=00=3D=00"=00C=00o=00m=00i=00c=00 =00S=00a=00n=00s=00 = =00M=00S=00"=00 =00=0D=00=0A= =00s=00i=00z=00e=00=3D=002=00>=00<=00S=00T=00R=00O=00N=00G=00>=00N=00U=00= K=00E=00<=00/=00S=00T=00R=00O=00N=00G=00>=00<=00/=00F=00O=00N=00T=00>=00<= =00/=00D=00I=00V=00>=00<=00/=00B=00O=00D=00Y=00>=00<=00/=00H=00T=00M=00L=00= >=00=0D=00=0A= =00 ------=_NextPart_000_0008_01BFCFF6.788A57A0-- There are hundreds of postings this occurs on if you need more feedback.
Keywords: correctness, nsbeta3
Summary: mime messages in news appear as ÿþ< → Quoted printable messages have "ÿþ" at beginning
Ok, the problem here is that the HTML part that is in quoted printable form is actually in Unicode. libmime doens't handle that very well. I'll have to see what I can do here. - rhp
Summary: Quoted printable messages have "ÿþ" at beginning → Unicode messages that are quoted printable parts have "ÿþ" at beginning
Target Milestone: M19 → M18
Hi Naoki, I was wondering if there are any other bugs you've looked at that are similar to this and if any of our overrides would work for this. Basically, this is an HTML attachment that is in quoted printable encoding, and that HTML doc has a charset=unicode META tag. I've tried hacking around with it, but without much luck. Any ideas or insights. - rhp
Kat, Wondering if you had any ideas on this one either? - rhp
Rich, this is my guess but the problem is something like this: 1. Someone uses OE5 to compose a rich text message. He/she uses Sans Comic Font and types in text under ISO-8859-1. 2. For some reason (perhaps instead of typing in this text, he/she copied from a text), a Byte Order Mark (BOM) u\FFFE gets into this message body. Since MS files text files can contain a BOM this is not unusual. 3. So this message goes out, it is really just plain English message and not even any 8-bit characters, but because of the spurious BOM at the beginning of the text, QP mechanism kicks in and QP-encode the whole thing. 4. My guess is that we don't handle a BOM well in mail messages and chokes on this when we try to quote it. "ÿþ<" would be simply FFFE + the beginning of an HTML file, which goes like: <!DOCTYPE HTML .... Recently bobj talked about supporting BOM in reading UTF-8 text files. Maybe he has some more idea about this. I am curious as to how this type of BOM gets into otherwise simple ASCII HTML file. My suspicion is copy/paste operation which may leave something like this without the user ever knowing about it. I would have Naoki look at this definitely.
There is a factual error above. It is "FEFF" rather than "FFFE" which we find in this message. So in the above comment, substitute FEFF wherever I say "FFFE". To add a bit more, Windows 2000/NT stores files in Unicode. I understand that they add a BOM for Little Endian, FEFF, to saved text files.
There is one more peculiar thing going on with this test message. In fact I tried MS OE5 with the same type of message structure and Mozilla has the same porblem with every one of these messages I created with OE5. The problem is that we are not reading the Content charset correctly with the multipart/alternative type messages produced by OE5. My test messages contained no BOM (FEFF) and so they are pure ISO-8859 (actually only ASCII) msgs. In fact the test message contained here shows the same problem. Just try this: 1. Set Default Message View encoding (Edit | Prefs | Mail & Newsgroup | Languages | Characer Coding" to something like Turkish. 2. Now display a message other than the test message and then come back to the test message. 3. Check View | Character Coding to see what the encoidng is. It will say "Turkish". 4. Go back to step 1 and change the encoding to something else. And try the steps 2 and 3, the value of encoding changes. 1-4 indicate that we are not picking up the charset parameter in the test message at all and falling back on the viewing deafult charset. If you reply, that will get engaged. So we don't know seem to be able to tell what the charset of this type of multipart/alternative messages. All my other test messages of multipart/alternative type from OE5 show the same problem. The all contain charset header info, however. My test messages don't have the quotep problem, though.
Summary: Unicode messages that are quoted printable parts have "ÿþ" at beginning → Unicode messages that are quoted printable parts have "ðž" at beginning
One more fact: we don't seem to have a problem with multipart/alternative msgs from Communicator as far as picking up the charset info is concerned.
I tried a few more times to create a mail message with OE5 so that it contained nothing but ASCII HTML text but still encoded in Unicode and QP-ed (with all the "=00" in front of the ASCII characters) with a BOM at the beginning. I haven't been successful so far. The original filer of this bug says that there are a bunch of messages like that. I'm curious as to how they get created.
I think this is a send problem. Since the message is labeled as charset="iso-8859-1", "=FF=FE" is displayed as "ÿþ". Sender should not include BOM when it's sending as "ISO-8859-1".
So are we thinking this is an invalid formatted message? If so, I'll mark invalid. - rhp
Yes, since ISO-8859-1 includes 0xFE and 0xFF as valid code points, we cannot simply ignore them when display the message.
Naoki, Well, I go back to my original question. Do you have any ideas on handling this. I'm in Mountain View today so maybe we can talk. I tried a few hacks, but it didn't seem to work...maybe I'm messing something up, but we can discuss. - rhp
I think we need to know how popular this type of mails are. If we do hack, we need make sure it won't affect correct MIME mails and performance.
Very much agreed. Let me continue down my hacking road and if I get a workable solution, I'll let you know. - rhp
From the previous comments, it looks like we are handling it correctly as nhotta commented. If a QP'd message is labeled iso-8859-1, then the correct interpretation of "=FF=FE" is "ÿþ". So by "fixing" this we would break any correct email that starts with "=FF=FE". But since this is very unlikely, we have a choice: - leave it, and tell users to file bugs against OE - put a special hack (maybe only for email from OE) to ignore it What happens if the OE email contains non-ASCII? Are they sent encoded in ISO-8859-1 in this case? What doesn't make sense is that if there is a BOM (FE FF or FE FF), then the data should be in UTF-16 and all "ASCII" characters should have a leading null byte. But that does not seem to be the case from the previous comments, otherwise you'd have seen something like a square box before every character. Kat, For UTF-8 browsing, the bug about handling UTF-8 "BOM"s (I don't remember the bug #) has been resolved as WORKSFORME. In the UTF-8 case, the "BOM" is EF BB BF not either FE FF or FE FF. BOMs (Byte Order Marks) don't really make sense for UTF-8 since it is a byte stream -- unless used as merely a UTF-8 signature.
bobj comments: > What doesn't make sense is that if there is a BOM (FE FF or FE FF), then the > data should be in UTF-16 and all "ASCII" characters should have a leading null > byte You're right. The original test data contain leading null bytes for all ASCII characters. And so this is UTF-16/UCS-2 data. My question really is how this data can be generated and how often this happens.
If this is an edge case that is not that common, I would MUCH rather mark invalid and move on than muck with libmime at this stage of the game. - rhp
Agreed. I'll keep an eye on the frequency issue now that we know this kind of data exist. However, I still need to file a separate bug on "multipart/alternative" mail by OE5 and us not recognizing the charset info. It happens without the Unicode mail and more general.
If none of us have any idea how prevalent this may be, then we could just wait and see what PR2 feedback we get or generate. Let's gather some data before deciding to punt or fix...
Until this is more pervasive, I'm going to pull back on the beta3 nomination. - rhp
Keywords: correctness, nsbeta3
Target Milestone: M18 → M20
I just thought I would say I only really get these messages with this problem, mainly from one or 2 posters. But all follow ups from that poster then gets the problem. At a guess I wouldn't say its too wide spread of a problem because netscape 4 has the same problem, and im the only person I know whos encounted the problem.
Due to problem happening relatively infrequently, going to future this one. - rhp
Target Milestone: M20 → Future
Steve Elmer wrote: Jaime, Did your team look at this bug during the triage? It's not even nominated, so we're thinking that means it should be FUTUREd. Let me know what you think. Thanks, Steve Jaime, Did your team look at this bug during the triage? It's not even nominated, so we're thinking that means it should be FUTUREd. Let me know what you think. Thanks, Steve
Adding Frank and myself to cc: list. What's the status of this one???
From the comments in this bug, it seems to me that NS6 is handling this correctly and that the sent mail is incorrectly encoded. The question is "Whether this was a common OE problem for which we need to provide a workaround?" From the data in this bug report, the answer seems to be "no", that this is a unusual case of a bogus email.
Frank - Based on the comments in the bug, and the low frequency of occurence. I vote to nsbeta3- or future this one for now.
I'd vote to RESOLVE this bug as INVALID. The email is bogus.
Frank do u have any objections to marking as invlaid? If not, please mark it as invalid and let's get it off the radar. On to bigger, badder bugs!
marking INVALID. reopen if you disagree.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → INVALID
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.