Closed Bug 166521 Opened 22 years ago Closed 21 years ago

Semi-HTML support -- conversion of numeric entities -- for plain-text emails

Categories

(MailNews Core :: Internationalization, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

VERIFIED WONTFIX

People

(Reporter: lapsap7+mz, Assigned: smontagu)

References

Details

(Keywords: intl)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1b) Gecko/20020826 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1b) Gecko/20020826 I'm getting more and more emails having &#ddddd; in Subject as well as in Body but the emails are in plain text. Such numeric entities are just shown as is. Would it be possible to add a function, and thus an item in the View menu (like "Numeric entities as characters"), to allow the conversion of numeric entities to characters? Of course, only Unicode code-points are allowed to use. Reproducible: Always Steps to Reproduce:
isn't escaped UTF8 code something like \uXXXX? what is &#XXXX ?
Summary: Suggestion : semi-HTML support -- conversion of numeric entities -- for plain-text emails → semi-HTML support -- conversion of numeric entities -- for plain-text emails
\uXXXX is used in JavaScript string. &#xXXXX; or &#DDDDD; is HTML numeric entities. They can't be used interchangeably.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Reassign to ben.bucksch@beonex.com, cc to ducarroz, mozilla@bucksch.org.
Assignee: nhotta → ben.bucksch
I don't plan to add any new features myself. reassign to default. Consider that people might exchanhge source code via email, and this might break it without any hints of the change. It is true that we "mangle" some other strings (like 2^5), but that is usually obvious to the reader. mozTXTToHTMLConv.cpp's Gylph conversion is the function to change, if you nevertheless choose that route. I am not sure, if the converter is the right place. It seems like something's going wrong during the encoding / charset conversion of the sender. Wouldn't the charset conversion routines be a better place for that? WONTFIX is another option. I have no opinion myself.
This wish has been left in suspense for a year and a half. Any plan to "revive" it? Or maybe if it's reassigned to I18N, it might get fixed? I said so because it's mostly affecting non Western users. Here're two personal experience: 1) A friend of mine used a mail client (I forgot what it is, but very probably OE) to send me a message. The message is encoded in Big5, but since my friend had used some characters which aren't in Big5 (some HongKong words or simplified Chinese), those non-Big5 are just simply sent as &#ddddd; entity in the mail! 2) (This case is the worst) Another friend of mine is sending me a message from yahoo.com (not hk.yahoo.com or any other Chinese localised yahoo). He's using Mozilla. The message is certainly not in Big5 or GB. Actually, yahoo.com doesn't seem to care: every character is purely and simply replaced as &#ddddd; entity! A side note: if he had used IE, the message is not at all readable! I don't understand why but I don't care :) So, this feature is quite important for us. Please rethink about it.
Summary: semi-HTML support -- conversion of numeric entities -- for plain-text emails → Wish: semi-HTML support -- conversion of numeric entities -- for plain-text emails
> I don't plan to add any new features myself. reassign to default. Seems like I forgot to reassign. Generally, this "problem" is caused by a *very* broken sender. Either very strange architecture which went wrong or wrong thinking on the programmer's part. Please tell the sender to use a better email program. If Yahoo as ISP changes the message content (and makes it malformed), that's even worse. SMTP servers generally shouldn't change the message at all. A workaround might be to use HTML, but I'm not sure if encouraging HTML is a good idea. I'm not against fixing this on our end, but it's going to be hard and dangerous, because (as mentioned), these character strings may appear in valid mail (e.g. programmers or web designers talking), so there will probably be false positives where we mess up the mail, if we fix this bug. In any case, I'm not interested in fixing it.
Assignee: ben.bucksch → smontagu
OS: Windows 2000 → All
Hardware: PC → All
Summary: Wish: semi-HTML support -- conversion of numeric entities -- for plain-text emails → Semi-HTML support -- conversion of numeric entities -- for plain-text emails
What's requested here can be taken care of by adding an item in 'View | Message Body As'. Currently, there are three modes, 'Simple HTML', 'Origianl HTML' and 'Plain Text'. Something like 'Plain Text with NCRs' might be added. Needless to say, that should NOT be the default because, as pointed out in the previous comment, NCRs in plain text are not to be interpreted as they're in html. As for Yahoo, you got NCRs because stupid Yahoo and many other web mail service providers don't use UTF-8 but made a false assumption that UI languages (English, S. Chinese, T. Chinese, Russian) have one-to-one mapping to/from character encodings (ISO-8859-1, GB2312/GB18030, Big5, KOI8-R/Windows-1251). What they should do is summarized here: http://bugs.horde.org/show_bug.cgi?id=1052
Keywords: intl
> can be taken care of by adding an item in 'View | Message Body As' I don't think it fits there very well. It's (from the user viewpoint) more a Character Encoding. And "View body as original html/simple html" still displays plaintext, if the msg is plaintext, so it's somewhat orthogonal. Also, I don't think users should be bothered with this. It's a plain bug, and it's nothing users understand (if not even the programmers of the sender's mail client understood it!) nor should have to care about. OTOH, they shouldn't care about charater encoding either.
(In reply to comment #8) > > can be taken care of by adding an item in 'View | Message Body As' > > I don't think it fits there very well. Neither do I, but I couldn't find any better place. > Also, I don't think users should be bothered with this. It's a plain bug, What do you mean by that it's a plain bug? I don't think it's a bug on Mozilla's side, which is why this bug is an enhancment request. We may as well resolve this as 'WONT FIX'. As you wrote in your comment #6, it's not us but stupid web mail service providers like Yahoo, Hotmail, etc that need to fix their products.
> As you wrote in your comment #6, it's not us but stupid web mail service > providers like Yahoo, Hotmail, etc that need to fix their products. Right, it's *their* bug, of Yahoo SMTP and MS OE. Why litter our product because of their obscure bugs? Just complain to these vendors/users. BTW: Not webmail, but SMTP, as I understood it.
Concerning displaying multi-language mails, mail.com and e-garfield.com are doing quite a good job. For ex, mail.com uses xhtml to make several encodings display possible: if a mail contains two attachments which have diff encodings, the mail can still be displayed correctly. For the last few comments: it's not only webmail, but also other mailers.
No, their bug has NOTHING to do with their SMTP servers. I thought you would read my bug report to Horde IMP I refered to in comment #7, but you apparently didn't. NCRs get into email messages because their web mail composition pages are sent to end-users in one of legacy encodings (say, ISO-8859-1) instead of UTF-8. When a user enters characters _outside_ the repertoire of ISO-8859-1 (say, Chinese characters), Mozilla, MS IE, and Konqueror convert them to NCRs before submitting them to the server-side form processing CGI (or JSP or whatever) program. Mozilla, MS IE and Konqueror are _forced_ to resort to this 'NCR-hack' (it's really a hack !) because the authors of HTML 4 failed to address I18N issues in the form submission. Yahoo, Hotmail, etc just pass along those NCRs to recipients. If they had used UTF-8 (which does NOT mean that they have to send outgoing emails in UTF-8), there'd be no such problem. > but also other mailers What stupid/broken mail clients do that?
> What stupid/broken mail clients do that? Outlook Express, well, I thought so... I've just configured my OE6 to do the test (message in Big5 with non-Big5 characters), but I failed. I don't know. Maybe it's in OE5 or Outlook 98/2000, or I didn't use the right character. Or some config isn't correct.
> No, their bug has NOTHING to do with their SMTP servers. I was referring to comemnt 5: "[somebody] is sending me a message from yahoo.com ... He's using Mozilla." I was assuming that the guy uses Mozilla Mailnews with Yahoo's POP3/SMTP-access to mailboxes. Maybe Seak just meant the Mozilla browser and the Yahoo webmail client. > I thought you would read my bug report to Horde IMP timeout > Yahoo, Hotmail, etc just pass along those NCRs to recipients. ...which is a bug. Seak, you should get to know - which exact clients (with version), in which configuration (and if it's default or not), expose the problem and - how many messages (percentage) are affected and how badly (are you still able to read it or it is totally impossible to decypher it?).
> Maybe Seak just meant the Mozilla browser and the Yahoo webmail client. Yes, that's what he meant. He couldn't have meant anything else because NO SMTP servers (known to me) knows anything about NCRs. Even if they knew, there'd be nothing they can do. All they (SMTP servers) have in the SMTP data is a stream of octets (as opposed to streams of 'characters'). Given just a stream of 'octets', they can do things like converting to and from base64/quouted-printable (see RFC 1652 and '8BITMIME' ESMTP extension. see also what sendmail 8.12.x or later does.), but they cannot replace characters outside the repertoire of the MIME charset (in Content-Type header) with NCRs (because characters not representable in the MIME charset couldn't be possibly in the SMTP data handed over to them in the first place). >> Yahoo, Hotmail, etc just pass along those NCRs to recipients. > ...which is a bug. No, that's not a bug. They don't have any way to tell whether an NCR is meant to represent itself or the Unicode character represented by the NCR (, which is why I wrote that using 'NCRs' in the form submission is a hack at best). All they could do is to just pass it along. The 'bug' of Yahoo, Hotmail etc is that they don't use UTF-8 for the UI.
> are you still able to read it or it is totally impossible to decypher it?) The easiest way to decipher it is to save it to an file ('html' extension), edit it to enclose the whole thing with '<pre>' and '</pre>' and read the file with a browser. Mozilla can do the equivalent, which is what this bug is about. > I didn't use the right character. IIRC, MS OE is not that stupid (it just turns characters not representable into question marks in text/plain), but I could be wrong. Anyway, you can test it by copy'n'pasting Korean characters (go to http://www.yahoo.co.kr) with the MIME charset set to Big5 which can't cover even a single Korean character.
> They don't have any way to tell whether an NCR is meant > to represent itself or the Unicode character represented by the NCR But we do even less so! > The 'bug' of Yahoo, Hotmail etc is that they don't use UTF-8 for the UI. Whatever they do, it's their bug, because the message they send is badly formatted. [Just throwing it verbatim ito <pre>] > Mozilla can do the equivalent, which is what this bug is about. We can't, at least not without causing a ton of bugs, e.g. you wouldn't be able to write <g> in plaintext anymore, because it would be interpreted as HTML just as &4657; would, neither would you be able to read this very bugmail message correctly with Mozilla Mailnews.
(In reply to comment #17) > > They don't have any way to tell whether an NCR is meant > > to represent itself or the Unicode character represented by the NCR > > But we do even less so! I never wrote we could do better. > > The 'bug' of Yahoo, Hotmail etc is that they don't use UTF-8 for the UI. > > Whatever they do, it's their bug, because the message they send is badly Sure, it's their bug. > [Just throwing it verbatim ito <pre>] > > Mozilla can do the equivalent, which is what this bug is about. > > We can't, at least not without causing a ton of bugs, e.g. you wouldn't be Which is exactly why that mode should NOT be the default but only be turned on by request. Anyway, this bug has to be given an very low priority or just has to be resolved as 'WONTFIX'.
(In reply to comment #14) > Maybe Seak just meant the Mozilla browser and the Yahoo webmail client. Yes, that's what I meant (and a little further, I had also made a comparison with IE too) > > Yahoo, Hotmail, etc just pass along those NCRs to recipients. What is NCR? Nowhere could I find its meaning. > Seak, you should get to know > - which exact clients (with version), in which configuration (and if it's > default or not), expose the problem and This bug was first reported more than a and a half year ago, I can't find back those problem mails. > - how many messages (percentage) are affected and how badly (are you still able Well, hard to say. Because it's a matter of invalid characters used in a certain encoding, it's hard to say when such certain characters are used. When it's about personal emails and since I've friends who have been studying in mainland China, they are used to use a mixture of s. Chinese and t. Chinese. In this case, I would say about one mail out of five is like this. When it's about forwarded mails (you know, those stupid jokes are whatever), the ratio could drop to 1 mail out of 20. But these mails aren't important, they should not be counted. > to read it or it is totally impossible to decypher it?). Yes, it's totally impossible to do so because it's in &#ddddd; form, how could I decypher it? Of course, in rare cases it's possible to guess the character by the context, but what I generally do is 1) convert ddddd to hexadecimal, 2) launch "Character table" and 3) look it up there. Quite a pain, huh!?
> Yes, it's totally impossible to do so because it's in &#ddddd; form, how > could I decypher it? Of course, in rare cases it's possible to guess > the character by the context Well, in English, it's relatively easy to guess a single word from the &#dddd;, even more so when it's only one cha&#dddd;acter which is missing. > This bug was first reported more than a and a half year ago, I can't find > back those problem mails. If it was a frequent problem, you'd get problematic mails all the time. So, this implies to me that this is a rare problem. Because it's not even our bug, and "fixing" it would have bad effects on valid mails, I am about to WONTFIX this. If you have a good idea (preferably easy and straight-forward) how to fix this without making valid mails wrong, I'd be glad to hear, but I have none. (Just doing this for the CJK charsets would not be a good solution, only make the problem less bad.) Even if we have a good idea, we still need somebody to implement it. > China, they are used to use a mixture of s. Chinese and t. Chinese. > In this case, I would say about one mail out of five is like this. Could you ask your friends, if that's really that frequent and let them answer my questions in comment 14.
(In reply to comment #19) > > > Yahoo, Hotmail, etc just pass along those NCRs to recipients. > > What is NCR? Nowhere could I find its meaning. Well, google should turn up tons of hits for NCR. It stands for numerical character reference. '&#dddd;' and '&#xhhhh;' are NCRs with 'dddd' (decimal) and 'hhhh' (in hexadecimal) being Unicode scalar values of characters represented. For instance, the Chinese character for number one is '&#x4E00;' or '&#19968;' because the Unicode code point for the character is 0x4E00. Hmm, I shouldn't have to go this length because you used the term 'numerical entity' in your bug report. > When it's about personal emails and since I've friends who have been studying > in mainland China, they are used to use a mixture of s. Chinese and t. Chinese. > In this case, I would say about one mail out of five is like this. Ask your friends to use UTF-8 and they would never have this problem because UTF-8 can cover both S.C and T.C. Also, write to Yahoo, hotmail and other bone-head web mail service providers to fix their web mail backends. The following is what they have to do ( http://bugs.horde.org/show_bug.cgi?id=1052 ) <quote> To solve this kind of problem, IMP should go all the way to UTF-8 which can represent Chinese/Japanese/French and many other languages in a single encoding. Below is what Otto Stolz wrote to Unicode mailing list in July 2001 (http://www.unicode.org/mail-arch/unicode-ml/y2001-m07/0268.html) OS> So none of these WWW interfaces is able to handle mail from, or to, OS> non-Western partners; not even Sorbian, a minority language in our OS> own country, nor the languages of our neighbours, Polish and Czech, OS> can be handeled. For a university, this lack of functionality is OS> plainly intolerable. OS> OS> There is a simple solution for this problem: OS> - encode all forms and other texts of the interface software in Unicode, OS> once and for all; OS> - convert incoming mail to unicode; OS> - talk to the browser in UTF-8; OS> - accept input from the HTML forms in UTF-8; OS> - send mail as is (in UTF-8), or convert outgoing mail to suitable 8-bit OS> (or even 7-bit) encodings, the user should have an option to suggest OS> the encoding for a particular message or addressee. OS> But none of the WWW interface packages I have tried works this way. This is exactly the way Mozilla-mail and MS Outlook Express work. That's why they're so excellent I18Nized mail clients. Hotmail/Yahoo mail/ Lycos mail/IMP don't work this way and that's why their I18N support is so poor. </quote> > > to read it or it is totally impossible to decypher it?). > > Yes, it's totally impossible to do so because it's in &#ddddd; form, > how could I decypher it? I've already told you how to decipher them. It's not perfect by any means (if it always works, we'd implement it) but most of time it works. Save them as html files (just enclose the entire body with a minimal html tags like '<html><body>' and '</html>') and view them with Mozilla browser.
(In reply to comment #20) > Well, in English, it's relatively easy to guess a single word from the &#dddd;, > even more so when it's only one cha&#dddd;acter which is missing. Yeah, sure, and it's true for all Indo-European languages. But your remark doesn't apply to, at least, eastern languages. > If it was a frequent problem, you'd get problematic mails all the time. No, not exactly. You couldn't do your reasoning with respect to letters. Every character for us is equivalent to every word for you. I have an idea how to explain this, but it's of course not totally valid because "no analogy is perfect": written British English and American English have a lot of differences. One of them is the "-our" vs "-or" terminations, like "behaviour" vs "behavior", "neighbour" vs "neighbor". Now, suppose one of them (I won't say if it's British or American to avoid flame war ;) ) could cause a bug in a mailer. Could you jump to the conclusion that this problematic mail would appear all the time? Probably not. You could search for this pattern in this message and I'm sure you can't find any (except the above paragraph giving the example -- but that doesn't count). Its appearance depends on context. So are the words in s. Chinese vs t. Chinese. > So, this > implies to me that this is a rare problem. Because it's not even our bug, and > "fixing" it would have bad effects on valid mails, I am about to WONTFIX this. Sure, it's not Mozilla's bug. That's why I started the bug as a wish/request. It's becoming a overwhelming situation. You could probably imagine that nowadays web-based users outnumber all mail clients users. Even if I would like to "evangelize" real mail client or the use of UTF-8 webmail to every friend I know, it would take me I-don't-know-how-long....
(In reply to comment #21) > Well, google should turn up tons of hits for NCR. It stands for numerical > character reference. '&#dddd;' and '&#xhhhh;' are NCRs with 'dddd' (decimal) > [snipped] > have to go this length because you used the term 'numerical entity' in your bug > report. Oh I see. I just knew that &...; is an HTML entity, so I did an extrapolation on its name and thought it might be called "numerical entity" :p > Ask your friends to use UTF-8 and they would never have this problem because > UTF-8 can cover both S.C and T.C. But the problem isn't theirs. Well, the problem is double. Primo, most of my friends (actually, most of the world today) are using IE. Secundo, have you ever tried to switch IE's encoding to UTF8 in the middle of a webmail session? For some, like hotmail, it's useless because IE is switching back to localised encoding :-/ For others, like sinaman, everytime when you change encoding, you're logged out! > Also, write to Yahoo, hotmail and other > bone-head web mail service providers to fix their web mail backends. I had done so to Hotmail, last year or so. I wrote them that I couldn't read or write Unicode message. Guess what I received afterwards. They told me to use the "correct" encoding, where to click, blah blah blah, etc etc. If you're interested to "appreciate" their skill on customer support, I could search my old mails and paste it here :) > This is exactly the way Mozilla-mail and MS Outlook Express work. > That's why they're so excellent I18Nized mail clients. Hotmail/Yahoo mail/ > Lycos mail/IMP don't work this way and that's why their I18N support > is so poor. Actually, as I've written, mail.com and e-garfield.com IU are very well done. Maybe I have to do free advertisment for them by telling every friend of mine to migrate to these websites.... > I've already told you how to decipher them. It's not perfect by any means (if > it always works, we'd implement it) but most of time it works. Save them as html > files (just enclose the entire body with a minimal html tags like '<html><body>' > and '</html>') and view them with Mozilla browser. Yes, I've seen that message of yours afterwards (since I replied the other message beforehand). That could be a temporary workaround.
(In reply to comment #20) > If you have a good idea (preferably easy and straight-forward) how to fix this > without making valid mails wrong, I'd be glad to hear, but I have none. (Just > doing this for the CJK charsets would not be a good solution, only make the > problem less bad.) Even if we have a good idea, we still need somebody to > implement it. I suppose a mail content, ie every screen-displayed object (subject or body) in its lowest level is composed of a string (if applied, of course) together with its screen properties, like screen position, font family, points, colour. So, I could imagine repassing these strings to a function which looks only for &#ddddd; pattern and replacing these strings back to the original strings. That should suffice.
(In reply to comment #23) > (In reply to comment #21) > > Ask your friends to use UTF-8 and they would never have this problem because > > UTF-8 can cover both S.C and T.C. > > But the problem isn't theirs. Well, the problem is double. Primo, most of my > friends (actually, most of the world today) are using IE. Secundo, have you ever > tried to switch IE's encoding to UTF8 in the middle of a webmail session? No, I was not talking about web mail when I wrote you had to ask your friend to switch to UTF-8. Whether it's IE or Mozilla, manually switching to UTF-8 does screw up hotmail's UI because its UI is in one of legacy encodings, which is why I think hotmail is completely broken. They should use UTF-8 for their UI as I explained in my previous comment. > > Also, write to Yahoo, hotmail and other > > bone-head web mail service providers to fix their web mail backends. > > I had done so to Hotmail, last year or so. I wrote them that I couldn't read > or write Unicode message. Guess what I received afterwards. They told me to use > the "correct" encoding, where to click, blah blah blah, etc etc. If you're They could have gotten it right in 1996 and they still don't get it right in 2004. > Actually, as I've written, mail.com and e-garfield.com IU are very well done. > Maybe I have to do free advertisment for them by telling every friend of mine > to migrate to these websites.... I also heard that myrealbox.com does what I proposed in my previous comment. > I could imagine repassing these strings to a function which looks only for > &#ddddd; pattern and replacing these strings back to the original strings. Gee. Needless to say, we knew that's what you meant from the beginning. The problelm is that this doesn't always work (although most of time it works). See comment #6. So if we ever do that, that has to be offered as an option users have to request explicitly __per__ message.
> I had done so to Hotmail, last year or so. I wrote them that I couldn't read > or write Unicode message. Guess what I received afterwards. I guess every tech support is the same. They assume that the user is dumb and at fault and are sutubborn at that. Insist that you're sure it's a problem on *their side*, that they pass the bug report to the programmers and that they fix it. Seak, without quantifiyable data (numbers) on how severe this problem is in reality, this is getting nowhere. See my questions above.
(In reply to comment #25) > Gee. Needless to say, we knew that's what you meant from the beginning. The > problelm is that this doesn't always work (although most of time it works). See > comment #6. So if we ever do that, that has to be offered as an option users > have to request explicitly __per__ message. Well, I didn't want to repeat the same sentence again and again. But it's just that I'm asked the same question, that's why I gave the same answer. Anyway, Jungshik's suggestion works for me, so I'm closing this bug as WONTFIX and I think everybody (except me) is happy. Allow me a little off-topic before closing this bug. I think I'm just going to setup my own web-based mail and provide it to my close friends so as to solve this encoding mess. I don't have yet a clear idea what to do first, so my questions are: * Except HTTP server (apache?) and SMTP server, what else do I need? * I want the Web GUI to be compliant to Unicode. Do I have to build everything from scratch or is that anything (free) that I could use? * And the encoding engine behind which is able to read SMIME, etc etc, and which is necessary to translate a, say Big5, email to UTF8, do I also need to write it from scratch? Which means I have to read through all those RFC's to understand to the least details how an email is composed? If anyone of you have any reference or advice, please write to my personal e-address. Thanks.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → WONTFIX
(In reply to comment #27) > (In reply to comment #25) > > Gee. (Don't take that personally.) > I think everybody (except me) is happy ... I think I'm just going > to setup my own web-based mail and provide it to my close friends so as > to solve this encoding mess. *phew* :-( If you want, you could try to hack the converter (for your own copy of Mozilla) yourself. It shouldn't be hard, if you know C++ and get Mozilla to build. As said above, mozTXTToHTMLConv.cpp::GlyphHit() is (IIRC) the function to change. > Do I have to build everything > from scratch or is that anything (free) that I could use? There are tons of Free full-blown webmailers and emailing scripts (*very* often used for "contact us"-style formulars). (For the latter, pay attention that it can't be used to send to arbitary addresses, otherwise spammers might use it.) I don't know, how well they support Unicode. I'd try freshmeat.net.
Status: RESOLVED → VERIFIED
(In reply to comment #27) > from scratch? Which means I have to read through all those RFC's to understand > to the least details how an email is composed? No, you don't have to. Get PHP and install IMP (http://www.horde.org). As I wrote before (comment #7), a couple of years ago, I made a patch to make it almost fully internationalized (I installed it on my machine and opened it up for my friends). That patch was not applied to IMP, but since then, they seem to have implemented most of stuffs I had implemented.
BTW, the first URL in comment #21 is no longer valid.
*** Bug 239066 has been marked as a duplicate of this bug. ***
*** Bug 265314 has been marked as a duplicate of this bug. ***
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.