Closed Bug 448842 Opened 16 years ago Closed 15 years ago

solution for bug 410333 introducing regression for Japanese users

Categories

(Thunderbird :: Mail Window Front End, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b2

People

(Reporter: bugzilla.mozilla.org, Assigned: emk)

References

Details

(Keywords: jp-critical, regression)

Attachments

(1 file, 5 obsolete files)

I think that the fix for bug 410333 is generally desirable (send in utf-8 when there are glyphs unencodable in current charset).  But UTF-8 is not widespread in Japan.  Many mobile handsets for example don't support UTF-8 mails.  Having mails sent automatically as UTF-8 will send a lot of technically correct mails that the recipient will still be unable to read.

intl.fallbackCharsetList.* is supposedly able to help here.  But it is both cumbersome and poweruser-only technique.  It also does not seem to work for me and TB 2.0.0.16 on hardy.

bug 410333 was about the body if I understand it correctly.  So this bug is about the body as well.  There are problems as well with Japanese and headers of a mail not being correctly encoded -> bug 369067
Attached patch Restore the "Send as UTF" prompt (obsolete) — Splinter Review
This patch also contains a fix for bug 328938 and bug 311821.
* You will be asked only when you are about to send a message. It will silently switch to UTF-8 on saving.
* You will be asked only if you checked "Use the default character encoding in replies" option which is not checked by default. So most users will never see the prompt.
This is critical for Japanese users. Some Japanese are really really conservative about the mail format. Some people even filter the all non-ISO-2022-JP messages. (It is an efficient spam filter for those who are not intrested in foreign messages.) Others believe that they can't use UTF-8 in mail body unless they have a private agreement each other.
Moreover, some Japanese mobile phones and Webmails still do not cope with UTF-8.
Assignee: nobody → VYV03354
Status: NEW → ASSIGNED
Attachment #358179 - Flags: review?(mnyromyr)
Flags: blocking-thunderbird3?
Flags: blocking-thunderbird3? → blocking-thunderbird3+
OS: Linux → All
Hardware: x86 → All
Target Milestone: --- → Thunderbird 3.0b3
So why don't make the intl.fallbackCharsetList.* stuff happen automagically, at least for jp users?
(and if the fallback is buggy, fix that)
Attachment #358179 - Flags: review?(mnyromyr) → review-
Comment on attachment 358179 [details] [diff] [review]
Restore the "Send as UTF" prompt

This dialog is not coming back.
If there are any issues with intl.fallbackCharsetList.*, they should get fixed, like Magnus already said. For example, we could have the fallback config be localizable or something like that...
(In reply to comment #4)
> So why don't make the intl.fallbackCharsetList.* stuff happen automagically, at
> least for jp users?

intl.fallbackCharsetList.* can't help Japanese users.
The fallback should not happen automatically for almost all Japanese senders because some Japanese recipient can decode only ISO-2022-JP.
Only UTF-8 is mentioned in Comment #1, but also all encodings other than ISO-2022-JP are not acceptable because there is no alternative encoding which is compatible with ISO-2022-JP and can represent Japanese.
(In reply to comment #7)
> The fallback should not happen automatically

I meant, "The fallback form ISO-2022-JP to another encoding."
(In reply to comment #7)
> intl.fallbackCharsetList.* can't help Japanese users.

... in all and every circumstance, agreed.  It is a great step forward IMHO, though.

Let's call it as it is, the Japanese are still using some rather obscure practices when it comes to IT.  I am strongly in favor of trying to accomodate their needs, but on the other hand, I think Japanese mail users can't expect the rest of the world to bend over backwards just because Unicode is not popular in Japan for political reasons.

Basically, there are two situations when sending a mail.

a) the mail contains characters outside ISO-2022-JP, then there is just no way to send it in ISO-2022-JP.  I'm inclined to say that mainly the recipient is to be blamed when using inappropriate spam filters or crippled software (although I understand this is not always under their control).

b) the characters can all be represented in ISO-2022-JP.  This is the difficult one, because I assume that in 95% of the cases it would still be inappropriate to send that mail in ISO-200-JP.  intl.fallbackCharsetList.* can help for powerusers, but I think TB should do more.  One example is that it could maintain a list of recipient domains where ISO-2022-JP encoding is used by default even for users who have not dug into tweaking intl.fallbackCharsetList.*  *.jp comes to mind immediately.

ToDo:
* set intl.fallbackCharsetList.* appropriately for users with Japanese locale (does Windows have something like a locale?)
* add some code to set intl.fallbackCharsetList.* appropriately, independent of sender locale but based on recipient domain

Just thinking out loud here.
(In reply to comment #4)
> So why don't make the intl.fallbackCharsetList.* stuff happen automagically, at
> least for jp users?
Because intl.fallbackCharsetList.ISO-2022-JP does not *prevent* to fallback to other charsets.
Here is a fallback algorithm before bug 410333:
1. Inspect intl.fallbackCharsetList.* and automatically fallback.
2. If the appropreate encodings are not found in intl.fallbackCharsetList.*, prompt the user to determine whether send as UTF-8 ir not.
3. Fallback to UTF-8.
Japanese users had been asked whether they want fallback to UTF-8 thanks to the step 2.
Bug 410333 removed the step 2. Now, messages will always fallback to UTF-8 automaticallv. It is very undesirable behavior for Japanese users.
(In reply to comment #6)
> (From update of attachment 358179 [details] [diff] [review])
> This dialog is not coming back.
So how to prevent messages from fallbacking to UTF-8 accidentaly? Probably you misunderstand our difficulty. Please reconsider.
> If there are any issues with intl.fallbackCharsetList.*, they should get fixed,
intl.fallbackCharsetList.* has no bugs at all. It is just useless for our puopose.
(In reply to comment #9)
> a) the mail contains characters outside ISO-2022-JP, then there is just no way
> to send it in ISO-2022-JP.
But we can see the most rest characters, at least. If messages are converted to UTF-8 and recipients doesn't understand, *All* Japanese characters will be garbled. It is contrary to the ISO-8859-* cases. You can always read all ASCII chars even if the message is converted to UTF-8. Hence automatically fallback is not a big problem for Western users. But it is fatal for Japanese users.
Keywords: regression
I would propose adding a special value for intl.fallbackCharsetList.* - "ask". So the algorithm from comment #10 would become:

1. Inspect intl.fallbackCharsetList.* and automatically fallback to first charset that contains all the symbols from the email.
2. If "ask" is encountered during "1" then pop up the UTF-8 dialog.
3. If nothing actionable is found during 1, fallback to UTF-8.

This way, one can add "intl.fallbackCharsetList.ISO-2022-JP = ask" to the default prefs, while everybody else could keep their "this dialog is not coming back" preference. A side benefit - power users could bring the dialog back if they specifically want it.
Asking is bad since users don't know and shouldn't have to know anything about Unicode/UTF-8.

So the case that is "broken" is that there is no way to send a broken email? I don't understand why you would want to do that? Either you use the jp charset (as default) and get no problem, or then you have mixed content, and therefore would have to have utf-8 to be able to show all text.

Thunderbird has ISO-2022-JP as default, but I notice seamonkey doesn't 
http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ja-JP-mac/mail/chrome/messenger/messenger.properties#264
http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ja-JP-mac/suite/chrome/mailnews/messenger.properties#259

I don't think broken charset spam filters are a valid reason to do anything. (Those people can fix.)
(In reply to comment #13)
> Asking is bad since users don't know and shouldn't have to know anything about
> Unicode/UTF-8.

I can agree this idea as an engineer. However, the idea is bad in Japanese marketing, probably. Because many Japanese people still think that non-ISO-2022-JP mail should not be used between Japanese people (the main reason was written in comment 11). Therefore, current behavior may make damage to the sender's reputation of the receiver by the silent fallback. Nobody in Japan can accept the risk.

I think we need to notify the problem to the users by a message dialog at least in Japanese build. Of course, it's better that the message doesn't have technical terms, if we can.

# adding the Mozilla Japan members to CC.
(In reply to comment #14)
> marketing, probably. Because many Japanese people still think that
> non-ISO-2022-JP mail should not be used between Japanese people (the main

Sure, but then they would have that set as default, and write in that charset, so the UTF-8 conversion wouldn't happen, no?

I mean, the situation where UTF-8 conversion might happen is when a jp user replies to/forwards a foreign mail - and then it would most likely be the right thing to do so.
(In reply to comment #16)

> I mean, the situation where UTF-8 conversion might happen is when a jp user
> replies to/forwards a foreign mail - and then it would most likely be the right
> thing to do so.

Not necessarily. In my experience (my "normal" charsets are US-ASCII for English and KOI8-R for Russian), the UTF-8 issue comes up the most when I paste something from another application (e.g. from Firefox or from Word). Often it would be a problem with just a character or two - e.g. a non-standard quote symbol, or some such.

In fact, for my personal usage, I'd rather have not just a dialog, but a dialog with a "select all foreign symbols" options so that I could find that one offending symbol, replace it and happily send my mail in an encoding that I am sure the recipient's MUA can handle.
(In reply to comment #10)
> Please reconsider.

Well, we're still talking, and this bug is still open... ;-)

(In reply to comment #11)
> > a) the mail contains characters outside ISO-2022-JP, then there is just
> > no way to send it in ISO-2022-JP.
> But we can see the most rest characters, at least.

Because all offending characters are replaced with "?"...

I'd propose a boolean pref "mailnews.force_send_default_charset" which, if true, would enforce the default composition charset on send (eg. convert all offending characters to ?), or, if false, would convert the message to UTF-8 (as it is now).

That way, even extensions could hook into the editor UI and provide more sophisticated choices like recipient-depending target charset, etc.

Basically, I don't think that any user should have to make a choice "do you prefer crippled over garbled?" - in fact, users shouldn't have to worry about charsets at all! (Thus I don't think "ask" is a useful fallback setting.)
Oh, btw, "[ISO-2022] also defines a way to specify coding systems that do not follow its own structure. Of particular interest, the sequence ESC % G designates the UTF-8 coding system" (Wikipedia) - I don't think those insisting on ISO-2022-JP can handle that, so they do use a severely broken client anyway. ;-)
(In reply to comment #16)
> (In reply to comment #14)
> > marketing, probably. Because many Japanese people still think that
> > non-ISO-2022-JP mail should not be used between Japanese people (the main
> 
> Sure, but then they would have that set as default, and write in that charset,
> so the UTF-8 conversion wouldn't happen, no?

I guess that the out of ISO-2022-JP characters come from other documents via clipboard. E.g., from web pages. The share of web pages of UTF-8 is growing in Japan, maybe. And also web pages can contain characters which are out of the encoding of the page by entities. E.g., Japanese legacy encoding web pages can contain Euro sign, n-dash, NBSP, yen sign (U+A5), etc... but they are not in ISO-2022-JP. And also some Kanji characters (Chinese characters) are not in ISO-2022-JP but they are used for names of Japanese people.
(In reply to comment #13)
> Asking is bad since users don't know and shouldn't have to know anything about
> Unicode/UTF-8.
In an ideal world, yes. Reality, however, is not that simple (at least for Japanese users).

> So the case that is "broken" is that there is no way to send a broken email? I
> don't understand why you would want to do that? Either you use the jp charset
> (as default) and get no problem, or then you have mixed content, and therefore
> would have to have utf-8 to be able to show all text.
It is rare that Japanese characters are not found in ISO-2022-JP. If we are warned, we could cancel, rewrite the messages without the offending characters, and click "Send" again. And some Japanese users actually do so. If they send a mojibake mail, receipients will blame and teach them about the lengthy Unicode story...
Or we could click "Send anyway". Although a few characters will break, receipients can still read most of the message. Anyway, mobile phone gateway will do the similar lossy conversion because many Japanese mobile phone devices only support JIS-variant charset. It is much better than completely garbled (mojibake) message.
Thunderbird do no longer offer either options. It is unacceptable for us.

> Thunderbird has ISO-2022-JP as default, but I notice seamonkey doesn't 
> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ja-JP-mac/mail/chrome/messenger/messenger.properties#264
> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ja-JP-mac/suite/chrome/mailnews/messenger.properties#259
Offcial Seamonkey Japanese build is not provided. We are using community build whose default mail charset is set to ISO-2022-JP.
http://seamonkey.mozilla.gr.jp/

(In reply to comment #16)
> Sure, but then they would have that set as default, and write in that charset,
> so the UTF-8 conversion wouldn't happen, no?
No. some (but not many) Japanese characters are not included in JIS charset. And input method may not warn even if those characters are used. If we send the message with those characters, it will turn into UTF-8 silently.
(In reply to comment #18)
> I'd propose a boolean pref "mailnews.force_send_default_charset" which, if
> true, would enforce the default composition charset on send (eg. convert all
> offending characters to ?), or, if false, would convert the message to UTF-8
> (as it is now).
Will do.

> Basically, I don't think that any user should have to make a choice "do you
> prefer crippled over garbled?" - in fact, users shouldn't have to worry about
> charsets at all! (Thus I don't think "ask" is a useful fallback setting.)
We still want a cancel feature to avoid sending both crippled and garbled mail by changing the wording.
We may be able to exntend spellchecker feature to warn users about offending characters...
Based on comment #18.
I don't think this is a final and perfect solution. But I'll separate a bug about a further improvement.
Attachment #358179 - Attachment is obsolete: true
Attachment #358541 - Flags: review?(mnyromyr)
Taken from the first patch.
Careful users will notice the title change on saving. It will not disturb the user operations at all. Also, no string change is required.
Attachment #358542 - Flags: review?(mnyromyr)
Attachment #358541 - Attachment is obsolete: true
Attachment #358541 - Flags: review?(mnyromyr)
Sorry, the previous patch broke composing if mailnews.force_send_default_charset is not created.
Attachment #358542 - Attachment is obsolete: true
Attachment #358592 - Flags: review?(mnyromyr)
Attachment #358542 - Flags: review?(mnyromyr)
Do you think "mailnews.force_send_default_charset" should be localizable pref? If so, the type should be nsIPrefLocalizedString. See following patch:
https://bugzilla.mozilla.org/attachment.cgi?id=289202&action=diff

Then, the value can be "true" or not.
(In reply to comment #26)
> Do you think "mailnews.force_send_default_charset" should be localizable pref?

No. Forcing crippled messages should not be the decision of L10N.
Comment on attachment 358592 [details] [diff] [review]
Update window title when fallback has occured

I'll comment on MsgComposeCommands.js only once for both.

>+++ b/mailnews/compose/resources/content/MsgComposeCommands.js
>@@ -1789,28 +1789,40 @@ function GenericSendMessage( msgType )
>+        // Check encoding, switch to UTF-8 if the default encoding doesn't fit
>+        // and force_send_default_charset isn't set.
>         if (!gMsgCompose.checkCharsetConversion(getCurrentIdentity(), fallbackCharset))
>-          fallbackCharset.value = "UTF-8";
>+        {
>+          var forceDefault = false;
>+          try {
>+            forceDefault = sPrefs.getBoolPref("mailnews.force_send_default_charset");
>+          } catch (e) {
>+          }

Align braces vertically, but "catch (e) {}" can be in one line. (Yes, there are still other bracing style violations in this file.)
Furthermore, add mailnews.force_send_default_charset=false to mailnews.js with a suitable comment.

>+      if (gMsgCompose && originalCharset != gMsgCompose.compFields.characterSet)
>+        SetComposeWindowTitle();

Good idea. But this won't work if gCharsetTitle is set; the title won't get updated.
Call SetDocumentCharacterSet instead.

>+++ b/mailnews/compose/src/nsMsgCompose.cpp
>         if (NS_ERROR_UENC_NOMAPPING == rv && m_editor) {
>           PRBool needToCheckCharset;
>           m_compFields->GetNeedToCheckCharset(&needToCheckCharset);
>           if (needToCheckCharset) {
>-            CopyUTF16toUTF8(msgBody.get(), outCString);
>-            m_compFields->SetCharacterSet("UTF-8");
>+            PRBool forceInDefault = PR_FALSE;
>+            nsCOMPtr<nsIPrefBranch> prefBranch (do_GetService(NS_PREFSERVICE_CONTRACTID, &rv));
>+            if (prefBranch) {
>+              prefBranch->GetBoolPref("mailnews.force_send_default_charset", &forceInDefault);
>+            }
>+            if (!forceInDefault) {
>+              CopyUTF16toUTF8(msgBody.get(), outCString);
>+              m_compFields->SetCharacterSet("UTF-8");
>+            }
>           }
>         }
>         // re-label to the fallback charset
>         else if (!fallbackCharset.IsEmpty())
>           m_compFields->SetCharacterSet(fallbackCharset.get());

Align braces vertically, correcting the entire quote here. While you're at it, the else-if content branch should have braces as well and its comment should go inside its content.

>+++ b/mailnews/compose/src/nsMsgSend.cpp
>       if (NS_ERROR_UENC_NOMAPPING == rv) {
>         PRBool needToCheckCharset;
>         mCompFields->GetNeedToCheckCharset(&needToCheckCharset);
>         if (needToCheckCharset) {
>-          // Just use UTF-8 and be done with it.
>-          CopyUTF16toUTF8(bodyText, outCString);
>-          mCompFields->SetCharacterSet("UTF-8");
>+          // Just use UTF-8 and be done with it
>+          // unless force_send_default_charset is set.
>+          PRBool forceInDefault = PR_FALSE;
>+          nsCOMPtr<nsIPrefBranch> prefBranch (do_GetService(NS_PREFSERVICE_CONTRACTID, &rv));
>+          if (prefBranch) {
>+            prefBranch->GetBoolPref("mailnews.force_send_default_charset", &forceInDefault);
>+          }
>+          if (!forceInDefault) {
>+            CopyUTF16toUTF8(bodyText, outCString);
>+            mCompFields->SetCharacterSet("UTF-8");
>+          }
>         }
>       }
>       // re-label to the fallback charset
>       else if (!fallbackCharset.IsEmpty())
>         mCompFields->SetCharacterSet(fallbackCharset.get());
>     }

Same again here.
Attachment #358592 - Flags: review?(mnyromyr) → review-
(In reply to comment #27)
> (In reply to comment #26)
> > Do you think "mailnews.force_send_default_charset" should be localizable pref?
> 
> No. Forcing crippled messages should not be the decision of L10N.

I don't think so, because most users don't know/change such hidden pref. I think the value should be true in default settings of Japanese localized build.
(In reply to comment #29)
> I don't think so, because most users don't know/change such hidden pref. I
> think the value should be true in default settings of Japanese localized build.
I don't think we should disable fallback from non-ISO-2022-JP charsets.
What about changing pref name to "mailnews.disable_fallback_to_utf8.<charset>" and setting "mailnews.disable_fallback_to_utf8.ISO-2022-JP" by default?
As I said in comment #11, fallback from ISO-8859-* is not so bad. It is not locale dependent but charset dependent. So I don't think this pref should be localizable.
That being said, we don't have to set even "mailnews.disable_fallback_to_utf8.ISO-2022-JP" by default, IMO. Rather, we should evangelize IMC recommendation.
http://www.imc.org/mail-i18n.html
As I said in comment #1, Some people in Japan (mis)believe that they can't use UTF-8 in mail body unless they have a private agreement each other.
Attached patch updated per comments (obsolete) — Splinter Review
(In reply to comment #28)
> Good idea. But this won't work if gCharsetTitle is set; the title won't get
> updated.
True, however it will break the charset menu in the following scenario:
1. Set mail default encoding to ISO-2022-JP.
2. Open compose window.
3. Type characters which is not included in ISO-2022-JP.
4. Save the composing message. SetDocumentCharacterSet() will be called, then gCurrentMailSendCharset will be set.
7. Select Options - Character encoding. InitCharsetMenuCheckMark will be called, however menuitem will not be updated because gCurrentMailSendCharset is already set.

I added a code which clears gCharsetTitle manually to fix both issues.
Attachment #358592 - Attachment is obsolete: true
Attachment #358683 - Flags: review?(mnyromyr)
(In reply to comment #30)
> (In reply to comment #29)
> > I don't think so, because most users don't know/change such hidden pref. I
> > think the value should be true in default settings of Japanese localized build.
> I don't think we should disable fallback from non-ISO-2022-JP charsets.
> What about changing pref name to "mailnews.disable_fallback_to_utf8.<charset>"
> and setting "mailnews.disable_fallback_to_utf8.ISO-2022-JP" by default?

good idea! It's very useful for tb nightly testers in Japan, probably.
Attachment #358683 - Flags: review?(mnyromyr) → review-
Comment on attachment 358683 [details] [diff] [review]
updated per comments

>+++ b/mail/components/compose/content/MsgComposeCommands.js
>+            forceDefault = getPref("mailnews.disable_fallback_to_utf8." + originalCharset);

I would have proposed renaming the pref anyway, because we don't actually enforce the default charset but the last one set to the message - but using a charset-dependent pref is even better, agreed.

>+// mailnews.disable_fallback_to_utf8.<charset>
>+// don't fallback from <charset> to UTF-8 even if some characters are not found in <charset>.
>+// those characters will be crippled.
>+pref("mailnews.disable_fallback_to_utf8.ISO-2022-JP", false);

Especially with this default. :)


Alas, it doesn't quite work yet, so we should have a look of what should happen on save (and incidentally later on send, which we can't actually see then):
- the currently used/needed charset should appear in the window title
- the currently used/needed charset should be marked in Options->Character Encoding

The current patch doesn't do the second:
- set Latin-1 as default encoding
- compose a new mail with ISO-2022-JP characters and Unicode characters
- set Options->Character Encoding to ISO-2022-JP
- save
=> Title will have "UTF-8" in it, but Options->Character Encoding still shows ISO-2022-JP.

Basically, I think that the automatic conversion on save etc. should have the same visual effect as changing the Options->Character Encoding setting.
Comment on attachment 358683 [details] [diff] [review]
updated per comments

>+          PRBool forceInDefault = PR_FALSE;

Oh, and probably better to rename this to "disableFallback".
(In reply to comment #34)
> - set Options->Character Encoding to ISO-2022-JP
This operation set the gCurrentMailSendCharset.
> - save
> => Title will have "UTF-8" in it, but Options->Character Encoding still shows
> ISO-2022-JP.
Therefore InitCharsetMenuCheckMark denied to update the charset menu. Hmm...
I removed gCurrentMailSendCharset. It was only used by InitCharsetMenuCheckMark.
Attachment #358683 - Attachment is obsolete: true
Attachment #358712 - Flags: review?(mnyromyr)
Comment on attachment 358712 [details] [diff] [review]
resolved review comments
[Checkin: Comment 41]

Looks good now!

>+// mailnews.disable_fallback_to_utf8.<charset>
>+// don't fallback from <charset> to UTF-8 even if some characters are not found in <charset>.
>+// those characters will be crippled.
>+pref("mailnews.disable_fallback_to_utf8.ISO-2022-JP", false);

Technically, we don't need this change to mailnews.js anymore. But I think it's worth having it in, so that it will show up in about:config.
Attachment #358712 - Flags: review?(mnyromyr) → review+
Comment on attachment 358712 [details] [diff] [review]
resolved review comments
[Checkin: Comment 41]

Thank you!
Asking for sr.
Attachment #358712 - Flags: superreview?(bienvenu)
Comment on attachment 358712 [details] [diff] [review]
resolved review comments
[Checkin: Comment 41]

thx for the patch!
Attachment #358712 - Flags: superreview?(bienvenu) → superreview+
Keywords: checkin-needed
Comment on attachment 358712 [details] [diff] [review]
resolved review comments
[Checkin: Comment 41]


http://hg.mozilla.org/comm-central/rev/b7615869fd7b
Attachment #358712 - Attachment description: resolved review comments → resolved review comments [Checkin: Comment 41]
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: Thunderbird 3.0b3 → Thunderbird 3.0b2
UTF-8 corrupts NONASCII characters down the workflow
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
PMJI, that particular question seems being soon solved, however after years of dealing with this question I think there is IMO a misunderstanding that can make it resurface relatively quickly, so I expose it. Thx to anyone either using, commenting, complementing, or correcting it if necessary.

(In reply to comment #9)
> I think Japanese mail users can't expect the rest of the world to bend over backwards just because Unicode is not popular in Japan for political reasons.

I think it is NOT Japanese people being retard, neither *Unicode*, but *UTF-8* being not ready yet (despite its high quality, good achievements already, and promising future). In facts it appears that:

- Unicode is not involved here. Only one of its practical implementations, UTF-8, is.
- UTF-8 works as intended in many occurrences, even large ones involving long workflows with a lot of editing by end-users (e.g. Wikipedia; parts of Microsoft or of Amazon). However too many people deny (because they ignore it) the sheer complexity of implementing UTF-8 *in the real workflow of the real world*, i.e. taking in account all the interfaces with the existing charsets, users, habits, systems; as a result, many programs or systems are corrupting NONASCII characters, in VERY FREQUENT circumstances, that have not been clearly identified yet (if they had, after so many years the problem would be fixed), but that always involve UTF-8 (that corruption never happens if UTF-8 is not used anywhere in the workflow); it seems it is mostly when transmitting NONASCII characters through different layers of programs, systems, protocols or charsets, of which at least one uses UTF-8. Quoted-Printable makes such transmissions sturdier generally, yet doesn't suffice here.
- 1st example, editing in OE (Outlook Express) the HTML source of an email containing NONASCII chars: see http://www.sitepoint.com/forums/showthread.php?t=450442&page=2#post3250318 "Please post successful test of source-editing UTF-8 European HTML" of Sun 21 Jan 2007 16:39 GMT
- 2nd example (very frequent), when a site replies with an UTF-8 email to a NONASCII text received in their email or posted on their site, see http://www.sitepoint.com/forums/showthread.php?t=613859#post4240270 "See the corruption happen" of Thu 30 Apr 2009 20:54 GMT
- (in either example, don't be impressed by post-count-appointed gurus clucking against any new idea or blowing hot air when short of arguments or knowledge. In addition, being a rare survivor of the race being exterminated, I am accustomed to be chased, in a masterfully, pitiless and vastly organized manner, whatever, however, wherever, whenever I post)

RESULT IS:
- countries EXCLUSIVELY using ASCII (USA, UK, AU) have NO PROBLEM AT ALL using UTF-8 (and no benefit either); self-appointed gurus will continue to impose UTF-8, with no opposition
- languages MOSTLY using ASCII but with SIGNIFICANT yet minority share of NONASCII chars (Western Europe), will (too often, but not always) get text badly crippled, yet remaining readable (at a price); individual users and many companies big or small will silently switch back from UTF-8 to specific charsets (or replace national chars with ASCII, making their mail inelegant yet more readable), and gurus will be able to continue touting UTF-8, with little effect yet no vocal opposition
- languages MOSTLY NONASCII (e.g. Japan) will (in the same circumstances, i.e. not general, but frequent and hard to identify) become UNREADABLE AT ALL; so, however their power, gurus will NOT be able to impose UTF-8 at all. So in short, it is NOT Japanese end-users being uneducated, it is UTF-8 gurus.

This is why I have recommended for a couple years that users, *while waiting for UTF-8 to grow up and fill its promises*, use ISO-8859-1 (or according charsets outside Western Europe) and Quoted-Printable; details in http://groups.google.com/group/microsoft.public.outlookexpress.general/browse_frm/thread/d7a4c969ef3c8412 « For Long URLs, Accentuated Chars, encode as Quoted-Printable, Western European (ISO), use "EUR" for Euro symbol » of Sun 19 Nov 2006 18:56 GMT.

Versailles, Thu 7 May 2009 18:29:35 +0200
(In reply to comment #42)
> UTF-8 corrupts NONASCII characters down the workflow

What you posted here is just plain wrong. Yes, we're sometimes seeing reluctance from latin-1 countries against UTF-8, using various arguments like this. But looking from a perspective of latin-2 country, switch to UTF-8 was the only sane solution.

You probably know, that latin-2 and windows-1250 are not compatible and differ in several characters. Thus a charset corruption was seen here on daily basis just because a text was transferred between different OSes with no chance to fix that.

Instead, we implemented UTF-8 which is self-detectable and easily distinguishable from both latin-2 and windows-1250. Since then, the rate of such failures went down to insignificant minimum.

As for your examples:

#1 is clearly a fault in ONE application. People should have complained to Microsoft and request a fix for that. I bet there are multiple applications which have bugs also for latin-1

#2 means the site admin doesn't understand UTF-8 or his job. If any sites send back emails claiming UTF-8 charset and containing unconverted latin-1 text, their automated scripts are broken. Again, this is no fault of UTF-8, in latin-2/win1250 region we got this right long time ago.

So posts like yours recommending to prefer latin-1 and avoid UTF-8 are not doing a favour to anyone. We're both living in the EU where 10 different legacy charsets are in use and prolonging wrong solutions instead of teaching site admins how to do their jobs better is the exact opposite of what's really needed.
UTF-8, good in theory, is unreliable in email practice
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
(In reply to comment #43)
Thx "Petr Hroudný" for replying (Mon 07 Sep 2009 11:11:16 GMT). However:

> What you posted here is just plain wrong
May I recall you that such a "comment" most often denotes an author who doesn't bother about truth or check or think what he writes, which immediately turns away a big number of readers, and most often avers actually "wrong" itself in its very first lines. Let's see if yours is an exception.

> Yes, we're sometimes seeing reluctance from latin-1 countries against UTF-8
There is, from my experience at least, absolutely no "reluctance", there may be oppositely *addiction* to UTF-8; *only real experience in real world* push a number of people to switch back from UTF-8 *in their acts*, but generally *not in their words*: most of them continue to tout (as you today) UTF-8 as "the" solution.

> from a perspective of latin-2 country, switch to UTF-8 was the only sane solution... the rate of such failures went down to insignificant minimum
In Latin-1 countries too, people 1st witched to UTF-8. Only later are the problems in real life making some of them switch back.
While not mandatory, bringing a few precise and verifiable examples would help the discussion (I will try to add some on my side).

> UTF-8 which is self-detectable and easily distinguishable from both latin-2 and windows-1250
It is distinguishable *if the text contains characters in the above-127 part of the charset*. If the text is entirely in English, thus entirely in ASCII (which is the 0-127 part common to all the IEC 8859 charsets, in particular the 10 Latin ones, including of course Latin-1 and Latin-2), then UTF-8 will NOT be distinguishable, which explains why American people write "in UTF-8": actually they are simply writing in ASCII, which of course is NOT affected by UTF-8 problems. In addition, if the text contains chars in the above-127 parts of *several different* charsets, that "distinction" may become misleading or dubious.

> #1 is clearly a fault in ONE application... I bet there are multiple applications which have bugs also for latin-1
(recall: this #1 was "editing in OE... the HTML source of an email containing NONASCII chars")
Unfortunately those apps are the ones 80% of people in the world are using. So, no matter what you or I do or like, our writings WILL go through these apps and suffer the problems, with a frequency that, low and acceptable for web pages, is too high to be acceptable in email.

> #2 means the site admin doesn't understand UTF-8 or his job... Again, this is no fault of UTF-8, in latin-2/win1250 region we got this right long time ago
(recall: this #2 was, when a site replies with an UTF-8 email to a NONASCII text received in their email or posted on their site)
Again, the problem is NOT if a "fault" was made and by whom, it is that this misunderstanding DOES HAPPEN IN REAL LIFE, and is "very frequent", when UTF-8 is used, while very infrequent when ISO-8859-1 (or another fixed-length charset) is used.
Anyway if you could bring a couple examples, with all the according precision, this would greatly help the discussion (and possibly make both of us agree on more points).

> posts like yours recommending to prefer latin-1 and avoid UTF-8
I was recommending more precisely ISO-8859-1, which (in addition to covering a vast number of people and countries) is the default encoding for HTML and MIME.

> prolonging wrong solutions instead of teaching site admins how to do their jobs better
The "wrong" and "better" words are your appreciations. They do match *some* people's *speech*, NOT *all* people's *acts*.

Versailles, Mon 07 Sep 2009 14:52:00 +0200
Just a reminder, this bug is closed. Thus, please refrain from adding lengthy philosophical comments and move those discussions to newsgroups or forums.
Nobody is forced to use the new preferences, those who wish to can. If you think that further work is needed, open a new bug. Thanks.
OK, just a short comment to #44:

Look, we did a switch to UTF-8 at our University in May 2007, so what I'm telling here is based on >2 years of real life experience.

#1 Please don't tell me that 80 % of users are editing HTML source in OE.

#2 You actually hit the point - ISO-8859-1 works "better" for you, since it's the default in HTML and also e.g. MySQL. Yes, lame admins don't bother to change defaults and then they're surprised things don't work as expected.

Final remark - this bugzilla and lot of other apps clearly prove UTF-8 works as expected. I posted this comment using web form - you'll get it as UTF-8 encoded email. And all accentend chars (éçčšťžýáíé) are displayed fine in any decent email client.
You need to log in before you can comment on or make changes to this bug.