Closed Bug 136664 Opened 22 years ago Closed 10 years ago

charset in header should use lowest common denominator charset

Categories

(MailNews Core :: MIME, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bobj, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: intl)

Attachments

(1 file)

When sending email in Windows-1252 encoding, the charset in the
content-type header should use the "lowest common denominator" charset:

 - If there are only ASCII text in the mail, use "US-ASCII".
   Example: "abcde"
 - If there are non-ASCII characters, but all are within the ISO-8859-1
   charset, use "ISO-8859-1".  Example: "àbcdê "
 - If there are non-ISO-8859-1 (e.g., smart quotes, Euro), use
   "Windows-1252".  Example: "‘àbcd€’"


Currently, Mozilla will send all three cases (above) as "Windows-1252".

Excerpt from http://www.ietf.org/rfc/rfc2046.txt

   In general, composition software should always use the "lowest common
   denominator" character set possible.  For example, if a body contains
   only US-ASCII characters, it SHOULD be marked as being in the US-
   ASCII character set, not ISO-8859-1, which, like all the ISO-8859
   family of character sets, is a superset of US-ASCII.  More generally,
   if a widely-used character set is a subset of another character set,
   and a body contains only characters in the widely-used subset, it
   should be labelled as being in that subset.  This will increase the
   chances that the recipient will be able to view the resulting entity
   correctly.

We probably should look into this for GB18030, GBK and GB2312 too.
It would be a simple change for windows-1252 case.
http://lxr.mozilla.org/seamonkey/source/mailnews/compose/src/nsMsgCompUtils.cpp#788

I think it could be also applied for other charsets but need to handle
differently for 7bit charsets like ISO-2022-JP (bug 86255).
So.. is this a duplicate of bug 86255 (sure sounds like it).  Is it a dependency?
This bug is more general.  Bug 86255 is specific to iso-2022-jp, so if anything
that one should be made a dup of this bug.  But note the comment from that bug

http://bugzilla.mozilla.org/show_bug.cgi?id=86255#c6
  If we fix this we should beware of pitfalls when one encoding character set
  is almost a subset of another, but not quite. See bug 4238 for an example
  -- 0x5C in Japanese charsets represents Unicode  U00A5, not ASCII 0x5C.
There are actually 2 codepoints in the 7-bit range which map
differently depending if they are Japanese:

CODE POINT     ASCII      JIS X 0201
==========   =========    ==========
  0x5C       backslash     yen sign
  0x7E       tilde         overline

CHARACTER    UNICODE VALUE
=========    =============
backslash    0x005C
yen sign     0x00A5
tilde        0x007E       
overline     0x203E

So if we test for ASCII before converting from Unicode, then
testing for < 0x007F will work.  If we test after converting from
Unicode, then we have to special case check for 0x5C and 0x7E
for Japanese.
Another case we should consider: GB18030 -> GB2312
Currently GB2312 is more commonly used and likely to be supported by any mail
app that supports Simplified Chinese.  Probably S. Chinese text in most
messages today are covered by GB2312.
QA contact to myself.
Keywords: intl
QA Contact: gayatri → ji
Reassign to nhotta.

Probably we need a mechanism to specify editable lists (e.g. pref, property) for
the charsets which need the mapping. 
To implement this, we need to convert more than once depends on the content of
the header/body. We may use the fallback charset mechanism to try converting
multiple charsets (e.g. us-ascii, gb2312, gb18030). But that would be slow
because usually Chinese message cannot be converted to us-ascii. The us-ascii
check may be substitute by 7 bit check for the Chinese case but not for Japanese
case (ISO-2022-JP is 7 bit encoding).

Alternative approach would be do the check while the user is editing. This may
work for the body.

Assignee: ducarroz → nhotta
> The us-ascii check may be substitute by 7 bit check for the Chinese case but 
> not for Japanese case (ISO-2022-JP is 7 bit encoding).
We may do the check while the body is Unicode and no special cases are needed.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.1beta
This link: http://www.w3.org/TR/japanese-xml/#ambiguity_of_yen
 "Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative)"

provides good info on the yen sign, etc. ambiguity problem.
The patch moves the ASCII check right before we convert Unicode to a mail
charset.
By doing this, we can skip the convert manager overhead if the body is ASCII
only.
After the conversion compFields remembers the result and it can be used later
in the code.
Additional changes needed to actually labeling any 7 bit only body as us-ascii
regardless of the mail charset. That part is not included in this patch.
The ASCII check is also needed after the Unicode conversion because the 8 bit
string may turn to entities like &aacute; or &euro; in case of HTML mail.
I realized that the current patch is actually for bug 86255
MIME charset header is incorrect when msg contains only ASCII characters.
So I put a new patch to that bug.

For this bug, we can do mapping when we set a charset. For example, if the user
choose GB18030 then we can map to GB2312. Later when we convert, it may fail
depends on the text contents. We can supply a fallback list for that case.
Target Milestone: mozilla1.1beta → ---
This behaviour causes compatibility problems with other MUA. See bug 247958 for
details. In summary, OE silently discards information when replying. The patch
at bug 247958 prevents this behaviour (as added by bug 86255) by default, adding
a pref to allow a user to re-enable it if desired.
Product: MailNews → Core
Depends on: 296233
Assignee: nhottanscp → nobody
Status: ASSIGNED → NEW
QA Contact: ji → mime
Product: Core → MailNews Core
The world moved on. When in doubt use UTF-8 and everyone will be happy.

-> WONTFIX
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: