Open Bug 71551 Opened 24 years ago Updated 3 years ago

Add charset=unknown-8bit when an attachment is of .txt type

Categories

(MailNews Core :: MIME, defect)

defect

Tracking

(Not tracked)

Future

People

(Reporter: momoi, Unassigned)

References

()

Details

** Observed with 3/9/2001 Win32 build ** When a .txt type of file is attached to a message, currently, Mozilla does not create teh charset parameter in the Content-type header. For example, the following are the typical headers for such an attachment (for multupart messages.) Content-Type: text/plain; name="mysigJ.txt" Content-Transfer-Encoding: base64 Content-Disposition: inline; filename="mysigJ.txt" Accoding to RFC 2046 and RFC 2049, it seems that the charset parameter must be present if we use text/plain type of declaration for the Content-type and if the attached body part contains 8-bit bytes originally, i.e. before applying transfer encodings such as Base64 or Quoted-Printable. However, the original motivation for leaving out the charset parameter is precisely because there is no easy way to discern the charset of .txt type of files. Unlike HTML, there is no formal way to embed charset info in .txt type of files. jgmyers suggested that we can probably make use of the information RFC 1428 (above URL). Though the original intention of using 'unknown-8bit' for the charset value is for the non-MIME to MIME mail gateway servers, it is probably not illegal to use this charset value for the current purpose. This proposal has some merits: 1. RFC 2046 states that the default charset in case the charset parameter is missing is US-ASCII. Thus, receiving agents are obligated to interpret text/plain multipart bodies without charset parameter as US-ASCII. Mozilla currently either applies auto-detection or setlle on the View Default charset instead. 2. The use of 'charset=unknown-8bit' then addresses the problem in 1. RFC 1428 states that a body with this charset parameter can be interpreted as seen fit by receiving agents. 3. RFC 1428 states also that: "This character set is not intended to be used by mail composers. It is assumed that the mail composer knows the character set in use and will mark it with a character set value as specified in [1], ..." However, what we have here is precisely the case when the mail composer does not know the charset in use of an external .txt file. Additional notes: A. The proposal then is to use 'charset=unknown-8bit' when attaching .txt files. B. We also need to do 8-bit check on such .txt files. If .txt file contains 8-bit bytes or the escape sequences used ISO-2022-xx encodings, then regard them as 8-bit. Otherwise, regard them as US-ASCII. C. In decoding such a charset parameter, we need to apply auto-detection if an auto-detect module is chosen, and then apply view default charset as the final fallback in case there is no auto-detection applied. The end results of all of this is that the behavior of decoding will be exactly the same as it is prior to this fix. But the creation of such body parts now will be in compliance with RFC 2046 & RFC 2049. It has been reported that some mail agents have difficulty displaying body parts without the charset parameter because they interpret them as US-ASCII. If such body parts in anything other than US-ASCII, e.g. Japanese Shift_JIS, this would cause a display problem in some mail viewing programs.
Note that currently 'charset=unknown-8bit' is treated as ISO-8859-1. We probably should apply the default viewing charset in such a case. We also check to see what we are really doing when the charset parameter is missing in received msgs. I would think that the correct behaviro in such a case would be: 1. Apply auto-detection if one is seletced and ON. 2. When there is no auto-detection module ON, then assume that it is US-ASCII.
http://bugzilla.mozilla.org/show_bug.cgi?id=71541 takes care of the default charset issue mentioned in the notes immediately abobe this comment.
Points raised here have been discussed extensively in Bugzilla-Japan: http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=727 If you can read Japanese.
Momoi san, I have a couple of questions. * In your additional notes B, is that check for sending or viewing? * The proposal A, what kind of impact to the users of existing mailers (e.g. NS4.x, NS6).
Status: NEW → ASSIGNED
> In your additional notes B, is that check for sending or viewing? This is for sending only. In order to append 'unknown-8bit', .txt data need to be truly 8-bit. > The proposal A, what kind of impact to the users of existing mailers (e.g. NS4.x, NS6). I don't think this will break NS4.x. My recollection is that we were not paying attention at all to multipart charset parameter in NS4.x. For NS6, it is more complicated. Bug 71541 points to dealing with unknown-8bit as "ISO-8859-1". jgmyers is proposing to change that to "default viewing charset". I think this is the right approach since RFC1428 says that it si upt to the receving mail agent to decide the charset of such body part. What about IE, Eudora and other mailers? RFC 1428 is not on standards track. It is informational. If other mailers use this RFC as a guideline, then they would have their own way of dealing with "unknown-8bit". But if they don't know how to deal with this charset name but can deal with no charset name, then that is somewhat of a cocern. If anyone reading this report has used other mailers with 'unknown-8bit" header name in messages, please help. I'll try to create data for this soon.
I don't think we want to do this soon since it would affect the existing users. The other options could be. * Have a flag to force to use OS file system (or user's selected) charset for text/plain. * UI to allow the user to set a charset per attachment (e.g. attachment info dialog to be invoked from the message compose view).
Target Milestone: --- → Future
I like the combined approach, i.e. the default would be the system charset but by right-clicking on a particular atatchment in the attachment pane, we could bring up a dilog to change the charset. As for breaking existing users if we adopt charset=unknown-8bit, For NS6.0/Mozilla M18, what would happen if they receive a multipart marked by unknown-8bit. It would not display it correctly presumably because it is not a known charset. Would auto-detection not apply? If so, that would be a problem. If auto-detection is not ON, then I don't see that it makes any difference whether charset=(null) or charset=unknown-8bit. In the former, in either case, it would be interpreted as ISO-8859-1. In any case, if we are allowing the user an option to set charset, then this proposal can be postponed or even tabled.
I filed bug 72116 about the dialog for attachment info.
NS 4 seems to use the default charset (iso-8859-1 in my case). This seems like the best choice. At least one person has complained that his mailer (VM under Emacs) displays attachements without charset as ASCII, and other characters outside ASCII becomes \nnn, which is very ugly in non-englich languages :-)
QA Contact: esther → trix
Bug 162440 is about extending nsIMsgAttachment to support a charset attribute (as well others).
Depends on: 162440
IMHO the first step would be just to _allow_ attachments to have charset. Currently, if I quit Mozilla and hand-edit the message in Drafts of Unsent\ Messsages folder, adding a charset, it stiill get dropped by Mozilla. IMHO the best default would be to use the users' locale setting (e.g. LC_CTYPE on Unix).
OS: Windows NT → All
Hardware: PC → All
This should be a user configuration, with auto-detection when possible. For instance, the user could choose UTF-8 if this is a valid UTF-8 file, otherwise ISO-8859-1.
(In reply to comment #12) > This should be a user configuration, with auto-detection when possible. User configuration is bug 72116.
Not exactly. I was speaking about a user configuration for the default charset.
Product: MailNews → Core
Assignee: nhottanscp → nobody
Status: ASSIGNED → NEW
QA Contact: stephend → mime
Product: Core → MailNews Core
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.