User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a3) Gecko/20040817 Build Identifier: Here's the current logic for determining which encoding to use for reading a message 1. infer the 'reported' encoding from the message headers (this is done with the rather borked libmime which needs a rewrite - but this is bug 248846) 2. if no encoding is named in the headers than use the default 3. if 'Apply Default to All Messages' is set than ignore 1. and 2. and use the default (this is unless I am mistaken and the encoding auto-detection for web pages is also applied to There are numerous problems with this scheme; I can come up with some of them, others could probably report more: - After this logic is applied, If you make a manual choice of encoding using the View->Character Encoding menu, it does not persist. Thus if you move to another message and back, two bad things happen: first, all of the headers are parsed again (although this is again a problem with libmime and the fact that no internal representation of messages seems to be constructed), and second, the 3-step logic above is applied again so you get the same wrong choice of encoding you had to manually override - the coercion of the default encoding also carries over to a 'reply-to' composer window, e.g. if you've received a UTF-8 message with characters in some Asian script and are replying to it they may be forced into the gibberish seen by reading them with your default Windows-1256 codepage, for instance, if you've chose 'Apply Default to All Messages' - The current coercion scheme is not the most effective 'cheap' coercion possible: Even when not checking the message body for whether the selected encoding seems to match the contents or not, it would provide better result if the coercion option was not "always coerse to default encoding" but rather "coerse to default encoding whenever the headers say nothing or say the default, e.g. ISO-8859-1 US-ASCII"; this is due to the fact that it is extremely rare for a message to arrive with, say, "charset=windows-1255" in the content-type header which is neither windows-1255 nor plain English in ASCII but rather, say, UTF-8 or Arabic in Windows-1256. I don't think this has ever happened - The message body needs (subject to a pref) to be considered when deciding the encoding; If it is not already used, it would be beneficial to apply the encoding auto-detection to mail messages as well as to documents shown in the browser. Of course it would be rather useless (at least AFAIAC) since it doesn't detect Hebrew (86999), which means it will also mis-detect several other encodings, e.g. Cyrillic Windows-1251, for some Hebrew messages. A simpler alternative is some logic for deciding when the coercion was wrong, e.g. if you coerce text into Windows-1255 but get lots of repeated sequences of punctuation marks without letters, or may occurnces of characters which are completely unused in Windows-1255 or very rare (3rd power, inverted exclamation mark, double dagger etc.) - then the coercion is probably a mistake and should be undone. Reproducible: Always Steps to Reproduce:
This sounds awfully familiar. At least part of it is a dupe of Bug 208917 and I'm sure most other issues are dealt with in other bugs. You might want to go over bug 254868 (which was recently fixed) and other bugs that are linked from the tracking Bug 254868. Prog.
Correction: The recently fixed bug is Bug 227265. Sorry for the spam, Prog.
*** This bug has been marked as a duplicate of 254868 ***
Status: UNCONFIRMED → RESOLVED
Last Resolved: 14 years ago
Resolution: --- → DUPLICATE
Eyal, since Bug 254868 is for tracking other bugs, please move your analysis and suggestions (in comment 0) to another bug, such as Bug 208917. There's no reason to have this content lost in dupelivion. Prog.
Cleanup *dupeme* whiteboard flag from bugs that are marked as Resolved Duplicate!
You need to log in before you can comment on or make changes to this bug.