Closed Bug 65277 Opened 24 years ago Closed 24 years ago

thread pane charset conversion should use folder charset override for non MIME headers

Categories

(MailNews Core :: Internationalization, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED
mozilla0.9

People

(Reporter: nhottanscp, Assigned: jgmyers)

References

Details

(Keywords: intl)

Attachments

(14 files)

2.66 KB, patch
Details | Diff | Splinter Review
5.54 KB, text/plain
Details
15.61 KB, patch
Details | Diff | Splinter Review
18.49 KB, patch
Details | Diff | Splinter Review
19.04 KB, patch
Details | Diff | Splinter Review
9.69 KB, patch
Details | Diff | Splinter Review
13.57 KB, patch
Details | Diff | Splinter Review
14.13 KB, patch
Details | Diff | Splinter Review
14.89 KB, patch
Details | Diff | Splinter Review
32.99 KB, patch
Details | Diff | Splinter Review
10.20 KB, patch
Details | Diff | Splinter Review
16.74 KB, patch
Details | Diff | Splinter Review
39.51 KB, patch
Details | Diff | Splinter Review
14.36 KB, patch
Details | Diff | Splinter Review
Thread pane charset conversion should use folder charset. Currently, the followings are not supported. * When message headers does not have MIME charset, a folder charset should be used for the conversion. * If the folder charset override is set then all the header should be converted as a folder charset.
Keywords: intl
Summary: thread pane charset conversion should use folder charset → thread pane charset conversion should use folder charset for non MIME headers
QA contact to ji. CCs to others.
QA Contact: momoi → ji
Status: NEW → ASSIGNED
>* When message headers does not have MIME charset, a folder charset should be >used for the conversion. This is actually working, only the override case has the problem.
Summary: thread pane charset conversion should use folder charset for non MIME headers → thread pane charset conversion should use folder charset override for non MIME headers
Target Milestone: --- → mozilla0.8
I attached a patch to support charset override for message headers. The change was to ignore the charset specified in the header and use the folder charset for the charset conversion. The added code path is only executed if the user sets charset override either folder level or global preference. David, could you review?
sr=bievnenu
I need one more reviewer. Scott, could you review it?
r=mscott
checked in
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
i am reopning this bug: in 2001-02-08 mtrunk build the folder charset override is working until you check the box " apply to all messages ...". Messages that have MIME in headers turn out unreadable. Reloading doesn't help ( but unchecking " apply to all messages in the folder does..). I've seen this in russian newsgroup, messages that have koi8-r in theirs headers are garbled after applying koi8-r to all messages in the folder
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It has benn working about a week after I check in until yesterday (so this is actually a regression). The fix for bug 58114 changed a behavior of MIME decoder. It always returns UTF-8 string. I think it's fine to return UTF-8 there, but I just did not know about the recent change, I need to rewrite the code to adopt to the new behavior.
Status: REOPENED → ASSIGNED
Target Milestone: mozilla0.8 → mozilla0.9
Actually, this is caused by bug 65702. Can't decode subject if it contains ASCII and Japanese
Depends on: 65702
Assignee: nhotta → jgmyers
Status: ASSIGNED → NEW
Here is a list for charset precedence, manual override has a highest precedence. Let me know if you have questions. default (pref) - use when no other charset is specified default (folder) - folder level default header - charset specified by the header, use when no override is applied override (pref) - global override set by the pref override (folder) - folder level override (by folder default charset) override (manual) - set by UI (charset menu)
Would it be acceptable to treat UTF-8 as special? Labeled UTF-8 would have precedence over any override. Possibly detected UTF-8 should have precedence over any default charset.
Status: NEW → ASSIGNED
That would be okay for main body. How about a UTF-8 mail with (non UTF-8) attachments? The user may want to apply manual override to see attachments correctly when attachments are not charset labeled.
This would only fail if non-utf8 data were mislabeled as utf-8 or if there were no override adn unlabeled data were syntactically legal utf-8. Note that it is extremely unlikely for non-utf8 data to be syntactically legal utf-8.
Okay, that's fine, assuming the check for legal UTF-8 would not be time consuming. BTW, why do you want to treat UTF-8 specially?
UTF-8 is the only possible way to ever get out of the existing charset soup. Its use should not be precluded by the charset soup workarounds.
Whiteboard: blocked on ducarroz review of 63834
Current plan is to pass an input charset and an override flag down to the encoded-word decoder, replacing the current output charset. A lot of redundant or unused charset manipulation needs to be cleaned up in the process. I am considering having the low-level decoder output a PRUnichar *, as several callers immediately convert the UTF-8 output to UTF-16. The charset converter recycling that used to go in some but not all callers should probably be replaced with a recycler at or below the encoded-word decoder.
Whiteboard: blocked on ducarroz review of 63834
ducarroz: please review
The new decoder MIME_DecodeMimeHeader, is not called. Do you plan to provide another patch?
Yes. I will submit further patches to switch all callers of MIME_DeocdeMimePartIIStr() to that interface.
Keywords: review
Whiteboard: awaiting ducarroz review
nhotta, can you test this patch as it essentially about I18n. Thanks
The new decode function is not called yet, so would not affect i18n. The other change is to replace UTF-8 to US-ASCII conversion by strdup(), since US-ASCII is a subset of UTF-8, no conversion is needed.
Reviewers should consider whether they consider the new MIME_DecodeMimeHeader() a good replacement for MIME_DecodeMimePartIIStr(). Is my proposed architecture of passing a charset and a bool down to the encoded-word decoder acceptable? Am I incorrectly neglecting support for charset detection? intl_is_legal_utf8() should return a PRBool. I can fix that either in this review cycle or in my next patch.
MIME decode and charset conversion were separated before. Override flag and default charset are related to charset conversion but not needed for the decoding. It also allowed optimization of charset conversion (e.g. cache unicode decoder). After the change of the function does both MIME decode and charset conversion, the override flag and default charset should be passed to the function. So the change is necessary when we keep MIME decode and charset conversion together. >Is my proposed architecture > of passing a charset and a bool down to the encoded-word decoder acceptable? So this is necessary. The issue we could have discussed before was whether we wanted to keep MIME decode and charset conversion separated or process them in one function. >Am I incorrectly neglecting support for charset detection? Is this about charset auto detection? That's for body only, no auto detection is applied to headers.
Optimization of charset conversion should be done below header decoding, not above. It is better to implement any given function in one place rather than, say, five.
looks good for me. R=ducarroz
waiting for nhotta to finish reviewing the patch before I sr.
* There are a couple of calls to intl_copy_uncoded_header() which replace strcpy(). Is that necessary? Before coming there, the caller checks "=?", so the header contains MIME encoded words. I think we do not have to apply the default charset to unencoded parts. The case that the caller does not find "=?", I don't see charset conversion using the default charset. If no conversion there, the default charset will only be used for headers which contains "=?". * I see mixture of lower and upper case of "UTF-8". Please use always "UTF-8". * Type of length is different between intl_is_legal_utf8 and intl_copy_uncoded_header, please unify. * In the code below, why int is used instead of unsigned char? Same for intl_is_legal_utf8() +static void intl_copy_uncoded_header(char **output, const char *input, + int len, const char *default_charset) +{ + int c;
Just because a header has the 2-character sequence =? does not mean it has MIME encoded words. That sequence can occur as an ASCII = followed by an ASCII ?. If MIME_DecodeMimeHeader() does not find "=?" but the header has non-UTF8 unencoded data, then the call to intl_is_legal_utf8() from MIME_DecodeMimeHeader() will return zero, causing MIME_DecodeMimeHeader() to call intl_decode_mime_part2_str() in order to apply the default charset. int is used instead of unsigned char because the shorter size is not necessary. int is by definition the most natural size for the platform.
I also changed intl_is_legal_utf8 to return PRBool and included a 1-character fix for bug 69251.
So intl_is_legal_utf8() may be called twice once from MIME_DecodeMimeHeader() then intl_decode_mime_part2_str(). Is that true?
Correct. intl_is_legal_utf8() may be called twice. Once from MIME_DecodeMimeHeader() and once from intl_copy_uncoded_header() inside of intl_decode_mime_part2_str(). This will generally happen when there is unencoded non-utf8 data in the header.
Okay, I don't think that would make significant performance degradation. r=nhotta
ducarroz: please review updated patch.
looks good to me. R=ducarroz
Whiteboard: awaiting ducarroz review → awaiting sspitzer review
Keywords: review
Whiteboard: awaiting sspitzer review
Keywords: review
Whiteboard: waiting for ducarroz & nhotta review
r=nhotta Please remove following line in nsStreamConverter.cpp. + // in future, the override flag to be per folder instead of a global pref
R=ducarroz
sr=sspitzer before you check in, can you remove that comment line?
Keywords: review
Whiteboard: waiting for ducarroz & nhotta review
The above patch blindly preserves the semantics of the existing code, except it doesn't cache the converter it doesn't need to use. Previously, MimeHeaders_convert_header_value() inexplicably handled a null opt->rfc1522_conversion_fn by freeing its value and returning NULL. If this is a "don't care" case (it never gets called in the draft-converter case), then the above patch can be further simplified by removing opt->rfc1522_conversion_p, always treating it as being PR_TRUE.
The patch replaces MimeHeaders_convert_rfc1522 by MIME_DecodeMimeHeader. MimeHeaders_convert_rfc1522 does caching for charset converter. That should not be replaced unless MIME_DecodeMimeHeader has equivalent caching mechanism. Why don't you change mime_convert_rfc1522 to replace MIME_DecodeMimePartIIStr by MIME_DecodeMimeHeader?
MimeHeaders_convert_rfc1522() caches charset decoders, but only uses those cached decoders in the situation where there are no MIME encoded words. If there are MIME encoded words, it just converts UTF-8 to UTF-8. If I just change mime_convert_rfc1522() to call MIME_DecodeMimeHeader() instead of MIME_DecodeMimePartIIStr(), then MimeHeaders_convert_rfc1522() would cache the decoders but never use them. This would be because MIME_DecodeMimeHeader() always returns data, including just-send-8 data, in UTF-8. The only effective difference would be the cached but unused decoders would be available for use by MimeInlineText_rotate_convert_and_parse_line(). The converter caching is being done at too high a level. It should instead be done in MIME_ConvertCharset() or below. Correctness first, performance second.
>Correctness first, performance second. I do not agree, as "correctness" for coding point of view no actual correctness problem at user side. Are you able to finish all your remaining changes plus performance enhancement before mozilla0.9 closure?
Whether or not I am able to finish by 0.9 tree closure depends entirely on the responsiveness of reviewers. If I get many more 3 day delays, such as for the last patch, I may well not finish in time.
ok, looks good to me. have you tested it carrefully? if yes, R=ducarroz
I have tested it on an i18n message and verified the override charset is now applied to the header display in the message pane.
My concern is performance. The patch removes the cache of charset converter. I think MIME_DecodeMimeHeader() checkes the header string and only applies charset conversion when the string is not 8 bit and not valid UTF-8. Is this correct? If that's the case, it would not be so bad. In any case, it's better to check performance with the change using Quantify for example. I will try to review the patch today.
r=nhotta - if ((!value) || (!*value)) + if (!*value) return *value; Is that okay, not to check the pointer?
Either alternative will fault if the pointer is null.
Also, the function is static and it is easy to audit that it is always called with a non-null pointer.
Attached patch Fix more callersSplinter Review
In my latest patch, I also change nsDBFolderInfo::GetCharPtrCharacterSet() to use the default character set like nsDBFolderInfo::GetCharacterSet() does.
nsMimeConverter, there were separate functions, one just dif MIME decode and other to dif MIME decode plus converted to unicode. Now both functions return unicode string, the difference is either returns char* or PRUnichar*. I think we just need one, the caller can convert UCS2<->UTF-8 easily.
Another difference between the functions (which callers depend on) is that if no conversion is needed the one returning char* will return nsnull whereas the one returning PRUnichar* will copy the input. I think there are enough callers wanting a PRUnichar* to warrant a second function.
okay, please change the comments to indicate the new decoder behavior r=nhotta Can I expect the latest patch will fix this bug?
There are about two more patches coming. One for nsMimeHtmlEmitter.cpp, nsMsgHeaderParser.cpp, and nsMsgI18N.cpp, another for the callers of nsMsgI18NDecodeMimePartIIStr. This patch takes care of the folder charset and folder override charset. I don't know if my work will address the override caused by the view menu. The call in nsMimeBaseEmitter.cpp doesn't have any charset information as there is no apparent context to get that information from. I can't even figure out when this code is called.
nsMimeBaseEmitter.cpp, override is currently not working, see bug 57086. For menu override, the only change I have seen so far was you changed override field from char* to bool.
The patch I just landed, id 26422, passes the override_charset down from MimeHeaders_convert_header_value(). This is sufficient for menu overrides to now affect (at least) subjects in the header section of the message pane.
*** Bug 71548 has been marked as a duplicate of this bug. ***
this code seems ok to me, but I'd like the module owners bienvenu (for mailnews/db) and ducarroz (for mailnews/mime) to review it.
R=ducarroz for mime part.
John, can you wait until the performance branch lands to check this in? (at least the db part). On the performance branch, I've added an convenience function to get and cache the mime converter so your code should use that. Other than that, your changes look OK. Actually, you don't have much choice, because the performance branch is going to land before the tree opens for .9 development, so you'll need to merge with my changes. Sorry for the potential conflicts.
sr=bienvenu
Blocks: 66098
Blocks: 63829
Attached patch Fix more callersSplinter Review
you can use an nsXPIDLString for PRUnichar *decodedString; in ReplyOnTop and remove the PR_Free call, I believe. Still looking at the rest.
question for nhotta: is it ok to use kCMimeConverterCID as a service instead of using CreateInstance?
It caches unicode convertor, so use createinstance or change the class implementation (this might have been done since the new decoder always returns UTF-8 and no converter is needed). http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/nsMimeConverter.h#108
The nsIMimeConverter objects retrieved by do_GetService() are only used to invoke the DecodeMimeHeader() methods. This doesn't touch any of the members.
Reviewing nsMsgCompose.cpp, * @@ -986,57 +986,66 @@, why is the extra conversion needed only for "auther"? That extra conversion is also in the original code. I think you can just use "decodedString" like other fields. * @@ -1235,49 +1237,34 @@, The new code ignores the return value (error code), is that okay?
R=ducarroz for new patch.
In nsMsgSearchTerm.h, a protected function MatchString is added. Is that implemented? I don't see that in the diff. Other question is what does it mean to pass null for charset when calling MatchString?
MatchString was previously exported through the interface. This patch makes it private. A null charset means the input is already in UTF-8.
r=nhotta For charset override in search, bug 68706 is assigned to taka (cc to him).
I added mention of bug 68706 to the XXX comment in nsMsgSearchOfflineMail::MatchTerms(). I will reattach the combined patch for the convenience of further reviewers.
Attached patch recombined patchSplinter Review
sr=bienvenu
r=nhotta
sr=bienvenu
R=ducarroz
Fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Verified with 04/10 builds. It's fixed.
Status: RESOLVED → VERIFIED
Product: MailNews → Core
Product: Core → MailNews Core
Blocks: 74916
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: