Closed
Bug 12481
Opened 25 years ago
Closed 23 years ago
libmime to feed more characters for charset detection (reminder)
Categories
(MailNews Core :: MIME, defect, P3)
MailNews Core
MIME
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla0.9.6
People
(Reporter: nhottanscp, Assigned: shanjian)
References
Details
(Keywords: intl)
Attachments
(1 file, 7 obsolete files)
16.19 KB,
patch
|
sspitzer
:
superreview+
|
Details | Diff | Splinter Review |
This is separated from #8899.
libmime has been feeding characters to charset converter (and charset detector)
per line bases since 4.x. This is not efficient (e.g. converters creation per
line) also causes lack of enough data to charset detectors (causes detection
failure and charset conversion failure).
We need to change data passing MIME part base instead of line base whenever we
have chance to change.
Reporter | ||
Updated•25 years ago
|
QA Contact: lchiang → nhotta
Updated•25 years ago
|
Target Milestone: M14
Comment 1•25 years ago
|
||
M14
Comment 2•25 years ago
|
||
Bulk change to assigned.
- rhp
Updated•25 years ago
|
Target Milestone: M14 → M19
Comment 3•25 years ago
|
||
Would be nice, but is unfortunately non-trivial because of the "line-by-line"
nature of libmime.
- rhp
Updated•25 years ago
|
Target Milestone: M19 → M20
Comment 4•25 years ago
|
||
Is there someplace else to cache the charset converter and detector which
doesn't force a rewrite of libmime?
Updated•25 years ago
|
Target Milestone: M20 → Future
Reporter | ||
Comment 6•24 years ago
|
||
Can MimeObject cache converter and detector? MimeObject is not per line (per
message, I think).
Assignee | ||
Comment 8•24 years ago
|
||
add myself to cc list
Assignee | ||
Comment 10•23 years ago
|
||
This problem need to be taken care of.
Reassign to myself.
Assignee: ducarroz → shanjian
Assignee | ||
Comment 11•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 12•23 years ago
|
||
ducarroz, naoki,
I just post a draft patch for your suggestion and recommendation. This patch was
mostly implemented within MimeInlineText. It uses a buffer to keep data and
sends them in one batch to detector. I also add some optimization work there.
Please take a look and offer some advice. Thx!
Reporter | ||
Comment 13•23 years ago
|
||
I am looking at the patch. One question, how does this work for a multi part
message? Is the buffer allocated per part or just one buffer for one message?
Assignee | ||
Comment 14•23 years ago
|
||
The buffer is allocated per part. MimeInlineText will be the child of
MimeMultiPart in case of multipart message. As you seen, the buffer is allocated
for MimeInlineText.
Reporter | ||
Comment 15•23 years ago
|
||
Your change also includes caching unicode encoder/decoder in a mime object.
This is good. The caching is done regardless of the auto detection is on or off,
correct?
Assignee | ||
Comment 16•23 years ago
|
||
Yes. It does not seem reasonable to me to normalize charset name and query for
decoder/encoder for each line. I do allow charset change in the middle of
parsing to handle meta charset, so decoder/encoder might change as well.
Reporter | ||
Comment 17•23 years ago
|
||
Looks good, are you going to revise the working patch for reviews?
Reporter | ||
Comment 18•23 years ago
|
||
Shanjian, could you change the target milestone?
I think part of your change (cache charset converters) would improve performance.
Assignee | ||
Comment 19•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Attachment #51888 -
Attachment is obsolete: true
Reporter | ||
Comment 21•23 years ago
|
||
1) "mail.charset.detector" is not used, you can remove that part of the code.
2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK
then NS_OK is returned and low confident charset to be use. That is not desirable.
+ res = detector->DoIt(aBuf, aLength, aCharset, oConfident);
+ if (NS_SUCCEEDED(res) && (eBestAnswer == oConfident || eSureAnswer ==
oConfident)) {
+ return NS_OK;
+ }
3) I think this will leak if text->charset is not null.
+ if (!text->lineDamBuffer || !text->lineDamPtrs)
+ {
+ text->charset = text->defaultCharset;
+ text->defaultCharset = nsnull;
4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe.
+ if (text->utf8Encoder == nsnull)
+ MIME_get_unicode_encoder("UTF-8", getter_AddRefs(text->utf8Encoder));
5) Question: this is allocating a 8k buffer for one part then copy up to 1024
lines to the buffer. So only one buffer allocation per one part, correct?
+ if (text->inputAutodetect)
+ {
+ //we need to prepare lineDam for charset detection
+ text->lineDamBuffer = (char*)PR_Malloc(DAM_MAX_BUFFER_SIZE);
+ text->lineDamPtrs = (char**)PR_Malloc(DAM_MAX_LINES*sizeof(char*));
6) Related to the last question, text->lastLineInDam++; do you need to check the
DAM_MAX_LINES there?
+ else {
+ //buffering current line
+ text->lineDamPtrs[text->lastLineInDam] = text->lineDamBuffer +
text->curDamOffset;
+ nsCRT::memcpy(text->lineDamPtrs[text->lastLineInDam], line, length);
+ text->lastLineInDam++;
+ text->curDamOffset += length;
+ }
Assignee | ||
Comment 22•23 years ago
|
||
1) "mail.charset.detector" is not used, you can remove that part of the code.
removed as you wish.
2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK
then NS_OK is returned and low confident charset to be use. That is not
desirable.
now I return null in aCharset if confidence is not good, and detectede charset
is checked before use.
3) I think this will leak if text->charset is not null.
+ if (!text->lineDamBuffer || !text->lineDamPtrs)
+ {
+ text->charset = text->defaultCharset;
+ text->defaultCharset = nsnull;
Under no circumstance text->charset will be non-null at this stage. I put a
assert there.
4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe.
I don't think 'm_unicodeToUTF8Encoder' is ever used. Those code probably need to
be removed.
5) Question: this is allocating a 8k buffer for one part then copy up to 1024
lines to the buffer. So only one buffer allocation per one part, correct?
Yes, one buffer for each text mimepart.
6) Related to the last question, text->lastLineInDam++; do you need to check the
DAM_MAX_LINES there?
No, it will be checked before a new line is added, not after. There is no need
to do it twice.
Assignee | ||
Comment 23•23 years ago
|
||
Reporter | ||
Comment 24•23 years ago
|
||
3) I am not clear about your change in MimeInlineText_initialize. Could you
explain your change there? It used to set to text->charset in case of it's null,
that's why I though it would leak. Now you changed not to set it there but set
it later conditionally only when text->inputAutodetect is true.
5) Please add comments about the buffer and why you chose 8k.
Assignee | ||
Comment 25•23 years ago
|
||
3) I am not clear about your change in MimeInlineText_initialize. Could you
explain your change there? It used to set to text->charset in case of it's null,
that's why I though it would leak. Now you changed not to set it there but set
it later conditionally only when text->inputAutodetect is true.
In my patch, only when text->charset is null after we tried all sources, will
text->defaultCharset be assigned and text->inputAutodetect be set to true. So
only one of text->charset and text->defaultCharset will be assigned after
MimeInlineText_initialize.
5) Please add comments about the buffer and why you chose 8k.
Normally it only takes 1K of real data for autodetector to reach good result.
But unfortunately, because of tags and big chunk of javascripts normally
attached to the head of html page, 8K is a guess to cover most cases. (A better
choice might be to keep feeding until detector returns a result or reach certain
limit. But the interface of autodetector need to be changed and I am unwilling
to do it now.) Another reason is I assumed 8K of data can normally be contained
in one UDP package if UDP is used. (No sure if it is true or not.)
Reporter | ||
Comment 26•23 years ago
|
||
Comment on attachment 53434 [details] [diff] [review]
patch 2.0, revision after naoki's suggestion
r=nhotta
please add comments about the items (#3, #5).
Also, please do the i18n mail smoke test.
http://www.mozilla.org/quality/intl/tests/data/
Attachment #53434 -
Flags: review+
Assignee | ||
Comment 27•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Attachment #53349 -
Attachment is obsolete: true
Assignee | ||
Updated•23 years ago
|
Attachment #53434 -
Attachment is obsolete: true
Assignee | ||
Comment 28•23 years ago
|
||
Comment on attachment 53665 [details] [diff] [review]
patch 3.0
taking over last review
Attachment #53665 -
Flags: review+
Assignee | ||
Comment 29•23 years ago
|
||
There is only one difference between patch 3.0 and 2.0, that is
- if (NS_SUCCEEDED(res) && *detectedCharset)
+ if (NS_SUCCEEDED(res) && detectedCharset && *detectedCharset)
the previous one might crash when detector return nothing.
Mail news smoke test is done and looks fine.
Assignee | ||
Comment 30•23 years ago
|
||
Seth, could you sr this bug?
Comment 31•23 years ago
|
||
you're going to want the module owner for mime (ducarroz) to review this patch.
can you provide information on when this code is hit? (once per message, if my
pref is set for autodetect? any extra overhead for for US-ASCII users?)
I'll leave it to the reviewer (ducarroz) to review memory usage (allocs and
frees) and to make sure allocations and copying is minimized.
some initial comments:
1) + nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res));
there's a contact id for pref, please use it instead of the CID.
2) double check your string usage.
to determine the contract id, do something like this:
nsCAutoString detector_contractid;
detector_contractid = NS_STRCDETECTOR_CONTRACTID_BASE;
nsXPIDLString detector_name;
rv = prefs->GetLocalizedUnicharPref("intl.charset.detector",
getter_Copies(detector_name));
if (NS_SUCCEEDED(rv)) {
detector_contractid += NS_ConvertUCS2toUTF8(detector_name).get();
}
3) is there a contract id for kCharsetConverterManagerCID? if so, use it
instead of the static cid.
4) NS_ConvertASCIItoUCS2("ISO-8859-1").get()
NS_LITERAL_STRING("ISO-8859-1").get() is the preferred way.
Assignee | ||
Comment 32•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Attachment #53665 -
Attachment is obsolete: true
Assignee | ||
Comment 33•23 years ago
|
||
seth,
Thanks for taking a look at my patch. There are some points in your comment need
further explainations.
> you're going to want the module owner for mime (ducarroz) to review this >
patch.
I did ask ducarroz, and he told me naoki's review is OK.
>can you provide information on when this code is hit? (once per message, if my
>pref is set for autodetect? any extra overhead for for US-ASCII users?)
The code is hit only when autodetector is selected. If an email has no charset
information specified, autodetector will be launched. In another word, if email
says itself is in us-ascii, there will be no overhead.
1) + nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res));
Done.
2) double check your string usage.
Is it because nsCAutoString is available so that we have to use it? Existing
code works faster and has no heap memory allocation. IMO traditional C/C++
string practice is allowed when appropriate. We know
NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector
name are strings that can be enumerate, so 128 is enough and there is no reason
to use heap allocation. Is there other reason you want me to change so?
3) is there a contract id for kCharsetConverterManagerCID? if so, use it
instead of the static cid.
Done.
4) NS_ConvertASCIItoUCS2("ISO-8859-1").get()
My understanding of document
"http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that
NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong.
Comment 34•23 years ago
|
||
Right, I let Naoki review it as it essentially a I18n matter. I'll anyway take a
look at it...
Comment 35•23 years ago
|
||
Comment on attachment 54012 [details] [diff] [review]
updated patch
Looks OK. R=ducarroz
Attachment #54012 -
Flags: review+
Comment 36•23 years ago
|
||
> NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector
> name are strings that can be enumerate, so 128 is enough and there is no reason
> to use heap allocation. Is there other reason you want me to change so?
+ nsresult res;
+ char detector_contractid[128];
+ PRUnichar* detector_name = nsnull;
+ nsCOMPtr<nsIStringCharsetDetector> detector;
+ PL_strcpy(detector_contractid, NS_STRCDETECTOR_CONTRACTID_BASE);
+
+ nsCOMPtr<nsIPref> prefs(do_GetService(NS_PREF_CONTRACTID, &res));
+ if (NS_SUCCEEDED(res)) {
+ if (NS_SUCCEEDED(prefs->GetLocalizedUnicharPref("intl.charset.detector",
&detector_name))) {
+ PL_strcat(detector_contractid, NS_ConvertUCS2toUTF8(detector_name).get());
+ PR_FREEIF(detector_name);
+ }
+ }
if length(NS_ConvertUCS2toUTF8(detector_name).get()) +
length(NS_STRCDETECTOR_CONTRACTID_BASE) is > 128, you now have a buffer overrun
on the stack. checking in this code would expose a security exploit.
make detector_contractid a nsCString (not an AutoString, if it is > 63 bytes, an
autostring will be copied to the heap.) and make detector_name a nsXPIDLString,
so you don't have to worry about the free().
> 4) NS_ConvertASCIItoUCS2("ISO-8859-1").get()
> My understanding of document
> "http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that
> NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong.
My understand is that pn platforms that support the "L" trick (like windows)
NS_LITERAL_STRING() does the conversion at compile time, instead of at run time.
Assignee | ||
Comment 37•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Attachment #54012 -
Attachment is obsolete: true
Assignee | ||
Comment 38•23 years ago
|
||
Comment on attachment 54520 [details] [diff] [review]
one more update as suggested by seth
carry over review
Attachment #54520 -
Flags: review+
Assignee | ||
Updated•23 years ago
|
Attachment #54520 -
Attachment is obsolete: true
Attachment #54520 -
Flags: review+
Assignee | ||
Comment 39•23 years ago
|
||
Assignee | ||
Comment 40•23 years ago
|
||
Seth,
I read the document carefully and you are right, NS_LITERAL_STRING is better.
But I still have some doubt about issue 3, though I did find several instances
of using nsCString and nsXPIDLString in such situation. Because I am sure that
length(NS_ConvertUCS2toUTF8(detector_name).get()) +
length(NS_STRCDETECTOR_CONTRACTID_BASE) is < 128, I believe my original code
is safe. While if you really think we should prepare for the unknown and try to
avoid using heap, things need to get a little bit more complicated. I checked
the string class implemenation. In nsStr.h, it says nsCString is heap based,
and nsCAutoString is partially stack based. 64 is not large enough in this case,
so I prepared the buf in stack for nsCAutoString.
I update the patch, please sr.
Comment 41•23 years ago
|
||
that last patch doesn't look like a patch.
can you attach the patch?
Assignee | ||
Comment 42•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Attachment #54527 -
Attachment is obsolete: true
Comment 43•23 years ago
|
||
Comment on attachment 54556 [details] [diff] [review]
post last patch again. (last one is corrupted.)
sr=sspitzer
it looks like we only execute this code if autodetect is turned on.
Is there a way re-use the 8K buffer we allocate, instead of allocating and creating it for each part?
I don't know the mime code well enough to suggest an approach.
Attachment #54556 -
Flags: superreview+
Assignee | ||
Comment 44•23 years ago
|
||
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•