Closed Bug 12481 Opened 25 years ago Closed 23 years ago

libmime to feed more characters for charset detection (reminder)

Categories

(MailNews Core :: MIME, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
mozilla0.9.6

People

(Reporter: nhottanscp, Assigned: shanjian)

References

Details

(Keywords: intl)

Attachments

(1 file, 7 obsolete files)

This is separated from #8899.
libmime has been feeding characters to charset converter (and charset detector)
per line bases since 4.x. This is not efficient (e.g. converters creation per
line) also causes lack of enough data to charset detectors (causes detection
failure and charset conversion failure).
We need to change data passing MIME part base instead of line base whenever we
have chance to change.
QA Contact: lchiang → nhotta
Target Milestone: M14
M14
Bulk change to assigned.

- rhp
Target Milestone: M14 → M19
Would be nice, but is unfortunately non-trivial because of the "line-by-line"
nature of libmime.

- rhp
Target Milestone: M19 → M20
Is there someplace else to cache the charset converter and detector which
doesn't force a rewrite of libmime?
Target Milestone: M20 → Future
reassigning to ducarroz
Assignee: rhp → ducarroz
Status: ASSIGNED → NEW
Blocks: 59798
Keywords: intl
Blocks: 90581
Can MimeObject cache converter and detector? MimeObject is not per line (per
message, I think).
*** Bug 44377 has been marked as a duplicate of this bug. ***
add myself to cc list
*** Bug 96025 has been marked as a duplicate of this bug. ***
This problem need to be taken care of. 
Reassign to myself. 
Assignee: ducarroz → shanjian
Attached patch working patch (obsolete) — Splinter Review
Status: NEW → ASSIGNED
ducarroz, naoki, 
I just post a draft patch for your suggestion and recommendation. This patch was 
mostly implemented within MimeInlineText. It uses a buffer to keep data and 
sends them in one batch to detector. I also add some optimization work there. 
Please take a look and offer some advice.  Thx!
I am looking at the patch. One question, how does this work for a multi part
message? Is the buffer allocated per part or just one buffer for one message?
The buffer is allocated per part. MimeInlineText will be the child of 
MimeMultiPart in case of multipart message. As you seen, the buffer is allocated 
for MimeInlineText. 
Your change also includes caching unicode encoder/decoder in a mime object.
This is good. The caching is done regardless of the auto detection is on or off,
correct?
Yes. It does not seem reasonable to me to normalize charset name and query for 
decoder/encoder for each line. I do allow charset change in the middle of 
parsing to handle meta charset, so decoder/encoder might change as well. 
Looks good, are you going to revise the working patch for reviews?
Shanjian, could you change the target milestone?
I think part of your change (cache charset converters) would improve performance.
Attached patch patch 1.0 (obsolete) — Splinter Review
Attachment #51888 - Attachment is obsolete: true
retarget this bug. 
Target Milestone: Future → mozilla0.9.6
1) "mail.charset.detector" is not used, you can remove that part of the code.

2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK
then NS_OK is returned and low confident charset to be use. That is not desirable.
+      res = detector->DoIt(aBuf, aLength, aCharset, oConfident);
+      if (NS_SUCCEEDED(res) && (eBestAnswer == oConfident || eSureAnswer ==
oConfident)) {
+        return NS_OK;
+      }

3) I think this will leak if text->charset is not null.
+    if (!text->lineDamBuffer || !text->lineDamPtrs)
+    {
+      text->charset = text->defaultCharset;
+      text->defaultCharset = nsnull;

4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe.

+  if (text->utf8Encoder == nsnull)
+    MIME_get_unicode_encoder("UTF-8", getter_AddRefs(text->utf8Encoder));

5) Question: this is allocating a 8k buffer for one part then copy up to 1024
lines to the buffer. So only one buffer allocation per one part, correct?

+  if (text->inputAutodetect)
+  {
+    //we need to prepare lineDam for charset detection
+    text->lineDamBuffer = (char*)PR_Malloc(DAM_MAX_BUFFER_SIZE);
+    text->lineDamPtrs = (char**)PR_Malloc(DAM_MAX_LINES*sizeof(char*));

6) Related to the last question, text->lastLineInDam++; do you need to check the
DAM_MAX_LINES there?

+      else {
+        //buffering current line
+        text->lineDamPtrs[text->lastLineInDam] = text->lineDamBuffer +
text->curDamOffset;
+        nsCRT::memcpy(text->lineDamPtrs[text->lastLineInDam], line, length);
+        text->lastLineInDam++;
+        text->curDamOffset += length;
+      }

1) "mail.charset.detector" is not used, you can remove that part of the code.
removed as you wish. 

2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK
then NS_OK is returned and low confident charset to be use. That is not 
desirable.
now I return null in aCharset if confidence is not good, and detectede charset 
is checked before use. 

3) I think this will leak if text->charset is not null.
+    if (!text->lineDamBuffer || !text->lineDamPtrs)
+    {
+      text->charset = text->defaultCharset;
+      text->defaultCharset = nsnull;
Under no circumstance text->charset will be non-null at this stage. I put a 
assert there.

4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe.
I don't think 'm_unicodeToUTF8Encoder' is ever used. Those code probably need to 
be removed. 

5) Question: this is allocating a 8k buffer for one part then copy up to 1024
lines to the buffer. So only one buffer allocation per one part, correct?
Yes, one buffer for each text mimepart. 

6) Related to the last question, text->lastLineInDam++; do you need to check the
DAM_MAX_LINES there?
No, it will be checked before a new line is added, not after. There is no need 
to do it twice. 

3) I am not clear about your change in MimeInlineText_initialize. Could you
explain your change there? It used to set to text->charset in case of it's null,
that's why I though it would leak. Now you changed not to set it there but set
it later conditionally only when text->inputAutodetect is true.

5) Please add comments about the buffer and why you chose 8k.
3) I am not clear about your change in MimeInlineText_initialize. Could you
explain your change there? It used to set to text->charset in case of it's null,
that's why I though it would leak. Now you changed not to set it there but set
it later conditionally only when text->inputAutodetect is true.
In my patch, only when text->charset is null after we tried all sources, will 
text->defaultCharset be assigned and text->inputAutodetect be set to true. So 
only one of text->charset and text->defaultCharset will be assigned after 
MimeInlineText_initialize. 

5) Please add comments about the buffer and why you chose 8k.
Normally it only takes 1K of real data for autodetector to reach good result. 
But unfortunately, because of tags and big chunk of javascripts normally 
attached to the head of html page, 8K is a guess to cover most cases. (A better 
choice might be to keep feeding until detector returns a result or reach certain 
limit. But the interface of autodetector need to be changed and I am unwilling 
to do it now.) Another reason is I assumed 8K of data can normally be contained 
in one UDP package if UDP is used. (No sure if it is true or not.)
Comment on attachment 53434 [details] [diff] [review]
patch 2.0, revision after naoki's suggestion

r=nhotta
please add comments about the items (#3, #5).
Also, please do the i18n mail smoke test.
http://www.mozilla.org/quality/intl/tests/data/
Attachment #53434 - Flags: review+
Blocks: 82591
Attached patch patch 3.0 (obsolete) — Splinter Review
Attachment #53349 - Attachment is obsolete: true
Attachment #53434 - Attachment is obsolete: true
Comment on attachment 53665 [details] [diff] [review]
patch 3.0

taking over last review
Attachment #53665 - Flags: review+
There is only one difference between patch 3.0 and 2.0, that is 
- if (NS_SUCCEEDED(res) && *detectedCharset)   
+ if (NS_SUCCEEDED(res) && detectedCharset && *detectedCharset) 

the previous one might crash when detector return nothing.  

Mail news smoke test is done and looks fine. 
Seth, could you sr this bug?
you're going to want the module owner for mime (ducarroz) to review this patch.

can you provide information on when this code is hit?  (once per message, if my
pref is set for autodetect?  any extra overhead for for US-ASCII users?)

I'll leave it to the reviewer (ducarroz) to review memory usage (allocs and
frees) and to make sure allocations and copying is minimized.

some initial comments:

1) +  nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res)); 

there's a contact id for pref, please use it instead of the CID.

2)  double check your string usage. 

to determine the contract id, do something like this:

nsCAutoString detector_contractid;
detector_contractid = NS_STRCDETECTOR_CONTRACTID_BASE;

nsXPIDLString detector_name;
rv = prefs->GetLocalizedUnicharPref("intl.charset.detector",
getter_Copies(detector_name));
if (NS_SUCCEEDED(rv)) {
  detector_contractid += NS_ConvertUCS2toUTF8(detector_name).get();
}

3)  is there a contract id for kCharsetConverterManagerCID?  if so, use it
instead of the static cid.

4)  NS_ConvertASCIItoUCS2("ISO-8859-1").get()

NS_LITERAL_STRING("ISO-8859-1").get() is the preferred way.
Attached patch updated patch (obsolete) — Splinter Review
Attachment #53665 - Attachment is obsolete: true
seth, 
Thanks for taking a look at my patch. There are some points in your comment need 
further explainations.

> you're going to want the module owner for mime (ducarroz) to review this > 
patch.
I did ask ducarroz, and he told me naoki's review is OK.

>can you provide information on when this code is hit?  (once per message, if my
>pref is set for autodetect?  any extra overhead for for US-ASCII users?)
The code is hit only when autodetector is selected. If an email has no charset 
information specified, autodetector will be launched. In another word, if email 
says itself is in us-ascii, there will be no overhead.
1) +  nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res)); 

Done. 

2)  double check your string usage. 

Is it because nsCAutoString is available so that we have to use it? Existing 
code works faster and has no heap memory allocation. IMO traditional C/C++ 
string practice is allowed when appropriate. We know 
NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector 
name are strings that can be enumerate, so 128 is enough and there is no reason 
to use heap allocation. Is there other reason you want me to change so?

3)  is there a contract id for kCharsetConverterManagerCID?  if so, use it
instead of the static cid.

Done.

4)  NS_ConvertASCIItoUCS2("ISO-8859-1").get()
My understanding of document 
"http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that 
NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong. 
Right, I let Naoki review it as it essentially a I18n matter. I'll anyway take a
look at it...
Comment on attachment 54012 [details] [diff] [review]
updated patch

Looks OK. R=ducarroz
Attachment #54012 - Flags: review+
> NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector 
> name are strings that can be enumerate, so 128 is enough and there is no reason 
> to use heap allocation. Is there other reason you want me to change so?

+  nsresult res;
+  char detector_contractid[128];
+  PRUnichar* detector_name = nsnull;
+  nsCOMPtr<nsIStringCharsetDetector> detector;
+  PL_strcpy(detector_contractid, NS_STRCDETECTOR_CONTRACTID_BASE);
+
+  nsCOMPtr<nsIPref> prefs(do_GetService(NS_PREF_CONTRACTID, &res)); 
+  if (NS_SUCCEEDED(res)) {
+    if (NS_SUCCEEDED(prefs->GetLocalizedUnicharPref("intl.charset.detector",
&detector_name))) {
+      PL_strcat(detector_contractid, NS_ConvertUCS2toUTF8(detector_name).get());
+      PR_FREEIF(detector_name);
+    }
+  }

if length(NS_ConvertUCS2toUTF8(detector_name).get()) +
length(NS_STRCDETECTOR_CONTRACTID_BASE) is > 128, you now have a buffer overrun
on the stack.  checking in this code would expose a security exploit.

make detector_contractid a nsCString (not an AutoString, if it is > 63 bytes, an
autostring will be copied to the heap.) and make detector_name a nsXPIDLString,
so you don't have to worry about the free().


> 4)  NS_ConvertASCIItoUCS2("ISO-8859-1").get()
> My understanding of document 
> "http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that 
> NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong. 

My understand is that pn platforms that support the "L" trick (like windows)
NS_LITERAL_STRING() does the conversion at compile time, instead of at run time.


Attachment #54012 - Attachment is obsolete: true
Comment on attachment 54520 [details] [diff] [review]
one more update as suggested by seth

carry over review
Attachment #54520 - Flags: review+
Attachment #54520 - Attachment is obsolete: true
Attachment #54520 - Flags: review+
Attached patch latest patch (obsolete) — Splinter Review
Seth, 
I read the document carefully and you are right, NS_LITERAL_STRING is better. 
But I still have some doubt about issue 3, though I did find several instances 
of using nsCString and nsXPIDLString in such situation. Because I am sure that 
length(NS_ConvertUCS2toUTF8(detector_name).get()) +
length(NS_STRCDETECTOR_CONTRACTID_BASE) is < 128, I believe my original code 
is safe. While if you really think we should prepare for the unknown and  try to 
avoid using heap, things need to get a little bit more complicated. I checked 
the string class implemenation. In nsStr.h, it says nsCString is heap based, 
and nsCAutoString is partially stack based. 64 is not large enough in this case,
so I prepared the buf in stack for nsCAutoString. 

I update the patch, please sr. 
that last patch doesn't look like a patch.

can you attach the patch?
Attachment #54527 - Attachment is obsolete: true
Comment on attachment 54556 [details] [diff] [review]
post last patch again. (last one is corrupted.)

sr=sspitzer

it looks like we only execute this code if autodetect is turned on.

Is there a way re-use the 8K buffer we allocate, instead of allocating and creating it for each part?

I don't know the mime code well enough to suggest an approach.
Attachment #54556 - Flags: superreview+
fix checked in. 
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: