libmime to feed more characters for charset detection (reminder)

RESOLVED FIXED in mozilla0.9.6

Status

P3
normal
RESOLVED FIXED
19 years ago
10 years ago

People

(Reporter: nhottanscp, Assigned: shanjian)

Tracking

({intl})

Trunk
mozilla0.9.6
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 7 obsolete attachments)

(Reporter)

Description

19 years ago
This is separated from #8899.
libmime has been feeding characters to charset converter (and charset detector)
per line bases since 4.x. This is not efficient (e.g. converters creation per
line) also causes lack of enough data to charset detectors (causes detection
failure and charset conversion failure).
We need to change data passing MIME part base instead of line base whenever we
have chance to change.
(Reporter)

Updated

19 years ago
QA Contact: lchiang → nhotta

Updated

19 years ago
Target Milestone: M14

Comment 1

19 years ago
M14

Comment 2

19 years ago
Bulk change to assigned.

- rhp

Updated

19 years ago
Target Milestone: M14 → M19

Comment 3

19 years ago
Would be nice, but is unfortunately non-trivial because of the "line-by-line"
nature of libmime.

- rhp

Updated

19 years ago
Target Milestone: M19 → M20

Comment 4

19 years ago
Is there someplace else to cache the charset converter and detector which
doesn't force a rewrite of libmime?

Updated

18 years ago
Target Milestone: M20 → Future

Comment 5

18 years ago
reassigning to ducarroz
Assignee: rhp → ducarroz
Status: ASSIGNED → NEW
(Reporter)

Updated

18 years ago
Blocks: 59798
(Reporter)

Updated

18 years ago
Keywords: intl
(Reporter)

Updated

17 years ago
Blocks: 90581
(Reporter)

Comment 6

17 years ago
Can MimeObject cache converter and detector? MimeObject is not per line (per
message, I think).
(Assignee)

Comment 7

17 years ago
*** Bug 44377 has been marked as a duplicate of this bug. ***
(Assignee)

Comment 8

17 years ago
add myself to cc list

Comment 9

17 years ago
*** Bug 96025 has been marked as a duplicate of this bug. ***
(Assignee)

Comment 10

17 years ago
This problem need to be taken care of. 
Reassign to myself. 
Assignee: ducarroz → shanjian
(Assignee)

Comment 11

17 years ago
Created attachment 51888 [details] [diff] [review]
working patch
(Assignee)

Updated

17 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 12

17 years ago
ducarroz, naoki, 
I just post a draft patch for your suggestion and recommendation. This patch was 
mostly implemented within MimeInlineText. It uses a buffer to keep data and 
sends them in one batch to detector. I also add some optimization work there. 
Please take a look and offer some advice.  Thx!
(Reporter)

Comment 13

17 years ago
I am looking at the patch. One question, how does this work for a multi part
message? Is the buffer allocated per part or just one buffer for one message?
(Assignee)

Comment 14

17 years ago
The buffer is allocated per part. MimeInlineText will be the child of 
MimeMultiPart in case of multipart message. As you seen, the buffer is allocated 
for MimeInlineText. 
(Reporter)

Comment 15

17 years ago
Your change also includes caching unicode encoder/decoder in a mime object.
This is good. The caching is done regardless of the auto detection is on or off,
correct?
(Assignee)

Comment 16

17 years ago
Yes. It does not seem reasonable to me to normalize charset name and query for 
decoder/encoder for each line. I do allow charset change in the middle of 
parsing to handle meta charset, so decoder/encoder might change as well. 
(Reporter)

Comment 17

17 years ago
Looks good, are you going to revise the working patch for reviews?
(Reporter)

Comment 18

17 years ago
Shanjian, could you change the target milestone?
I think part of your change (cache charset converters) would improve performance.
(Assignee)

Comment 19

17 years ago
Created attachment 53349 [details] [diff] [review]
patch 1.0
(Assignee)

Updated

17 years ago
Attachment #51888 - Attachment is obsolete: true
(Assignee)

Comment 20

17 years ago
retarget this bug. 
Target Milestone: Future → mozilla0.9.6
(Reporter)

Comment 21

17 years ago
1) "mail.charset.detector" is not used, you can remove that part of the code.

2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK
then NS_OK is returned and low confident charset to be use. That is not desirable.
+      res = detector->DoIt(aBuf, aLength, aCharset, oConfident);
+      if (NS_SUCCEEDED(res) && (eBestAnswer == oConfident || eSureAnswer ==
oConfident)) {
+        return NS_OK;
+      }

3) I think this will leak if text->charset is not null.
+    if (!text->lineDamBuffer || !text->lineDamPtrs)
+    {
+      text->charset = text->defaultCharset;
+      text->defaultCharset = nsnull;

4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe.

+  if (text->utf8Encoder == nsnull)
+    MIME_get_unicode_encoder("UTF-8", getter_AddRefs(text->utf8Encoder));

5) Question: this is allocating a 8k buffer for one part then copy up to 1024
lines to the buffer. So only one buffer allocation per one part, correct?

+  if (text->inputAutodetect)
+  {
+    //we need to prepare lineDam for charset detection
+    text->lineDamBuffer = (char*)PR_Malloc(DAM_MAX_BUFFER_SIZE);
+    text->lineDamPtrs = (char**)PR_Malloc(DAM_MAX_LINES*sizeof(char*));

6) Related to the last question, text->lastLineInDam++; do you need to check the
DAM_MAX_LINES there?

+      else {
+        //buffering current line
+        text->lineDamPtrs[text->lastLineInDam] = text->lineDamBuffer +
text->curDamOffset;
+        nsCRT::memcpy(text->lineDamPtrs[text->lastLineInDam], line, length);
+        text->lastLineInDam++;
+        text->curDamOffset += length;
+      }

(Assignee)

Comment 22

17 years ago
1) "mail.charset.detector" is not used, you can remove that part of the code.
removed as you wish. 

2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK
then NS_OK is returned and low confident charset to be use. That is not 
desirable.
now I return null in aCharset if confidence is not good, and detectede charset 
is checked before use. 

3) I think this will leak if text->charset is not null.
+    if (!text->lineDamBuffer || !text->lineDamPtrs)
+    {
+      text->charset = text->defaultCharset;
+      text->defaultCharset = nsnull;
Under no circumstance text->charset will be non-null at this stage. I put a 
assert there.

4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe.
I don't think 'm_unicodeToUTF8Encoder' is ever used. Those code probably need to 
be removed. 

5) Question: this is allocating a 8k buffer for one part then copy up to 1024
lines to the buffer. So only one buffer allocation per one part, correct?
Yes, one buffer for each text mimepart. 

6) Related to the last question, text->lastLineInDam++; do you need to check the
DAM_MAX_LINES there?
No, it will be checked before a new line is added, not after. There is no need 
to do it twice. 

(Assignee)

Comment 23

17 years ago
Created attachment 53434 [details] [diff] [review]
patch 2.0, revision after naoki's suggestion
(Reporter)

Comment 24

17 years ago
3) I am not clear about your change in MimeInlineText_initialize. Could you
explain your change there? It used to set to text->charset in case of it's null,
that's why I though it would leak. Now you changed not to set it there but set
it later conditionally only when text->inputAutodetect is true.

5) Please add comments about the buffer and why you chose 8k.
(Assignee)

Comment 25

17 years ago
3) I am not clear about your change in MimeInlineText_initialize. Could you
explain your change there? It used to set to text->charset in case of it's null,
that's why I though it would leak. Now you changed not to set it there but set
it later conditionally only when text->inputAutodetect is true.
In my patch, only when text->charset is null after we tried all sources, will 
text->defaultCharset be assigned and text->inputAutodetect be set to true. So 
only one of text->charset and text->defaultCharset will be assigned after 
MimeInlineText_initialize. 

5) Please add comments about the buffer and why you chose 8k.
Normally it only takes 1K of real data for autodetector to reach good result. 
But unfortunately, because of tags and big chunk of javascripts normally 
attached to the head of html page, 8K is a guess to cover most cases. (A better 
choice might be to keep feeding until detector returns a result or reach certain 
limit. But the interface of autodetector need to be changed and I am unwilling 
to do it now.) Another reason is I assumed 8K of data can normally be contained 
in one UDP package if UDP is used. (No sure if it is true or not.)
(Reporter)

Comment 26

17 years ago
Comment on attachment 53434 [details] [diff] [review]
patch 2.0, revision after naoki's suggestion

r=nhotta
please add comments about the items (#3, #5).
Also, please do the i18n mail smoke test.
http://www.mozilla.org/quality/intl/tests/data/
Attachment #53434 - Flags: review+
(Reporter)

Updated

17 years ago
Blocks: 82591
(Assignee)

Comment 27

17 years ago
Created attachment 53665 [details] [diff] [review]
patch 3.0
(Assignee)

Updated

17 years ago
Attachment #53349 - Attachment is obsolete: true
(Assignee)

Updated

17 years ago
Attachment #53434 - Attachment is obsolete: true
(Assignee)

Comment 28

17 years ago
Comment on attachment 53665 [details] [diff] [review]
patch 3.0

taking over last review
Attachment #53665 - Flags: review+
(Assignee)

Comment 29

17 years ago
There is only one difference between patch 3.0 and 2.0, that is 
- if (NS_SUCCEEDED(res) && *detectedCharset)   
+ if (NS_SUCCEEDED(res) && detectedCharset && *detectedCharset) 

the previous one might crash when detector return nothing.  

Mail news smoke test is done and looks fine. 
(Assignee)

Comment 30

17 years ago
Seth, could you sr this bug?
you're going to want the module owner for mime (ducarroz) to review this patch.

can you provide information on when this code is hit?  (once per message, if my
pref is set for autodetect?  any extra overhead for for US-ASCII users?)

I'll leave it to the reviewer (ducarroz) to review memory usage (allocs and
frees) and to make sure allocations and copying is minimized.

some initial comments:

1) +  nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res)); 

there's a contact id for pref, please use it instead of the CID.

2)  double check your string usage. 

to determine the contract id, do something like this:

nsCAutoString detector_contractid;
detector_contractid = NS_STRCDETECTOR_CONTRACTID_BASE;

nsXPIDLString detector_name;
rv = prefs->GetLocalizedUnicharPref("intl.charset.detector",
getter_Copies(detector_name));
if (NS_SUCCEEDED(rv)) {
  detector_contractid += NS_ConvertUCS2toUTF8(detector_name).get();
}

3)  is there a contract id for kCharsetConverterManagerCID?  if so, use it
instead of the static cid.

4)  NS_ConvertASCIItoUCS2("ISO-8859-1").get()

NS_LITERAL_STRING("ISO-8859-1").get() is the preferred way.
(Assignee)

Comment 32

17 years ago
Created attachment 54012 [details] [diff] [review]
updated patch
(Assignee)

Updated

17 years ago
Attachment #53665 - Attachment is obsolete: true
(Assignee)

Comment 33

17 years ago
seth, 
Thanks for taking a look at my patch. There are some points in your comment need 
further explainations.

> you're going to want the module owner for mime (ducarroz) to review this > 
patch.
I did ask ducarroz, and he told me naoki's review is OK.

>can you provide information on when this code is hit?  (once per message, if my
>pref is set for autodetect?  any extra overhead for for US-ASCII users?)
The code is hit only when autodetector is selected. If an email has no charset 
information specified, autodetector will be launched. In another word, if email 
says itself is in us-ascii, there will be no overhead.
1) +  nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res)); 

Done. 

2)  double check your string usage. 

Is it because nsCAutoString is available so that we have to use it? Existing 
code works faster and has no heap memory allocation. IMO traditional C/C++ 
string practice is allowed when appropriate. We know 
NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector 
name are strings that can be enumerate, so 128 is enough and there is no reason 
to use heap allocation. Is there other reason you want me to change so?

3)  is there a contract id for kCharsetConverterManagerCID?  if so, use it
instead of the static cid.

Done.

4)  NS_ConvertASCIItoUCS2("ISO-8859-1").get()
My understanding of document 
"http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that 
NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong. 
Right, I let Naoki review it as it essentially a I18n matter. I'll anyway take a
look at it...
Comment on attachment 54012 [details] [diff] [review]
updated patch

Looks OK. R=ducarroz
Attachment #54012 - Flags: review+
> NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector 
> name are strings that can be enumerate, so 128 is enough and there is no reason 
> to use heap allocation. Is there other reason you want me to change so?

+  nsresult res;
+  char detector_contractid[128];
+  PRUnichar* detector_name = nsnull;
+  nsCOMPtr<nsIStringCharsetDetector> detector;
+  PL_strcpy(detector_contractid, NS_STRCDETECTOR_CONTRACTID_BASE);
+
+  nsCOMPtr<nsIPref> prefs(do_GetService(NS_PREF_CONTRACTID, &res)); 
+  if (NS_SUCCEEDED(res)) {
+    if (NS_SUCCEEDED(prefs->GetLocalizedUnicharPref("intl.charset.detector",
&detector_name))) {
+      PL_strcat(detector_contractid, NS_ConvertUCS2toUTF8(detector_name).get());
+      PR_FREEIF(detector_name);
+    }
+  }

if length(NS_ConvertUCS2toUTF8(detector_name).get()) +
length(NS_STRCDETECTOR_CONTRACTID_BASE) is > 128, you now have a buffer overrun
on the stack.  checking in this code would expose a security exploit.

make detector_contractid a nsCString (not an AutoString, if it is > 63 bytes, an
autostring will be copied to the heap.) and make detector_name a nsXPIDLString,
so you don't have to worry about the free().


> 4)  NS_ConvertASCIItoUCS2("ISO-8859-1").get()
> My understanding of document 
> "http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that 
> NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong. 

My understand is that pn platforms that support the "L" trick (like windows)
NS_LITERAL_STRING() does the conversion at compile time, instead of at run time.


(Assignee)

Comment 37

17 years ago
Created attachment 54520 [details] [diff] [review]
one more update as suggested by seth
(Assignee)

Updated

17 years ago
Attachment #54012 - Attachment is obsolete: true
(Assignee)

Comment 38

17 years ago
Comment on attachment 54520 [details] [diff] [review]
one more update as suggested by seth

carry over review
Attachment #54520 - Flags: review+
(Assignee)

Updated

17 years ago
Attachment #54520 - Attachment is obsolete: true
Attachment #54520 - Flags: review+
(Assignee)

Comment 39

17 years ago
Created attachment 54527 [details] [diff] [review]
latest patch
(Assignee)

Comment 40

17 years ago
Seth, 
I read the document carefully and you are right, NS_LITERAL_STRING is better. 
But I still have some doubt about issue 3, though I did find several instances 
of using nsCString and nsXPIDLString in such situation. Because I am sure that 
length(NS_ConvertUCS2toUTF8(detector_name).get()) +
length(NS_STRCDETECTOR_CONTRACTID_BASE) is < 128, I believe my original code 
is safe. While if you really think we should prepare for the unknown and  try to 
avoid using heap, things need to get a little bit more complicated. I checked 
the string class implemenation. In nsStr.h, it says nsCString is heap based, 
and nsCAutoString is partially stack based. 64 is not large enough in this case,
so I prepared the buf in stack for nsCAutoString. 

I update the patch, please sr. 
that last patch doesn't look like a patch.

can you attach the patch?
(Assignee)

Comment 42

17 years ago
Created attachment 54556 [details] [diff] [review]
post last patch again. (last one is corrupted.)
(Assignee)

Updated

17 years ago
Attachment #54527 - Attachment is obsolete: true
Comment on attachment 54556 [details] [diff] [review]
post last patch again. (last one is corrupted.)

sr=sspitzer

it looks like we only execute this code if autodetect is turned on.

Is there a way re-use the 8K buffer we allocate, instead of allocating and creating it for each part?

I don't know the mime code well enough to suggest an approach.
Attachment #54556 - Flags: superreview+
(Assignee)

Comment 44

17 years ago
fix checked in. 
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → FIXED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.