Closed Bug 12481 Opened 25 years ago Closed 23 years ago

libmime to feed more characters for charset detection (reminder)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

mozilla0.9.6

People

(Reporter: nhottanscp, Assigned: shanjian)

References

Details

(Keywords: intl)

Attachments

(1 file, 7 obsolete files)

working patch 23 years ago Shanjian Li 15.45 KB, patch		Details \| Diff \| Splinter Review
patch 1.0 23 years ago Shanjian Li 16.16 KB, patch		Details \| Diff \| Splinter Review
patch 2.0, revision after naoki's suggestion 23 years ago Shanjian Li 16.04 KB, patch	nhottanscp : review+	Details \| Diff \| Splinter Review
patch 3.0 23 years ago Shanjian Li 16.06 KB, patch	shanjian : review+	Details \| Diff \| Splinter Review
updated patch 23 years ago Shanjian Li 16.10 KB, patch	bugzilla : review+	Details \| Diff \| Splinter Review
one more update as suggested by seth 23 years ago Shanjian Li 17.40 KB, patch		Details \| Diff \| Splinter Review
latest patch 23 years ago Shanjian Li 96 bytes, patch		Details \| Diff \| Splinter Review
post last patch again. (last one is corrupted.) 23 years ago Shanjian Li 16.19 KB, patch	sspitzer : superreview+	Details \| Diff \| Splinter Review

nhottanscp

Reporter

Description

•

25 years ago

This is separated from #8899.
libmime has been feeding characters to charset converter (and charset detector)
per line bases since 4.x. This is not efficient (e.g. converters creation per
line) also causes lack of enough data to charset detectors (causes detection
failure and charset conversion failure).
We need to change data passing MIME part base instead of line base whenever we
have chance to change.

nhottanscp

Reporter

Updated

•

25 years ago

QA Contact: lchiang → nhotta

Phil Peterson

Updated

•

25 years ago

Target Milestone: M14

Phil Peterson

Comment 1

•

25 years ago

M14

rhp (gone)

Comment 2

•

25 years ago

Bulk change to assigned.

- rhp

rhp (gone)

Updated

•

25 years ago

Target Milestone: M14 → M19

rhp (gone)

Comment 3

•

25 years ago

Would be nice, but is unfortunately non-trivial because of the "line-by-line"
nature of libmime.

- rhp

rhp (gone)

Updated

•

25 years ago

Target Milestone: M19 → M20

Phil Peterson

Comment 4

•

25 years ago

Is there someplace else to cache the charset converter and detector which
doesn't force a rewrite of libmime?

rhp (gone)

Updated

•

24 years ago

Target Milestone: M20 → Future

scottputterman

Comment 5

•

24 years ago

reassigning to ducarroz

Assignee: rhp → ducarroz

Status: ASSIGNED → NEW

nhottanscp

Reporter

Updated

•

23 years ago

Blocks: 59798

nhottanscp

Reporter

Updated

•

23 years ago

Keywords: intl

nhottanscp

Reporter

Updated

•

23 years ago

Blocks: 90581

nhottanscp

Reporter

Comment 6

•

23 years ago

Can MimeObject cache converter and detector? MimeObject is not per line (per
message, I think).

Shanjian Li

Assignee

Comment 7

•

23 years ago

*** Bug 44377 has been marked as a duplicate of this bug. ***

Shanjian Li

Assignee

Comment 8

•

23 years ago

add myself to cc list

Comment 9

•

23 years ago

*** Bug 96025 has been marked as a duplicate of this bug. ***

Shanjian Li

Assignee

Comment 10

•

23 years ago

This problem need to be taken care of. 
Reassign to myself.

Assignee: ducarroz → shanjian

Shanjian Li

Assignee

Comment 11

•

23 years ago

Attached patch working patch (obsolete) — Details — Splinter Review

Shanjian Li

Assignee

Updated

•

23 years ago

Status: NEW → ASSIGNED

Shanjian Li

Assignee

Comment 12

•

23 years ago

ducarroz, naoki, 
I just post a draft patch for your suggestion and recommendation. This patch was 
mostly implemented within MimeInlineText. It uses a buffer to keep data and 
sends them in one batch to detector. I also add some optimization work there. 
Please take a look and offer some advice.  Thx!

nhottanscp

Reporter

Comment 13

•

23 years ago

I am looking at the patch. One question, how does this work for a multi part
message? Is the buffer allocated per part or just one buffer for one message?

Shanjian Li

Assignee

Comment 14

•

23 years ago

The buffer is allocated per part. MimeInlineText will be the child of 
MimeMultiPart in case of multipart message. As you seen, the buffer is allocated 
for MimeInlineText.

nhottanscp

Reporter

Comment 15

•

23 years ago

Your change also includes caching unicode encoder/decoder in a mime object.
This is good. The caching is done regardless of the auto detection is on or off,
correct?

Shanjian Li

Assignee

Comment 16

•

23 years ago

Yes. It does not seem reasonable to me to normalize charset name and query for 
decoder/encoder for each line. I do allow charset change in the middle of 
parsing to handle meta charset, so decoder/encoder might change as well.

nhottanscp

Reporter

Comment 17

•

23 years ago

Looks good, are you going to revise the working patch for reviews?

nhottanscp

Reporter

Comment 18

•

23 years ago

Shanjian, could you change the target milestone?
I think part of your change (cache charset converters) would improve performance.

Shanjian Li

Assignee

Comment 19

•

23 years ago

Attached patch patch 1.0 (obsolete) — Details — Splinter Review

Shanjian Li

Assignee

Updated

•

23 years ago

Attachment #51888 - Attachment is obsolete: true

Shanjian Li

Assignee

Comment 20

•

23 years ago

retarget this bug.

Target Milestone: Future → mozilla0.9.6

nhottanscp

Reporter

Comment 21

•

23 years ago

1) "mail.charset.detector" is not used, you can remove that part of the code.

2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK
then NS_OK is returned and low confident charset to be use. That is not desirable.
+      res = detector->DoIt(aBuf, aLength, aCharset, oConfident);
+      if (NS_SUCCEEDED(res) && (eBestAnswer == oConfident || eSureAnswer ==
oConfident)) {
+        return NS_OK;
+      }

3) I think this will leak if text->charset is not null.
+    if (!text->lineDamBuffer || !text->lineDamPtrs)
+    {
+      text->charset = text->defaultCharset;
+      text->defaultCharset = nsnull;

4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe.

+  if (text->utf8Encoder == nsnull)
+    MIME_get_unicode_encoder("UTF-8", getter_AddRefs(text->utf8Encoder));

5) Question: this is allocating a 8k buffer for one part then copy up to 1024
lines to the buffer. So only one buffer allocation per one part, correct?

+  if (text->inputAutodetect)
+  {
+    //we need to prepare lineDam for charset detection
+    text->lineDamBuffer = (char*)PR_Malloc(DAM_MAX_BUFFER_SIZE);
+    text->lineDamPtrs = (char**)PR_Malloc(DAM_MAX_LINES*sizeof(char*));

6) Related to the last question, text->lastLineInDam++; do you need to check the
DAM_MAX_LINES there?

+      else {
+        //buffering current line
+        text->lineDamPtrs[text->lastLineInDam] = text->lineDamBuffer +
text->curDamOffset;
+        nsCRT::memcpy(text->lineDamPtrs[text->lastLineInDam], line, length);
+        text->lastLineInDam++;
+        text->curDamOffset += length;
+      }

Shanjian Li

Assignee

Comment 22

•

23 years ago

1) "mail.charset.detector" is not used, you can remove that part of the code.
removed as you wish. 

2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK
then NS_OK is returned and low confident charset to be use. That is not 
desirable.
now I return null in aCharset if confidence is not good, and detectede charset 
is checked before use. 

3) I think this will leak if text->charset is not null.
+    if (!text->lineDamBuffer || !text->lineDamPtrs)
+    {
+      text->charset = text->defaultCharset;
+      text->defaultCharset = nsnull;
Under no circumstance text->charset will be non-null at this stage. I put a 
assert there.

4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe.
I don't think 'm_unicodeToUTF8Encoder' is ever used. Those code probably need to 
be removed. 

5) Question: this is allocating a 8k buffer for one part then copy up to 1024
lines to the buffer. So only one buffer allocation per one part, correct?
Yes, one buffer for each text mimepart. 

6) Related to the last question, text->lastLineInDam++; do you need to check the
DAM_MAX_LINES there?
No, it will be checked before a new line is added, not after. There is no need 
to do it twice.

Shanjian Li

Assignee

Comment 23

•

23 years ago

Attached patch patch 2.0, revision after naoki's suggestion (obsolete) — Details — Splinter Review

nhottanscp

Reporter

Comment 24

•

23 years ago

3) I am not clear about your change in MimeInlineText_initialize. Could you
explain your change there? It used to set to text->charset in case of it's null,
that's why I though it would leak. Now you changed not to set it there but set
it later conditionally only when text->inputAutodetect is true.

5) Please add comments about the buffer and why you chose 8k.

Shanjian Li

Assignee

Comment 25

•

23 years ago

3) I am not clear about your change in MimeInlineText_initialize. Could you
explain your change there? It used to set to text->charset in case of it's null,
that's why I though it would leak. Now you changed not to set it there but set
it later conditionally only when text->inputAutodetect is true.
In my patch, only when text->charset is null after we tried all sources, will 
text->defaultCharset be assigned and text->inputAutodetect be set to true. So 
only one of text->charset and text->defaultCharset will be assigned after 
MimeInlineText_initialize. 

5) Please add comments about the buffer and why you chose 8k.
Normally it only takes 1K of real data for autodetector to reach good result. 
But unfortunately, because of tags and big chunk of javascripts normally 
attached to the head of html page, 8K is a guess to cover most cases. (A better 
choice might be to keep feeding until detector returns a result or reach certain 
limit. But the interface of autodetector need to be changed and I am unwilling 
to do it now.) Another reason is I assumed 8K of data can normally be contained 
in one UDP package if UDP is used. (No sure if it is true or not.)

nhottanscp

Reporter

Comment 26

•

23 years ago

Comment on attachment 53434 [details] [diff] [review]
patch 2.0, revision after naoki's suggestion

r=nhotta
please add comments about the items (#3, #5).
Also, please do the i18n mail smoke test.
http://www.mozilla.org/quality/intl/tests/data/

Attachment #53434 - Flags: review+

nhottanscp

Reporter

Updated

•

23 years ago

Blocks: 82591

Shanjian Li

Assignee

Comment 27

•

23 years ago

Attached patch patch 3.0 (obsolete) — Details — Splinter Review

Shanjian Li

Assignee

Updated

•

23 years ago

Attachment #53349 - Attachment is obsolete: true

Shanjian Li

Assignee

Updated

•

23 years ago

Attachment #53434 - Attachment is obsolete: true

Shanjian Li

Assignee

Comment 28

•

23 years ago

Comment on attachment 53665 [details] [diff] [review]
patch 3.0

taking over last review

Attachment #53665 - Flags: review+

Shanjian Li

Assignee

Comment 29

•

23 years ago

There is only one difference between patch 3.0 and 2.0, that is 
- if (NS_SUCCEEDED(res) && *detectedCharset)   
+ if (NS_SUCCEEDED(res) && detectedCharset && *detectedCharset) 

the previous one might crash when detector return nothing.  

Mail news smoke test is done and looks fine.

Shanjian Li

Assignee

Comment 30

•

23 years ago

Seth, could you sr this bug?

(not reading, please use seth@sspitzer.org instead)

Comment 31

•

23 years ago

you're going to want the module owner for mime (ducarroz) to review this patch.

can you provide information on when this code is hit?  (once per message, if my
pref is set for autodetect?  any extra overhead for for US-ASCII users?)

I'll leave it to the reviewer (ducarroz) to review memory usage (allocs and
frees) and to make sure allocations and copying is minimized.

some initial comments:

1) +  nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res)); 

there's a contact id for pref, please use it instead of the CID.

2)  double check your string usage. 

to determine the contract id, do something like this:

nsCAutoString detector_contractid;
detector_contractid = NS_STRCDETECTOR_CONTRACTID_BASE;

nsXPIDLString detector_name;
rv = prefs->GetLocalizedUnicharPref("intl.charset.detector",
getter_Copies(detector_name));
if (NS_SUCCEEDED(rv)) {
  detector_contractid += NS_ConvertUCS2toUTF8(detector_name).get();
}

3)  is there a contract id for kCharsetConverterManagerCID?  if so, use it
instead of the static cid.

4)  NS_ConvertASCIItoUCS2("ISO-8859-1").get()

NS_LITERAL_STRING("ISO-8859-1").get() is the preferred way.

Shanjian Li

Assignee

Comment 32

•

23 years ago

Attached patch updated patch (obsolete) — Details — Splinter Review

Shanjian Li

Assignee

Updated

•

23 years ago

Attachment #53665 - Attachment is obsolete: true

Shanjian Li

Assignee

Comment 33

•

23 years ago

seth, 
Thanks for taking a look at my patch. There are some points in your comment need 
further explainations.

> you're going to want the module owner for mime (ducarroz) to review this > 
patch.
I did ask ducarroz, and he told me naoki's review is OK.

>can you provide information on when this code is hit?  (once per message, if my
>pref is set for autodetect?  any extra overhead for for US-ASCII users?)
The code is hit only when autodetector is selected. If an email has no charset 
information specified, autodetector will be launched. In another word, if email 
says itself is in us-ascii, there will be no overhead.
1) +  nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res)); 

Done. 

2)  double check your string usage. 

Is it because nsCAutoString is available so that we have to use it? Existing 
code works faster and has no heap memory allocation. IMO traditional C/C++ 
string practice is allowed when appropriate. We know 
NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector 
name are strings that can be enumerate, so 128 is enough and there is no reason 
to use heap allocation. Is there other reason you want me to change so?

3)  is there a contract id for kCharsetConverterManagerCID?  if so, use it
instead of the static cid.

Done.

4)  NS_ConvertASCIItoUCS2("ISO-8859-1").get()
My understanding of document 
"http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that 
NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong.

Jean-Francois Ducarroz

Comment 34

•

23 years ago

Right, I let Naoki review it as it essentially a I18n matter. I'll anyway take a
look at it...

Jean-Francois Ducarroz

Comment 35

•

23 years ago

Comment on attachment 54012 [details] [diff] [review]
updated patch

Looks OK. R=ducarroz

Attachment #54012 - Flags: review+

(not reading, please use seth@sspitzer.org instead)

Comment 36

•

23 years ago

> NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector 
> name are strings that can be enumerate, so 128 is enough and there is no reason 
> to use heap allocation. Is there other reason you want me to change so?

+  nsresult res;
+  char detector_contractid[128];
+  PRUnichar* detector_name = nsnull;
+  nsCOMPtr<nsIStringCharsetDetector> detector;
+  PL_strcpy(detector_contractid, NS_STRCDETECTOR_CONTRACTID_BASE);
+
+  nsCOMPtr<nsIPref> prefs(do_GetService(NS_PREF_CONTRACTID, &res)); 
+  if (NS_SUCCEEDED(res)) {
+    if (NS_SUCCEEDED(prefs->GetLocalizedUnicharPref("intl.charset.detector",
&detector_name))) {
+      PL_strcat(detector_contractid, NS_ConvertUCS2toUTF8(detector_name).get());
+      PR_FREEIF(detector_name);
+    }
+  }

if length(NS_ConvertUCS2toUTF8(detector_name).get()) +
length(NS_STRCDETECTOR_CONTRACTID_BASE) is > 128, you now have a buffer overrun
on the stack.  checking in this code would expose a security exploit.

make detector_contractid a nsCString (not an AutoString, if it is > 63 bytes, an
autostring will be copied to the heap.) and make detector_name a nsXPIDLString,
so you don't have to worry about the free().


> 4)  NS_ConvertASCIItoUCS2("ISO-8859-1").get()
> My understanding of document 
> "http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that 
> NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong. 

My understand is that pn platforms that support the "L" trick (like windows)
NS_LITERAL_STRING() does the conversion at compile time, instead of at run time.

Shanjian Li

Assignee

Comment 37

•

23 years ago

Attached patch one more update as suggested by seth (obsolete) — Details — Splinter Review

Shanjian Li

Assignee

Updated

•

23 years ago

Attachment #54012 - Attachment is obsolete: true

Shanjian Li

Assignee

Comment 38

•

23 years ago

Comment on attachment 54520 [details] [diff] [review]
one more update as suggested by seth

carry over review

Attachment #54520 - Flags: review+

Shanjian Li

Assignee

Updated

•

23 years ago

Attachment #54520 - Attachment is obsolete: true

Attachment #54520 - Flags: review+

Shanjian Li

Assignee

Comment 39

•

23 years ago

Attached patch latest patch (obsolete) — Details — Splinter Review

Shanjian Li

Assignee

Comment 40

•

23 years ago

Seth, 
I read the document carefully and you are right, NS_LITERAL_STRING is better. 
But I still have some doubt about issue 3, though I did find several instances 
of using nsCString and nsXPIDLString in such situation. Because I am sure that 
length(NS_ConvertUCS2toUTF8(detector_name).get()) +
length(NS_STRCDETECTOR_CONTRACTID_BASE) is < 128, I believe my original code 
is safe. While if you really think we should prepare for the unknown and  try to 
avoid using heap, things need to get a little bit more complicated. I checked 
the string class implemenation. In nsStr.h, it says nsCString is heap based, 
and nsCAutoString is partially stack based. 64 is not large enough in this case,
so I prepared the buf in stack for nsCAutoString. 

I update the patch, please sr.

(not reading, please use seth@sspitzer.org instead)

Comment 41

•

23 years ago

that last patch doesn't look like a patch.

can you attach the patch?

Shanjian Li

Assignee

Comment 42

•

23 years ago

Attached patch post last patch again. (last one is corrupted.) — Details — Splinter Review

Shanjian Li

Assignee

Updated

•

23 years ago

Attachment #54527 - Attachment is obsolete: true

(not reading, please use seth@sspitzer.org instead)

Comment 43

•

23 years ago

Comment on attachment 54556 [details] [diff] [review]
post last patch again. (last one is corrupted.)

sr=sspitzer

it looks like we only execute this code if autodetect is turned on.

Is there a way re-use the 8K buffer we allocate, instead of allocating and creating it for each part?

I don't know the mime code well enough to suggest an approach.

Attachment #54556 - Flags: superreview+

Shanjian Li

Assignee