Closed Bug 12481 Opened 25 years ago Closed 23 years ago

libmime to feed more characters for charset detection (reminder)

Categories

(MailNews Core :: MIME, defect, P3)

Product:

Component:

Type:

defect

Priority:

P3

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

mozilla0.9.6

People

(Reporter: nhottanscp, Assigned: shanjian)

References

Details

(Keywords: intl)

Attachments

(1 file, 7 obsolete files)

working patch 23 years ago Shanjian Li 15.45 KB, patch		Details \| Diff \| Splinter Review
patch 1.0 23 years ago Shanjian Li 16.16 KB, patch		Details \| Diff \| Splinter Review
patch 2.0, revision after naoki's suggestion 23 years ago Shanjian Li 16.04 KB, patch	nhottanscp : review+	Details \| Diff \| Splinter Review
patch 3.0 23 years ago Shanjian Li 16.06 KB, patch	shanjian : review+	Details \| Diff \| Splinter Review
updated patch 23 years ago Shanjian Li 16.10 KB, patch	bugzilla : review+	Details \| Diff \| Splinter Review
one more update as suggested by seth 23 years ago Shanjian Li 17.40 KB, patch		Details \| Diff \| Splinter Review
latest patch 23 years ago Shanjian Li 96 bytes, patch		Details \| Diff \| Splinter Review
post last patch again. (last one is corrupted.) 23 years ago Shanjian Li 16.19 KB, patch	sspitzer : superreview+	Details \| Diff \| Splinter Review

Reporter

Description

•

25 years ago

This is separated from #8899. libmime has been feeding characters to charset converter (and charset detector) per line bases since 4.x. This is not efficient (e.g. converters creation per line) also causes lack of enough data to charset detectors (causes detection failure and charset conversion failure). We need to change data passing MIME part base instead of line base whenever we have chance to change.

Reporter

Updated

•

25 years ago

QA Contact: lchiang → nhotta

Updated

•

25 years ago

Target Milestone: M14

Comment 1

•

25 years ago

M14

Comment 2

•

25 years ago

Bulk change to assigned. - rhp

Updated

•

25 years ago

Target Milestone: M14 → M19

Comment 3

•

25 years ago

Would be nice, but is unfortunately non-trivial because of the "line-by-line" nature of libmime. - rhp

Updated

•

25 years ago

Target Milestone: M19 → M20

Comment 4

•

25 years ago

Is there someplace else to cache the charset converter and detector which doesn't force a rewrite of libmime?

Updated

•

25 years ago

Target Milestone: M20 → Future

Comment 5

•

24 years ago

reassigning to ducarroz

Assignee: rhp → ducarroz

Status: ASSIGNED → NEW

Reporter

Updated

•

24 years ago

Blocks: 59798

Reporter

Updated

•

24 years ago

Keywords: intl

Reporter

Updated

•

24 years ago

Blocks: 90581

Reporter

Comment 6

•

24 years ago

Can MimeObject cache converter and detector? MimeObject is not per line (per message, I think).

Assignee

Comment 7

•

24 years ago

*** Bug 44377 has been marked as a duplicate of this bug. ***

Assignee

Comment 8

•

24 years ago

add myself to cc list

Comment 9

•

23 years ago

*** Bug 96025 has been marked as a duplicate of this bug. ***

Assignee

Comment 10

•

23 years ago

This problem need to be taken care of. Reassign to myself.

Assignee: ducarroz → shanjian

Assignee

Comment 11

•

23 years ago

Attached patch working patch (obsolete) — Details — Splinter Review

Assignee

Updated

•

23 years ago

Status: NEW → ASSIGNED

Assignee

Comment 12

•

23 years ago

ducarroz, naoki, I just post a draft patch for your suggestion and recommendation. This patch was mostly implemented within MimeInlineText. It uses a buffer to keep data and sends them in one batch to detector. I also add some optimization work there. Please take a look and offer some advice. Thx!

Reporter

Comment 13

•

23 years ago

I am looking at the patch. One question, how does this work for a multi part message? Is the buffer allocated per part or just one buffer for one message?

Assignee

Comment 14

•

23 years ago

The buffer is allocated per part. MimeInlineText will be the child of MimeMultiPart in case of multipart message. As you seen, the buffer is allocated for MimeInlineText.

Reporter

Comment 15

•

23 years ago

Your change also includes caching unicode encoder/decoder in a mime object. This is good. The caching is done regardless of the auto detection is on or off, correct?

Assignee

Comment 16

•

23 years ago

Yes. It does not seem reasonable to me to normalize charset name and query for decoder/encoder for each line. I do allow charset change in the middle of parsing to handle meta charset, so decoder/encoder might change as well.

Reporter

Comment 17

•

23 years ago

Looks good, are you going to revise the working patch for reviews?

Reporter

Comment 18

•

23 years ago

Shanjian, could you change the target milestone? I think part of your change (cache charset converters) would improve performance.

Assignee

Comment 19

•

23 years ago

Attached patch patch 1.0 (obsolete) — Details — Splinter Review

Assignee

Updated

•

23 years ago

Attachment #51888 - Attachment is obsolete: true

Assignee

Comment 20

•

23 years ago

retarget this bug.

Target Milestone: Future → mozilla0.9.6

Reporter

Comment 21

•

23 years ago

1) "mail.charset.detector" is not used, you can remove that part of the code. 2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK then NS_OK is returned and low confident charset to be use. That is not desirable. + res = detector->DoIt(aBuf, aLength, aCharset, oConfident); + if (NS_SUCCEEDED(res) && (eBestAnswer == oConfident || eSureAnswer == oConfident)) { + return NS_OK; + } 3) I think this will leak if text->charset is not null. + if (!text->lineDamBuffer || !text->lineDamPtrs) + { + text->charset = text->defaultCharset; + text->defaultCharset = nsnull; 4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe. + if (text->utf8Encoder == nsnull) + MIME_get_unicode_encoder("UTF-8", getter_AddRefs(text->utf8Encoder)); 5) Question: this is allocating a 8k buffer for one part then copy up to 1024 lines to the buffer. So only one buffer allocation per one part, correct? + if (text->inputAutodetect) + { + //we need to prepare lineDam for charset detection + text->lineDamBuffer = (char*)PR_Malloc(DAM_MAX_BUFFER_SIZE); + text->lineDamPtrs = (char**)PR_Malloc(DAM_MAX_LINES*sizeof(char*)); 6) Related to the last question, text->lastLineInDam++; do you need to check the DAM_MAX_LINES there? + else { + //buffering current line + text->lineDamPtrs[text->lastLineInDam] = text->lineDamBuffer + text->curDamOffset; + nsCRT::memcpy(text->lineDamPtrs[text->lastLineInDam], line, length); + text->lastLineInDam++; + text->curDamOffset += length; + }

Assignee

Comment 22

•

23 years ago

1) "mail.charset.detector" is not used, you can remove that part of the code. removed as you wish. 2) In case 'oConfident' is not eBestAnswer and eSureAnswer but 'res' is NS_OK then NS_OK is returned and low confident charset to be use. That is not desirable. now I return null in aCharset if confidence is not good, and detectede charset is checked before use. 3) I think this will leak if text->charset is not null. + if (!text->lineDamBuffer || !text->lineDamPtrs) + { + text->charset = text->defaultCharset; + text->defaultCharset = nsnull; Under no circumstance text->charset will be non-null at this stage. I put a assert there. 4) This is not necessary if the global one 'm_unicodeToUTF8Encoder' is avaialbe. I don't think 'm_unicodeToUTF8Encoder' is ever used. Those code probably need to be removed. 5) Question: this is allocating a 8k buffer for one part then copy up to 1024 lines to the buffer. So only one buffer allocation per one part, correct? Yes, one buffer for each text mimepart. 6) Related to the last question, text->lastLineInDam++; do you need to check the DAM_MAX_LINES there? No, it will be checked before a new line is added, not after. There is no need to do it twice.

Assignee

Comment 23

•

23 years ago

Attached patch patch 2.0, revision after naoki's suggestion (obsolete) — Details — Splinter Review

Reporter

Comment 24

•

23 years ago

3) I am not clear about your change in MimeInlineText_initialize. Could you explain your change there? It used to set to text->charset in case of it's null, that's why I though it would leak. Now you changed not to set it there but set it later conditionally only when text->inputAutodetect is true. 5) Please add comments about the buffer and why you chose 8k.

Assignee

Comment 25

•

23 years ago

3) I am not clear about your change in MimeInlineText_initialize. Could you explain your change there? It used to set to text->charset in case of it's null, that's why I though it would leak. Now you changed not to set it there but set it later conditionally only when text->inputAutodetect is true. In my patch, only when text->charset is null after we tried all sources, will text->defaultCharset be assigned and text->inputAutodetect be set to true. So only one of text->charset and text->defaultCharset will be assigned after MimeInlineText_initialize. 5) Please add comments about the buffer and why you chose 8k. Normally it only takes 1K of real data for autodetector to reach good result. But unfortunately, because of tags and big chunk of javascripts normally attached to the head of html page, 8K is a guess to cover most cases. (A better choice might be to keep feeding until detector returns a result or reach certain limit. But the interface of autodetector need to be changed and I am unwilling to do it now.) Another reason is I assumed 8K of data can normally be contained in one UDP package if UDP is used. (No sure if it is true or not.)

Reporter

Comment 26

•

23 years ago

Comment on attachment 53434 [details] [diff] [review] patch 2.0, revision after naoki's suggestion r=nhotta please add comments about the items (#3, #5). Also, please do the i18n mail smoke test. http://www.mozilla.org/quality/intl/tests/data/

Attachment #53434 - Flags: review+

Reporter

Updated

•

23 years ago

Blocks: 82591

Assignee

Comment 27

•

23 years ago

Attached patch patch 3.0 (obsolete) — Details — Splinter Review

Assignee

Updated

•

23 years ago

Attachment #53349 - Attachment is obsolete: true

Assignee

Updated

•

23 years ago

Attachment #53434 - Attachment is obsolete: true

Assignee

Comment 28

•

23 years ago

Comment on attachment 53665 [details] [diff] [review] patch 3.0 taking over last review

Attachment #53665 - Flags: review+

Assignee

Comment 29

•

23 years ago

There is only one difference between patch 3.0 and 2.0, that is - if (NS_SUCCEEDED(res) && *detectedCharset) + if (NS_SUCCEEDED(res) && detectedCharset && *detectedCharset) the previous one might crash when detector return nothing. Mail news smoke test is done and looks fine.

Assignee

Comment 30

•

23 years ago

Seth, could you sr this bug?

(not reading, please use seth@sspitzer.org instead)

Comment 31

•

23 years ago

you're going to want the module owner for mime (ducarroz) to review this patch. can you provide information on when this code is hit? (once per message, if my pref is set for autodetect? any extra overhead for for US-ASCII users?) I'll leave it to the reviewer (ducarroz) to review memory usage (allocs and frees) and to make sure allocations and copying is minimized. some initial comments: 1) + nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res)); there's a contact id for pref, please use it instead of the CID. 2) double check your string usage. to determine the contract id, do something like this: nsCAutoString detector_contractid; detector_contractid = NS_STRCDETECTOR_CONTRACTID_BASE; nsXPIDLString detector_name; rv = prefs->GetLocalizedUnicharPref("intl.charset.detector", getter_Copies(detector_name)); if (NS_SUCCEEDED(rv)) { detector_contractid += NS_ConvertUCS2toUTF8(detector_name).get(); } 3) is there a contract id for kCharsetConverterManagerCID? if so, use it instead of the static cid. 4) NS_ConvertASCIItoUCS2("ISO-8859-1").get() NS_LITERAL_STRING("ISO-8859-1").get() is the preferred way.

Assignee

Comment 32

•

23 years ago

Attached patch updated patch (obsolete) — Details — Splinter Review

Assignee

Updated

•

23 years ago

Attachment #53665 - Attachment is obsolete: true

Assignee

Comment 33

•

23 years ago

seth, Thanks for taking a look at my patch. There are some points in your comment need further explainations. > you're going to want the module owner for mime (ducarroz) to review this > patch. I did ask ducarroz, and he told me naoki's review is OK. >can you provide information on when this code is hit? (once per message, if my >pref is set for autodetect? any extra overhead for for US-ASCII users?) The code is hit only when autodetector is selected. If an email has no charset information specified, autodetector will be launched. In another word, if email says itself is in us-ascii, there will be no overhead. 1) + nsCOMPtr<nsIPref> prefs(do_GetService(kPrefCID, &res)); Done. 2) double check your string usage. Is it because nsCAutoString is available so that we have to use it? Existing code works faster and has no heap memory allocation. IMO traditional C/C++ string practice is allowed when appropriate. We know NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector name are strings that can be enumerate, so 128 is enough and there is no reason to use heap allocation. Is there other reason you want me to change so? 3) is there a contract id for kCharsetConverterManagerCID? if so, use it instead of the static cid. Done. 4) NS_ConvertASCIItoUCS2("ISO-8859-1").get() My understanding of document "http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong.

Jean-Francois Ducarroz

Comment 34

•

23 years ago

Right, I let Naoki review it as it essentially a I18n matter. I'll anyway take a look at it...

Jean-Francois Ducarroz

Comment 35

•

23 years ago

Comment on attachment 54012 [details] [diff] [review] updated patch Looks OK. R=ducarroz

Attachment #54012 - Flags: review+

(not reading, please use seth@sspitzer.org instead)

Comment 36

•

23 years ago

> NS_STRCDETECTOR_CONTRACTID_BASE is a constant string, and all charset detector > name are strings that can be enumerate, so 128 is enough and there is no reason > to use heap allocation. Is there other reason you want me to change so? + nsresult res; + char detector_contractid[128]; + PRUnichar* detector_name = nsnull; + nsCOMPtr<nsIStringCharsetDetector> detector; + PL_strcpy(detector_contractid, NS_STRCDETECTOR_CONTRACTID_BASE); + + nsCOMPtr<nsIPref> prefs(do_GetService(NS_PREF_CONTRACTID, &res)); + if (NS_SUCCEEDED(res)) { + if (NS_SUCCEEDED(prefs->GetLocalizedUnicharPref("intl.charset.detector", &detector_name))) { + PL_strcat(detector_contractid, NS_ConvertUCS2toUTF8(detector_name).get()); + PR_FREEIF(detector_name); + } + } if length(NS_ConvertUCS2toUTF8(detector_name).get()) + length(NS_STRCDETECTOR_CONTRACTID_BASE) is > 128, you now have a buffer overrun on the stack. checking in this code would expose a security exploit. make detector_contractid a nsCString (not an AutoString, if it is > 63 bytes, an autostring will be copied to the heap.) and make detector_name a nsXPIDLString, so you don't have to worry about the free(). > 4) NS_ConvertASCIItoUCS2("ISO-8859-1").get() > My understanding of document > "http://lxr.mozilla.org/seamonkey/source/string/doc/string-guide.html" is that > NS_ConvertASCIItoUCS2 is the prefered way. Let me know if I am wrong. My understand is that pn platforms that support the "L" trick (like windows) NS_LITERAL_STRING() does the conversion at compile time, instead of at run time.

Assignee

Comment 37

•

23 years ago

Attached patch one more update as suggested by seth (obsolete) — Details — Splinter Review

Assignee

Updated

•

23 years ago

Attachment #54012 - Attachment is obsolete: true

Assignee

Comment 38

•

23 years ago

Comment on attachment 54520 [details] [diff] [review] one more update as suggested by seth carry over review

Attachment #54520 - Flags: review+

Assignee

Updated

•

23 years ago

Attachment #54520 - Attachment is obsolete: true

Attachment #54520 - Flags: review+

Assignee

Comment 39

•

23 years ago

Attached patch latest patch (obsolete) — Details — Splinter Review

Assignee

Comment 40

•

23 years ago

Seth, I read the document carefully and you are right, NS_LITERAL_STRING is better. But I still have some doubt about issue 3, though I did find several instances of using nsCString and nsXPIDLString in such situation. Because I am sure that length(NS_ConvertUCS2toUTF8(detector_name).get()) + length(NS_STRCDETECTOR_CONTRACTID_BASE) is < 128, I believe my original code is safe. While if you really think we should prepare for the unknown and try to avoid using heap, things need to get a little bit more complicated. I checked the string class implemenation. In nsStr.h, it says nsCString is heap based, and nsCAutoString is partially stack based. 64 is not large enough in this case, so I prepared the buf in stack for nsCAutoString. I update the patch, please sr.

(not reading, please use seth@sspitzer.org instead)

Comment 41

•

23 years ago

that last patch doesn't look like a patch. can you attach the patch?

Assignee

Comment 42

•

23 years ago

Attached patch post last patch again. (last one is corrupted.) — Details — Splinter Review

Assignee

Updated

•

23 years ago

Attachment #54527 - Attachment is obsolete: true

(not reading, please use seth@sspitzer.org instead)

Comment 43

•

23 years ago

Comment on attachment 54556 [details] [diff] [review] post last patch again. (last one is corrupted.) sr=sspitzer it looks like we only execute this code if autodetect is turned on. Is there a way re-use the 8K buffer we allocate, instead of allocating and creating it for each part? I don't know the mime code well enough to suggest an approach.

Attachment #54556 - Flags: superreview+

Assignee

Comment 44

•

23 years ago

fix checked in.

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Core → MailNews Core

You need to log in before you can comment on or make changes to this bug.