50413 - In saving JPN msgs into HTML format, characters which contain 0x3C are misconverted into <

Reporter

Description

•

25 years ago

** Observed with 8/25/2000 Win32 build ** Take the Intl Smoketest mail folder and select the 2nd message from the top. Now save this message using HTML format -- File | Save as | File menu item. Give it a name and supply teh extension. .html and then save it. Now open this file and view it with Browser under Japanese (ISO-2022-JP) encoding. The subject and the body contain exactly the same data but the body text is partially corrupted displays as non-Japanese. Switch to ISO-8859-1 encoding and view hte page source and compare the Subject and Body lines that are in ISO-2022-JP. $B$3$l$OF|K\8l$N%a!<%k$G$9!#(B $B$3$l$OF|K\8l$N%a!<%k$G$9!#(B --> Note the misconversion of "<" into "<". Initially sending this to Naoki.

Katsuhiko Momoi

Reporter

Comment 1

•

25 years ago

This is a familiar porblem. It will probably affect other languages which uses multi-byte encodings (e.g. Chinese) in addition to Japanese.

nhottanscp

Assignee

Comment 2

•

25 years ago

HTML escape problem, it should be escaped when the data is in unicode before the charset conversion.

Status: NEW → ASSIGNED

nhottanscp

Assignee

Comment 3

•

25 years ago

nsbeta3, since this is data loss.

Keywords: nsbeta3

Jaime Rodriguez, Jr.

Comment 4

•

25 years ago

Marking as nsbeta3+ (P3) per I18N Bug Triage.

Whiteboard: [nsbeta3+]

Frank Tang

Comment 5

•

25 years ago

mark as P4

Priority: P3 → P4

nhottanscp

Assignee

Comment 6

•

25 years ago

< is generated by ScanTXT() which is used for plain text display and save as file. So I think the bug only happens when the original message is plain text and it's saved as HTML.

Frank Tang

Comment 7

•

25 years ago

But the ScanTXT function pass parameter by using PRUnichar* and nsString. I think the problem is the caller convert char* into PRUnichar* . One of the problem is here. The problem is this routine convert char* into nsString by using ASCII to PRUnichar* rule. http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpfl.cpp#340 255 256 static int 257 MimeInlineTextPlainFlowed_parse_line (char *line, PRInt32 length, MimeObject *obj) 258 { 259 int status; 260 PRBool quoting = ( obj->options 261 && ( obj->options->format_out == nsMimeOutput::nsMimeMessageQuoting || 262 obj->options->format_out == nsMimeOutput::nsMimeMessageBodyQuoting 263 ) ); // see above 264 PRBool plainHTML = quoting || (obj->options && 265 obj->options->format_out == nsMimeOutput::nsMimeMessageSaveAs); 266 // see above 267 268 struct MimeInlineTextPlainFlowedExData *exdata; 269 exdata = MimeInlineTextPlainFlowedExDataList; 270 while(exdata && (exdata->ownerobj != obj)) { 271 exdata = exdata->next; 272 } 273 274 NS_ASSERTION(exdata, "The extra data has disappeared!"); 275 276 NS_ASSERTION(length > 0, "zero length"); 277 if (length <= 0) return 0; 278 279 uint32 linequotelevel = 0; 280 char *linep = line; 281 // Space stuffed? 282 if(' ' == *linep) { 283 linep++; 284 } else { 285 // count '>':s before the first non-'>' 286 while('>' == *linep) { 287 linep++; 288 linequotelevel++; 289 } 290 // Space stuffed? 291 if(' ' == *linep) { 292 linep++; 293 } 294 } 295 296 // Look if the last character (after stripping ending end 297 // of lines and quoting stuff) is a SPACE. If it is, we are looking at a 298 // flowed line. Normally we assume that the last two chars 299 // are CR and LF as said in RFC822, but that doesn't seem to 300 // be the case always. 301 PRBool flowed = PR_FALSE; 302 PRInt32 index = length-1; 303 while(index >= 0 && ('\r' == line[index] || '\n' == line[index])) { 304 index--; 305 } 306 if (index > linep - line && ' ' == line[index]) 307 /* Ignore space stuffing, i.e. lines with just 308 (quote marks and) a space count as empty */ 309 flowed = PR_TRUE; 310 311 mozITXTToHTMLConv *conv = GetTextConverter(obj->options); 312 313 PRBool skipConversion = !conv || 314 (obj->options && obj->options->force_user_charset); 315 316 nsString lineSource; 317 nsString lineResult; 318 lineSource.SetCapacity(kInitialBufferSize); 319 lineResult.SetCapacity(kInitialBufferSize); 320 321 if (!skipConversion) 322 { 323 lineSource.AssignWithConversion(linep, (length - (linep - line)) ); 324 PRUnichar* wresult = nsnull; 325 nsresult rv = NS_OK; 326 PRBool whattodo = obj->options->whattodo; 327 if (plainHTML) 328 { 329 if (quoting) 330 whattodo = 0; 331 else 332 whattodo = whattodo & ~mozITXTToHTMLConv::kGlyphSubstitution; 333 /* Do recognition for the case, the result is viewed in 334 Mozilla, but not GlyphSubstitution, because other UAs 335 might not be able to display the glyphs. */ 336 } 337 338 /* This is the main TXT to HTML conversion: 339 escaping (very important), eventually recognizing etc. */ 340 rv = conv->ScanTXT(lineSource.GetUnicode(), whattodo, &wresult); 341 if (NS_FAILED(rv)) return -1; 342 343 lineResult = wresult; 344 Recycle(wresult); 345 } 346 else 347 { 348 lineResult.AssignWithConversion(line, length); 349 status = NS_OK; 350 } 351 352 nsCAutoString preface; 353 354 /* Correct number of blockquotes */ 355 int32 quoteleveldiff=linequotelevel - exdata->quotelevel; 356 if((quoteleveldiff != 0) && flowed && exdata->inflow) { 357 // From RFC 2646 4.5 358 // The receiver SHOULD handle this error by using the 'quote-depth-wins' rule, 359 // which is to ignore the flowed indicator and treat the line as fixed. That 360 // is, the change in quote depth ends the paragraph. 361 362 // We get that behaviour by just going on. 363 } 364 while(quoteleveldiff>0) { 365 quoteleveldiff--; 366 preface += "<blockquote type=cite"; 367 // This is to have us observe the user pref settings for citations 368 MimeInlineTextPlainFlowed *tObj = (MimeInlineTextPlainFlowed *) obj; 369 char *style = MimeTextBuildPrefixCSS(tObj->mQuotedSizeSetting, 370 tObj->mQuotedStyleSetting, 371 tObj->mCitationColor); 372 if (!plainHTML && style && strlen(style)) 373 { 374 preface += " style=\""; 375 preface += style; 376 preface += '"'; 377 PR_FREEIF(style); 378 } 379 preface += '>'; 380 } 381 while(quoteleveldiff<0) { 382 quoteleveldiff++; 383 preface += "</blockquote>"; 384 } 385 exdata->quotelevel = linequotelevel; 386 387 nsString lineResult2; 388 lineResult2.SetCapacity(kInitialBufferSize); 389 390 if(flowed) { 391 // Check RFC 2646 "4.3. Usenet Signature Convention": "-- "+CRLF is 392 // not a flowed line 393 if 394 ( // is "-- "LINEBREAK 395 lineSource.Length() >= 4 396 && lineSource[0] == '-' 397 && 398 ( 399 lineSource.EqualsWithConversion("-- \r", PR_FALSE, 4) || 400 lineSource.EqualsWithConversion("-- \n", PR_FALSE, 4) 401 ) 402 ) 403 { 404 if (linequotelevel > 0 || exdata->isSig) 405 { 406 preface += "--  "; 407 } else { 408 exdata->isSig = PR_TRUE; 409 preface += 410 "<div class=\"txt-sig\">--  "; 411 } 412 } else { 413 Line_convert_whitespace(lineResult, PR_FALSE /* Allow wraps */, 414 lineResult2); 415 } 416 417 exdata->inflow=PR_TRUE; 418 } else { 419 // Fixed paragraph. 420 Line_convert_whitespace(lineResult, 421 !plainHTML && !obj->options->wrap_long_lines_p 422 /* If wrap, convert all spaces but the last in 423 a row into nbsp, otherwise all. */, 424 lineResult2); 425 lineResult2 += NS_LITERAL_STRING(" "); 426 exdata->inflow = PR_FALSE; 427 } // End Fixed line 428 429 if (!(exdata->isSig && quoting)) 430 { 431 char* tmp = preface.ToNewCString(); 432 status = MimeObject_write(obj, tmp, preface.Length(), PR_TRUE); 433 Recycle(tmp); 434 tmp = lineResult2.ToNewCString(); 435 status = MimeObject_write(obj, tmp, lineResult2.Length(), PR_TRUE); 436 Recycle(tmp); 437 return status; 438 } 439 else 440 return NS_OK; 441 } Looks like the same kind of problem in http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpla.cpp#398 and http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpla.cpp#428 I think the function ScanTXT itself have no problem. The problem is we apply some expansion code which should apply to the PRUnichar* data after it have already convert to char*.

Frank Tang

Comment 8

•

25 years ago

add rhp (and if someone can, please add BenB's correct email address ) to the cc list since these code (Those code which convert char* to nsString and then pass to scanTXT() ) are check in by them.

rhp (gone)

Comment 9

•

25 years ago

Adding Ben.

Ben Bucksch (:BenB)

Comment 10

•

25 years ago

> I think the problem is the caller convert char* into PRUnichar* . > One of the problem is here. The problem is this routine convert char* into > nsString by using ASCII to PRUnichar* rule. You mean |lineSource.AssignWithConversion(linep, ...);|? What is wrong with that? What should I use instead? Is this a new problem? IIRC, this conversion has always (i.e. since end of last year) been done this way. Or did I break it with my last checkin? (I just noticed that we should propably do the quote handling ("...286 while('>' == *linep)...") with nsString's instead of char*s, too. But that is really an old problem. Adding Daniel Bratell.)

nhottanscp

Assignee

Comment 11

•

25 years ago

I think the usage of nsString is not correct here but that's not the reason which causes this bug. This part is shared by message display, quoting, save as, maybe more. So, I wouldn't correct that at this time. The problem here is that the string is usually converted to UTF-8 but for saveas case it is in mail charset (e.g. ISO-2022-JP). Since ScanTXT does not expect ISO-2022-JP as an input, it does not escape correctly. So I was thinking to do a special case for saveas. It is ugly but minimize possible side effects. The change would be for save as only convert from mail charset to unicode before ScanTXT then convert back to mail charset.

Ben Bucksch (:BenB)

Comment 12

•

25 years ago

> So I was thinking to do a special case for saveas. It is ugly but minimize > possible side effects. It's not that these classes were pretty right now :). > convert from mail > charset to unicode before ScanTXT then convert back to mail charset. I like this solution - this is what I thought it would work.

nhottanscp

Assignee

Comment 13

•

25 years ago

Attached patch patch file — Details — Splinter Review

Frank Tang

Updated

•

25 years ago

Whiteboard: [nsbeta3+] → [nsbeta3+]fix in hand, need review

Frank Tang

Comment 14

•

25 years ago

nhotta said he have a patch in hand, need code review and test. He today crash before he can hit that code, with or without his patch. nhotta- please attach your patch here. Thanks

Ben Bucksch (:BenB)

Comment 15

•

25 years ago

I did not review, but looked at the file. Suggestions/Corrections: In mimetpla.cpp: - |lineSourceStr| is just the nsString versionof |line|. Use that instead of |newcstr|. - In the charset-magic else block, ScanTXT is ran over the whole source line. This is wrong, you need to skip the first |logicalLineStart| chars. This is propably just the tip of the iceberg - I (and Daniel, too, as it seems) wrote this stuff in the assumption that I operate with Unicode only. We should leave this open or file a new bug about that. Also, independant from that, I need some education about I18N in general and in Mozilla, especially with |ns*String|. I'll look at the website, when I have time, but pointers to references, tutorials etc. appreciated.

nhottanscp

Assignee

Comment 16

•

25 years ago

> - |lineSourceStr| is just the nsString versionof |line|. Use that instead of > |newcstr|. The problem is that is using nsString but the actual string is UTF-8 because of the incorrect use of nsString. > - In the charset-magic else block, ScanTXT is ran over the whole source line. > This is wrong, you need to skip the first |logicalLineStart| chars. I ignore that because I thought that's only needed for quoting and the else block is only for save as. But do I also need that for saving quoted text?

rhp (gone)

Comment 17

•

25 years ago

Hi Naoki, Looks reasonable to me. r: rhp - rhp

Ben Bucksch (:BenB)

Comment 18

•

25 years ago

nhottanscp

Assignee

Comment 19

•

25 years ago

Attached file a new patch with Ben's suggestion — Details

Ben Bucksch (:BenB)

Comment 20

•

25 years ago

nhottanscp

Assignee

Comment 21

•

25 years ago

Attached patch patch for the crasher mentioned by ftang on 9/11 — Details — Splinter Review

nhottanscp

Assignee

Updated

•

25 years ago

Whiteboard: [nsbeta3+]fix in hand, need review → [nsbeta3+]fix in hand, reviewed

Frank Tang

Comment 22

•

25 years ago

also crashing problem. change the priority to P2 because of crash

Keywords: crash

Priority: P4 → P2

Ben Bucksch (:BenB)

Comment 23

•

25 years ago

woa :-( Do you happen to know, where it crashes? (Many this helps preventing similar crashs.) If you don't, that's OK, too.

nhottanscp

Assignee

Comment 24

•

25 years ago

Fix check in (both the crasher and "<" problem).

Status: ASSIGNED → RESOLVED

Closed: 25 years ago

Resolution: --- → FIXED

msanz

Comment 25

•

25 years ago

Another one for Xianglan. Thanks!

QA Contact: momoi → ji

ji

Comment 26

•

25 years ago

Verified with win32 2000092005 and linux 2000092008 build. It's fixed.

Status: RESOLVED → VERIFIED

Ben Bucksch (:BenB)

Comment 27

•

24 years ago

jgmyers wants to remove the saveashtml functinality altogether: bug 68982.

Myk Melez [:myk] [@mykmelez]

Updated

•

21 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

17 years ago

Product: Core → MailNews Core

patch file 25 years ago nhottanscp 6.37 KB, patch		Details \| Diff \| Splinter Review
a new patch with Ben's suggestion 25 years ago nhottanscp 5.89 KB, text/plain		Details
patch for the crasher mentioned by ftang on 9/11 25 years ago nhottanscp 789 bytes, patch		Details \| Diff \| Splinter Review

In saving JPN msgs into HTML format, characters which contain 0x3C are misconverted into &lt;

In saving JPN msgs into HTML format, characters which contain 0x3C are misconverted into <