Closed Bug 50413 Opened 25 years ago Closed 25 years ago

In saving JPN msgs into HTML format, characters which contain 0x3C are misconverted into <

Categories

(MailNews Core :: Internationalization, defect, P2)

x86
Windows NT
defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: momoi, Assigned: nhottanscp)

Details

(Keywords: crash, Whiteboard: [nsbeta3+]fix in hand, reviewed)

Attachments

(3 files)

** Observed with 8/25/2000 Win32 build ** Take the Intl Smoketest mail folder and select the 2nd message from the top. Now save this message using HTML format -- File | Save as | File menu item. Give it a name and supply teh extension. .html and then save it. Now open this file and view it with Browser under Japanese (ISO-2022-JP) encoding. The subject and the body contain exactly the same data but the body text is partially corrupted displays as non-Japanese. Switch to ISO-8859-1 encoding and view hte page source and compare the Subject and Body lines that are in ISO-2022-JP. $B$3$l$OF|K\8l$N%a!<%k$G$9!#(B $B$3$l$OF|K\8l$N%a!&lt;%k$G$9!#(B --> Note the misconversion of "<" into "&lt;". Initially sending this to Naoki.
This is a familiar porblem. It will probably affect other languages which uses multi-byte encodings (e.g. Chinese) in addition to Japanese.
HTML escape problem, it should be escaped when the data is in unicode before the charset conversion.
Status: NEW → ASSIGNED
nsbeta3, since this is data loss.
Keywords: nsbeta3
Marking as nsbeta3+ (P3) per I18N Bug Triage.
Whiteboard: [nsbeta3+]
mark as P4
Priority: P3 → P4
&lt; is generated by ScanTXT() which is used for plain text display and save as file. So I think the bug only happens when the original message is plain text and it's saved as HTML.
But the ScanTXT function pass parameter by using PRUnichar* and nsString. I think the problem is the caller convert char* into PRUnichar* . One of the problem is here. The problem is this routine convert char* into nsString by using ASCII to PRUnichar* rule. http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpfl.cpp#340 255 256 static int 257 MimeInlineTextPlainFlowed_parse_line (char *line, PRInt32 length, MimeObject *obj) 258 { 259 int status; 260 PRBool quoting = ( obj->options 261 && ( obj->options->format_out == nsMimeOutput::nsMimeMessageQuoting || 262 obj->options->format_out == nsMimeOutput::nsMimeMessageBodyQuoting 263 ) ); // see above 264 PRBool plainHTML = quoting || (obj->options && 265 obj->options->format_out == nsMimeOutput::nsMimeMessageSaveAs); 266 // see above 267 268 struct MimeInlineTextPlainFlowedExData *exdata; 269 exdata = MimeInlineTextPlainFlowedExDataList; 270 while(exdata && (exdata->ownerobj != obj)) { 271 exdata = exdata->next; 272 } 273 274 NS_ASSERTION(exdata, "The extra data has disappeared!"); 275 276 NS_ASSERTION(length > 0, "zero length"); 277 if (length <= 0) return 0; 278 279 uint32 linequotelevel = 0; 280 char *linep = line; 281 // Space stuffed? 282 if(' ' == *linep) { 283 linep++; 284 } else { 285 // count '>':s before the first non-'>' 286 while('>' == *linep) { 287 linep++; 288 linequotelevel++; 289 } 290 // Space stuffed? 291 if(' ' == *linep) { 292 linep++; 293 } 294 } 295 296 // Look if the last character (after stripping ending end 297 // of lines and quoting stuff) is a SPACE. If it is, we are looking at a 298 // flowed line. Normally we assume that the last two chars 299 // are CR and LF as said in RFC822, but that doesn't seem to 300 // be the case always. 301 PRBool flowed = PR_FALSE; 302 PRInt32 index = length-1; 303 while(index >= 0 && ('\r' == line[index] || '\n' == line[index])) { 304 index--; 305 } 306 if (index > linep - line && ' ' == line[index]) 307 /* Ignore space stuffing, i.e. lines with just 308 (quote marks and) a space count as empty */ 309 flowed = PR_TRUE; 310 311 mozITXTToHTMLConv *conv = GetTextConverter(obj->options); 312 313 PRBool skipConversion = !conv || 314 (obj->options && obj->options->force_user_charset); 315 316 nsString lineSource; 317 nsString lineResult; 318 lineSource.SetCapacity(kInitialBufferSize); 319 lineResult.SetCapacity(kInitialBufferSize); 320 321 if (!skipConversion) 322 { 323 lineSource.AssignWithConversion(linep, (length - (linep - line)) ); 324 PRUnichar* wresult = nsnull; 325 nsresult rv = NS_OK; 326 PRBool whattodo = obj->options->whattodo; 327 if (plainHTML) 328 { 329 if (quoting) 330 whattodo = 0; 331 else 332 whattodo = whattodo & ~mozITXTToHTMLConv::kGlyphSubstitution; 333 /* Do recognition for the case, the result is viewed in 334 Mozilla, but not GlyphSubstitution, because other UAs 335 might not be able to display the glyphs. */ 336 } 337 338 /* This is the main TXT to HTML conversion: 339 escaping (very important), eventually recognizing etc. */ 340 rv = conv->ScanTXT(lineSource.GetUnicode(), whattodo, &wresult); 341 if (NS_FAILED(rv)) return -1; 342 343 lineResult = wresult; 344 Recycle(wresult); 345 } 346 else 347 { 348 lineResult.AssignWithConversion(line, length); 349 status = NS_OK; 350 } 351 352 nsCAutoString preface; 353 354 /* Correct number of blockquotes */ 355 int32 quoteleveldiff=linequotelevel - exdata->quotelevel; 356 if((quoteleveldiff != 0) && flowed && exdata->inflow) { 357 // From RFC 2646 4.5 358 // The receiver SHOULD handle this error by using the 'quote-depth-wins' rule, 359 // which is to ignore the flowed indicator and treat the line as fixed. That 360 // is, the change in quote depth ends the paragraph. 361 362 // We get that behaviour by just going on. 363 } 364 while(quoteleveldiff>0) { 365 quoteleveldiff--; 366 preface += "<blockquote type=cite"; 367 // This is to have us observe the user pref settings for citations 368 MimeInlineTextPlainFlowed *tObj = (MimeInlineTextPlainFlowed *) obj; 369 char *style = MimeTextBuildPrefixCSS(tObj->mQuotedSizeSetting, 370 tObj->mQuotedStyleSetting, 371 tObj->mCitationColor); 372 if (!plainHTML && style && strlen(style)) 373 { 374 preface += " style=\""; 375 preface += style; 376 preface += '"'; 377 PR_FREEIF(style); 378 } 379 preface += '>'; 380 } 381 while(quoteleveldiff<0) { 382 quoteleveldiff++; 383 preface += "</blockquote>"; 384 } 385 exdata->quotelevel = linequotelevel; 386 387 nsString lineResult2; 388 lineResult2.SetCapacity(kInitialBufferSize); 389 390 if(flowed) { 391 // Check RFC 2646 "4.3. Usenet Signature Convention": "-- "+CRLF is 392 // not a flowed line 393 if 394 ( // is "-- "LINEBREAK 395 lineSource.Length() >= 4 396 && lineSource[0] == '-' 397 && 398 ( 399 lineSource.EqualsWithConversion("-- \r", PR_FALSE, 4) || 400 lineSource.EqualsWithConversion("-- \n", PR_FALSE, 4) 401 ) 402 ) 403 { 404 if (linequotelevel > 0 || exdata->isSig) 405 { 406 preface += "--&nbsp;<br>"; 407 } else { 408 exdata->isSig = PR_TRUE; 409 preface += 410 "<div class=\"txt-sig\"><span class=\"txt-tag\">--&nbsp;<br></span>"; 411 } 412 } else { 413 Line_convert_whitespace(lineResult, PR_FALSE /* Allow wraps */, 414 lineResult2); 415 } 416 417 exdata->inflow=PR_TRUE; 418 } else { 419 // Fixed paragraph. 420 Line_convert_whitespace(lineResult, 421 !plainHTML && !obj->options->wrap_long_lines_p 422 /* If wrap, convert all spaces but the last in 423 a row into nbsp, otherwise all. */, 424 lineResult2); 425 lineResult2 += NS_LITERAL_STRING("<br>"); 426 exdata->inflow = PR_FALSE; 427 } // End Fixed line 428 429 if (!(exdata->isSig && quoting)) 430 { 431 char* tmp = preface.ToNewCString(); 432 status = MimeObject_write(obj, tmp, preface.Length(), PR_TRUE); 433 Recycle(tmp); 434 tmp = lineResult2.ToNewCString(); 435 status = MimeObject_write(obj, tmp, lineResult2.Length(), PR_TRUE); 436 Recycle(tmp); 437 return status; 438 } 439 else 440 return NS_OK; 441 } Looks like the same kind of problem in http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpla.cpp#398 and http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpla.cpp#428 I think the function ScanTXT itself have no problem. The problem is we apply some expansion code which should apply to the PRUnichar* data after it have already convert to char*.
add rhp (and if someone can, please add BenB's correct email address ) to the cc list since these code (Those code which convert char* to nsString and then pass to scanTXT() ) are check in by them.
Adding Ben.
> I think the problem is the caller convert char* into PRUnichar* . > One of the problem is here. The problem is this routine convert char* into > nsString by using ASCII to PRUnichar* rule. You mean |lineSource.AssignWithConversion(linep, ...);|? What is wrong with that? What should I use instead? Is this a new problem? IIRC, this conversion has always (i.e. since end of last year) been done this way. Or did I break it with my last checkin? (I just noticed that we should propably do the quote handling ("...286 while('>' == *linep)...") with nsString's instead of char*s, too. But that is really an old problem. Adding Daniel Bratell.)
I think the usage of nsString is not correct here but that's not the reason which causes this bug. This part is shared by message display, quoting, save as, maybe more. So, I wouldn't correct that at this time. The problem here is that the string is usually converted to UTF-8 but for saveas case it is in mail charset (e.g. ISO-2022-JP). Since ScanTXT does not expect ISO-2022-JP as an input, it does not escape correctly. So I was thinking to do a special case for saveas. It is ugly but minimize possible side effects. The change would be for save as only convert from mail charset to unicode before ScanTXT then convert back to mail charset.
> So I was thinking to do a special case for saveas. It is ugly but minimize > possible side effects. It's not that these classes were pretty right now :). > convert from mail > charset to unicode before ScanTXT then convert back to mail charset. I like this solution - this is what I thought it would work.
Attached patch patch fileSplinter Review
Whiteboard: [nsbeta3+] → [nsbeta3+]fix in hand, need review
nhotta said he have a patch in hand, need code review and test. He today crash before he can hit that code, with or without his patch. nhotta- please attach your patch here. Thanks
I did not review, but looked at the file. Suggestions/Corrections: In mimetpla.cpp: - |lineSourceStr| is just the nsString versionof |line|. Use that instead of |newcstr|. - In the charset-magic else block, ScanTXT is ran over the whole source line. This is wrong, you need to skip the first |logicalLineStart| chars. This is propably just the tip of the iceberg - I (and Daniel, too, as it seems) wrote this stuff in the assumption that I operate with Unicode only. We should leave this open or file a new bug about that. Also, independant from that, I need some education about I18N in general and in Mozilla, especially with |ns*String|. I'll look at the website, when I have time, but pointers to references, tutorials etc. appreciated.
> - |lineSourceStr| is just the nsString versionof |line|. Use that instead of > |newcstr|. The problem is that is using nsString but the actual string is UTF-8 because of the incorrect use of nsString. > - In the charset-magic else block, ScanTXT is ran over the whole source line. > This is wrong, you need to skip the first |logicalLineStart| chars. I ignore that because I thought that's only needed for quoting and the else block is only for save as. But do I also need that for saving quoted text?
Hi Naoki, Looks reasonable to me. r: rhp - rhp
> > - |lineSourceStr| is just the nsString versionof |line|. Use that instead of > > |newcstr|. > The problem is that is using nsString but the actual string is UTF-8 because > of the incorrect use of nsString. Sorry, i was not clear. |line| is a cstr, and |lineSourceStr| holds the nsString equivalence. No point in converting from nsString back to cstr, use the original (|line|). > I ignore that because I thought that's only needed for quoting and the else > block is only for save as. But do I also need that for saving quoted text? This is for quotes in the recived msg. E.g. the previous paragraph would count as quote, and |logicalLineStart| would be 2. The ">"s are added at another place (to |prefaceResultStr|).
You reused |line| in mimemtpfl.cpp, too. However, in *this* class, |lineSource| != |line|. In this class, |lineSource| does *not* contain the ">"s anymore, while |line| does. (That's propably my fault - misleading variable names.) Take the patch for mimetpfl.cpp from teh first attachment, and the patch for mimetpla.cpp from the second attachment, and you should be set.
Whiteboard: [nsbeta3+]fix in hand, need review → [nsbeta3+]fix in hand, reviewed
also crashing problem. change the priority to P2 because of crash
Keywords: crash
Priority: P4 → P2
woa :-( Do you happen to know, where it crashes? (Many this helps preventing similar crashs.) If you don't, that's OK, too.
Fix check in (both the crasher and "&lt;" problem).
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Another one for Xianglan. Thanks!
QA Contact: momoi → ji
Verified with win32 2000092005 and linux 2000092008 build. It's fixed.
Status: RESOLVED → VERIFIED
jgmyers wants to remove the saveashtml functinality altogether: bug 68982.
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: