Closed Bug 269812 Opened 20 years ago Closed 15 years ago

"Save as text" of non-ASCII mail is corrupted when 3-bytes escape sequence is split to two buffers

Categories

(MailNews Core :: Internationalization, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: World, Assigned: smontagu)

References

Details

(Keywords: dataloss)

Attachments

(7 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a5) Gecko/20041113
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a5) Gecko/20041113

"Save as text" of non-asci mail is corrupted when mail body exceeds some size
limit .
This problem occurs on both Mozilla Suite and Thunderbird.
 - Mozilla suite latest-trunk 2004111305 on Win-2K
 - Thunderbird latest-trunk 2004/11/13 build on Win-2K

Reproducible: Always
Steps to Reproduce:





I tested next 6 cases (ISO-2022-JP text/plain; mails)
 (1) Case-01: Mail Size = 1K, Subject: header = Single line 
              ==> No problem               
 (2) Case-02: Mail Size = 1K, Subject: header = Multiple line
              ==> No problem               
 (3) Case-03: Mail Size = 8K, Subject: header = Single line 
              ==> Partial mail was saved(First portion only, Linebreak is kept)
 (4) Case-04: Mail Size = 8K, Subject: header = Multiple line
              ==> Null files(file size=0) was saved
 (5) Case-05: Mail Size =11K, Subject: header = Single line 
              ==> Partial mail was saved(First portion only, Linebreak is kept)
 (6) Case-06: Mail Size =11K, Subject: header = Multiple line
              ==> Only mid portion of mail is saved with no Linebreak

Result of Case-04 is similar to Bug 112069 and result of Case-06 is similar to
Bug 257828, althogh these two bugs does not refer to "large mail size" nor
"multiple line subject header".
Save this file in your mail directry(say "Test") and restart Mozilla or
Thunderbird, then "Test" folder will appear.
This folder contains 6 mails of above 6 test cases(date is set as case number).
Can this have anything to do with a recent (?) stream change? 
Keywords: dataloss
Product: MailNews → Core
Saving as text under an English Win2K system, I primarily see the symptom from 
bug 230042.  However, the corruption looks very similar to that in bug 261923 
(which is definitely not a recent regression).  That also hinges on a non-ASCII 
character, in the subject in that case.
(In reply to comment #4)
> That also hinges on a non-ASCII character

Mainly on non-ascii, I think.
This bug is problem in convertion from charset of mail to charset of system,
iso-2022-jp to Shift_JIS(CP 932) in my case.
I suspect buffer corruption/overlay or control block corruption when data is
longer than buffer size.
So I think this is also applicable to long iso-8859-1 mail on English MS
Windows, as you reported to Bug 257828.
Blocks: 261923
Workaround:
 (1) Start text editor(notepad, simpletext, e editor etc.)
 (2) CTRL+A, then CTRL+C on Mozilla/Thunderbird (Command+A,Command+C on Mac)
 (3) CTRL+V on editor (Command+V on Mac)
 (4) CTRL+S on editor (Command+S on Mac), then specifiy file name, then press OK
(Sorry but I don't know how to on Emacs)
  
Status: UNCONFIRMED → NEW
Ever confirmed: true
Cc: ing to to Jshin, because assignee of Bug 263797, and putting Bug 138578 in
"depends on:".
Jshin, remove if wrong, please.
Depends on: 138578
I wrote a comment but forgot to commit. My fix for bug 138578 fixes case 3, but
cases 4 to 6 appear to have a different cause. Stack is as shown below and I
wonder how   p == end (end = start  + N where N is 4096) can be satisfied for
every UTF-8 string passed to |CalculateUTF8Length::write| (there's no guarantee
that the fixed-size buffer boundary coincides with a 'UTF-8 character'
boundary). 4096 seems to be a constant defined somewhere in the stream code. The
beginning of a string passed to |write| is valid as UTF-8, but I wasn't able to
inspect some variables I want to inspect with VC++ (it wouldn't let me). I'll
try later with gdb on Linux.

nsDebug::Assertion(const char * 0x007e8908 `string', const char * 0x007e3b4c
`string', const c
har * 0x007eedd0 `string', int 0x00000108) line 109
CalculateUTF8Length::write(CalculateUTF8Length * const 0x0102bfb8, const char *
0x04473760, un
signed int 0x00001000) line 264 + 25 bytes
   N = 4096 : how come? 

  if ( p != end )
          {
            NS_ERROR("Not a UTF-8 string. This code should only be used for
converting from kn
own UTF-8 strings.");
            mErrorEncountered = PR_TRUE;
            return N;
          }
        return p - start;


copy_string(nsReadingIterator<char> & {...}, const nsReadingIterator<char> &
{...}, CalculateU
TF8Length & {...}) line 95 + 13 bytes
AppendUTF8toUTF16(const nsACString & {...}, nsAString & {...}) line 234
nsSaveMsgListener::OnDataAvailable(nsSaveMsgListener * const 0x00001077,
nsIRequest * 0x043ea7
e0, nsISupports * 0x044467bc, nsIInputStream * 0x00001000, unsigned int
0x00000000, unsigned i
nt 0x00001000) line 2032 + 44 bytes
nsMimeBaseEmitter::Complete(nsMimeBaseEmitter * const 0x00004077) line 949
nsStreamConverter::OnStopRequest(nsStreamConverter * const 0x01d6fc69,
nsIRequest * 0x0428de98
, nsISupports * 0x04414abc, unsigned int 0x037ed828) line 1028
nsMsgProtocol::OnStopRequest(nsMsgProtocol * const 0x04444e88, nsIRequest *
0x00000000, nsISup
ports * 0x04473760, unsigned int 0x002b7b20) line 388 + 25 bytes
Blocks: 230042
Blocks: 257828
No longer blocks: 230042
My observations to this bug:
I could reproduce the problem with the attached testcase in seamonkey 1.0.
On the other hand, Linux Thunderbird 1.0 gave different results:
-cases 1 and 2 were fine
-cases 3 to 5 produced an empty file
-case 6 produced a file containing just the string '??????????????????????????????????????????????? End of Mail Body '

This is surprising because my other experiments showed that TB on linux is much better - I couldn't reproduce the bug with messages that produced the bug on the same machine in Windows Mozilla 1.7.
Summary: "Save as text" of non-asci mail is corrupted when mail body exceeds some size limit → "Save as text" of non-ASCII mail is corrupted when mail body exceeds some size limit
I add two attachments with a sample message. The first one 
produces an empty file when saved as txt. The other one 
works fine. The only difference between them are 2 bytes ('xx') in the 
sender's address. The message text is intentionally 
obfuscated (simple substitution 1 char for 1 char). But it
was necessary to leave it in, stripping the text removed the
bug. I can not reprocude the bug in Linux TB on these files, but it shows in Windows. I hope somebody can now trace where do the codepaths diverge on these 2 messages.

I propose to change the OS field, and I also am not sure about the component.
I am not sure this bug depends on the size of the message. Maybe partly, but I have large (30KB) messages that export correctly. Only about 10% of messages from one of my friends loose parts of the text when saved. It may have something to do with the buffered read, e.g. attached messages are encoded as quoted-printable.
There, accented characters are encoded as 3 bytes. What happens if such triplet happens to be split into 2 adjacent buffers? The same is in Japanese messages, where chars are 2 bytes wide.
(a part of comment #10)
> The first one produces an empty file when saved as txt. The other one 
> works fine. The only difference between them are 2 bytes ('xx') in the 
> sender's address.
(a part of comment #13)
> I am not sure this bug depends on the size of the message. Maybe partly,
> but I have large (30KB) messages that export correctly. Only about 10% of
> messages from one of my friends loose parts of the text when saved.

Your test result says that size itself is not cause of problem. 
Size has relation to only "whether multiple buffer are used or not", I think. 

(In reply to comment #13)
> There, accented characters are encoded as 3 bytes. What happens if such
> triplet happens to be split into 2 adjacent buffers?
> The same is in Japanese messages, where chars are 2 bytes wide.

Your guess sounds to explain problem of my test cases, IOS-2022-JP text mail.
  - If shourt mail, whole data is saved in a buffer, then no split occurs.
  - If long mail, multiple buffers are used, and probablity of split is very
    high because it's ISO-2022-JP mail.
OS: Windows 2000 → All
It seems like our stream code has some 'bad interactions' with our encoding converter. biesi, can you take a look? 
I have no idea what mailnews does here or what code is involved...
(In reply to comment #14)
> Your guess sounds to explain problem of my test cases, IOS-2022-JP text mail.
>   - If shourt mail, whole data is saved in a buffer, then no split occurs.
>   - If long mail, multiple buffers are used, and probablity of split is very
>     high because it's ISO-2022-JP mail.
> 
Yes, but we can't be sure. Comment 8 said, the buffer may be 4KB. But where does is start to read? At the beginning of the message data? At the beginning of the message BODY data, i.e. after the headers? Without this info, I can't prove this theory.

(In reply to comment #15)
> It seems like our stream code has some 'bad interactions' with our encoding
> converter.

At least next cases seems to be cared :
 (1) Split at mid of 3 byte encoded data (Unicode, EUC_JP etc.)
      --- Buffer ----><--- Next Buffer ----
      .......<Three-Byte-Encode-Data>.......
 (2) Split at mid of 3 byte Escape sequence/Unescape sequence (ISO-2022-JP)
   (2-A)    
      --- Buffer ----><--- Next Buffer --------------------
      .....<Escape-Sequence>.......<Unescape-Sequecce>.....
   (2-B)    
      ----------------------- Buffer ---><--- Next Buffer ---
      .....<Escape-Sequecce>......<Unescape-Sequence>........
 (3) Split of "Escape sequence" and "Unescape sequence" (ISO-2022-JP)
      -------------- Buffer ---><--- Next Buffer -------------
      ...<Escape-Sequence>.............<Unescape-Sequecnce>....

Above sounds to be the reason why many variations of corruption pattern.
Blocks: 230042
All test mail is ISO-2022-JP mails, and length of all mail headers of all test mails are same. Difference is only next three.
 (1) number of mail body lines
 (2) length of <Pad_1>
 (3) length of <Pad_2> of line-0030 (if line number<30, always no <Pad_2>).

Note: Esc == 0x1B, Esc$B == escape to DBCS, Esc(B == escape to SBCS

>Case-1 : ISO-2022-JP-7-26
> Split occurs in line-0027. This case is to see at where split starts.
> <Pad_1> : ascii chars, Length of <Pad_1> : varies from 0 to 256
> Mail subject == ISO-2022-JP-26-NNNN-0000 ( NNNN == length of <Pad_1> )
>  (line-0000) Esc$B + 0x2422 + Esc(B + "X" + <Pad_1>
>  (line-0001) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-0026) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>  (last line) Esc$B + 0x2422 + Esc(B + "Y"

>Case-2 : ISO-2022-JP-7-30
> Split occurs in line-0027. Mail is splitted to 2 buffers only when 30 lines.
> <Pad_1> : ascii chars, Length of <Pad_1> : varies from 0 to 256
> Mail subject == ISO-2022-JP-26-NNNN-0000 ( NNNN == length of <Pad_1> )
>  (line-0000) Esc$B + 0x2422 + Esc(B + "X" + <Pad_1>
>  (line-0001) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-0027) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7  <== Split occurs
>           |
>  (line-0030) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>  (last line) Esc$B + 0x2422 + Esc(B + "Y"

>Case-3 : ISO-2022-JP-7-70
> First split occurs in line-0027.
> Second split occurs in line-XXXX(between line-0031 to line-0070).
> <Pad_1> : ascii chars, Length of <Pad_1> : varies from 0 to 24
> <Pad_2> : ascii chars, Length of <Pad_2> : varies from 0 to 24
> Mail subject == ISO-2022-JP-70-NNNN-PPPP
>               ( NNNN == length of <Pad_1>, PPPP == length of <Pad_2> ) 
>  (line-0000) Esc$B + 0x2422 + Esc(B + "X" + <Pad_1>
>  (line-0001) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-0027) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7  <== First split
>           |
>  (line-0030) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7 + <Pad_2>
>  (line-0031) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-XXXX) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7  <== Second split
>           |
>  (line-0070) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>  (last line) Esc$B + 0x2422 + Esc(B + "Y"

Corruption pattern depends on both of (A) where first split occurs & (B) where second split occurs. This is the reason why many variations of problem exists when ISO-2022-JP.
See Bug 230042 Comment #30 for split of 3bytes UTF-8 code case.
Corrections about test cases. Sorry for spam.

>Case-1 : ISO-2022-JP-7-26
> Length of <Pad_1> : varies from 0 to 256
Subject varies from 0000 to 0256, but valid cases are 0000 to 0104 only. 0105 to 0256 are meaningless cases.
(I forgot to increase pad character data length from 104, which was sufficient when pad_len==99)

>Case-2 : ISO-2022-JP-7-30
> Length of <Pad_1> : varies from 0 to 256
It should be "varies from 0 to 99".
Summary: "Save as text" of non-ASCII mail is corrupted when mail body exceeds some size limit → "Save as text" of non-ASCII mail is corrupted when 3-bytes escape sequence is split to two buffers
FYI.
Add-On of ImportExportTools can be a workaround of this bug for some users who can accept "Export" as alternative for "Save As/TEXT".
> http://importexporttools.en.softonic.com/
> export all messages in single files (html or plain text format), with an index.
Product: Core → MailNews Core
QA Contact: i18n
Simon, is this something you might be able to look at soonish?
No longer blocks: 257828, 261923
So as far as I can tell from comment 8 and an initial code inspection pass without having looked at any of the testcases in a debugger, there is at least one problem contributing to this bug (there may well be others):

nsSaveMsgListener::OnDataAvailable buffers the message data into a text buffer using NS_ConvertUTF8toUTF16.
http://mxr.mozilla.org/comm-central/source/mailnews/base/src/nsMessenger.cpp#1787

In nsSaveMsgListener::OnStopRequest this whole buffer is converted to plain text and then encoded in the system charset (via nsMsgI18NSaveAsCharset).

The problem is in the first step, because NS_ConvertUTF8toUTF16 doesn't allow for UTF-8 sequences crossing buffer boundaries (see comments at http://mxr.mozilla.org/comm-central/source/mozilla/xpcom/string/public/nsUTF8Utils.h#395 and 479).

I think the right fix is to make the buffer an nsCString rather than an nsString, and hold off the conversion to UTF16 until we pass it to ConvertBufToPlainText
Status: NEW → ASSIGNED
Attached patch PatchSplinter Review
With this patch all the mails in attachment 165909 [details] are saved correctly. I'm not sure what magic I need to do with the other attachments to get them into a form where I can use them as testcases, and I have no clue how to make an automated testcase.
Attachment #391337 - Flags: superreview?(bugzilla)
Attachment #391337 - Flags: review?
Attachment #391337 - Flags: review? → review?(bienvenu)
Comment on attachment 391337 [details] [diff] [review]
Patch

I tried this, and it does fix the issue (WADA's six message drafts folder was the easiest way I found to reproduce the issue, FWIW).

I looked into an automated test case for this. Currently, it would have to be a mozmill test because the code path isn't exercisable from an xpcshell test at this point. It's probably worthwhile adding a method to nsIMessenger for saving a message as text so that extensions could use it, and I might have a go at that.
Attachment #391337 - Flags: review?(bienvenu) → review+
Setting in-testsuite? to remind us to see if we can write a test for this at some stage.
Flags: in-testsuite?
Comment on attachment 391337 [details] [diff] [review]
Patch

Sorry for the delay in reviewing this, looks good. sr=Standard8
Attachment #391337 - Flags: superreview?(bugzilla) → superreview+
http://build.mozillamessaging.com/mercurial/comm-central/rev/e10df4ee2d17
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Thanks, Simon!  Much appreciated.
Flags: in-testsuite?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: