"Save as text" of non-ASCII mail is corrupted when 3-bytes escape sequence is split to two buffers

RESOLVED FIXED

Status

MailNews Core
Internationalization
RESOLVED FIXED
14 years ago
3 years ago

People

(Reporter: World, Assigned: smontagu)

Tracking

({dataloss})

Trunk
x86
All
dataloss
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(7 attachments)

(Reporter)

Description

14 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a5) Gecko/20041113
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a5) Gecko/20041113

"Save as text" of non-asci mail is corrupted when mail body exceeds some size
limit .
This problem occurs on both Mozilla Suite and Thunderbird.
 - Mozilla suite latest-trunk 2004111305 on Win-2K
 - Thunderbird latest-trunk 2004/11/13 build on Win-2K

Reproducible: Always
Steps to Reproduce:





I tested next 6 cases (ISO-2022-JP text/plain; mails)
 (1) Case-01: Mail Size = 1K, Subject: header = Single line 
              ==> No problem               
 (2) Case-02: Mail Size = 1K, Subject: header = Multiple line
              ==> No problem               
 (3) Case-03: Mail Size = 8K, Subject: header = Single line 
              ==> Partial mail was saved(First portion only, Linebreak is kept)
 (4) Case-04: Mail Size = 8K, Subject: header = Multiple line
              ==> Null files(file size=0) was saved
 (5) Case-05: Mail Size =11K, Subject: header = Single line 
              ==> Partial mail was saved(First portion only, Linebreak is kept)
 (6) Case-06: Mail Size =11K, Subject: header = Multiple line
              ==> Only mid portion of mail is saved with no Linebreak

Result of Case-04 is similar to Bug 112069 and result of Case-06 is similar to
Bug 257828, althogh these two bugs does not refer to "large mail size" nor
"multiple line subject header".
(Reporter)

Comment 1

14 years ago
Created attachment 165909 [details]
Drafts folder file (contains 6 mails)

Save this file in your mail directry(say "Test") and restart Mozilla or
Thunderbird, then "Test" folder will appear.
This folder contains 6 mails of above 6 test cases(date is set as case number).
(Reporter)

Comment 2

14 years ago
Created attachment 165910 [details]
Zip file, contains "Saved as text" files of 6 test cases.

Comment 3

14 years ago
Can this have anything to do with a recent (?) stream change? 
Keywords: dataloss
Product: MailNews → Core

Comment 4

14 years ago
Saving as text under an English Win2K system, I primarily see the symptom from 
bug 230042.  However, the corruption looks very similar to that in bug 261923 
(which is definitely not a recent regression).  That also hinges on a non-ASCII 
character, in the subject in that case.
(Reporter)

Comment 5

14 years ago
(In reply to comment #4)
> That also hinges on a non-ASCII character

Mainly on non-ascii, I think.
This bug is problem in convertion from charset of mail to charset of system,
iso-2022-jp to Shift_JIS(CP 932) in my case.
I suspect buffer corruption/overlay or control block corruption when data is
longer than buffer size.
So I think this is also applicable to long iso-8859-1 mail on English MS
Windows, as you reported to Bug 257828.
(Reporter)

Updated

14 years ago
Blocks: 261923
(Reporter)

Comment 6

14 years ago
Workaround:
 (1) Start text editor(notepad, simpletext, e editor etc.)
 (2) CTRL+A, then CTRL+C on Mozilla/Thunderbird (Command+A,Command+C on Mac)
 (3) CTRL+V on editor (Command+V on Mac)
 (4) CTRL+S on editor (Command+S on Mac), then specifiy file name, then press OK
(Sorry but I don't know how to on Emacs)
  
(Reporter)

Updated

13 years ago
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Reporter)

Comment 7

13 years ago
Cc: ing to to Jshin, because assignee of Bug 263797, and putting Bug 138578 in
"depends on:".
Jshin, remove if wrong, please.
Depends on: 138578

Comment 8

13 years ago
I wrote a comment but forgot to commit. My fix for bug 138578 fixes case 3, but
cases 4 to 6 appear to have a different cause. Stack is as shown below and I
wonder how   p == end (end = start  + N where N is 4096) can be satisfied for
every UTF-8 string passed to |CalculateUTF8Length::write| (there's no guarantee
that the fixed-size buffer boundary coincides with a 'UTF-8 character'
boundary). 4096 seems to be a constant defined somewhere in the stream code. The
beginning of a string passed to |write| is valid as UTF-8, but I wasn't able to
inspect some variables I want to inspect with VC++ (it wouldn't let me). I'll
try later with gdb on Linux.

nsDebug::Assertion(const char * 0x007e8908 `string', const char * 0x007e3b4c
`string', const c
har * 0x007eedd0 `string', int 0x00000108) line 109
CalculateUTF8Length::write(CalculateUTF8Length * const 0x0102bfb8, const char *
0x04473760, un
signed int 0x00001000) line 264 + 25 bytes
   N = 4096 : how come? 

  if ( p != end )
          {
            NS_ERROR("Not a UTF-8 string. This code should only be used for
converting from kn
own UTF-8 strings.");
            mErrorEncountered = PR_TRUE;
            return N;
          }
        return p - start;


copy_string(nsReadingIterator<char> & {...}, const nsReadingIterator<char> &
{...}, CalculateU
TF8Length & {...}) line 95 + 13 bytes
AppendUTF8toUTF16(const nsACString & {...}, nsAString & {...}) line 234
nsSaveMsgListener::OnDataAvailable(nsSaveMsgListener * const 0x00001077,
nsIRequest * 0x043ea7
e0, nsISupports * 0x044467bc, nsIInputStream * 0x00001000, unsigned int
0x00000000, unsigned i
nt 0x00001000) line 2032 + 44 bytes
nsMimeBaseEmitter::Complete(nsMimeBaseEmitter * const 0x00004077) line 949
nsStreamConverter::OnStopRequest(nsStreamConverter * const 0x01d6fc69,
nsIRequest * 0x0428de98
, nsISupports * 0x04414abc, unsigned int 0x037ed828) line 1028
nsMsgProtocol::OnStopRequest(nsMsgProtocol * const 0x04444e88, nsIRequest *
0x00000000, nsISup
ports * 0x04473760, unsigned int 0x002b7b20) line 388 + 25 bytes
(Reporter)

Updated

13 years ago
Blocks: 230042
(Reporter)

Updated

13 years ago
Blocks: 257828
(Reporter)

Updated

13 years ago
No longer blocks: 230042

Comment 9

12 years ago
My observations to this bug:
I could reproduce the problem with the attached testcase in seamonkey 1.0.
On the other hand, Linux Thunderbird 1.0 gave different results:
-cases 1 and 2 were fine
-cases 3 to 5 produced an empty file
-case 6 produced a file containing just the string '??????????????????????????????????????????????? End of Mail Body '

This is surprising because my other experiments showed that TB on linux is much better - I couldn't reproduce the bug with messages that produced the bug on the same machine in Windows Mozilla 1.7.
Summary: "Save as text" of non-asci mail is corrupted when mail body exceeds some size limit → "Save as text" of non-ASCII mail is corrupted when mail body exceeds some size limit

Comment 10

12 years ago
I add two attachments with a sample message. The first one 
produces an empty file when saved as txt. The other one 
works fine. The only difference between them are 2 bytes ('xx') in the 
sender's address. The message text is intentionally 
obfuscated (simple substitution 1 char for 1 char). But it
was necessary to leave it in, stripping the text removed the
bug. I can not reprocude the bug in Linux TB on these files, but it shows in Windows. I hope somebody can now trace where do the codepaths diverge on these 2 messages.

I propose to change the OS field, and I also am not sure about the component.

Comment 11

12 years ago
Created attachment 212605 [details]
When this message is saved as txt, empty file is produced.

Comment 12

12 years ago
Created attachment 212606 [details]
This message is saved correctly.

Comment 13

12 years ago
I am not sure this bug depends on the size of the message. Maybe partly, but I have large (30KB) messages that export correctly. Only about 10% of messages from one of my friends loose parts of the text when saved. It may have something to do with the buffered read, e.g. attached messages are encoded as quoted-printable.
There, accented characters are encoded as 3 bytes. What happens if such triplet happens to be split into 2 adjacent buffers? The same is in Japanese messages, where chars are 2 bytes wide.
(Reporter)

Comment 14

12 years ago
(a part of comment #10)
> The first one produces an empty file when saved as txt. The other one 
> works fine. The only difference between them are 2 bytes ('xx') in the 
> sender's address.
(a part of comment #13)
> I am not sure this bug depends on the size of the message. Maybe partly,
> but I have large (30KB) messages that export correctly. Only about 10% of
> messages from one of my friends loose parts of the text when saved.

Your test result says that size itself is not cause of problem. 
Size has relation to only "whether multiple buffer are used or not", I think. 

(In reply to comment #13)
> There, accented characters are encoded as 3 bytes. What happens if such
> triplet happens to be split into 2 adjacent buffers?
> The same is in Japanese messages, where chars are 2 bytes wide.

Your guess sounds to explain problem of my test cases, IOS-2022-JP text mail.
  - If shourt mail, whole data is saved in a buffer, then no split occurs.
  - If long mail, multiple buffers are used, and probablity of split is very
    high because it's ISO-2022-JP mail.
OS: Windows 2000 → All

Comment 15

12 years ago
It seems like our stream code has some 'bad interactions' with our encoding converter. biesi, can you take a look? 
I have no idea what mailnews does here or what code is involved...

Comment 17

12 years ago
(In reply to comment #14)
> Your guess sounds to explain problem of my test cases, IOS-2022-JP text mail.
>   - If shourt mail, whole data is saved in a buffer, then no split occurs.
>   - If long mail, multiple buffers are used, and probablity of split is very
>     high because it's ISO-2022-JP mail.
> 
Yes, but we can't be sure. Comment 8 said, the buffer may be 4KB. But where does is start to read? At the beginning of the message data? At the beginning of the message BODY data, i.e. after the headers? Without this info, I can't prove this theory.

(Reporter)

Comment 18

12 years ago
(In reply to comment #15)
> It seems like our stream code has some 'bad interactions' with our encoding
> converter.

At least next cases seems to be cared :
 (1) Split at mid of 3 byte encoded data (Unicode, EUC_JP etc.)
      --- Buffer ----><--- Next Buffer ----
      .......<Three-Byte-Encode-Data>.......
 (2) Split at mid of 3 byte Escape sequence/Unescape sequence (ISO-2022-JP)
   (2-A)    
      --- Buffer ----><--- Next Buffer --------------------
      .....<Escape-Sequence>.......<Unescape-Sequecce>.....
   (2-B)    
      ----------------------- Buffer ---><--- Next Buffer ---
      .....<Escape-Sequecce>......<Unescape-Sequence>........
 (3) Split of "Escape sequence" and "Unescape sequence" (ISO-2022-JP)
      -------------- Buffer ---><--- Next Buffer -------------
      ...<Escape-Sequence>.............<Unescape-Sequecnce>....

Above sounds to be the reason why many variations of corruption pattern.
(Reporter)

Updated

11 years ago
Blocks: 230042
(Reporter)

Comment 19

10 years ago
Created attachment 312548 [details]
ZIP of Mail folder files/saved text files (test mails for test of split of 3bytes escape sequence of ISO-2022-JP)

All test mail is ISO-2022-JP mails, and length of all mail headers of all test mails are same. Difference is only next three.
 (1) number of mail body lines
 (2) length of <Pad_1>
 (3) length of <Pad_2> of line-0030 (if line number<30, always no <Pad_2>).

Note: Esc == 0x1B, Esc$B == escape to DBCS, Esc(B == escape to SBCS

>Case-1 : ISO-2022-JP-7-26
> Split occurs in line-0027. This case is to see at where split starts.
> <Pad_1> : ascii chars, Length of <Pad_1> : varies from 0 to 256
> Mail subject == ISO-2022-JP-26-NNNN-0000 ( NNNN == length of <Pad_1> )
>  (line-0000) Esc$B + 0x2422 + Esc(B + "X" + <Pad_1>
>  (line-0001) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-0026) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>  (last line) Esc$B + 0x2422 + Esc(B + "Y"

>Case-2 : ISO-2022-JP-7-30
> Split occurs in line-0027. Mail is splitted to 2 buffers only when 30 lines.
> <Pad_1> : ascii chars, Length of <Pad_1> : varies from 0 to 256
> Mail subject == ISO-2022-JP-26-NNNN-0000 ( NNNN == length of <Pad_1> )
>  (line-0000) Esc$B + 0x2422 + Esc(B + "X" + <Pad_1>
>  (line-0001) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-0027) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7  <== Split occurs
>           |
>  (line-0030) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>  (last line) Esc$B + 0x2422 + Esc(B + "Y"

>Case-3 : ISO-2022-JP-7-70
> First split occurs in line-0027.
> Second split occurs in line-XXXX(between line-0031 to line-0070).
> <Pad_1> : ascii chars, Length of <Pad_1> : varies from 0 to 24
> <Pad_2> : ascii chars, Length of <Pad_2> : varies from 0 to 24
> Mail subject == ISO-2022-JP-70-NNNN-PPPP
>               ( NNNN == length of <Pad_1>, PPPP == length of <Pad_2> ) 
>  (line-0000) Esc$B + 0x2422 + Esc(B + "X" + <Pad_1>
>  (line-0001) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-0027) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7  <== First split
>           |
>  (line-0030) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7 + <Pad_2>
>  (line-0031) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-XXXX) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7  <== Second split
>           |
>  (line-0070) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>  (last line) Esc$B + 0x2422 + Esc(B + "Y"

Corruption pattern depends on both of (A) where first split occurs & (B) where second split occurs. This is the reason why many variations of problem exists when ISO-2022-JP.
See Bug 230042 Comment #30 for split of 3bytes UTF-8 code case.
(Reporter)

Comment 20

10 years ago
Created attachment 312550 [details]
Dir list of saved text files (file name & file size)
(Reporter)

Comment 21

10 years ago
Corrections about test cases. Sorry for spam.

>Case-1 : ISO-2022-JP-7-26
> Length of <Pad_1> : varies from 0 to 256
Subject varies from 0000 to 0256, but valid cases are 0000 to 0104 only. 0105 to 0256 are meaningless cases.
(I forgot to increase pad character data length from 104, which was sufficient when pad_len==99)

>Case-2 : ISO-2022-JP-7-30
> Length of <Pad_1> : varies from 0 to 256
It should be "varies from 0 to 99".
(Reporter)

Updated

10 years ago
Summary: "Save as text" of non-ASCII mail is corrupted when mail body exceeds some size limit → "Save as text" of non-ASCII mail is corrupted when 3-bytes escape sequence is split to two buffers
(Reporter)

Comment 22

10 years ago
FYI.
Add-On of ImportExportTools can be a workaround of this bug for some users who can accept "Export" as alternative for "Save As/TEXT".
> http://importexporttools.en.softonic.com/
> export all messages in single files (html or plain text format), with an index.
Product: Core → MailNews Core

Updated

9 years ago
QA Contact: i18n
Simon, is this something you might be able to look at soonish?
(Reporter)

Updated

9 years ago
No longer blocks: 257828, 261923
(Assignee)

Comment 24

9 years ago
So as far as I can tell from comment 8 and an initial code inspection pass without having looked at any of the testcases in a debugger, there is at least one problem contributing to this bug (there may well be others):

nsSaveMsgListener::OnDataAvailable buffers the message data into a text buffer using NS_ConvertUTF8toUTF16.
http://mxr.mozilla.org/comm-central/source/mailnews/base/src/nsMessenger.cpp#1787

In nsSaveMsgListener::OnStopRequest this whole buffer is converted to plain text and then encoded in the system charset (via nsMsgI18NSaveAsCharset).

The problem is in the first step, because NS_ConvertUTF8toUTF16 doesn't allow for UTF-8 sequences crossing buffer boundaries (see comments at http://mxr.mozilla.org/comm-central/source/mozilla/xpcom/string/public/nsUTF8Utils.h#395 and 479).

I think the right fix is to make the buffer an nsCString rather than an nsString, and hold off the conversion to UTF16 until we pass it to ConvertBufToPlainText
(Assignee)

Updated

9 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 25

9 years ago
Created attachment 391337 [details] [diff] [review]
Patch

With this patch all the mails in attachment 165909 [details] are saved correctly. I'm not sure what magic I need to do with the other attachments to get them into a form where I can use them as testcases, and I have no clue how to make an automated testcase.
Attachment #391337 - Flags: superreview?(bugzilla)
Attachment #391337 - Flags: review?
(Assignee)

Updated

9 years ago
Attachment #391337 - Flags: review? → review?(bienvenu)

Comment 26

9 years ago
Comment on attachment 391337 [details] [diff] [review]
Patch

I tried this, and it does fix the issue (WADA's six message drafts folder was the easiest way I found to reproduce the issue, FWIW).

I looked into an automated test case for this. Currently, it would have to be a mozmill test because the code path isn't exercisable from an xpcshell test at this point. It's probably worthwhile adding a method to nsIMessenger for saving a message as text so that extensions could use it, and I might have a go at that.
Attachment #391337 - Flags: review?(bienvenu) → review+
Setting in-testsuite? to remind us to see if we can write a test for this at some stage.
Flags: in-testsuite?
Comment on attachment 391337 [details] [diff] [review]
Patch

Sorry for the delay in reviewing this, looks good. sr=Standard8
Attachment #391337 - Flags: superreview?(bugzilla) → superreview+
(Assignee)

Comment 29

9 years ago
http://build.mozillamessaging.com/mercurial/comm-central/rev/e10df4ee2d17
Status: ASSIGNED → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
Thanks, Simon!  Much appreciated.
Flags: in-testsuite?
You need to log in before you can comment on or make changes to this bug.