Closed Bug 269812 Opened 20 years ago Closed 15 years ago

"Save as text" of non-ASCII mail is corrupted when 3-bytes escape sequence is split to two buffers

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: World, Assigned: smontagu)

References

Details

(Keywords: dataloss)

Attachments

(7 files)

Drafts folder file (contains 6 mails) 20 years ago WADA:World Anti-bad-Duping Agency 43.07 KB, text/plain		Details
Zip file, contains "Saved as text" files of 6 test cases. 20 years ago WADA:World Anti-bad-Duping Agency 1.77 KB, application/octet-stream		Details
When this message is saved as txt, empty file is produced. 18 years ago :aceman 7.33 KB, text/plain		Details
This message is saved correctly. 18 years ago :aceman 7.34 KB, text/plain		Details
ZIP of Mail folder files/saved text files (test mails for test of split of 3bytes escape sequence of ISO-2022-JP) 16 years ago WADA:World Anti-bad-Duping Agency 99.29 KB, application/octet-stream		Details
Dir list of saved text files (file name & file size) 16 years ago WADA:World Anti-bad-Duping Agency 7.30 KB, text/plain		Details
Patch 15 years ago Simon Montagu :smontagu 2.37 KB, patch	Bienvenu : review+ standard8 : superreview+	Details \| Diff \| Splinter Review

WADA:World Anti-bad-Duping Agency

Reporter

Description

•

20 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a5) Gecko/20041113
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a5) Gecko/20041113

"Save as text" of non-asci mail is corrupted when mail body exceeds some size
limit .
This problem occurs on both Mozilla Suite and Thunderbird.
 - Mozilla suite latest-trunk 2004111305 on Win-2K
 - Thunderbird latest-trunk 2004/11/13 build on Win-2K

Reproducible: Always
Steps to Reproduce:





I tested next 6 cases (ISO-2022-JP text/plain; mails)
 (1) Case-01: Mail Size = 1K, Subject: header = Single line 
              ==> No problem               
 (2) Case-02: Mail Size = 1K, Subject: header = Multiple line
              ==> No problem               
 (3) Case-03: Mail Size = 8K, Subject: header = Single line 
              ==> Partial mail was saved(First portion only, Linebreak is kept)
 (4) Case-04: Mail Size = 8K, Subject: header = Multiple line
              ==> Null files(file size=0) was saved
 (5) Case-05: Mail Size =11K, Subject: header = Single line 
              ==> Partial mail was saved(First portion only, Linebreak is kept)
 (6) Case-06: Mail Size =11K, Subject: header = Multiple line
              ==> Only mid portion of mail is saved with no Linebreak

Result of Case-04 is similar to Bug 112069 and result of Case-06 is similar to
Bug 257828, althogh these two bugs does not refer to "large mail size" nor
"multiple line subject header".

WADA:World Anti-bad-Duping Agency

Reporter

Comment 1

•

20 years ago

Attached file Drafts folder file (contains 6 mails) — Details

Save this file in your mail directry(say "Test") and restart Mozilla or
Thunderbird, then "Test" folder will appear.
This folder contains 6 mails of above 6 test cases(date is set as case number).

WADA:World Anti-bad-Duping Agency

Reporter

Comment 2

•

20 years ago

Attached file Zip file, contains "Saved as text" files of 6 test cases. — Details

Jungshik Shin

Comment 3

•

20 years ago

Can this have anything to do with a recent (?) stream change?

Keywords: dataloss

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: MailNews → Core

Mike Cowperthwaite

Comment 4

•

20 years ago

Saving as text under an English Win2K system, I primarily see the symptom from 
bug 230042.  However, the corruption looks very similar to that in bug 261923 
(which is definitely not a recent regression).  That also hinges on a non-ASCII 
character, in the subject in that case.

WADA:World Anti-bad-Duping Agency

Reporter

Comment 5

•

20 years ago

(In reply to comment #4)
> That also hinges on a non-ASCII character

Mainly on non-ascii, I think.
This bug is problem in convertion from charset of mail to charset of system,
iso-2022-jp to Shift_JIS(CP 932) in my case.
I suspect buffer corruption/overlay or control block corruption when data is
longer than buffer size.
So I think this is also applicable to long iso-8859-1 mail on English MS
Windows, as you reported to Bug 257828.

WADA:World Anti-bad-Duping Agency

Reporter

Updated

•

20 years ago

Blocks: 261923

WADA:World Anti-bad-Duping Agency

Reporter

Comment 6

•

20 years ago

Workaround:
 (1) Start text editor(notepad, simpletext, e editor etc.)
 (2) CTRL+A, then CTRL+C on Mozilla/Thunderbird (Command+A,Command+C on Mac)
 (3) CTRL+V on editor (Command+V on Mac)
 (4) CTRL+S on editor (Command+S on Mac), then specifiy file name, then press OK
(Sorry but I don't know how to on Emacs)

WADA:World Anti-bad-Duping Agency

Reporter

Updated

•

19 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

WADA:World Anti-bad-Duping Agency

Reporter

Comment 7

•

19 years ago

Cc: ing to to Jshin, because assignee of Bug 263797, and putting Bug 138578 in
"depends on:".
Jshin, remove if wrong, please.

Depends on: 138578

Jungshik Shin

Comment 8

•

19 years ago

I wrote a comment but forgot to commit. My fix for bug 138578 fixes case 3, but
cases 4 to 6 appear to have a different cause. Stack is as shown below and I
wonder how   p == end (end = start  + N where N is 4096) can be satisfied for
every UTF-8 string passed to |CalculateUTF8Length::write| (there's no guarantee
that the fixed-size buffer boundary coincides with a 'UTF-8 character'
boundary). 4096 seems to be a constant defined somewhere in the stream code. The
beginning of a string passed to |write| is valid as UTF-8, but I wasn't able to
inspect some variables I want to inspect with VC++ (it wouldn't let me). I'll
try later with gdb on Linux.

nsDebug::Assertion(const char * 0x007e8908 `string', const char * 0x007e3b4c
`string', const c
har * 0x007eedd0 `string', int 0x00000108) line 109
CalculateUTF8Length::write(CalculateUTF8Length * const 0x0102bfb8, const char *
0x04473760, un
signed int 0x00001000) line 264 + 25 bytes
   N = 4096 : how come? 

  if ( p != end )
          {
            NS_ERROR("Not a UTF-8 string. This code should only be used for
converting from kn
own UTF-8 strings.");
            mErrorEncountered = PR_TRUE;
            return N;
          }
        return p - start;


copy_string(nsReadingIterator<char> & {...}, const nsReadingIterator<char> &
{...}, CalculateU
TF8Length & {...}) line 95 + 13 bytes
AppendUTF8toUTF16(const nsACString & {...}, nsAString & {...}) line 234
nsSaveMsgListener::OnDataAvailable(nsSaveMsgListener * const 0x00001077,
nsIRequest * 0x043ea7
e0, nsISupports * 0x044467bc, nsIInputStream * 0x00001000, unsigned int
0x00000000, unsigned i
nt 0x00001000) line 2032 + 44 bytes
nsMimeBaseEmitter::Complete(nsMimeBaseEmitter * const 0x00004077) line 949
nsStreamConverter::OnStopRequest(nsStreamConverter * const 0x01d6fc69,
nsIRequest * 0x0428de98
, nsISupports * 0x04414abc, unsigned int 0x037ed828) line 1028
nsMsgProtocol::OnStopRequest(nsMsgProtocol * const 0x04444e88, nsIRequest *
0x00000000, nsISup
ports * 0x04473760, unsigned int 0x002b7b20) line 388 + 25 bytes

WADA:World Anti-bad-Duping Agency

Reporter

Updated

•

19 years ago

Blocks: 230042

WADA:World Anti-bad-Duping Agency

Reporter

Updated

•

19 years ago

Blocks: 257828

WADA:World Anti-bad-Duping Agency

Reporter

Updated

•

19 years ago

No longer blocks: 230042

:aceman

Comment 9

•

18 years ago

My observations to this bug:
I could reproduce the problem with the attached testcase in seamonkey 1.0.
On the other hand, Linux Thunderbird 1.0 gave different results:
-cases 1 and 2 were fine
-cases 3 to 5 produced an empty file
-case 6 produced a file containing just the string '??????????????????????????????????????????????? End of Mail Body '

This is surprising because my other experiments showed that TB on linux is much better - I couldn't reproduce the bug with messages that produced the bug on the same machine in Windows Mozilla 1.7.

Summary: "Save as text" of non-asci mail is corrupted when mail body exceeds some size limit → "Save as text" of non-ASCII mail is corrupted when mail body exceeds some size limit

:aceman

Comment 10

•

18 years ago

I add two attachments with a sample message. The first one 
produces an empty file when saved as txt. The other one 
works fine. The only difference between them are 2 bytes ('xx') in the 
sender's address. The message text is intentionally 
obfuscated (simple substitution 1 char for 1 char). But it
was necessary to leave it in, stripping the text removed the
bug. I can not reprocude the bug in Linux TB on these files, but it shows in Windows. I hope somebody can now trace where do the codepaths diverge on these 2 messages.

I propose to change the OS field, and I also am not sure about the component.

:aceman

Comment 11

•

18 years ago

Attached file When this message is saved as txt, empty file is produced. — Details

:aceman

Comment 12

•

18 years ago

Attached file This message is saved correctly. — Details

:aceman

Comment 13

•

18 years ago

I am not sure this bug depends on the size of the message. Maybe partly, but I have large (30KB) messages that export correctly. Only about 10% of messages from one of my friends loose parts of the text when saved. It may have something to do with the buffered read, e.g. attached messages are encoded as quoted-printable.
There, accented characters are encoded as 3 bytes. What happens if such triplet happens to be split into 2 adjacent buffers? The same is in Japanese messages, where chars are 2 bytes wide.

WADA:World Anti-bad-Duping Agency

Reporter

Comment 14

•

18 years ago

(a part of comment #10)
> The first one produces an empty file when saved as txt. The other one 
> works fine. The only difference between them are 2 bytes ('xx') in the 
> sender's address.
(a part of comment #13)
> I am not sure this bug depends on the size of the message. Maybe partly,
> but I have large (30KB) messages that export correctly. Only about 10% of
> messages from one of my friends loose parts of the text when saved.

Your test result says that size itself is not cause of problem. 
Size has relation to only "whether multiple buffer are used or not", I think. 

(In reply to comment #13)
> There, accented characters are encoded as 3 bytes. What happens if such
> triplet happens to be split into 2 adjacent buffers?
> The same is in Japanese messages, where chars are 2 bytes wide.

Your guess sounds to explain problem of my test cases, IOS-2022-JP text mail.
  - If shourt mail, whole data is saved in a buffer, then no split occurs.
  - If long mail, multiple buffers are used, and probablity of split is very
    high because it's ISO-2022-JP mail.

OS: Windows 2000 → All

Jungshik Shin

Comment 15

•

18 years ago

It seems like our stream code has some 'bad interactions' with our encoding converter. biesi, can you take a look?

Christian :Biesinger (don't email me, ping me on IRC)

Comment 16

•

18 years ago

I have no idea what mailnews does here or what code is involved...

:aceman

Comment 17

•

18 years ago

(In reply to comment #14)
> Your guess sounds to explain problem of my test cases, IOS-2022-JP text mail.
>   - If shourt mail, whole data is saved in a buffer, then no split occurs.
>   - If long mail, multiple buffers are used, and probablity of split is very
>     high because it's ISO-2022-JP mail.
> 
Yes, but we can't be sure. Comment 8 said, the buffer may be 4KB. But where does is start to read? At the beginning of the message data? At the beginning of the message BODY data, i.e. after the headers? Without this info, I can't prove this theory.

WADA:World Anti-bad-Duping Agency

Reporter

Comment 18

•

18 years ago

(In reply to comment #15)
> It seems like our stream code has some 'bad interactions' with our encoding
> converter.

At least next cases seems to be cared :
 (1) Split at mid of 3 byte encoded data (Unicode, EUC_JP etc.)
      --- Buffer ----><--- Next Buffer ----
      .......<Three-Byte-Encode-Data>.......
 (2) Split at mid of 3 byte Escape sequence/Unescape sequence (ISO-2022-JP)
   (2-A)    
      --- Buffer ----><--- Next Buffer --------------------
      .....<Escape-Sequence>.......<Unescape-Sequecce>.....
   (2-B)    
      ----------------------- Buffer ---><--- Next Buffer ---
      .....<Escape-Sequecce>......<Unescape-Sequence>........
 (3) Split of "Escape sequence" and "Unescape sequence" (ISO-2022-JP)
      -------------- Buffer ---><--- Next Buffer -------------
      ...<Escape-Sequence>.............<Unescape-Sequecnce>....

Above sounds to be the reason why many variations of corruption pattern.

WADA:World Anti-bad-Duping Agency

Reporter

Updated

•

17 years ago

Blocks: 230042

WADA:World Anti-bad-Duping Agency

Reporter

Comment 19

•

16 years ago

Attached file ZIP of Mail folder files/saved text files (test mails for test of split of 3bytes escape sequence of ISO-2022-JP) — Details

All test mail is ISO-2022-JP mails, and length of all mail headers of all test mails are same. Difference is only next three.
 (1) number of mail body lines
 (2) length of <Pad_1>
 (3) length of <Pad_2> of line-0030 (if line number<30, always no <Pad_2>).

Note: Esc == 0x1B, Esc$B == escape to DBCS, Esc(B == escape to SBCS

>Case-1 : ISO-2022-JP-7-26
> Split occurs in line-0027. This case is to see at where split starts.
> <Pad_1> : ascii chars, Length of <Pad_1> : varies from 0 to 256
> Mail subject == ISO-2022-JP-26-NNNN-0000 ( NNNN == length of <Pad_1> )
>  (line-0000) Esc$B + 0x2422 + Esc(B + "X" + <Pad_1>
>  (line-0001) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-0026) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>  (last line) Esc$B + 0x2422 + Esc(B + "Y"

>Case-2 : ISO-2022-JP-7-30
> Split occurs in line-0027. Mail is splitted to 2 buffers only when 30 lines.
> <Pad_1> : ascii chars, Length of <Pad_1> : varies from 0 to 256
> Mail subject == ISO-2022-JP-26-NNNN-0000 ( NNNN == length of <Pad_1> )
>  (line-0000) Esc$B + 0x2422 + Esc(B + "X" + <Pad_1>
>  (line-0001) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-0027) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7  <== Split occurs
>           |
>  (line-0030) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>  (last line) Esc$B + 0x2422 + Esc(B + "Y"

>Case-3 : ISO-2022-JP-7-70
> First split occurs in line-0027.
> Second split occurs in line-XXXX(between line-0031 to line-0070).
> <Pad_1> : ascii chars, Length of <Pad_1> : varies from 0 to 24
> <Pad_2> : ascii chars, Length of <Pad_2> : varies from 0 to 24
> Mail subject == ISO-2022-JP-70-NNNN-PPPP
>               ( NNNN == length of <Pad_1>, PPPP == length of <Pad_2> ) 
>  (line-0000) Esc$B + 0x2422 + Esc(B + "X" + <Pad_1>
>  (line-0001) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-0027) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7  <== First split
>           |
>  (line-0030) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7 + <Pad_2>
>  (line-0031) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>           |
>  (line-XXXX) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7  <== Second split
>           |
>  (line-0070) [ Esc$B + {0x2422}*4 + Esc(B + "A"*5 ]*7
>  (last line) Esc$B + 0x2422 + Esc(B + "Y"

Corruption pattern depends on both of (A) where first split occurs & (B) where second split occurs. This is the reason why many variations of problem exists when ISO-2022-JP.
See Bug 230042 Comment #30 for split of 3bytes UTF-8 code case.

WADA:World Anti-bad-Duping Agency

Reporter

Comment 20

•

16 years ago

Attached file Dir list of saved text files (file name & file size) — Details

WADA:World Anti-bad-Duping Agency

Reporter

Comment 21

•

16 years ago

Corrections about test cases. Sorry for spam.

>Case-1 : ISO-2022-JP-7-26
> Length of <Pad_1> : varies from 0 to 256
Subject varies from 0000 to 0256, but valid cases are 0000 to 0104 only. 0105 to 0256 are meaningless cases.
(I forgot to increase pad character data length from 104, which was sufficient when pad_len==99)

>Case-2 : ISO-2022-JP-7-30
> Length of <Pad_1> : varies from 0 to 256
It should be "varies from 0 to 99".

WADA:World Anti-bad-Duping Agency

Reporter

Updated

•

16 years ago

Summary: "Save as text" of non-ASCII mail is corrupted when mail body exceeds some size limit → "Save as text" of non-ASCII mail is corrupted when 3-bytes escape sequence is split to two buffers

WADA:World Anti-bad-Duping Agency

Reporter

Comment 22

•

16 years ago

FYI.
Add-On of ImportExportTools can be a workaround of this bug for some users who can accept "Export" as alternative for "Save As/TEXT".
> http://importexporttools.en.softonic.com/
> export all messages in single files (html or plain text format), with an index.

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Core → MailNews Core

Magnus Melin [:mkmelin]

Updated

•

15 years ago

QA Contact: i18n

Dan Mosedale (:dmosedale, :dmose)

Comment 23

•

15 years ago

Simon, is this something you might be able to look at soonish?

WADA:World Anti-bad-Duping Agency

Reporter

Updated

•

15 years ago

No longer blocks: 257828, 261923

Simon Montagu :smontagu

Assignee

Comment 24

•

15 years ago

So as far as I can tell from comment 8 and an initial code inspection pass without having looked at any of the testcases in a debugger, there is at least one problem contributing to this bug (there may well be others):

nsSaveMsgListener::OnDataAvailable buffers the message data into a text buffer using NS_ConvertUTF8toUTF16.
http://mxr.mozilla.org/comm-central/source/mailnews/base/src/nsMessenger.cpp#1787

In nsSaveMsgListener::OnStopRequest this whole buffer is converted to plain text and then encoded in the system charset (via nsMsgI18NSaveAsCharset).

The problem is in the first step, because NS_ConvertUTF8toUTF16 doesn't allow for UTF-8 sequences crossing buffer boundaries (see comments at http://mxr.mozilla.org/comm-central/source/mozilla/xpcom/string/public/nsUTF8Utils.h#395 and 479).

I think the right fix is to make the buffer an nsCString rather than an nsString, and hold off the conversion to UTF16 until we pass it to ConvertBufToPlainText

Simon Montagu :smontagu

Assignee

Updated

•

15 years ago

Status: NEW → ASSIGNED

Simon Montagu :smontagu

Assignee

Comment 25

•

15 years ago

Attached patch Patch — Details — Splinter Review

With this patch all the mails in attachment 165909 [details] are saved correctly. I'm not sure what magic I need to do with the other attachments to get them into a form where I can use them as testcases, and I have no clue how to make an automated testcase.

Attachment #391337 - Flags: superreview?(bugzilla)

Attachment #391337 - Flags: review?

Simon Montagu :smontagu

Assignee

Updated

•

15 years ago

Attachment #391337 - Flags: review? → review?(bienvenu)

David :Bienvenu

Comment 26

•

15 years ago

Comment on attachment 391337 [details] [diff] [review]
Patch

I tried this, and it does fix the issue (WADA's six message drafts folder was the easiest way I found to reproduce the issue, FWIW).

I looked into an automated test case for this. Currently, it would have to be a mozmill test because the code path isn't exercisable from an xpcshell test at this point. It's probably worthwhile adding a method to nsIMessenger for saving a message as text so that extensions could use it, and I might have a go at that.

Attachment #391337 - Flags: review?(bienvenu) → review+

Mark Banner (:standard8)

Comment 27

•

15 years ago

Setting in-testsuite? to remind us to see if we can write a test for this at some stage.

Flags: in-testsuite?

Mark Banner (:standard8)

Comment 28

•

15 years ago

Comment on attachment 391337 [details] [diff] [review]
Patch

Sorry for the delay in reviewing this, looks good. sr=Standard8

Attachment #391337 - Flags: superreview?(bugzilla) → superreview+

Simon Montagu :smontagu

Assignee

Comment 29

•

15 years ago

http://build.mozillamessaging.com/mercurial/comm-central/rev/e10df4ee2d17

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Dan Mosedale (:dmosedale, :dmose)

Comment 30

•

15 years ago

Thanks, Simon!  Much appreciated.

Mark Banner (:standard8)

Updated

•

9 years ago

Flags: in-testsuite?

You need to log in before you can comment on or make changes to this bug.