Closed Bug 355209 Opened 14 years ago Closed 5 years ago

Long Japanese or Unicode sentences is broken by Tb/Sm when mail is sent or saved in Outbox/Drafts (When long line is split by SMTP line length limit==LINE_BREAK_MAX, Tb splits at mid of 3bytes code of utf-8 and 3bytes escape sequence of iso-2022-jp)

Categories

(MailNews Core :: Composition, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1225904

People

(Reporter: sugar.waffle, Unassigned)

References

Details

(Keywords: dataloss, intl)

Attachments

(2 files)

When long Japanese sentences are input, save cannot be normally done to Draft. 
It is possible to preserve it up to 492 characters in Japanese normally. 

Reproducible: Always

Steps to Reproduce:

Thunderbird account setting:
 1) Open Tools --> Account Settings
 2) "Composition & Addressing" is selected from a set list of the left side. 
 3) The check on "Compose messages in HTML format" is turned off.
 4) Push OK button

Thunderbird option setting:
 5) Open Tools --> Options
 6) Select Composition tab
 7) 0 is input with "Wrap plain text message at xxx characters". 
 8) Select Display tab
 9) Select "Japanese(ISO-2022-JP)" with Outgoing Mail and Incoming Mail
10) Push OK button

11) Create new messasge
12) IME is turned on
13) Japanese is input by 493 characters.
    e.g Only あ is input by 493 characters. 
14) Push Save button and close Compose window
15) The message saved in Draft is opened.

Windows XP SP1
version 3 alpha 1 (20061002)
The message saved in Draft was input continuing "あ" in Japanese by 493 
characters.
(In reply to comment #2)
> Created an attachment (id=241033) [edit]
> Draft when problem occurs
> 
> The message saved in Draft was input continuing "あ" in Japanese by 493 
> characters. 
> 
oops... 
The last several characters input "い". 
I think it occurs here:
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/mailnews/compose/src/nsMsgSend.cpp&rev=1.387#1876
|charsSinceLineBreak| is actually bytes, not character.
So, multi-byte character may be split with linebreak.

Product is not Thunderbird, but Core/MailNews?
(In reply to comment #4)
> I think it occurs here:
> http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/mailnews/compose/src/nsMsgSend.cpp&rev=1.387#1876
> |charsSinceLineBreak| is actually bytes, not character.
> So, multi-byte character may be split with linebreak.

The the number of characters and following figures that save is done by an illegal character are coherent. 
http://landfill.mozilla.org/mxr-test/mozilla/source/mailnews/compose/src/nsMsgSend.cpp#1845
1845 #define LINE_BREAK_MAX 990

> Product is not Thunderbird, but Core/MailNews?
Yes.
And the Mac version reproduces, too. 
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows XP → All
Hardware: PC → All
Assignee: mscott → nobody
Severity: normal → major
Component: General → Composition
Product: Thunderbird → MailNews Core
QA Contact: general → composition
Version: Trunk → unspecified
perhaps kozawa can test this with beta 3
 http://www.mozillamessaging.com/en-US/thunderbird/early_releases/
Whiteboard: [needs trunk test]
Version: unspecified → 1.8 Branch
WinXP/SP3, Tb bata3 reproduced.
It doesn't seem that the problem was corrected as long as I see the cvs history.
Keywords: intl
Whiteboard: [needs trunk test]
Version: 1.8 Branch → Trunk
I guess there's something that can be done with the charset encoders/decoders?

I'm not sure what exactly the STR are on 3.1 RC, can't get that to fail locally, I don't seem to have found the setting to trigger this.

Also, is that max line break a thing we do for RFCs or for our own sanity? That would determine how the encoding actually impacts what we're doing, namely glyphs vs bytes.
I think "Wrap plain text message" options don't works now.

Plain text formatter always set linebreak each 72 character and mailnews code sets linebreak 990 byte at force.

I will consider the fix by bug 553526 and Bug 26734.  To support delsp=yes, I need refactor linebreak code in mailnews.
Attachment #241033 - Attachment mime type: text/plain → application/octet-stream
(In reply to comment #9)
> I will consider the fix by bug 553526 and Bug 26734.  To support delsp=yes, I
> need refactor linebreak code in mailnews.

if you mean you will fix them here then please adjust the dependencies I have just test
Depends on: 553526, 26734
(In reply to comment #10)
> (In reply to comment #9)
> > I will consider the fix by bug 553526 and Bug 26734.  To support delsp=yes, I
> > need refactor linebreak code in mailnews.
> 
> if you mean you will fix them here then please adjust the dependencies I have
> just test

To support DelSp=yes (Bug 26734), I must fix this.

Fix plan is 
- It doesn't break line when saving mail to draft.
- when sending it, it breaks lines

But this is just idea.  I am investigating fixing.
(In reply to comment #1)
> Created attachment 241032 [details]
> Screen shot when problem occurs

With Tb 3.1.7, I couldn't see this kind of corruption of iso-2022-jp data(loss of escape sequence due to inserted CRLF) with HTML mode composition and "Send Later".
Tb 3.1 looks to care for charset and escape sequence upon split by LINE_BREAK_MAX 990(990 bytes).
Kato san do you still see same corruption with Tb 3.1?

(In reply to comment #11)
> To support DelSp=yes (Bug 26734), I must fix this.
> Fix plan is 
> - It doesn't break line when saving mail to draft.

If local Drafts folder, it'll improve, because Tb can use any line length . But, if IMAP Drats, line length should be cared.

- when sending it, it breaks lines

"Generated mail data stream for mail send" is same data as mail data saved in Outbox(==Unsent Messages) by "Send Later".
Split of long line happens by any of next in the generated mail data stream.

(A) text/html part.
(A-1) by "#define LINE_BREAK_MAX 990".
      This is applied to any charset.
(A-2) by editor.htmlWrapColumn(default=72, 72 characters, not 72 bytes)
      If SBCS character like ascii, split of a word longer than this length
      doesn't occur. Split of continuous characters seems DBCS charset only
      phenomenon.
As Tb 3.1 executes formatting of HTML source(indention by putting spaces before text in HTML, inserted data by splitting becomes "CRLF + some spaces").

(B) text/plain part.
(B-0) text/plain part data is gnerated by text converter.
      Because "new line character" in HTML is equivallent to a space,
      inserted CRLF by (A) for text/html is converted to a space.
      So, excess space appears in text/plain part data.
After text conversion, next are applied.
(B-1) by mailnews.wraplength(default=72, 72 bytes, not 72 characters)
      As 72 bytes instead of 72 characters, additional line splitting
      occurs if charset of multi-bytes code is used.
(B-2) by format=flowed(max 80 bytes or 78 bytes including CRLF)
      If ascii, split at a space.
      I don't know about behaviour on DBCS characters well.

Note: If text mode composition, Hard-Wrap is executed during compoition. So, line split by "LINE_BREAK_MAX 990" occurs only when user intentionally sets mailnews.wraplength=0 or value larger than 990.

See bug 611411 comment #3 for procedure to observe above.

Tb 3.1 doesn't show very long line as if hard-wrapped during HTML mode composition. 
And, Even if long line is split in text/html part by Save As or Send Later, Tb 3.1 looks to show it as "continuous characters"(i.e. ignore or remove inserted CRLF in HTML source by line splitting), if text/html part exists and View/Message Body As/HTML is choosed.
I don't know behaviour on <pre> part.
I guess bug 611411 is for excess space by (B-0) in text/plain part.

To support line spliting of multi-byte charset by wrap length or line length limitation well, DelSp=Yes support or similar is required for both text/html part and text/plain part.

Further, "split at middle of a multi-bytes charater or an escape-sequence by LINE_BREAK_MAX 990" should be cared. "Draft or sent mail data corruption when many long long lines are pasted at compose window" was reported to a Forum Japan once. It looked split of "three bytes code" or "three bytes escape sequece of iso-2022-jp" when a special condition(e.g. the three bytes are placed at buffer boundary).
(In reply to comment #1)
> Created attachment 241032 [details]
> Screen shot when problem occurs

This problem still occurred in Tb 3.1 and Tb 3.3a3pre(2011/01/15 build). I wrongly thought first line is also shown as corrupted data by Tb, because iso-2022-jp's spec is "line should end with unescape to ascii-mode".
Sorry for my misunderstanding.

(1) Text mode composition, mailnews.wraplength=0, iso-2022-jp.
    CRLF is inserted regardless of escape sequence of iso-2022-jp
    by LINE_BREAK_MAX 990.
    Tb 3.1 shows first line in Japanes character, with U+FFFD at line end.
    All of second and later line in gargled text.
    View/Message Source of Tb 3.1/trunk shows first line also in garbled.
    Text editor shows all lines in garbled too.
(2) HTML mode composition, mailnews.wraplength=0, iso-2022-jp,
    editor.htmlWrapColumn=0.
    CRLF is inserted regardless of editor.htmlWrapColumn setting in text/html
    part(looks always 72 characters).
    In text/part, line length looks "LINE_BREAK_MAX 990", but data corruption
    is not observed.
    Last charcter's column is 498, 499, or 500. Depends on excess space.
    It seems splitting in text/plain part is executed at character boundary
    with regarding escape sequence, if split is executed on is-2022-jp
    binary line.
Depends on: 653342
Severity: major → critical
Keywords: dataloss
Summary: Long Japanese sentences are not normally saved in Draft → Long Japanese sentences is broken by Tb/Sm when mail is sent or saved in Outbox/Drafts (When long line is split by SMTP line length limit==LINE_BREAK_MAX, Tb splits at mid of 3bytes code of utf-8 and 3bytes escape sequence of iso-2022-jp)
(Correction of comment #12)
> With Tb 3.1.7, I couldn't see this kind of corruption of iso-2022-jp
> data(loss of escape sequence due to inserted CRLF) with HTML mode
> composition and "Send Later".
> Tb 3.1 looks to care for charset and escape sequence upon split by
> LINE_BREAK_MAX 990(990 bytes).

My observation ad guess was wrong.
Reason why "split of 3bytes escape sequence of iso-2022-jp" doesn't occur in HTML mode composition was;
  HTML editor always inserts line break for each "around 80 Unicode
  Characters(Not in "Bytes". Perhaps at "80-Line Break length" chars.)
This "insert line break around 80 unicode chars" doesn't occur in <pre>. So, if <pre> is used, this bug occurs at text/html part, even in HTML mode composition.
See Also: → 653342
See Also: 653342
Summary: Long Japanese sentences is broken by Tb/Sm when mail is sent or saved in Outbox/Drafts (When long line is split by SMTP line length limit==LINE_BREAK_MAX, Tb splits at mid of 3bytes code of utf-8 and 3bytes escape sequence of iso-2022-jp) → Long Japanese or Unicode sentences is broken by Tb/Sm when mail is sent or saved in Outbox/Drafts (When long line is split by SMTP line length limit==LINE_BREAK_MAX, Tb splits at mid of 3bytes code of utf-8 and 3bytes escape sequence of iso-2022-jp)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1225904
You need to log in before you can comment on or make changes to this bug.