Closed Bug 1225864 Opened 10 years ago Closed 10 years ago

Don't wrap/chop through long CJK strings when serialising

Categories

(Core :: DOM: Serializers, defect)

Product:

Component:

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla45

Tracking Flags:

Tracking

Status

firefox45

---

fixed

People

(Reporter: jorgk-bmo, Assigned: jorgk-bmo)

References

Details

Attachments

(5 files, 2 obsolete files)

Test showing wrapping of CJK characters when rendered onto the screen. 10 years ago Jorg K (CEST = GMT+2) 879 bytes, text/html		Details
Result of serialisation with line breaking (current state) 10 years ago Jorg K (CEST = GMT+2) 825 bytes, text/html		Details
Result of serialisation with line breaking (proposed new) 10 years ago Jorg K (CEST = GMT+2) 801 bytes, text/html		Details
Proposed solution (v1) 10 years ago Jorg K (CEST = GMT+2) 5.34 KB, patch	masayuki : review-	Details \| Diff \| Splinter Review
Proposed solution (v1b) 10 years ago Jorg K (CEST = GMT+2) 5.96 KB, patch	masayuki : review+	Details \| Diff \| Splinter Review
Test (v1) 10 years ago Jorg K (CEST = GMT+2) 2.19 KB, patch	masayuki : review+	Details \| Diff \| Splinter Review
Proposed solution (v1c). 10 years ago Jorg K (CEST = GMT+2) 6.72 KB, patch	jorgk-bmo : review+	Details \| Diff \| Splinter Review

Jorg K (CEST = GMT+2)

Assignee

Description

•

10 years ago

Attached file Test showing wrapping of CJK characters when rendered onto the screen. — Details

Don't wrap/chop through long CJK strings when serialising: nsEditor::OutputToString() and nsContentUtils::ConvertToPlainText() wrap/chop long (UTF-8) strings containing CJK characters (and no white-space) when called with nsIDocumentEncoder::OutputFormatted. They must not do so. No string must be broken into multiple output lines. For details see bug 653342 comment #161, bug 653342 comment #162 and bug 653342 comment #170. Long strings containing Western European characters like á (and no white-space) are not split, so it's not clear why CJK strings are split. Obviously screen rendering of CJK strings is a different story (see attachment), but serialising causes problems since white-space is added where there was none.

Jorg K (CEST = GMT+2)

Assignee

Updated

•

10 years ago

Blocks: 653342

Jorg K (CEST = GMT+2)

Assignee

Comment 1

•

10 years ago

Oops, that's nsPlaintextEditor::OutputToString(): http://mxr.mozilla.org/mozilla-central/source/editor/libeditor/nsPlaintextEditor.cpp#1333 The other call is from nsContentUtils::ConvertToPlainText(): http://mxr.mozilla.org/mozilla-central/source/dom/base/nsContentUtils.cpp#4529 In the end both call nsDocumentEncoder::EncodeToString(): http://mxr.mozilla.org/mozilla-central/source/dom/base/nsDocumentEncoder.cpp#1085

Comment hidden (obsolete)

Comment hidden (obsolete)

Jorg K (CEST = GMT+2)

Assignee

Comment 4

•

10 years ago

Another observation: We might have a little fight about this issue. The author of this test https://dxr.mozilla.org/mozilla-central/source/dom/base/test/TestPlainTextSerializer.cpp#78 thought that cutting through a CJK string is a good thing. Take a look, expecting to get cut after 36 characters. This was implemented in bug 553540 to support the delsp=yes stuff from bug 26734, which never landed since the authors abandoned it. Let's read up on RFC 3676 which *obsoletes* RFC 2646: https://www.ietf.org/rfc/rfc3676.txt 4.1. Interpreting Format=Flowed If the line is flowed and DelSp is "yes", the trailing space immediately prior to the line's CRLF is logically deleted. Finally the whole thing is dawning on me: The authors of bug 26734 and bug 553540 tried to suppress the spaces caused by the serialiser chopping through CJK strings by specifying "format=flowed;delsp=yes". This as a *desperate* measure to remove the erroneous spaces which would only ever have worked in plaintext e-mail. But we also have problem of the extra spaces in HTML e-mail. That's why we need to fix it here. No intelligence in the words will be able to remove spaces that got inserted "randomly".

Comment hidden (off-topic)

Comment hidden (off-topic)

Comment hidden (off-topic)

Comment hidden (off-topic)

(In reply to Jorg K (GMT+1) from comment #8) > Thanks for the comment. > Let me summarise the problem. The M-C serialiser inserts spaces into CJK > strings which we don't now how to remove later. It doesn't insert spaces > into Western European strings, regardless of their length. > > My aim is to treat all strings as anonymous data and transmit them > unadulterated. > > When I am finished, the data in the *body* will be transmitted as the user > entered it. > Good. Fair enough. > <off topic> > I can look at issues in the subject later. But are there really any? > I put > これは長い日本語のテキストですので、 > 行が折れると思います。これは長い日本語のテキストですので、行が折れると思います。これは長い日本語のテキストですので、行が折れると思います。 > The result is > Subject: =?UTF-8?B?44GT44KM44Gv6ZW344GE5pel5pys6Kqe44Gu44OG44Kt44K544OI44Gn?= > =?UTF-8?B?44GZ44Gu44Gn44CB6KGM44GM5oqY44KM44KL44Go5oCd44GE44G+44GZ44CC44GT?= > =?UTF-8?B?44KM44Gv6ZW344GE5pel5pys6Kqe44Gu44OG44Kt44K544OI44Gn44GZ44Gu44Gn?= > =?UTF-8?B?44CB6KGM44GM5oqY44KM44KL44Go5oCd44GE44G+44GZ44CC44GT44KM44Gv6ZW3?= > =?UTF-8?B?44GE5pel5pys6Kqe44Gu44OG44Kt44K544OI44Gn44GZ44Gu44Gn44CB6KGM44GM?= > =?UTF-8?B?5oqY44KM44KL44Go5oCd44GE44G+44GZ44CC?= > with no extra spaces. > </off topic> I will ask my friend for a sample of such broken subject line which he mentioned. > > I don't want to start the discussion where CJK strings could potentially be > broken. You say yourself that 'Japan' shouldn't be broken. I believe there > are complicated rules where you can break and where you can't, see: > https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages Let us forget about it. It is in the realm of formatter such as TeX, nroff, etc. > Again: The aim it to transmit what was entered, no more and no less. > Breaking lines is really unnecessary since everything that exceeds the > prescribed buffer length can be transmitted using CTE base64. We get exactly what we input, right? TIA

Comment hidden (off-topic)

Jorg K (CEST = GMT+2)

Assignee

Comment 11

•

10 years ago

Attached file Result of serialisation with line breaking (current state) — Details

Please look for spaces, you'll see three spurious spaces. Take a look at the HTML to see that they are caused by line breaking.

Jorg K (CEST = GMT+2)

Assignee

Comment 12

•

10 years ago

Attached file Result of serialisation with line breaking (proposed new) — Details

This was serialised without line breaking. No extra spaces.

Assignee: nobody → mozilla

Status: NEW → ASSIGNED

Jorg K (CEST = GMT+2)

Assignee

Comment 13

•

10 years ago

Attached patch Proposed solution (v1) (obsolete) — Details — Splinter Review

(Hi Ehsan, how was your holiday? Are you taking reviews at the moment?) I made this bug as streamlined as possible by obsoleting the conversation I had with a Japanese user. The problem is that when we serialise DOM to text (e-mail body) we get extra spaces. Many have tried to deal with them, and all have failed, because they need to be removed at their source. I know that serialisers are un-owned and un-maintained and a little broken here and there. This small tweak won't do any harm and we really need it badly. I haven't done a try run. I can't imagine that this will break anything, since the new flag is obviously not used anywhere. How can we proceed? I'd like to get this landed quickly, since I need to land the depending Thunderbird bugs. Due to this bug, TB has basically been broken for all CJK users for years. I can offer to do a try run and I can offer to add a test for the new feature. There is already dom/base/test/unit/test_xmlserializer.js where some test cases could easily be added.

Attachment #8690636 - Flags: review?(ehsan)

Jorg K (CEST = GMT+2)

Assignee

Comment 14

•

10 years ago

For a try build refer to bug 653342 comment #177.

(no longer active)

Comment 15

•

10 years ago

Comment on attachment 8690636 [details] [diff] [review] Proposed solution (v1) It's not immediately obvious to me where the bug comes from, and whether this is the right fix. I think Jonathan and/or Masayuki should be able to evaluate what the issue here is and whether this is the right approach. Note that whatever the underlying bug is, it may manifest itself in other places such as line-breaking preformatted text, or textarea's wrapping support, etc, so the bug may need to be fixed at a different level...

Attachment #8690636 - Flags: review?(masayuki)

Attachment #8690636 - Flags: review?(jfkthame)

Attachment #8690636 - Flags: review?(ehsan)

Jorg K (CEST = GMT+2)

Assignee

Comment 16

•

10 years ago

Thanks for the quick reply. All I can say it that for obvious reasons we need the DOM converted to attachment 8690633 [details] and not attachment 8690632 [details]. While I was debugging it yesterday I ended up in http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/nsJISx4051LineBreaker.cpp#801 JIS might stand for Japanese Industry Standard, I don't know. However, in my test case, as you can see, I was using Korean characters, the character representing the syllable "an", which is the first syllable of "Hello"/"Good Day" in Korean. I'm not even aware that Korean can be broken anywhere you want, since the delimit their words by white-space, at least in contemporary writing, unlike the Chinese or Japanese. Be that as it may, for the purpose of converting DOM to e-mail body, we need to treat the German word Hottentottenstottertrottelmutterattentäterlattengitterwetterkotterbeutelrattenfangprämie and this Japanese line これは長い日本語のテキストですので、行が折れると思います。これは長い日本語のテキストですので、行が折れると思います。 the same and *must* not inject random spaces. It's quite unimportant how this is achieved ... as long as it happens before the next branch date ;-)

Comment hidden (off-topic)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 18

•

10 years ago

(In reply to Jorg K (GMT+1) from comment #16) > JIS might stand for Japanese Industry Standard, I don't know. Right. E.g., some specs of ISO is redefined by JIS for internal usage in Japan. > これは長い日本語のテキストですので、行が折れると思います。これは長い日本語のテキストですので、行が折れると思います。 Perhaps, you may know about Japanese email manners (still) in these days but I should explain about that: 1. Japanese language isn't a language whose words are separated by whitespaces or something. Only when new paragraphs cause line breaks at the end of previous paragraphs. Therefore, a lot of long lines always exist in Japanese emails. 2. Plaintext mails are used basically because a lot of Japanese people don't like HTML mails. 3. ISO-2022-JP is still used mainly only for emails. However, UTF-8 is already de facto standard also in Japan at creating new web contents. 4. Japanese people believe that long lines are automatically broken by MUA because most of them believe breaking each lines every 72 or similar characters is one of email manners. We know Japanese email manners do not make sense especially in technical reasons. For example, it depends on devices of readers how much characters per line is good to read. But changing manners is much difficult... So, anyway, I don't see complain about ISO-2022-JP emails. However, I also use UTF-8 for Japanese emails personally. In this case, the unexpected spaces are indeed a big problem. I guess I must not follow your comments in other bugs enough. Do you try to change ISO-2022-JP's email format too? And ccing some other guys who are familiar with Japanese email manners.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 19

•

10 years ago

Comment on attachment 8690636 [details] [diff] [review] Proposed solution (v1) >diff --git a/dom/base/nsIDocumentEncoder.idl b/dom/base/nsIDocumentEncoder.idl >--- a/dom/base/nsIDocumentEncoder.idl >+++ b/dom/base/nsIDocumentEncoder.idl >@@ -236,16 +236,23 @@ interface nsIDocumentEncoder : nsISuppor > > /** > * Include ruby annotations and ruby parentheses in the output. > * PlainText output only. > */ > const unsigned long OutputRubyAnnotation = (1 << 26); > > /** >+ * Disallow breaking of long character strings. This is important >+ * for serialising e-mail which contains CJK strings. These must s/serialising/serializing ? >+ * not be broken just as "normal" longs strings aren't briken. s/briken/broken ?? > // Only create a linebreaker if we will handle wrapping. >- if (MayWrap()) { >+ if (MayWrap() && !(mFlags & nsIDocumentEncoder::OutputDisallowLineBreaking)) { You should create a helper inline method which checks mFlags like MayWrap(). I saw a lot of simple mistakes checking flags directly, so, I really don't like this. >diff --git a/dom/base/nsXMLContentSerializer.cpp b/dom/base/nsXMLContentSerializer.cpp >--- a/dom/base/nsXMLContentSerializer.cpp >+++ b/dom/base/nsXMLContentSerializer.cpp >@@ -107,16 +107,18 @@ nsXMLContentSerializer::Init(uint32_t aF > } > > mDoRaw = !!(mFlags & nsIDocumentEncoder::OutputRaw); > > mDoFormat = (mFlags & nsIDocumentEncoder::OutputFormatted && !mDoRaw); > > mDoWrap = (mFlags & nsIDocumentEncoder::OutputWrap && !mDoRaw); > >+ mAllowLineBreaking = !(mFlags & nsIDocumentEncoder::OutputDisallowLineBreaking); I wonder, do you really need new bool member here? It looks like that inline method which checks mFlags is enough. >diff --git a/dom/base/nsXMLContentSerializer.h b/dom/base/nsXMLContentSerializer.h >--- a/dom/base/nsXMLContentSerializer.h >+++ b/dom/base/nsXMLContentSerializer.h >@@ -350,16 +350,19 @@ class nsXMLContentSerializer : public ns > > // true = no formatting,(OutputRaw flag) > // no newline convertion and no rewrap long lines even if OutputWrap is set. > bool mDoRaw; > > // true = wrapping should be done (OutputWrap flag) > bool mDoWrap; > >+ // true = we can break lines (OutputDisallowLineBreaking flag) >+ bool mAllowLineBreaking; >+ > // number of maximum column in a line, in the wrap mode > uint32_t mMaxColumn; > > // current indent value > nsString mIndent; off topic, though... these order doesn't think the memory alignment... (not your fault and not need to fix it) Otherwise, looks good to me.

Attachment #8690636 - Flags: review?(masayuki) → review-

Makoto Kato [:m_kato]

Comment 20

•

10 years ago

DelSp=Yes is correct way for plain text into mail body, if CJT and you don't want break between character. Document encoder already has this option.

Makoto Kato [:m_kato]

Comment 21

•

10 years ago

If about HTML formatter, thunderbird's compose editor will use format option instead of raw option. If using raw option, it doesn't use line break. Of course, if this is issue for HTML formatter instead of plain text formatter, this option useful for tb's compose editor.

Jorg K (CEST = GMT+2)

Assignee

Comment 22

•

10 years ago

Masayuki and Makoto, thank you for your comments. Let me summarise: Japanese users use long lines, they don't use (ASCII) white-space. They prefer plain text and use ISO-2022-JP a lot. Delsp=yes is the preferred method for ISO-2022-JP encoded mail so that CTE 7bit can be used. There is one thing I don't understand: > 4. Japanese people believe that long lines are automatically broken by MUA [mail client] > because most of them believe breaking each lines every 72 or similar characters > is one of email manners. This I don't understand. There are very intricate "breaking" rules, see: https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages Another Japanese user told me in comment #7 that for example the word "Japan" shouldn't be broken. So why would it be good to break every 72 characters, perhaps right in the middle of words. Korean users most certainly would not appreciate that at all. Now let me summarise my approach to get rid of the extra spaces: 1) In this bug here we fix the serialiser so that extra spaces are not created any more if an additional flag is set. 2) In bug 1225904 we fix the Thunderbird sending mechanism so that it can transmit lines of any length using base64 encoding when lines are long and cannot be broken. This will lead to ISO-2022-JP being transmitted as base64 in some cases. I know that RFC 1468 (from 1993!) says that CTE 7bit *should* be used. There is a sample message in attachment 8689687 [details] that has ISO-2022-JP and CTE base64 and works perfectly well. 3) In bug 653342 we tie it altogether. We allow format=flowed for all character encodings and set the flag that won't break the lines any more from this bug here. Please try the binary from bug 653342 comment #177. Then one day we fix bug 26734 to implement delsp=yes, but only for ISO-2022-JP encoded messages. Sadly the patches proposed for this bug would have implemented delsp=yes for *all* flowed e-mail and that is not correct. Personally I think whether ISO-2022-JP messages are using CTE 7bit or base64 makes no difference at all to the end user. Masayuki, I will address the review comments in the next patch coming up soon. Thanks for the review!

Jorg K (CEST = GMT+2)

Assignee

Comment 23

•

10 years ago

(In reply to Makoto Kato [:m_kato] from comment #21) > If about HTML formatter, thunderbird's compose editor will use format option > instead of raw option. If using raw option, it doesn't use line break. Yes, there are no spaces created with the "raw" option. But using "raw", the HTML is not "pretty-printed" any more. For plain text, the "raw" option loses a lot of information. See bug 653342 comment #170. I considered using "raw", but having another flag is the better solution.

Jorg K (CEST = GMT+2)

Assignee

Comment 24

•

10 years ago

Attached patch Proposed solution (v1b) (obsolete) — Details — Splinter Review

Addressing review issues: "to serialise" is British/Australian English (I am Australian). "to serialize" is American English. I used that now. I have a new inline method MayBreakLines() in class nsPlainTextSerializer. This cannot be used in nsXMLContentSerializer so I maintained the bool flag mAllowLineBreaking. Or do you prefer another inline method?

Attachment #8690636 - Attachment is obsolete: true

Attachment #8690636 - Flags: review?(jfkthame)

Attachment #8692439 - Flags: review?(masayuki)

ISHIKAWA, Chiaki

Comment 25

•

10 years ago

(In reply to Jorg K (GMT+1) from comment #22) ... > There is one thing I don't understand: > > 4. Japanese people believe that long lines are automatically broken by MUA [mail client] > > because most of them believe breaking each lines every 72 or similar characters > > is one of email manners. > > This I don't understand. There are very intricate "breaking" rules, see: > https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages > Another Japanese user told me in comment #7 that for example the word > "Japan" shouldn't be broken. Jorg, " Another Japanese user told me in comment #7 that for example the word > "Japan" shouldn't be broken." is from me. I have to stress that this particular requirement is at a more higher-level of formatting rules that must be implemented at the formating programs that operate even at the higher level than TeX, TROFF, and other formating programs and meant for producing magazine layout, newspaper layout, etc. I once had to code such a special routine so that the names of baseball players won't be broken in the middle when their names appear below scoreboard in TV sport news segment. But expecting such higher-level line breaking rules (not even mentioned in the URL you quoted) often found only in the realm of publisher's in-house rules enforced only by meticulous editors is too much for TB. The particular characteristics I mentioned for line-breaking rules may not be even understood outside of (paper) publishing industry. So let's ignore that particular mention of preferring not to break proper nouns at this moment. I don't believe more then 50% of TB users in Japan care about this subtle issue. As long as TB passes the user input text data WITHOUT inserting whitespace, the text data can be fed into formating program to produce good layout. Extra whitespace may not be removed by formating program since it cannot tell whether the user intended an extra whitespace to be inserted. So that *IS* a problem. > So why would it be good to break every 72 characters, perhaps right in the > middle of words. > Korean users most certainly would not appreciate that at all. I think whether the user agrees with the forced breaking at 72 (the constant could be different and often settable by the user) is today moot. In the old days, when an earlier RFC had to warn the readers, that there is a chance that a long line may be chopped off by the transmission path somewhere due to various limitations of mail gateways, and such, this line breaking was accepted with a sigh or something. I didn't like it very much and so set the right margin at a slightly larger place (but not exceeding large, say, 78) and many times I created the original mail body using EMACS with my OWN short right margin around 74 or so, and basically overruled the line breaking position of my own to begin with. I simply used EMACS's fill-paragpraph mode to fill the paragraph with line breaking and only when I notice undesirable breaking like the product name being broken over two lines, etc. I fixed them manually. Since I set my mailer's right margin at larger value than my Emacs's setting I know that the particular line breaking layout I had won't be disturbed by the mailer. But again, this is MY OWN PARTICULAR experience, and I have a doubt that you can NOT satisfy everybody. Just NOT inserting whitespace would be a better approach today. Young users won't tolerate strange whitespace or forced line breaks.

Jorg K (CEST = GMT+2)

Assignee

Comment 26

•

10 years ago

(In reply to ISHIKAWA, Chiaki from comment #25) > As long as TB passes the user input text data WITHOUT inserting whitespace, > the text data can be fed into formating program to produce good layout. > Extra whitespace may not be removed by formating program since it cannot tell > whether the user intended an extra whitespace to be inserted. So that *IS* a > problem. I agree 100%. Thank you for your support. > Just NOT inserting whitespace would be a better approach today. > Young users won't tolerate strange whitespace or forced line breaks. I agree 100%. In the future the e-mail will be received exactly as entered. The problem is already solved, you can try a Windows build at https://bugzilla.mozilla.org/show_bug.cgi?id=653342#c177 The three bugs involved, this one and bug 1225904 and bug 653342 all have patches proposed for review which will land on Thunderbird 45. Kent James is committed to fixing this problem once and for all. The review of this bug here was positive (comment #19: Otherwise, looks good to me). I've addressed the issues so I'm positive that this will land soon.

Masatoshi Kimura [:emk]

Comment 27

•

10 years ago

If the old compatibility does not matter, why don't you just drop non-UTF-8 support entirely? (I already proposed this.)

Jorg K (CEST = GMT+2)

Assignee

Comment 28

•

10 years ago

This is a different issue. Additional spaces are added regardless of the encoding.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 29

•

10 years ago

Comment on attachment 8692439 [details] [diff] [review] Proposed solution (v1b) (In reply to Jorg K (GMT+1) from comment #24) > Addressing review issues: > "to serialise" is British/Australian English (I am Australian). > "to serialize" is American English. I used that now. Oh, I didn't know the difference, thank you for the information! > This cannot be used in nsXMLContentSerializer so I maintained the bool flag > mAllowLineBreaking. Or do you prefer another inline method? I like another inline method better, but up to you because at least for now, the new bool doesn't increase the size of the instance because it's allocated in existing padding area. And I realized that you don't change uuid of nsIDocumentEncoder. Please change it and request super review (to smaug?) because of changing nsIDocumentEncoder. With those changes, r=msasayuki.

Attachment #8692439 - Flags: review?(masayuki) → review+

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 30

•

10 years ago

(In reply to Jorg K (GMT+1) from comment #22) > There is one thing I don't understand: > > 4. Japanese people believe that long lines are automatically broken by MUA [mail client] > > because most of them believe breaking each lines every 72 or similar characters > > is one of email manners. > > This I don't understand. There are very intricate "breaking" rules, see: > https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages > Another Japanese user told me in comment #7 that for example the word > "Japan" shouldn't be broken. > So why would it be good to break every 72 characters, perhaps right in the > middle of words. The magic number may depend on people, though (as Ishikawa-san said). At breaking each line, of course, Kinsoku-Syori should be applied (nsJISx4051LineBreaker.cpp implements the rule with browser specific additional rules). So, for example, words of English shouldn't be broken middle of them. A word reaches and over a line limitation, the word should be put the head of next line. > Korean users most certainly would not appreciate that at all. As my understanding, Korean text should be treated as Western languages since each word is separated by ASCII whitespace. But I don't familiar with Korean text layout rules. > Now let me summarise my approach to get rid of the extra spaces: > 1) In this bug here we fix the serialiser so that extra spaces are not > created any more > if an additional flag is set. Sure. > 2) In bug 1225904 we fix the Thunderbird sending mechanism so that it can > transmit > lines of any length using base64 encoding when lines are long and cannot > be broken. > This will lead to ISO-2022-JP being transmitted as base64 in some cases. > I know that RFC 1468 (from 1993!) says that CTE 7bit *should* be used. > There is a sample message in attachment 8689687 [details] that has > ISO-2022-JP and CTE base64 > and works perfectly well. I'm not sure if this is right thing. But it should be discussed in the bug ;-) > 3) In bug 653342 we tie it altogether. We allow format=flowed for all > character encodings > and set the flag that won't break the lines any more from this bug here. > > Please try the binary from bug 653342 comment #177. Thank you, but the most important thing is　interoperability　between any MUAs. At least in Japan, I guess, about 10 or more MUAs are still being used (including web mails like Gmail). We cannot test it only by our hands. So, I strongly recommend that you should add a preference which can disable the new behavior for ability to disable the new behavior even after the release. If we've reported some trouble with some other MUAs, we should release new build which disables the new behavior with the pref. > Then one day we fix bug 26734 to implement delsp=yes, but only for > ISO-2022-JP encoded messages. Sadly the patches proposed for this bug would > have implemented delsp=yes for *all* flowed e-mail and that is not correct. I think that UTF-8 emails should be Japanese text layout aware. > Personally I think whether ISO-2022-JP messages are using CTE 7bit or base64 > makes no difference at all to the end user. If no MUAs didn't support encoded body for ISO-2022-JP, you're right. But I'm afraid that there are some MUAs not supporting our new behavior...

ISHIKAWA, Chiaki

Comment 31

•

10 years ago

(In reply to Masatoshi Kimura [:emk] from comment #27) > If the old compatibility does not matter, why don't you just drop non-UTF-8 > support entirely? (I already proposed this.) (In reply to Jorg K (GMT+1) from comment #28) > This is a different issue. Additional spaces are added regardless of the > encoding. I know this is a different issue, but I could not help chiming in. Established businesses don't want to break status quo. They are VERY conservative. And when it comes to conservative business practices, the banking industry comes to my mind. Sure enough, when I checked e-mails I receive from banks and other financial institutions, they are encoded in iso-2022-jp: Content-type: header contained the following. text/plain; charset=iso-2022-jp text/plain; charset=ISO-2022-JP text/html; charset="ISO-2022-JP" I did not realize the name are case-insensitive and we can quote the name in double quotes. Wait. Who is sending me HTML mail? It is very unlike the banking industry! I am afraid that we have to live with non-UTF-8 mail and the "mess" (for lack of better words) for a long time... The fact there are widely used different character code systems in a country is the source of practical problems. But there are already large user base (how large is relative) and, in some sectors, they can't be ignored. Of course, dropping non-utf-8 support for *WRITING*, I presume, means TB has to forget about penetrating stable business sector, which probably is not TB's strong field anyway (?) [We have to receive and display non-utf-8 e-mails for obvious reasons. Imagine the 15 years worth of e-mail archives which contain messages in many encodings.] Sorry for different topic, but I could not resist. But given the relative lack of voices from the Japanese communities, I think sometimes an opinion from there, even skewed to one's particular experience, may be in order. TIA

Magnus Melin [:mkmelin]

Comment 32

•

10 years ago

Yes we're only talking about dropping support for *sending* non-utf-8 (gmail has also done that recently and seem to get away with it). Reading/responding to mails of other encoding would still work.

Jorg K (CEST = GMT+2)

Assignee

Comment 33

•

10 years ago

Comment on attachment 8692439 [details] [diff] [review] Proposed solution (v1b) Thanks for the quick review. > And I realized that you don't change uuid of nsIDocumentEncoder. Please > change it and request super review (to smaug?) because of changing > nsIDocumentEncoder. With those changes, r=msasayuki. I was told in another bug that I shouldn't do the "./mach update-uuids" if something new gets added but nothing changes, which is the case here. Olli and Ehsan: Who can "superreview" this?

Attachment #8692439 - Flags: superreview?(bugs)

Jorg K (CEST = GMT+2)

Assignee

Comment 34

•

10 years ago

Comment on attachment 8692439 [details] [diff] [review] Proposed solution (v1b) Oops, Bugzilla didn't accept two people in the superreview field.

Attachment #8692439 - Flags: review?(ehsan)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 35

•

10 years ago

This is off-topic, though, some Japanese people still need to send emails ISO-2022-JP if the receiver uses feature phone. According to this article (written in Japanese), https://www.marketingbank.jp/special/cat01/51.php user share of feature phone is about 50% at March, 2015. Although, we can guess that the number of users sending emails from Thunderbird to feature phone are not majority, but we cannot ignore them at least in Japan since the number of Thunderbird users in Japan is No.2 in the world. So, even if they were minority than other users, we shouldn't ignore them since the actual number of such users may not so few.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 36

•

10 years ago

(In reply to Jorg K (GMT+1) from comment #33) > Comment on attachment 8692439 [details] [diff] [review] > Proposed solution (v1b) > > Thanks for the quick review. > > And I realized that you don't change uuid of nsIDocumentEncoder. Please > > change it and request super review (to smaug?) because of changing > > nsIDocumentEncoder. With those changes, r=msasayuki. > I was told in another bug that I shouldn't do the "./mach update-uuids" if > something new gets added but nothing changes, which is the case here. You're adding a constant value. It seems that it's necessary to change the UUID but not sure. > Olli and Ehsan: Who can "superreview" this? Ehsan isn't a super reviewer at least according to this document... https://www.mozilla.org/en-US/about/governance/policies/reviewers/ But he should be the best person to check the interface.

Jorg K (CEST = GMT+2)

Assignee

Comment 37

•

10 years ago

This bug has become a discussion forum about general Japanese encoding issues when it's about adding an encoder flag. I hope the superreviewers won't get confused. ;-) (In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) from comment #30) > So, I > strongly recommend that you should add a preference which can disable the > new behavior for ability to disable the new behavior even after the release. > If we've reported some trouble with some other MUAs, we should release new > build which disables the new behavior with the pref. Yes, that's easily done in Thunderbird. Here we want to land the fix that enables us to have the choice ... and fix it at least for Korean users. (In reply to ISHIKAWA, Chiaki from comment #31) > And when it comes to conservative business practices, the banking industry > comes to my mind. Yes, the banks are excellent at breaking the rules, see attachment 8687928 [details] from bug 1224426. There they stuff ISO-2022-JP into a subject header which *must* be RCF 2047 encoded.

Jorg K (CEST = GMT+2)

Assignee

Comment 38

•

10 years ago

(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) from comment #35) > This is off-topic, though, some Japanese people still need to send emails > ISO-2022-JP if the receiver uses feature phone. Thanks for the hint. I personally didn't plan to remove ISO-2022-JP. The question is: Can this "feature phone" deal with ISO-2022-JP and CTE base64? Anyway, I will comment on bug 653342 and propose the option you asked for.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 39

•

10 years ago

(In reply to Jorg K (GMT+1) from comment #38) > (In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) from comment #35) > > This is off-topic, though, some Japanese people still need to send emails > > ISO-2022-JP if the receiver uses feature phone. > Thanks for the hint. I personally didn't plan to remove ISO-2022-JP. The > question is: Can this "feature phone" deal with ISO-2022-JP and CTE base64? I'm not sure. Makoto-san, do you know this or can you test this?

Flags: needinfo?(m_kato)

Masatoshi Kimura [:emk]

Comment 40

•

10 years ago

I'm arguing about the encoding here because RFC 1468 is a big reason why we are using ISO-2022-JP for outgoing Japanese mail messages. If we use ISO-2022-JP, we should follow the entire RFC 1468 instead of picking some convenient parts of the RFC.

Jorg K (CEST = GMT+2)

Assignee

Comment 41

•

10 years ago

Attached patch Test (v1) — Details — Splinter Review

Did I hear anyone say "test" ;-) - Here it is. I don't know who wants to review this.

Attachment #8692965 - Flags: review?(ehsan)

Attachment #8692965 - Flags: review?(bugs)

Jorg K (CEST = GMT+2)

Assignee

Comment 42

•

10 years ago

(In reply to Masatoshi Kimura [:emk] from comment #40) > I'm arguing about the encoding here because RFC 1468 is a big reason why we > are using ISO-2022-JP for outgoing Japanese mail messages. If we use > ISO-2022-JP, we should follow the entire RFC 1468 instead of picking some > convenient parts of the RFC. This discussion is in the wrong bug. This bug here is about not breaking long lines without any control. The discussion about how to best send ISO-2022-JP is happening in Thunderbird bug 653342. Can we please stop any discussion about application/encoding issues so that the core reviewers don't get confused. The change proposed here, that is, having control about not braking the lines is *essential*. It is also needed for HTML serialisation. Let's continue the discussion in the other bug 653342. In bug 653342 comment #185 I proposed a preference that will the allow the user to maintain the existing status quo: Cut the plaintext up and encode ISO-2022-JP with 7bit. That will honour the entire RFC 1468.

(no longer active)

Comment 43

•

10 years ago

Comment on attachment 8692439 [details] [diff] [review] Proposed solution (v1b) Review of attachment 8692439 [details] [diff] [review]: ----------------------------------------------------------------- superreviews aren't really a thing any more. Masayuki's review is enough here. But you should rev the uuid still. (Also please don't ask review from more than one person when you don't actually want all of them to look at the patch, that's kind of spammy!)

Attachment #8692439 - Flags: superreview?(bugs)

Attachment #8692439 - Flags: review?(ehsan)

(no longer active)

Comment 44

•

10 years ago

Comment on attachment 8692965 [details] [diff] [review] Test (v1) Review of attachment 8692965 [details] [diff] [review]: ----------------------------------------------------------------- I think Masayuki should review this.

Attachment #8692965 - Flags: review?(ehsan)

Jorg K (CEST = GMT+2)

Assignee

Updated

•

10 years ago

Attachment #8692965 - Flags: review?(bugs) → review?(masayuki)

Jorg K (CEST = GMT+2)

Assignee

Comment 45

•

10 years ago

Attached patch Proposed solution (v1c). — Details — Splinter Review

Carrying forward Masayuki Nakano'r r+ Updated UUID.

Attachment #8692439 - Attachment is obsolete: true

Attachment #8693121 - Flags: review+

Jorg K (CEST = GMT+2)

Assignee

Comment 46

•

10 years ago

Cancelling the NI for Makoto. In bug 653342 we decided to maintain the status quo for ISO-2022-JP. Further discussion in that bug, please.

Flags: needinfo?(m_kato)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Updated

•

10 years ago

Attachment #8692965 - Flags: review?(masayuki) → review+

Jorg K (CEST = GMT+2)

Assignee

Comment 47

•

10 years ago

Dear Sheriff, there are two patches here which can be applied in any order. Note that the change is minor and the possibility for breakage is close to zero. So please, to avoid unnecessary testing, combine this with some other suitable patches when landing. Thanking you in advance, Jörg.

Keywords: checkin-needed

Jorg K (CEST = GMT+2)

Assignee

Comment 48

•

10 years ago

Dear Sheriff, please add "r=masayuki" to the Test patch. Thanks!

Comment 49

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/0e3af3ef46f0 https://hg.mozilla.org/integration/mozilla-inbound/rev/ca8916bb80cf

Keywords: checkin-needed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 50

•

10 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0e3af3ef46f0 https://hg.mozilla.org/mozilla-central/rev/ca8916bb80cf

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

status-firefox45: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla45

Wayne Mery (:wsmwk)

Updated

•

9 years ago

Depends on: 1274837

You need to log in before you can comment on or make changes to this bug.