Closed Bug 229367 Opened 21 years ago Closed 14 years ago

<br> confuses our bidiness (punctuation before <br> at end of line starting with number doesn't follow paragraph directionality)

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bugzillamozilla, Unassigned)

References

(Blocks 3 open bugs, )

Details

(Keywords: html5, rtl, testcase)

Attachments

(4 files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5) Gecko/20031007 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5) Gecko/20031007 This happens in lines that end with words of the opposite direction, followed by lines that start with words of this opposite direction, or followed by lines that start with numbers. The linked testcase will make this much easier to understand. Reproducible: Always Steps to Reproduce: 1. In an RTL textarea, write a Hebrew word followed by an English one and a full stop. 2. Start the next line with an English word or a number. Actual Results: The full stop at the end of the first line appears to the right of the English word. Expected Results: The full stop should appear at the visual end of sentence - to the left of the English word. Arcane workaround: use RLM (Alt+0254) or LRM (Alt+0253) at the end of lines. See testcase. Prog.
Attached file Testcase #1
Attaching the linked testcase... Prog.
I will look again tomorrow when less jet-lagged, but isn't Mozilla's behaviour correct according to the Unicode Bidi Algorithm?
I'm not sure about the Unicode BiDi algorithm, but certainly we can't expect most users to know about control characters, now can we? not when other browsers handle these situation perfectly - out of the box and without expecting the user to come up with arcane workarounds. Prog.
There's definitely a bug here: http://www.hixie.ch/tests/adhoc/unicode/bidi/010.html There's no way our rendering of that testcase is per the algorithm. See also: http://www.hixie.ch/tests/adhoc/unicode/bidi/011.html (similar, works fine) It would appear the <br> is what is confusing us. (Original testcase: http://oren.gomen.org/mozilla/control_characters/testcase1.html ...)
Keywords: testcase
Summary: Punctuation marks appear in the wrong side of embedded sections with opposite direction → <br> confuses our bidiness (punctuation before <br> at end of line starting with number doesn't follow paragraph directionality)
I think the first question should be "why does [Enter] in a text area add a <br>?" Is there already a bug report on that?
Daniel: see comment 5.
I don't understand the question here. If the question is "why do we insert a <br> instead of just a CR", that's because the textarea is really an editor, and like in all instances of the editor, empty lines need a frame or they won't be selectable. We solve this problem with <br>. Known old problem... Whoever fixes this big problem wins my eternal consideration.
as can be seen from the test case (attachment 137938 [details]), a text area is just an easy way to reproduce this bug. it happenes outside of textareas too, in regular page text.
In regular page text I consider this as INVALID or Evangelism, because authors can and should start a new block element rather than using <br>. For example, if attachment 137938 [details] was a regular page I would just say "use an <ol>". At least according to HTML [1] and Unicode [2] <br> has the semantics of LINE SEPARATOR, not CR, and should not cause any kind of paragraph formatting. [1] http://www.w3.org/TR/html4/struct/text.html#edef-BR [2] http://www.unicode.org/reports/tr14/index.html#BK
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
smontagu, even if <br> is just a line seperator, punctuation marks in RTL segments, if they appear at the end of a LTR word, should be on the left end of the word. unless i didn't understand what you mean here. and how come in ian's test case in comment 10, two different lines are rendered as identical?
Re comment 10, I have two questions: (a) Do we want to emulate IE's behaviour as a quirk? (b) The modified testcase shows a genuine bug which I also encountered when testing this issue: there is extra space between the two words on the first line of each pair, which contradicts the UBA's statement [1] that "trailing white space will appear at the visual end of the line (in the paragraph direction)". Do we want to keep this open for that issue, or open another report? [1] http://www.unicode.org/reports/tr9/#L1 Re comment 11: Punctuation between two LTR words should appear at the right edge of the first word, even in a RTL paragraph. Consider a case where you have "Profession: Engineer" in a RTL paragraph, and then suppose that you narrow the margin so that the line wraps just between those two words. Should that make the colon jump from the right edge to the left edge of the first word?
A half baked idea for a programmatic workaround has occurred to me: what if we inserted before the <br> in the editor a directional mark in the paragraph direction (i.e. LRM in left-to-right paragraph and RLM in right-to-left paragraph)?
The following: HEBREW english . 123 HEBREW ...becomes: HEBREW english . 123 HEBREW 111111122222222222221111111 ...so the "." is LTR and part of the surrounding run of english text. The reason for this is that the numbers (weak, section 3.3.3:W2) are resolved before the punctuation (neutral, section 3.3.4:N1). The numbers thus get the direction LTR (from the english text; you search backwards) and the punctuation then finds itself between two LTR runs so it becomes LTR too. A dot at the end: HEBREW english . 1111111222222211 ...gets the direction of the paragraph (RTL in this test). The important point in the test is to realise that the two lines are actually all part of the same paragraph. Is there a case we are doing incorrectly?
> (a) Do we want to emulate IE's behaviour as a quirk? I strongly feel we have enough (or even too many) quirks. If we want to add a quirk, let's remove another one first. What quirk do you think we should remove? > (b) The modified testcase shows a genuine bug which I also encountered [...] That's bug 132561. > what if we inserted before the <br> in the editor a directional mark in the > paragraph direction [...] The <br> hack in the Editor has caused lots and lots of problems. The correct fix is to remove it altogether and replace it with a real newline character. I really don't think we should add even more characters to the mix.
Is this another case where we decide to stick with the UBA and the hell with users? Simon, even if we do provide a workaround based on control characters, it doesn't help one bit with existing texts. So not only do users have endure Mozilla's incompatibility with many (substandard) websites, they have to learn to live with HyphenMinus at the wrong side of the number, punctuation marks at the wrong side of the word and the like. Can we really expect anyone to switch? not when they only get "half the internet" (I don't remember when I heard that about alternative browsers, but it fits well in this context). Prog.
>> (b) The modified testcase shows a genuine bug which I also encountered [...] >That's bug 132561. I don't think it is, although it overlaps with it in the white-space:normal case. Let me try a testcase with white-space:pre
So... what do we suggest to users who: 1. Can't input RLM/LRM in their platform of choice (for technical keyboard layout reasons) -or- 2. Don't understand the concept of hidden control characters. -or- 3. Wish to read existing texts as their author envisioned them. If this behavior is "by design" then we should at least be able offer solutions to real world problems that are caused by this decision. Prog.
Blocks: 240501
> In regular page text I consider this as INVALID or Evangelism, because authors > can and should start a new block element rather than using <br>. Why? Am I is it improper or disallowed to have a multi-line paragraphs in my HTML? > For example, if > attachment 137938 [details] was a regular page I would just say "use an <ol>". You're right in this case, the testcase is a bad example because it uses <br>'s as list item separators which is wrong. But this is not the general case! > > At least according to HTML [1] and Unicode [2] <br> has the semantics of LINE > SEPARATOR, not CR, and should not cause any kind of paragraph formatting. > The consideration of the last punctuation mark in a line within in RTL paragraph to be RTL is not paragraph formatting, it's line formatting. At least that's what my intuition says. Another point is that regardless of whether some standard says otherwise, this is plain wrong. Intolerable. It's like with bug 73251: if the standard does not fit reasonable expectations then the standard must change.
like bug 73251, the way to fix this bug probably goes through the unicode consortium.
*** Bug 150568 has been marked as a duplicate of this bug. ***
Looks like bug 73251 was fixed (as bug 240943) by going through the Unicode Consortium. I couldn't find references to this problem in Unicode's public mailing list. Did anyone propose this change to the Unicode Consortium? So far, I gathered this is the contact channel: http://www.unicode.org/reporting.html Or does the Mozilla Foundation have a better channel for feedback?
IANAUE (I am not a Unicode expert), but if I read the per-line examples here: http://www.unicode.org/reports/tr9/#Resolving_Implicit_Levels correctly, they indicate that the behavior of considering the period at the end of a line as part of the embedding run continuing upto it (rather than the paragraph embedding level) - which is the buggy behavior this bug is about - is not what the UBA requires, but rather the opposite of the requirement. Do correct me if I'm wrong.
Eyal, as I understand the Bidi algorithm the problem is that even if the <br> is set to the paragraph embedding level in rule L1, the embedding level of the punctuation characters has already been set before that in rule N1 to the level of the text on both sides (if it has the same direction) or to the paragraph level (if it has different directions). I have sent mail to the bidi discussion list at unicode.org proposing that the <br> should be set to the paragraph embedding level at an earlier stage in the process.
(In reply to comment #25) > set to the paragraph embedding level in rule L1, the embedding level of the > punctuation characters has already been set before that in rule N1 to the level > of the text on both sides (if it has the same direction) or to the paragraph > level (if it has different directions). But that's a 'good thing' in our case: suppose the paragraph is RTL and the first line has a word in English, then a period, than the line break - like in the test case. Like you said, in rule N1 the level of the period is set to the paragraph level (since it is part of a sequence of neutrals which begins after the word in English and ends with the first Hebrew character on the next line), i.e. its embedding level is set to 1. Great! Now when L2 is applied, the period should not be reversed - only the word in English need be reveresed. Or have I missed something?
(In reply to comment #26) > But that's a 'good thing' in our case: suppose the paragraph is RTL and the > first line has a word in English, then a period, than the line break - like in > the test case. The question is, what is the first directional character after the line break? If it's RTL, then the reordering is as you describe. But in the test case it's a number, which was set to LTR in rule W7, so the text on both sides of the period is LTR, i.e. level 2.
> The question is, what is the first directional character after the line break? > If it's RTL, then the reordering is as you describe. But in the test case it's a > number, which was set to LTR in rule W7, so the text on both sides of the period > is LTR, i.e. level 2. Yes, you're right. I would think the solution would be to restart level runs at line breaks; this way W7 would not un-neutralize the period.
In fact, I suggest that we change the UBA implementation to do exactly that. You know they'll approve this eventually. Just like they did with bug 73251. It's what the UBA should say.
Blocks: 269759
Now that I have editbugs, I'm reopening the bug. It is certainly not INVALID. I have made a suggestion in comments #28, #29. If people reject my suggestion, and any other possible fix, then the status will change to WONTFIX (_please_ don't do that, though... ).
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Assignee: mkaply → eyalroz
Status: REOPENED → NEW
I'm sorry, I can't take this bug due to lack of time to work on it.
Assignee: eyalroz → mkaply
I will however make some some hopefully useful comments. It seems there are two alternative ways to modify the UBA implementation. The first alternatively would be to modify P1 so that text is pre-split into lines rather than into paragraphs, i.e. instead of "P1. Split the text into separate paragraphs. A paragraph separator is kept with the previous paragraph. Within each paragraph, apply all the other rules of this algorithm." We would have "P1. Split the text into separate lines. A line/paragraph separator is kept with the previous line. Within each line, apply all the other rules of this algorithm." This will mean minimal or no changes in nsBidi.* , which assumes P1 is performed externally. The second alternative is to consider line breaks to be of type L if the paragraph embedding level is even, i.e. an LTR paragraph, and of type R if the paragraph embedding level is odd, i.e. an RTL paragraph. This second alternative problably means changes to some nsBidi.cpp functions (maybe also to nsBiDi.h although I'm not too sure), but it has the advantage of allowing BiDi control chars (e.g. LRO) to work across line breaks. Finally, we need to consider whether or not the behavior should be different for 'manual' line breaks (e.g. <br>s) and for automatic line breaks due to wrapping.
Attached file another testcase
Adding another testcase which does not raise the question of "why don't you just make it an ordered list" and emphasizes how this bug may occur in 'simple' text. Note that the two sentences can be said to form a single semantic whole which should be a single paragraph, despite an author's decision to break the line at certain points. So "why don't you just make it two paragraphs" is invalid here, or at least as invalid as it is in any case with <br>'s instead of paragraph breaks.
See W3C I18N Core WG comments at http://lists.w3.org/Archives/Public/public-i18n-core/2005JanMar/0065.html We believe Mozilla is behaving correctly, and that changing the behaviour will break the more general expectations of behaviour.
Richard, there's no argument that, in that regard (resetting levels at line breaks), Mozilla behaves by the Unicode Standard whereas Internet Explorer does not. -- Let's review more realistic cases: <p dir="rtl"> THE ENGLISH MAN SAID "hello,<br> my friend" AND SMILED. </p> Mozilla would be right (according to Hebrew grammar), rendering: hello," DIAS NAM HSILGNE EHT .DELIMS DNA "my friend Another case (note the order): <p dir="rtl"> I USE mozilla,<br> konqueror, opera AND internet explorer. </p> Mozilla would render: mozilla, ESU I .internet explorer DNA konqueror, opera which would look wrong Hebrew-grammar-wise, but it's just making an already-existing problem more visible -- the comma is part of the Hebrew item listing and, by the Hebrew grammar rules, it should be: [...] .internet explorer DNA opera ,konqueror <-- note the comma's position and if the two lines would be unified, grammar dictates it should render: .internet explorer DNA opera ,konqueror ,mozilla ESU I whereas, in both IE and Mozilla, it'll render: .internet explorer DNA mozilla, konqueror, opera ESU I To achieve the grammatically-correct rendering, in any of the browsers, the user will have to put in RLMs! --- So, places where IE has it right due to a linebreak, are places where there'd be a grammar mistake were it not for the linebreak. Therefore, I take my words back about needing to change the UBA cause IE's "standard" is better lingually. The reason to change the UBA is to fit a defacto standard which isn't better than the UBA's standard.
Excuse the spam, but I'm posting my reply to Richard: Thank you for your reply Richard, I still find that your reply is more of a presentation of the current views of W3C / the Unicode people on the matter. If this had been an apriori argument about what <br>'s should mean and how they should be used, I would find your view a perfectly reasonable alternative (although it would mean there would have to be a way to 'break a line syntactically' without switching to a new paragraph, which in the current scheme of things <br> is not intended to do). However, there is the huge corpus of existing HTML documents with RTL text which all assume an alternative interpretation of <br> - being not just white space, but having some semantic significance of a break. True, this in some part due to MSIE's conventions, but like I argued in my previous e-mails, this assumption has merit for itself. I know, the 1. xxx 2. xxx is not the best of examples when considered as HTML, but it is a very common example when plain text is displayed as HTML. i.e. , if you were to take the following block of text: ----- Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisienim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure etc etc ----- and 'HTML'ify it (this is very common practice, not just in Mozilla and other web browsers which feed text to their HTML renderer for display and/or edit text messages as HTML, but also in news sites which store stories as plain text and use it to prepare their displayed pages) - you would not make every line of text a paragraph, on one hand, and usually not go to the trouble of running some intelligent classifier to decide where paragraphs end (e.g. volutpat) and where they do not (e.g. at consequat). What you would normally do is append <br>'s to each line. Thus you see text (or html) like most people in the room don't know the Hebrew word ;DROW<br> 3 people do know it. or most people in the room don't know the Hebrew word ;DROW<br> REHTONA is a better known word. These are the more representative example - at least from my experience - of the use of <br>'s, in comparison to your provided example of 1. xxxxxx xxxx English,<br> and more xxx. Because one immediately finds one's self asking: Why put a forced line break in the middle of a phrase? If it's a single phrase, authors prefer to keep it on the same line; if it is broken by a <br>, the author most likely considers it not to be an indivisible whole. And if this <br> comes from HTMLizing text, it is usually less confusing to err by endowing the comma with the 'general' direction of the text than to err by switching it to the other side of 'English'. The reason is that the first type of error does nonetheless correspond somewhat to the order of reading (you only need to reinterpret an 'end-of-the-line' punctuation mark as an 'coming-after-the-last-word-in-this-line' punctuation mark) ; while the second option, in case of an error, makes you pause when first viewing the comma or period thinking that a new clause is beginning, then to move to the next line only to find that this is not the case and that 'English' belongs to the previous clause or sentence after all, and finally to re-read the text while switching the comma or period back to the end of the line in your head. I hope this clears up 'where I'm coming from' in requesting this change. This situation seems to me not entirely unlike the problem we had with minus as a number separator, https://bugzilla.mozilla.org/show_bug.cgi?id=73251 , in which the behavior of the minus was modified to accomodate one of its common uses in RTL text. There too, although MSIE was 'wrong' w.r.t. the Unicode standard's original scheme, it turned out that the 'wrong' way of laying out the minus was in general better than the 'right' way. I think we would be hard-pressed to find more than a handful, if even that, of a people breaking up their "English, and more" with <br>s, especially within RTL text. Eyal PS - As for audio-only browser, the point about <br>s being discarded is sort of moot, since when reading the HTML or text, the punctuation mark will always come after the last word if it is written after the last word - the 'switching' is done only for visual layout.
Ilya, what you've brought up (english words separated with commas in a Hebrew paragraph) is a different issue, pertaining to the UBA's behavior squarely within level runs. I think we should not try to link that up with the question of what happens at <br>'s.
CCing Richard Ishida (who probably missed the last few comments). Prog.
(In reply to comment #5) > I think the first question should be "why does [Enter] in a text area add a > <br>?" Is there already a bug report on that? For the record, the answer is now "yes": bug 240933.
This bug doesn't happen only when <br> tags are present. Testcase #2 shows that it also occurs in pre-formatted texts. Prog.
(In reply to comment #40) > This bug doesn't happen only when <br> tags are present. Testcase #2 shows that > it also occurs in pre-formatted texts. Since this bug is explicitly about <br>, that testcase shows a different bug, and one which is definitely valid without any changes to the Unicode Bidi Algorithm. It's mentioned in the comment at http://lxr.mozilla.org/seamonkey/source/layout/base/nsBidiPresUtils.cpp#486 // XXX: TODO: Handle preformatted text ('\n')
No longer blocks: 269759
*** Bug 269759 has been marked as a duplicate of this bug. ***
(In reply to comment #43) I would like to hear people's opinions, especially dbaron's, regarding my comment 36.
I've used BR in both ways, a number of times -- but I didn't have any problems since all my text was LTR. Probably most of my uses have been for breaks that do have semantics, rather than just a place I'd like to ensure the line be broken. Given what MSIE does, it sounds like we should probably switch to being compatible with it, since there are a significant number of Websites and applications that depend on the MSIE behavior.
Blocks: 137995
Assignee: mozilla → nobody
QA Contact: zach → layout.bidi
Not going to block 1.8.1 for this, but we'd be happy to consider a patch that's baked on the trunk.
Flags: blocking1.8.1? → blocking1.8.1-
This is a recent discussion on the matter on #devs@irc.mozilla.org : http://pastebin.mozilla.org/3187
That was a temporary post to pastebin. Resubmitted as http://pastebin.mozilla.org/3190
Blocks: 376359
Component: Layout: BiDi Hebrew & Arabic → Layout: Text
QA Contact: layout.bidi → layout.fonts-and-text
Keywords: rtl
Simon: are you aware of the latest status of this bug? (Feeling kind of lost due to the number of comments here.)
Attachment #137938 - Attachment mime type: text/html → text/html; charset=windows-1255
Attachment #174368 - Attachment mime type: text/html → text/html; charset=windows-1255
Attachment #190274 - Attachment mime type: text/html → text/html; charset=windows-1255
HTML5 is apparently going to require that we change our behavior to fix this; it's also tested in the CSS 2.1 test suite, in the fifth test of: http://test.csswg.org/suites/css2.1/20101001/xhtml1/bidi-breaking-002.xht (I think tests 3 and 4 are invalid.)
Blocks: html5bidi
Depends on: 263359
Fixed in bug 263359
Status: NEW → RESOLVED
Closed: 21 years ago14 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: