Closed Bug 229367 Opened 21 years ago Closed 13 years ago

<br> confuses our bidiness (punctuation before <br> at end of line starting with number doesn't follow paragraph directionality)

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bugzillamozilla, Unassigned)

References

(Blocks 3 open bugs, )

Details

(Keywords: html5, rtl, testcase)

Attachments

(4 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5) Gecko/20031007
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5) Gecko/20031007

This happens in lines that end with words of the opposite direction, followed by
lines that start with words of this opposite direction, or followed by lines
that start with numbers.

The linked testcase will make this much easier to understand.


Reproducible: Always

Steps to Reproduce:
1. In an RTL textarea, write a Hebrew word followed by an English one and a full
stop.
2. Start the next line with an English word or a number.

Actual Results:  
The full stop at the end of the first line appears to the right of the English word.


Expected Results:  
The full stop should appear at the visual end of sentence - to the left of the
English word.

Arcane workaround: use RLM (Alt+0254) or LRM (Alt+0253) at the end of lines. See
testcase.

Prog.
Attached file Testcase #1
Attaching the linked testcase...

Prog.
I will look again tomorrow when less jet-lagged, but isn't Mozilla's behaviour
correct according to the Unicode Bidi Algorithm?
I'm not sure about the Unicode BiDi algorithm, but certainly we can't expect
most users to know about control characters, now can we? not when other browsers
handle these situation perfectly - out of the box and without expecting the user
to come up with arcane workarounds.

Prog.
There's definitely a bug here:
   http://www.hixie.ch/tests/adhoc/unicode/bidi/010.html

There's no way our rendering of that testcase is per the algorithm.

See also:
   http://www.hixie.ch/tests/adhoc/unicode/bidi/011.html (similar, works fine)

It would appear the <br> is what is confusing us.

(Original testcase:
   http://oren.gomen.org/mozilla/control_characters/testcase1.html
...)
Keywords: testcase
Summary: Punctuation marks appear in the wrong side of embedded sections with opposite direction → <br> confuses our bidiness (punctuation before <br> at end of line starting with number doesn't follow paragraph directionality)
I think the first question should be "why does [Enter] in a text area add a
<br>?" Is there already a bug report on that?
Daniel: see comment 5.
I don't understand the question here. If the question is "why do we insert
a <br> instead of just a CR", that's because the textarea is really an editor, and
like in all instances of the editor, empty lines need a frame or they won't be
selectable. We solve this problem with <br>. Known old problem... Whoever fixes
this big problem wins my eternal consideration.
as can be seen from the test case (attachment 137938 [details]), a text area is just an
easy way to reproduce this bug. it happenes outside of textareas too, in regular
page text.
In regular page text I consider this as INVALID or Evangelism, because authors
can and should start a new block element rather than using <br>. For example, if
attachment 137938 [details] was a regular page I would just say "use an <ol>".

At least according to HTML [1] and Unicode [2] <br> has the semantics of LINE
SEPARATOR, not CR, and should not cause any kind of paragraph formatting.

[1] http://www.w3.org/TR/html4/struct/text.html#edef-BR
[2] http://www.unicode.org/reports/tr14/index.html#BK
Ok, I fixed my testcase:
   http://www.hixie.ch/tests/adhoc/unicode/bidi/010.html

INVALID?
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
smontagu, even if <br> is just a line seperator, punctuation marks in RTL
segments, if they appear at the end of a LTR word, should be on the left end of
the word.

unless i didn't understand what you mean here.

and how come in ian's test case in comment 10, two different lines are rendered
as identical?
Re comment 10, I have two questions:

(a) Do we want to emulate IE's behaviour as a quirk?
(b) The modified testcase shows a genuine bug which I also encountered when
testing this issue: there is extra space between the two words on the first line
of each pair, which contradicts the UBA's statement [1] that "trailing white
space will appear at the visual end of the line (in the paragraph direction)".
Do we want to keep this open for that issue, or open another report?

[1] http://www.unicode.org/reports/tr9/#L1

Re comment 11: Punctuation between two LTR words should appear at the right edge
of the first word, even in a RTL paragraph. Consider a case where you have
"Profession: Engineer" in a RTL paragraph, and then suppose that you narrow the
margin so that the line wraps just between those two words. Should that make the
colon jump from the right edge to the left edge of the first word?
A half baked idea for a programmatic workaround has occurred to me: what if we
inserted before the <br> in the editor a directional mark in the paragraph
direction (i.e. LRM in left-to-right paragraph and RLM in right-to-left paragraph)?
The following:

   HEBREW english . 123 HEBREW

...becomes:

   HEBREW english . 123 HEBREW
   111111122222222222221111111

...so the "." is LTR and part of the surrounding run of english text.

The reason for this is that the numbers (weak, section 3.3.3:W2) are resolved
before the punctuation (neutral, section 3.3.4:N1). The numbers thus get the
direction LTR (from the english text; you search backwards) and the punctuation
then finds itself between two LTR runs so it becomes LTR too.

A dot at the end:

   HEBREW english .
   1111111222222211

...gets the direction of the paragraph (RTL in this test).

The important point in the test is to realise that the two lines are actually
all part of the same paragraph.

Is there a case we are doing incorrectly?
> (a) Do we want to emulate IE's behaviour as a quirk?

I strongly feel we have enough (or even too many) quirks. If we want to add a
quirk, let's remove another one first. What quirk do you think we should remove?


> (b) The modified testcase shows a genuine bug which I also encountered [...]

That's bug 132561.


> what if we inserted before the <br> in the editor a directional mark in the 
> paragraph direction [...]

The <br> hack in the Editor has caused lots and lots of problems. The correct
fix is to remove it altogether and replace it with a real newline character. I
really don't think we should add even more characters to the mix.
Is this another case where we decide to stick with the UBA and the hell with users?

Simon, even if we do provide a workaround based on control characters, it
doesn't help one bit with existing texts. So not only do users have endure
Mozilla's incompatibility with many (substandard) websites, they have to learn
to live with HyphenMinus at the wrong side of the number, punctuation marks at
the wrong side of the word and the like. Can we really expect anyone to switch?
not when they only get "half the internet" (I don't remember when I heard that
about alternative browsers, but it fits well in this context).

Prog.
>> (b) The modified testcase shows a genuine bug which I also encountered [...]
>That's bug 132561.

I don't think it is, although it overlaps with it in the white-space:normal
case. Let me try a testcase with white-space:pre
So... what do we suggest to users who:

1. Can't input RLM/LRM in their platform of choice (for technical keyboard
layout reasons)
-or-
2. Don't understand the concept of hidden control characters.
-or-
3. Wish to read existing texts as their author envisioned them.

If this behavior is "by design" then we should at least be able offer solutions
to real world problems that are caused by this decision.

Prog.
Blocks: 240501
> In regular page text I consider this as INVALID or Evangelism, because authors
> can and should start a new block element rather than using <br>.

Why? Am I is it improper or disallowed to have a multi-line paragraphs in my HTML?

> For example, if
> attachment 137938 [details] was a regular page I would just say "use an <ol>".

You're right in this case, the testcase is a bad example because it uses <br>'s
as list item separators which is wrong. But this is not the general case!


> 
> At least according to HTML [1] and Unicode [2] <br> has the semantics of LINE
> SEPARATOR, not CR, and should not cause any kind of paragraph formatting.
> 

The consideration of the last punctuation mark in a line within in RTL paragraph
to be RTL is not paragraph formatting, it's line formatting. At least that's
what my intuition says. 

Another point is that regardless of whether some standard says otherwise, this
is plain wrong. Intolerable. It's like with bug 73251: if the standard does not
fit reasonable expectations then the standard must change.
like bug 73251, the way to fix this bug probably goes through the unicode
consortium.
*** Bug 150568 has been marked as a duplicate of this bug. ***
Looks like bug 73251 was fixed (as bug 240943) by going through the Unicode 
Consortium. 
 
I couldn't find references to this problem in Unicode's public mailing list. 
Did anyone propose this change to the Unicode Consortium? 
So far, I gathered this is the contact channel: 
http://www.unicode.org/reporting.html 
Or does the Mozilla Foundation have a better channel for feedback? 
IANAUE (I am not a Unicode expert), but if I read the per-line examples here:

http://www.unicode.org/reports/tr9/#Resolving_Implicit_Levels

correctly, they indicate that the behavior of considering the period at the end
of a line as part of the embedding run continuing upto it (rather than the
paragraph embedding level) - which is the buggy behavior this bug is about - is
not what the UBA requires, but rather the opposite of the requirement. Do
correct me if I'm wrong.
Eyal, as I understand the Bidi algorithm the problem is that even if the <br> is
set to the paragraph embedding level in rule L1, the embedding level of the
punctuation characters has already been set before that in rule N1 to the level
of the text on both sides (if it has the same direction) or to the paragraph
level (if it has different directions).

I have sent mail to the bidi discussion list at unicode.org proposing that the
<br> should be set to the paragraph embedding level at an earlier stage in the
process.
(In reply to comment #25)
> set to the paragraph embedding level in rule L1, the embedding level of the
> punctuation characters has already been set before that in rule N1 to the level
> of the text on both sides (if it has the same direction) or to the paragraph
> level (if it has different directions).

But that's a 'good thing' in our case: suppose the paragraph is RTL and the
first line has a word in English, then a period, than the line break - like in
the test case. Like you said, in rule N1 the level of the period is set to the
paragraph level (since it is part of a sequence of neutrals which begins after
the word in English and ends with the first Hebrew character on the next line),
i.e. its embedding level is set to 1. Great! Now when L2 is applied, the period
should not be reversed - only the word in English need be reveresed.

Or have I missed something?
(In reply to comment #26)
> But that's a 'good thing' in our case: suppose the paragraph is RTL and the
> first line has a word in English, then a period, than the line break - like in
> the test case.

The question is, what is the first directional character after the line break?
If it's RTL, then the reordering is as you describe. But in the test case it's a
number, which was set to LTR in rule W7, so the text on both sides of the period
is LTR, i.e. level 2.
> The question is, what is the first directional character after the line break?
> If it's RTL, then the reordering is as you describe. But in the test case it's a
> number, which was set to LTR in rule W7, so the text on both sides of the period
> is LTR, i.e. level 2.

Yes, you're right. I would think the solution would be to restart level runs at
line breaks; this way W7 would not un-neutralize the period.

In fact, I suggest that we change the UBA implementation to do exactly that. You
know they'll approve this eventually. Just like they did with bug 73251. It's
what the UBA should say.
Blocks: 269759
Now that I have editbugs, I'm reopening the bug. It is certainly not INVALID. I
have made a suggestion in comments #28, #29. If people reject my suggestion, and
any other possible fix, then the status will change to WONTFIX (_please_ don't
do that, though... ).
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Assignee: mkaply → eyalroz
Status: REOPENED → NEW
I'm sorry, I can't take this bug due to lack of time to work on it.
Assignee: eyalroz → mkaply
I will however make some some hopefully useful comments.

It seems there are two alternative ways to modify the UBA implementation. The
first alternatively would be to modify P1 so that text is pre-split into lines
rather than into paragraphs, i.e. instead of 

"P1. Split the text into separate paragraphs. A paragraph separator is kept with
the previous paragraph. Within each paragraph, apply all the other rules of this
algorithm."

We would have

"P1. Split the text into separate lines. A line/paragraph separator is kept with
the previous line. Within each line, apply all the other rules of this algorithm."

This will mean minimal or no changes in nsBidi.* , which assumes P1 is performed
externally.

The second alternative is to consider line breaks to be of type L if the
paragraph embedding level is even, i.e. an LTR paragraph, and of type R if the
paragraph embedding level is odd, i.e. an RTL paragraph. This second alternative
problably means changes to some nsBidi.cpp functions (maybe also to nsBiDi.h
although I'm not too sure), but it has the advantage of allowing BiDi control
chars (e.g. LRO) to work across line breaks.

Finally, we need to consider whether or not the behavior should be different for
'manual' line breaks (e.g. <br>s) and for automatic line breaks due to wrapping.
Attached file another testcase
Adding another testcase which does not raise the question of "why don't you
just make it an ordered list" and emphasizes how this bug may occur in 'simple'
text. Note that the two sentences can be said to form a single semantic whole
which should be a single paragraph, despite an author's decision to break the
line at certain points. So "why don't you just make it two paragraphs" is
invalid here, or at least as invalid as it is in any case with <br>'s instead
of paragraph breaks.
See W3C I18N Core WG comments at
http://lists.w3.org/Archives/Public/public-i18n-core/2005JanMar/0065.html

We believe Mozilla is behaving correctly, and that changing the behaviour will
break the more general expectations of behaviour.
Richard, there's no argument that, in that regard (resetting levels at line
breaks), Mozilla behaves by the Unicode Standard whereas Internet Explorer does not.

--

Let's review more realistic cases:

 <p dir="rtl">
  THE ENGLISH MAN SAID "hello,<br>
  my friend" AND SMILED.
 </p>

Mozilla would be right (according to Hebrew grammar), rendering:

  hello," DIAS NAM HSILGNE EHT
       .DELIMS DNA "my friend

Another case (note the order):

 <p dir="rtl">
 I USE mozilla,<br>
 konqueror, opera AND internet explorer.
 </p>

Mozilla would render:

                          mozilla, ESU I
 .internet explorer DNA konqueror, opera

which would look wrong Hebrew-grammar-wise, but it's just making an
already-existing problem more visible -- the comma is part of the Hebrew item
listing and, by the Hebrew grammar rules, it should be:

 [...]
 .internet explorer DNA opera ,konqueror <-- note the comma's position

and if the two lines would be unified, grammar dictates it should render:

 .internet explorer DNA opera ,konqueror ,mozilla ESU I

whereas, in both IE and Mozilla, it'll render:

 .internet explorer DNA mozilla, konqueror, opera ESU I

To achieve the grammatically-correct rendering, in any of the browsers, the user
will have to put in RLMs!

---

So, places where IE has it right due to a linebreak, are places where there'd be
a grammar mistake were it not for the linebreak. Therefore, I take my words back
about needing to change the UBA cause IE's "standard" is better lingually. The
reason to change the UBA is to fit a defacto standard which isn't better than
the UBA's standard.
Excuse the spam, but I'm posting my reply to Richard:

Thank you for your reply Richard,

I still find that your reply is more of a presentation of the current views of
W3C / the Unicode people on the matter.

If this had been an apriori argument about what <br>'s should mean and how they
should be used, I would find your view a perfectly reasonable alternative
(although it would mean there would have to be a way to 'break a line
syntactically' without switching to a new paragraph, which in the current scheme
of things <br> is not intended to do).

However, there is the huge corpus of existing HTML documents with RTL text which
all assume an alternative interpretation of <br> - being not just white space,
but having some semantic significance of a break. True, this in some part due to
MSIE's conventions, but like I argued in my previous e-mails, this assumption
has merit for itself. I know, the 1. xxx 2. xxx is not the best of examples when
considered as HTML, but it is a very common example when plain text is displayed
as HTML. i.e. , if you were to take the following block of text:

-----
Lorem ipsum dolor sit amet, consectetuer
adipiscing elit, sed diam nonummy nibh
euismod tincidunt ut laoreet dolore magna
aliquam erat volutpat.
Ut wisienim ad minim veniam, quis nostrud
exerci tation ullamcorper suscipit lobortis
nisl ut aliquip ex ea commodo consequat.
Duis autem vel eum iriure etc etc
-----

and 'HTML'ify it (this is very common practice, not just in Mozilla and other
web browsers which feed text to their HTML renderer for display and/or edit text
messages as HTML, but also in news sites which store stories as plain text and
use it to prepare their displayed pages) - you would not make every line of text
a paragraph, on one hand, and usually not go to the trouble of running some
intelligent classifier to decide where paragraphs end (e.g. volutpat) and where
they do not (e.g. at consequat). What you would normally do is append <br>'s to
each line. Thus you see text (or html) like

most people in the room don't know the Hebrew word ;DROW<br>
3 people do know it.

or

most people in the room don't know the Hebrew word ;DROW<br>
REHTONA is a better known word.

These are the more representative example - at least from my experience - of the
use of <br>'s, in comparison to your provided example of

1. xxxxxx xxxx English,<br>
and more xxx.

Because one immediately finds one's self asking: Why put a forced line break in
the middle of a phrase? If it's a single phrase, authors prefer to keep it on
the same line; if it is broken by a <br>, the author most likely considers it
not to be an indivisible whole. And if this <br> comes from HTMLizing text, it
is usually less confusing to err by endowing the comma with the 'general'
direction of the text than to err by switching it to the other side of
'English'. The reason is that the first type of error does nonetheless
correspond somewhat to the order of reading (you only need to reinterpret an
'end-of-the-line' punctuation mark as an
'coming-after-the-last-word-in-this-line' punctuation mark) ; while the second
option, in case of an error, makes you pause when first viewing the comma or
period thinking that a new clause is beginning, then to move to the next line
only to find that this is not the case and that 'English' belongs to the
previous clause or sentence after all, and finally to re-read the text while
switching the comma or period back to the end of the line in your head.

I hope this clears up 'where I'm coming from' in requesting this change. This
situation seems to me not entirely unlike the problem we had with minus as a
number separator,

https://bugzilla.mozilla.org/show_bug.cgi?id=73251

, in which the behavior of the minus was modified to accomodate one of its
common uses in RTL text. There too, although MSIE was 'wrong' w.r.t. the Unicode
standard's original scheme, it turned out that the 'wrong' way of laying out the
minus was in general better than the 'right' way. I think we would be
hard-pressed to find more than a handful, if even that, of a people breaking up
their "English, and more" with <br>s, especially within RTL text.

Eyal

PS - As for audio-only browser, the point about <br>s being discarded is sort of
moot, since when reading the HTML or text, the punctuation mark will always come
after the last word if it is written after the last word - the 'switching' is
done only for visual layout.

Ilya, what you've brought up (english words separated with commas in a Hebrew
paragraph) is a different issue, pertaining to the UBA's behavior squarely
within level runs. I think we should not try to link that up with the question
of what happens at <br>'s.
CCing Richard Ishida (who probably missed the last few comments).

Prog.
(In reply to comment #5)
> I think the first question should be "why does [Enter] in a text area add a
> <br>?" Is there already a bug report on that?

For the record, the answer is now "yes": bug 240933.
This bug doesn't happen only when <br> tags are present. Testcase #2 shows that
it also occurs in pre-formatted texts.

Prog.
(In reply to comment #40)
> This bug doesn't happen only when <br> tags are present. Testcase #2 shows that
> it also occurs in pre-formatted texts.

Since this bug is explicitly about <br>, that testcase shows a different bug,
and one which is definitely valid without any changes to the Unicode Bidi
Algorithm. It's mentioned in the comment at
http://lxr.mozilla.org/seamonkey/source/layout/base/nsBidiPresUtils.cpp#486
   // XXX: TODO: Handle preformatted text ('\n')
No longer blocks: 269759
*** Bug 269759 has been marked as a duplicate of this bug. ***
(In reply to comment #43)
I would like to hear people's opinions, especially dbaron's, regarding my comment 36.
I've used BR in both ways, a number of times -- but I didn't have any problems since all my text was LTR.  Probably most of my uses have been for breaks that do have semantics, rather than just a place I'd like to ensure the line be broken.

Given what MSIE does, it sounds like we should probably switch to being compatible with it, since there are a significant number of Websites and applications that depend on the MSIE behavior.
Blocks: 137995
Assignee: mozilla → nobody
QA Contact: zach → layout.bidi
Not going to block 1.8.1 for this, but we'd be happy to consider a patch that's baked on the trunk.
Flags: blocking1.8.1? → blocking1.8.1-
This is a recent discussion on the matter on #devs@irc.mozilla.org :

http://pastebin.mozilla.org/3187
That was a temporary post to pastebin. Resubmitted as http://pastebin.mozilla.org/3190
Blocks: 376359
Component: Layout: BiDi Hebrew & Arabic → Layout: Text
QA Contact: layout.bidi → layout.fonts-and-text
Keywords: rtl
Simon: are you aware of the latest status of this bug?  (Feeling kind of lost due to the number of comments here.)
Attachment #137938 - Attachment mime type: text/html → text/html; charset=windows-1255
Attachment #174368 - Attachment mime type: text/html → text/html; charset=windows-1255
Attachment #190274 - Attachment mime type: text/html → text/html; charset=windows-1255
HTML5 is apparently going to require that we change our behavior to fix this; it's also tested in the CSS 2.1 test suite, in the fifth test of:
http://test.csswg.org/suites/css2.1/20101001/xhtml1/bidi-breaking-002.xht
(I think tests 3 and 4 are invalid.)
Blocks: html5bidi
Depends on: 263359
Fixed in bug 263359
Status: NEW → RESOLVED
Closed: 21 years ago13 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: