In plaintext, U+2028 and U+2029 each display as "?" but should display as newlines

NEW
Unassigned

Status

()

Core
Layout: Text
15 years ago
7 years ago

People

(Reporter: Sean M. Burke, Unassigned)

Tracking

(Blocks: 1 bug, {testcase})

Trunk
x86
Windows 98
testcase
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(1 attachment)

(Reporter)

Description

15 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6
Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6

Unicode has two newline characters, u+2028 and U+2029, besides the
normal \n and \r that we all know so well.
2028 (that's hex) is "LINE SEPARATOR" and
2029 (that's hex) is "PARAGRAPH SEPARATOR".
Currently Mozilla doesn't seem to implement these, so viewing the plaintext file
at the above URL (which has several instances of these characters, and no \n's
or \r's) shows as just one line, whereas it should show as four.

If the above URL is unreachable, you can reproduce it with this Perl program:
use utf8;
open OUT, ">:utf8", 'uninl.txt' or die $!;
print OUT "\x{FEFF}", # BOM
  "First paragraph.\x{2029}",
  "Second paragraph\x{2029}",
  "Third paragraph, first line.\x{2028}",
  "Third paragraph, second line\x{2028}",
;
close(OUT);

Also, these Unicode newline characters occur quite frequently in
ftp://www.unicode.org/Public/TEXT/FIVEBOOKS


Reproducible: Always

Steps to Reproduce:
1.Start browser
2.View http://interglacial.com/~sburke/uninl.txt

Actual Results:  
It shows a single line of text that looks like this:
First paragraph.?Second paragraph?Third paragraph, first line.?Third paragraph,
second line?


Expected Results:  
It should instead display as:

First paragraph.
Second paragraph
Third paragraph, first line.
Third paragraph, second line?


I don't know how these characters should be treated if they occur in HTML
(inside or outside of PRE and the like), whether raw or &-encoded.  But that
seems a larger and quite separate issue from what I'm reporting.

Presumably the issue of u+2028 and u+2029 in plaintext has a clearer and simpler
solution.

Comment 1

15 years ago
This could well be a dupe of bug 33032, which is about all the whitespace
characters in this Unicode range. However not a lot happening on that bug :-(
See also bug 138215.

Comment 2

15 years ago
Could you attach testcase please
(Reporter)

Comment 3

15 years ago
Created attachment 129567 [details]
Short UTF8 text file containing Unicode newlines.

Comment 4

15 years ago
marking as duplicate of bug 33032
transferring over the testcase


*** This bug has been marked as a duplicate of 33032 ***
Status: UNCONFIRMED → RESOLVED
Last Resolved: 15 years ago
Resolution: --- → DUPLICATE
Since the summary and comment 0 explicitly refer to plaintext, this isn't a dupe
of bug 33032.
Status: RESOLVED → UNCONFIRMED
Resolution: DUPLICATE → ---

Updated

14 years ago
Keywords: testcase

Updated

13 years ago
Status: UNCONFIRMED → NEW
Depends on: 33032
Ever confirmed: true
Assignee: layout.fonts-and-text → nobody
QA Contact: ian → layout.fonts-and-text

Comment 7

8 years ago
Still exists in:

Mozilla/5.0 (X11; U; Linux i686; ru; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3
- The CSS 2.1 test suite recently got http://test.csswg.org/source/approved/css2.1/src/bidi-text/bidi-breaking-003.xht, which tests for this in the context of <pre>, where (as in <textarea>) there is absolutely no excuse not to support LINE SEPARATOR and TEXT SEPARATOR.

- Furthermore, HTML 4.01 (http://www.w3.org/TR/html401/struct/text.html#h-9.1) explicitly excluded LINE SEPARATOR and TEXT SEPARATOR from the categories of line breaks and whitespace. (Similarly, HTML5 leaves it all up to CSS, and CSS3 (http://dev.w3.org/csswg/css3-text/#white-space-rules) simply does not include LINE SEPARATOR and TEXT SEPARATOR in these categories.) It is the handling of these categories that constitutes the basic difference between the text inside <pre> and <textarea> and the text under other elements. Since LINE SEPARATOR and TEXT SEPARATOR are not in these categories, I do not see why they have to be handled any differently in <pre> and <textarea> and in other elements.

- Re http://www.w3.org/TR/unicode-xml/#Line, it is a set of guidelines for document authors. It is *not* a set of guidelines for what browsers should and should not support. Its recommendation to "use <xhtml:br /> instead of U+2028 and surround paragraphs by <xhtml:p> and </xhtml:p> instead of separating them with U+2029" never really held water. It certainly needs to be updated now that the HTML5 spec for <br> changed it from being a line separator to being a paragraph separator. If it really hates recommending using LINE SEPARATOR, it can recommend using <bdi><br/></bdi>, I guess. But I personally would prefer to use LINE SEPARATOR (as &#x2028;).

- It would be great if someone could mark this bug as blocking 613154; I don't have the rights.
BTW, I hope that when this bug is fixed, it will also result in the support for LINE SEPARATOR and TEXT SEPARATOR in alert() and confirm() (where there is also no good excuse not to support them). Do I remember correctly that alert() and confirm() are implemented via an element with preformatted whitespace?
Blocks: 613154
You need to log in before you can comment on or make changes to this bug.