Closed Bug 909344 Opened 6 years ago Closed 6 years ago

view source does not represent all characters

Categories

(Core :: Layout: Text and Fonts, defect)

12 Branch
x86_64
All
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla28
Tracking Status
firefox22 --- wontfix
firefox23 --- wontfix
firefox24 --- wontfix
firefox25 --- wontfix
firefox26 --- wontfix
firefox27 --- wontfix
firefox28 --- wontfix
firefox29 --- wontfix
firefox30 --- fixed
firefox-esr17 --- wontfix
firefox-esr24 --- wontfix

People

(Reporter: kdevel, Assigned: jfkthame)

References

Details

(Keywords: regression)

Attachments

(8 files)

Attached file test.html
User Agent: 

Steps to reproduce:

1. Open testcase.
2. View Page Source.


Actual results:

2. Source appears as if it were syntactically correct.


Expected results:

2. Display unallowed characters after 'u' in 'document'.
OS: Other → Linux
Hardware: Other → x86_64
Attachment #795438 - Attachment mime type: text/plain → text/html
Regression window(m-c)
Good:
http://hg.mozilla.org/mozilla-central/rev/8ae16e346bd0
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120106015923
Bad:
http://hg.mozilla.org/mozilla-central/rev/fcc32e70c95f
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120106042423
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8ae16e346bd0&tochange=fcc32e70c95f


Regression window(m-i)
Good:
http://hg.mozilla.org/integration/mozilla-inbound/rev/511078d51f71
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120105035122
Bad:
http://hg.mozilla.org/integration/mozilla-inbound/rev/c0b62edd2917
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120105041225
Pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=511078d51f71&tochange=c0b62edd2917

Regressed by: bug 703100
Blocks: 703100
Status: UNCONFIRMED → NEW
Component: Untriaged → Layout: Text
Ever confirmed: true
Keywords: regression
OS: Linux → All
Product: Firefox → Core
Version: 20 Branch → 12 Branch
I don't think this is a bug. Control characters (except for a few such as <tab> and <return>) "have no visual or spatial representation" .... i.e. they're invisible. If you sprinkle them through source code, it won't look any different, but it'll certainly break.

The same would apply to other zero-width, invisible characters such as U+200B, U+2060, U+034F, U+2062, and many more; if you add these to your source, they're unlikely to be visible yet they will break the functionality. Similarly if you replace the "o" (U+006F) in "document" with "ο" (U+03BF); it may not look any different, but don't expect Javascript to understand it!
(In reply to Jonathan Kew (:jfkthame) from comment #2)
> Control characters (except for a few such as
> <tab> and <return>) "have no visual or spatial representation" .... i.e.
> they're invisible.

They have no representation in a presentation type context/mode. But view-source does not constitute what type of context but rather an edit type context. And all editors I have checked (vi, nano, kate, kwrite, emacs) represent these control characters properly.
I think that invisible char should be indicate in viewsource and inspector at least. because debuging make easy.

The visualization of invisible char, I think that something like Ubuntu and adding special color is nice.
"Properly" meaning... what, exactly? There are many conventions for representing control characters in visible form.

(I notice that if I view the sequence <U+0061,U+0001,U+0062> in nano, there's no visible difference from the sequence <U+0061,U+005E,U+0041,U+0062>. To detect the difference, it's necessary to cursor through the text and notice that the "^A" is a single unit rather than two characters.)

If we displayed the control characters as hexboxes, would that be satisfactory? Or should view-source use some other convention, such as converting them to the appropriate escaped or encoded form for HTML or CSS or JS or whatever is being viewed?
One option would be a patch like this, which makes control characters in general render as hexboxes rather than remaining invisible. Note that this will apply -all- text rendering, not only view-source, so it may affect web pages that (accidentally?) include control characters and didn't expect them to be visible. However, I think it's probably a reasonable thing to do on the whole. (If people want view-source to display control chars in some other form, such as a vim/emacs-style "colored ^X" notation, that would need to be implemented as part of the view-source feature instead.)
Attachment #795729 - Flags: review?(roc)
Assignee: nobody → jfkthame
I don't think we should do this for all Web content, just for view-source. For regular Web content I see no upside and there is compatibility risk,
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #7)
> I don't think we should do this for all Web content, just for view-source.
> For regular Web content I see no upside and there is compatibility risk,

The benefit is that "stray" control chars will no longer be invisible to the user, which means that if Find, for example, is not giving the expected result (try searching for the word "benefit" here by typing it into the search bar), the user has some clue as to why. Similarly, it avoids unexpected results when copy-pasting text from a page into some other application where the presence of control chars may have unwanted effects, or the possibility of submitting invisibly-incorrect data via web forms.

The compatibility risk is minimal, IMO, especially in view of the fact that prior to Fx12 it was possible (depending on what font happened to get chosen during fallback, and whether it mapped the control chars to invisible glyphs or to a .notdef box) for such control characters to show up as visible blots on the page.
How about if we try landing this after the upcoming merge, and then see if any compatibility issues show up as it progresses through the channels towards release? It'd be easy to revert if necessary during Aurora or even Beta.

(Personally, I suspect instances of this are sufficiently rare that we won't encounter any significant compatibility issues; and for the rare cases that are affected, I believe the changed behavior is actually an improvement for all web content, not only for view-source.)
What do other browsers do? This should really be specced.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #10)
> What do other browsers do? This should really be specced.

I prefer Jonathan Kew's approach. It ensures that authors who use a Firefox browser to enter their story become aware of a possible control character pollution. Take a look at this newpaper article:

http://www.faz.net/aktuell/feuilleton/medien/gema-alternative-c3s-legaler-tausch-schliesst-profit-nicht-aus-12578461.html

Do you see any of the 24 U+0084 and U+0093 characters or any of the six U+0096? How do I as an editor using a current Firefox browser ensure that my story does not contain these strange characters?
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #10)
> What do other browsers do?

It varies. On my Win8.1 (preview) VM, for example, Internet Explorer shows a "box" glyph for one of them, and blank space for a couple of others, while most remain invisible. On the same machine, Chrome leaves them all invisible. See attached screenshot.

I suspect the results are often font-dependent, as the browsers are simply letting these characters go through their normal rendering path, potentially with font fallback (as many fonts will omit them from the character map entirely). Note the selection highlighting in the screenshot of IE, which suggests that the control characters are being "drawn" with at least a couple of different fonts, resulting in different heights for the selection highlight.

> This should really be specced.

In which spec?
Attached image ctrl-chars.png
Screenshot showing examples of IE and Chrome rendering of text with stray control characters on Win8.1.
(See http://people.mozilla.org/~jkew/ctrlchars.html for the HTML page used for that screenshot.)
(In reply to Jonathan Kew (:jfkthame) from comment #12)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #10)
> > This should really be specced.
> 
> In which spec?

CSS.

(In reply to Stefan from comment #11)
> Do you see any of the 24 U+0084 and U+0093 characters or any of the six
> U+0096? How do I as an editor using a current Firefox browser ensure that my
> story does not contain these strange characters?

There are a few options other than showing them to users in the browser:
-- Examine view-source
-- Web developer tools warnings
-- Have your HTML editor, CMS or Web framework check automatically
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #15)
> (In reply to Stefan from comment #11)
> > Do you see any of the 24 U+0084 and U+0093 characters or any of the six
> > U+0096? How do I as an editor using a current Firefox browser ensure that my
> > story does not contain these strange characters?
> 
> There are a few options other than showing them to users in the browser:
> -- Examine view-source
> -- Web developer tools warnings
> -- Have your HTML editor, CMS or Web framework check automatically

These are (or would be) useful options to help web authors/content creators avoid inadvertently including spurious control characters in pages. However, they don't address the issue for the end user viewing a page where such characters have (by whatever means) been included. Their presence affects the usability, and perhaps even the meaning, of the page; I suggest we are doing users a disservice by hiding them.
(In reply to Jonathan Kew (:jfkthame) from comment #16)
> These are (or would be) useful options to help web authors/content creators
> avoid inadvertently including spurious control characters in pages. However,
> they don't address the issue for the end user viewing a page where such
> characters have (by whatever means) been included. Their presence affects
> the usability, and perhaps even the meaning, of the page; I suggest we are
> doing users a disservice by hiding them.

Can you point to any real pages where users (not developers) would benefit from seeing control characters rendered as hexboxes?
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #17)
> Can you point to any real pages where users (not developers) would benefit
> from seeing control characters rendered as hexboxes?

How do users benefit from automatic ligatures (fi, fl)? 

According to Eric E. Schmidt I would like to suggest a different perspective: If you have characters you don't want anyone to see, maybe they should not have been written in the first place.
(In reply to Stefan from comment #18)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #17)
> > Can you point to any real pages where users (not developers) would benefit
> > from seeing control characters rendered as hexboxes?
> 
> How do users benefit from automatic ligatures (fi, fl)?

They make the text look better. Hexboxes don't make the page look better.

> According to Eric E. Schmidt I would like to suggest a different
> perspective: If you have characters you don't want anyone to see, maybe they
> should not have been written in the first place.

That's fine if you're the author of the page. Most users aren't.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #19)
> (In reply to Stefan from comment #18)
> > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #17)
> > > Can you point to any real pages where users (not developers) would benefit
> > > from seeing control characters rendered as hexboxes?
> > 
> > How do users benefit from automatic ligatures (fi, fl)?
> 
> They make the text look better.

Do they really?

First, that was probably neither intended at the time of their invention nor is it empirically true that they make text look better in general. There are also at least two kinds of 'false' ligatures. The first one is when ligatures are formed in monospace text: It does not look better but also breaks a possible text alignment--making text look worse even for non-expert eyes. 

The second one is when the processor forms ligatures contrary to the typesetting rules, e.g. in "Auflösung" a morpheme boundary is crossed. Does it look better if these rules are disregarded? Does it make text look that better if a processor, which is unable to get kerning/positioning right, makes use of fancy typography?

> Hexboxes don't make the page look better.

That's not the intended benefit.
 
> > According to Eric E. Schmidt I would like to suggest a different
> > perspective: If you have characters you don't want anyone to see, maybe they
> > should not have been written in the first place.
> 
> That's fine if you're the author of the page. Most users aren't.

Nowadays probably most users /are/ authors of content.
(In reply to Stefan from comment #20)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #19)
> > They make the text look better.
> 
> Do they really?

That's the intended benefit, anyway. If you want to argue that we should drop ligature support because you don't like ligatures, please make it in some other forum.

> > Hexboxes don't make the page look better.
> 
> That's not the intended benefit.

Of course. So what is the intended benefit to users who aren't the authors of the page?

> > > According to Eric E. Schmidt I would like to suggest a different
> > > perspective: If you have characters you don't want anyone to see, maybe they
> > > should not have been written in the first place.
> > 
> > That's fine if you're the author of the page. Most users aren't.
> 
> Nowadays probably most users /are/ authors of content.

A user being an author of *some* content doesn't mean that they care to see the errors made by authors of *other* content. This is obvious. I fear you are being intentionally obtuse.
The presence of spurious control characters will disrupt normally-expected functionality, such as search (both Find-in-Page and search-engine indexing) and copy/paste reuse of text, that may be relevant to all users - not only to the original author.

Making the control characters visible rather than invisible is a service to all users because it means the "unexpected" behavior such text will exhibit is reflected in a visible indication that "all is not what it seems". It makes the platform more transparent to users, less liable to mislead or surprise them.

Of course, there are many other ways in which authors may (intentionally or otherwise) produce content that behaves less-than-optimally without giving users any visible clue as to why it is behaving poorly. But in this case, as there is (AFAICT) no legitimate use case for invisible C0/C1 control characters within web content, and given that the web platform has never guaranteed nor consistently implemented this behavior (see attachment 807732 [details]), I don't see any reason to continue hiding them.
(In reply to Jonathan Kew (:jfkthame) from comment #22)
> Making the control characters visible rather than invisible is a service to
> all users because it means the "unexpected" behavior such text will exhibit
> is reflected in a visible indication that "all is not what it seems". It
> makes the platform more transparent to users, less liable to mislead or
> surprise them.

Control-characters disrupting copy-and-paste and find-in-page would occur rather less often than they occur in documents. So I think this benefit is only incurred on pages where showing hexboxes has already annoyed or confused a larger number of users. I don't think it can be a net win.

I think it makes sense to do this in view-source, in devtools, in HTML text input elements, and contenteditable content. I still think it's a bad idea for regular Web content.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #21)
> (In reply to Stefan from comment #20)
> > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #19)
> > > They make the text look better.
> > 
> > Do they really?
> 
> That's the intended benefit, anyway.

Ligatures - especially those which come automatically (recall the contraindication monospace and crossed morpheme boundaries) - actually

> If you want to argue that we should
> drop ligature support because you don't like ligatures, please make it in
> some other forum.

No and no. Sorry, you got me wrong. My point is that 


> > > Hexboxes don't make the page look better.
> > 
> > That's not the intended benefit.
> 
> Of course. So what is the intended benefit to users who aren't the authors
> of the page?
> 
> > > > According to Eric E. Schmidt I would like to suggest a different
> > > > perspective: If you have characters you don't want anyone to see, maybe they
> > > > should not have been written in the first place.
> > > 
> > > That's fine if you're the author of the page. Most users aren't.
> > 
> > Nowadays probably most users /are/ authors of content.
> 
> A user being an author of *some* content doesn't mean that they care to see
> the errors made by authors of *other* content. This is obvious. I fear you
> are being intentionally obtuse.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #21)
> (In reply to Stefan from comment #20)
> > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #19)
> > > They make the text look better.
> > 
> > Do they really?
> 
> That's the intended benefit, anyway.

Ligatures, especially those which come automatically (recall the contraindication monospace and crossed morpheme boundaries), actually do not empirically make "text look better". I don't know of any empirical study made in the group of nontypographs proving that automatic ligatures makes "text look better".

> If you want to argue that we should drop ligature support because you don't like ligatures, please
> make it in some other forum.

No and no. Sorry, you got me wrong. My point is that features (automatic ligatures) are implemented though they are of – let's face the truth – no use to the ordinary user: They do more harm than good for readers and authors of web pages.

> > > Hexboxes don't make the page look better.
> > 
> > That's not the intended benefit.
> 
> Of course. So what is the intended benefit to users who aren't the authors
> of the page?

What is the benefit of presenting � to the user? Shall the user really be confronted with illegal bytes in UTF-8 sequences? (Yes)

> > > > According to Eric E. Schmidt I would like to suggest a different
> > > > perspective: If you have characters you don't want anyone to see, maybe they
> > > > should not have been written in the first place.
> > > 
> > > That's fine if you're the author of the page. Most users aren't.
> > 
> > Nowadays probably most users /are/ authors of content.
> 
> A user being an author of *some* content doesn't mean that they care to see
> the errors made by authors of *other* content.

How do you know? How many authors of some content have you asked?
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #23)
> (In reply to Jonathan Kew (:jfkthame) from comment #22)
> > Making the control characters visible rather than invisible is a service to
> > all users because it means the "unexpected" behavior such text will exhibit
> > is reflected in a visible indication that "all is not what it seems". It
> > makes the platform more transparent to users, less liable to mislead or
> > surprise them.
> 
> Control-characters disrupting copy-and-paste and find-in-page would occur
> rather less often than they occur in documents. So I think this benefit is
> only incurred on pages where showing hexboxes has already annoyed or
> confused a larger number of users. I don't think it can be a net win.

Do we have any evidence that such control characters (that we're currently hiding) are actually widespread on web pages, and that making them visible would be likely to annoy or confuse many users?

I suspect the most common scenario is where incorrect management of encodings somewhere in the authoring chain leads to the Windows-1252 characters in the x80-x9f range being treated as though they were Unicode C1-Controls values, and then UTF-8-encoded on the page. I've seen this happen to things like the "typographic quotes" and em-/en-dashes:

  data:text/html;charset=utf-8,%C2%93hello%C2%94%20foo%C2%97bar

where the author's intention was:

  data:text/html;charset=windows-1252,%93hello%94%20foo%97bar

With our current behavior, these incorrectly-encoded punctuation characters are completely invisible to the reader. This can actually be quite confusing, and make the page more difficult to read - even distort its apparent meaning.

I'd argue that while hexboxes on the page are ugly, they are in fact a BETTER experience - for the reader as well as the author - in this situation. They give the reader a clue that SOMETHING was meant to be there, and it's then possible to make a reasonable guess as to what the author intended; whereas making them invisible leaves the reader likely to mis-read without even realizing it.
Comment on attachment 795729 [details] [diff] [review]
render stray control characters as hexboxes instead of invisible

Review of attachment 795729 [details] [diff] [review]:
-----------------------------------------------------------------

Alright, I give up. I guess we'll see what happens.
Attachment #795729 - Flags: review?(roc) → review+
OK, here goes. :) We'll see...

https://hg.mozilla.org/integration/mozilla-inbound/rev/739edf98ca0b
Target Milestone: --- → mozilla28
Provided the patch here sticks (oops - after pushing it, I realized I couldn't remember if I've previously run this through tryserver), we should also have a reftest for it.
Attachment #830712 - Flags: review?(roc)
I backed this out because it caused layout/reftests/bugs/388367-1.html to fail.

https://hg.mozilla.org/integration/mozilla-inbound/rev/44cce34b7073

The testcase in 388367-1 includes a "formfeed" character U+000c, which became visible and thus caused the comparison with about:blank to fail.
Bug 388367 was about an assertion, not really about the 'rendering' of the form-feed control character; so I suggest we change its testcase to load as a simple crashtest instead of comparing to about:blank.
Attachment #830753 - Flags: review?(roc)
This also led to crashtest orange (due to an UNEXPECTED-PASS) on Linux/Linux64 only:

REFTEST TEST-UNEXPECTED-PASS | file:///builds/slave/test/build/tests/reftest/tests/layout/generic/crashtests/398332-3.html | assertion count 0 is less than expected 2 assertions

I notice that the testcase includes a control character (0x15), so it's plausible that the change here could affect it; by explicitly making this a missing-glyph rather than dropping it, we'll have affected the positioning of the following glyphs in the run, at least.

Given the comments in bug 436123 that seem to indicate people have seen various numbers of assertions in this testcase, presumably depending on platform/version issues, I think we should just change the annotation there from asserts(2) to asserts(0-2).
This just marks the testcase with asserts(0-2). I suppose we could be more precise, and mark it as asserting everywhere except Linux, but I suspect that whether the assertions are triggered is not really dependent on the platform as such, but more likely on exactly what fonts/metrics are involved. The real solution is to fix the underlying bug (436123) so that the assertion never happens anywhere.
Attachment #830776 - Flags: review?(roc)
FYI, one of the tests fails on OSX 10.8:
REFTEST TEST-UNEXPECTED-FAIL | file:///builds/slave/talos-slave/test/build/tests/reftest/tests/layout/reftests/text/control-chars-01c.html | image comparison (!=)
So the old tests are now fine, but one of the new ones misbehaved. :( I wonder why it's platform-specific.

Marking the bug as [leave open] to investigate and reland the tests.
Whiteboard: [leave open]
Vertical tab and form feed used to show as boxes, leading to bug 106311, but it has been fixed by (?accidental) changes for years without being closed.  When this bug reaches a satisfactory end, it would be a good time to close bug 106311, too.

The original fontfont.de website has changed and the page is not on archive.org, but I think it had vertical tab at the start of every paragraph or perhaps every (layout) table row, and looked a mess.
The problem with testcase 1c on OS X happened because we fail to treat U+007F (DEL) as a control character in gfxFontGroup::IsInvalidChar; instead, it gets handled as though it were a printable character. That's seems wrong - and leads to platform/available-font-dependent behavior; OS X happens to have a font where this character code maps to a zero-width invisible glyph, and so the testcase failed, but the other platforms it must have generated space and/or a .notdef box, and so the test passed. Instead, we should simply treat it like the rest of the control characters.
Attachment #831003 - Flags: review?(roc)
Attachment #830712 - Flags: checkin-
https://hg.mozilla.org/mozilla-central/rev/1cb4343bea0d
https://hg.mozilla.org/mozilla-central/rev/f8a78d1723d8
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Depends on: 941940
Depends on: 943334
Depends on: 947588
Depends on: 963252
This was backed out from Firefox 28 and Firefox 29 for causing bug 947588. The fix for that bug is in Firefox 30 but was deemed too risky to backport.
>        } else if ((ch & 0x7f) < 0x20 || ch == 0x7f) {
I wonder why we need to "& 0x7f" here? Shouldn't it simply compare ch with 0x20?
Flags: needinfo?(jfkthame)
It is equivalent to (ch == 0x20 || ch == 0xa0). That is, normal space or no-break space.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #45)
> >        } else if ((ch & 0x7f) < 0x20 || ch == 0x7f) {
> I wonder why we need to "& 0x7f" here? Shouldn't it simply compare ch with
> 0x20?

We also want to catch the C1 controls 0x80..0x9f. Though on looking at the code now, it does look a bit odd how it's written; it would match the first 32 codepoints of any 128-character Unicode block! Something like (ch & ~0x80) < 0x20 would be more logical.

(I suspect it doesn't matter in practice, because this isn't applied to arbitrary Unicode characters, only to the "invalid" character codes that have been identified as terminating a run to be shaped. But it does seem confusing.)
Flags: needinfo?(jfkthame)
The type of ch is uint8_t. Is this really a Unicode character?
(In reply to Masatoshi Kimura [:emk] from comment #48)
> The type of ch is uint8_t. Is this really a Unicode character?

All this has moved somewhat since the patch here, so maybe we're not looking at the same code. In current m-c code, we have an expression like this in

  gfxFontGroup::IsInvalidChar(uint8_t ch)

where it is indeed an 8-bit value, so there's no issue there. But there's also similar code in the

  static bool
  IsInvalidControlChar(uint32_t aCh)

helper (see gfxFont.cpp), which I think is where the original fragment Xidorn quoted has moved; and that is used by both 8-bit and 16-bit codepaths.
Depends on: 1234039
You need to log in before you can comment on or make changes to this bug.