Closed Bug 236325 Opened 20 years ago Closed 13 years ago

UTF-8 documents containing Byte Order Mark (BOM), misdelivered as ISO-8859-1, fail to display

Categories

(Tech Evangelism Graveyard :: Other, defect)

PowerPC
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: yhlien2004, Unassigned)

References

()

Details

User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7b) Gecko/20040302 Camino/0.7+
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7b) Gecko/20040302 Camino/0.7+

When the given URL was loaded, the page content area just showed the strange
symbols "" in the upper left corner. I checked the source of the URL and
found the charset label of the page was UTF-8. However, Camino still selected
the Western(Latin ISO 1) encoding and ignored the charset label in that page.
The given URL showed up properly after selecting UTF-8 manually.

Reproducible: Always
Steps to Reproduce:
1. input the given URL
2.
3.

Actual Results:  
strange symbols "" showed up.

Expected Results:  
The page showed up Traditional Chinese properly.
Also happens using Mozilla.

The document is sent with "Content-Type: text/html;charset=ISO-8859-1",
explicitly specifying ISO-8859-1. Those characters are a Unicode Byte Order Mark
(BOM).

Should the http-equiv META data override the Content-Type?

Should some special content sniffing detect a BOM in non-UTF-8 files and compensate?

Reassigning to Browser/Parser.
Assignee: pinkerton → parser
Severity: minor → normal
Status: UNCONFIRMED → NEW
Component: Page Layout → HTML: Parser
Ever confirmed: true
Product: Camino → Browser
Summary: the page content did not showed up until the proper encoding was selected → UTF-8 documents containing Byte Order Mark (BOM), misdelivered as ISO-8859-1, fail to display
Version: unspecified → Trunk
(In reply to comment #1)
> Should the http-equiv META data override the Content-Type?

See <http://www.w3.org/TR/html4/charset.html#h-5.2.2> : the Content-Type should
have preference.

Reporter, a workaround is to specify the UTF-8 charset with the View->Character
Coding menu, which overides everthing. Note: the auto-detector didn't work either. 
invalid, http charset headers override everything else.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → INVALID
actually... maybe not... shouldn't we show the frameset?
jshin?
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
<body> tags in HTML are optional.  Once we hit text, we automatically open a
<body>.  Once a <body> is open, <frameset> is no longer allowed (and the parser
drops it).

Evang.
Assignee: parser → other
Status: REOPENED → NEW
Component: HTML: Parser → Other
Product: Browser → Tech Evangelism
QA Contact: other
Version: Trunk → unspecified
Though I may be completely off, shouldn't one also be able to put &#xFEFF; , &#xFFFE; , &#xEFBBBF; , etc. before the prolog of an XML document and have it not show up as the characters but rather be treated as byte-order marks?
Comment 6 is correct, but I don't see what application it has to this bug report. If you are asking whether the encoding determined by the byte-order mark should take precedence over the encoding specified in HTTP headers, the answer is no. However the BOM may be used to determine the encoding when none is specified or to identify the endianness of UTF-16, and we do this.
INCOMPLETE due to lack of activity since the end of 2009.

If someone is willing to investigate the issues raised in this bug to determine whether they still exist, *and* work with the site in question to fix any existing issues, please feel free to re-open and assign to yourself.

Sorry for the bugspam; filter on "NO MORE PRE-2010 TE BUGS" to remove.
Status: NEW → RESOLVED
Closed: 20 years ago13 years ago
Resolution: --- → INCOMPLETE
(In reply to comment #2)
> (In reply to comment #1)
> > Should the http-equiv META data override the Content-Type?
> 
> See <http://www.w3.org/TR/html4/charset.html#h-5.2.2> : the Content-Type
> should have preference.

This conclusion is wrong. Because:
 * HTML4 did not discuss the UTF-8 BOM
 * HTML5 unfortunately stills says the same, however, bugs have been filed.
 * However, IE and Webnkit respects the BOM higher than Content-Type header.
 * The IE/WEbkit behaviour is in tune with XML 1.0.
 * The Firefox/Opera behavior triggers Quirks-Mode in HTML and trigger Yellow Screen of Death in XML - those errors are not seen in Webkit or IE.

Summary: for the encoding, then the BOM should take have higher priority than the HTTP header.

Test case: http://malform.no/testing/html5/bom/
HTML5 bug: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
 XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info

> Reporter, a workaround is to specify the UTF-8 charset with the
> View->Character
> Coding menu, which overides everthing. Note: the auto-detector didn't work
> either.

This is yet another issue: Webkit and IE does not allow you to override the encoding whenever the encoding is UTF-8 *and* there is a UTF-8 Byte Order Mark.
Product: Tech Evangelism → Tech Evangelism Graveyard
You need to log in before you can comment on or make changes to this bug.