Cannot display UTF-16 encoded webpage correctly.

RESOLVED INVALID

Status

--
minor
RESOLVED INVALID
14 years ago
14 years ago

People

(Reporter: mika_adler, Assigned: smontagu)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(1 attachment)

(Reporter)

Description

14 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041231
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041231

My webpage is edited and saved with the Quanta editor in Gentoo Linux, using
UTF-16 encoding and has a properly (I think) META-tag to tell the browser that
I'm using UTF-16. Mozilla's auto-detect feature for encoding does only try
iso-8859-1 encoding, no matter how I configure the browser, and the result is
that the webpage just looks like garbage. It does work, though, if I manually
tell Mozilla to use UTF-16, but then there are problems that the HTML <B>-tag
does only work with UTF-16 little endian, not with UTF-16.

Reproducible: Always

Steps to Reproduce:
1.Load my homepage into the browser

Actual Results:  
Characters decoded with iso-8859-1 = garbagae.

Expected Results:  
Read the HTML META-tag, and made an intelligent decision that this page was
encoded with UTF-16, then the auto-detect feature for char encoding probably
would work _much_ better :-)

Comment 1

14 years ago
Your server is sending a content type of text/html; charset=ISO-8859-1 and it
seems to me that it would be difficult to locate a UTF-16 encoded meta tag in
such a document.
server headers override everything else, this is INVALID. fix your server.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 14 years ago
Resolution: --- → INVALID
hm...
10 Content-Type: text/html; charset=UTF-16

it seems the server (now) sends the correct headers. and indeed, it works for me
in mozilla.
(Assignee)

Comment 4

14 years ago
(In reply to comment #3)
> hm...
> 10 Content-Type: text/html; charset=UTF-16
> 
> it seems the server (now) sends the correct headers. and indeed, it works for me
> in mozilla.

Not completely (for both parts of this statement). UTF-16 implies BE, and the
page is in fact little-endian; and there is still the issue mentioned in comment
0 --

> the HTML <B>-tag does only work with UTF-16 little endian, not with UTF-16.

I don't understand why if I change View | Character Encoding between UTF-16,
UTF-16-BE and UTF-16-LE, each one displays slightly differently. I would have
expected at least one to display garbage.
ah - reopening for those issues, then
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
(Assignee)

Comment 6

14 years ago
Created attachment 172170 [details] [diff] [review]
Patch for the font issue

Thanks to biesi for suggesting that the problem was that UTF-16 isn't
recognized as being in the "x-unicode" langGroup. This is actually a regression
from bug 68738, which removed the aliasing of UTF-16 to UTF-16BE, without
adding a new entry for it in charsetData.properties.
Assignee: general → smontagu
Status: UNCONFIRMED → ASSIGNED
(Assignee)

Updated

14 years ago
Attachment #172170 - Flags: superreview?(dbaron)
Attachment #172170 - Flags: review?(cbiesinger)
Attachment #172170 - Flags: review?(cbiesinger) → review+
Comment on attachment 172170 [details] [diff] [review]
Patch for the font issue

sr=dbaron

utf-32 isn't aliased to itself in charsetalias.properties.  Does it need to be?
 Do we support BOM detection on it?
Attachment #172170 - Flags: superreview?(dbaron) → superreview+
(Reporter)

Updated

14 years ago
Severity: major → minor
Status: ASSIGNED → RESOLVED
Last Resolved: 14 years ago14 years ago
Resolution: --- → INVALID
(Reporter)

Comment 8

14 years ago
True, my server was sending wrong data.. but the issue with the font is a
problem (small though).
(Assignee)

Comment 9

14 years ago
> utf-32 isn't aliased to itself in charsetalias.properties.  Does it need to be?
>  Do we support BOM detection on it?

All the HTML utf-32 testcases at http://jshin.net/i18n/utftest/ seem to work
(with autodetection turned off), but maybe we need it for UTF-32 stylesheets
(parallel to bug 235090)?
(Assignee)

Comment 10

14 years ago
Comment on attachment 172170 [details] [diff] [review]
Patch for the font issue

Checked in.

Comment 11

14 years ago
(In reply to comment #4)

> Not completely (for both parts of this statement). UTF-16 implies BE, and the
> page is in fact little-endian; and there is still the issue mentioned in comment
> 0 --

That's because 'UTF-16' decoder does 'sort of' endian detection at the beginning
instead of regarding it as 'UTF-16BE'. 
 
> > the HTML <B>-tag does only work with UTF-16 little endian, not with UTF-16.
> 
> I don't understand why if I change View | Character Encoding between UTF-16,
> UTF-16-BE and UTF-16-LE, each one displays slightly differently. I would have
> expected at least one to display garbage.

It would break in a more spectacular manner if it included a lot more characters
beyond U+0100. The page  is mostly made of characters below U+0100 and it begins
with 0xFF 0xFE 0x3c 0x00 0x48 0x00 0x54 0x00 0x4d 0x00. What's happening is that
UTF-16BE decoder interprets 0xFF as invalid ('?') and the rest (0xFE 0x3c 0x00
0x48 0x00 0x54 0x00 0x4d 0x00) as 'U+FE3C U+0048 U+0054 U+004D'. This would work
'perfectly' if the page doesn't have any characters above U+0100. However, it
has a few Japanese characters, which breaks this interpretation of UTF-16LE as
UTF-16BE with 'one byte offset'. 

As for stylesheets in UTF-32, indeed we need to do something like what we did in
bug 235090 for CSS stylesheet in UTF-16.
You need to log in before you can comment on or make changes to this bug.