Closed
Bug 231610
Opened 22 years ago
Closed 22 years ago
topjobs.ie sends UTF-16LE encoded pages without charset specified in http header
Categories
(Tech Evangelism Graveyard :: Other, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: P, Unassigned)
References
()
Details
works in M$ IExplorer
Comment 1•22 years ago
|
||
I think this bug is invalid since the server sends the page as ISO-8859-1, the
content is UTF-16 or USC-2
Comment 2•22 years ago
|
||
IT's not valid as it was filed, but it's valid as a tech-evangelism bug. The
page is in UTF-16LE, but http Content-Type header is just 'text/html' without
charset specified. It should be
Content-Type: text/html; charset=UTF-16LE
Our parser fails to detect it as UTF-16LE because the page has several CR/LFs
before "<html>". We may have to change our parser to cope with this situation. I
found that Tomcat (other java application servers may do the same) spits out
several new lines in html output.
bz and smontagu, what do you think?
Component: Browser-General → Other
OS: Linux → All
Product: Browser → Tech Evangelism
Hardware: PC → All
Summary: unicode not displayed correctly → topjobs.ie sends UTF-16LE encoded pages without charset specified in http header
Version: Other Branch → unspecified
Updated•22 years ago
|
Component: Other → Arabic
OS: All → Linux
Hardware: All → PC
Comment 4•22 years ago
|
||
Hmm.. So the data I see sent is:
cr lf 0 cr lf 0 cr lf 0 .....
How is that anything resembling UTF-16?? The whole file is like that (the
actual HTML seems to correctly have null bytes every other byte, except at
newlines).
The parser could certainly look for null bytes and guess based on that (does it
not?) but in this case that would fail anyway...
Comment 5•22 years ago
|
||
You're absolutely right. It's horrible beyond imagination. I thought it's like
0d 00 0a 00 0d 00 0a 00 .... followed by '<html>' in UTF-16LE.
A part of the page can be interpreted as UTF-16 while other parts are not.
I have no idea how in the world they can produce such a beast.
Comment 6•22 years ago
|
||
> look for null bytes and guess based on that (does it not?)
It does only for certain characters. I guess it was considered to extend that
but not implemented. See
http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsParser.cpp#2000
In case of xml/xhtml, '<?xml' has to be at the very beginning, but as I wrote,
tomcat emits several new lines.
Even lynx displays something reasonable, because it accepts the server's
header, treats the page as iso8859-1 (my default), and just ignores the null
bytes. There is something to be said for just following the server. :-)
This looks like a unix utf16 document passed through a simple-mined newline
converter.
| Reporter | ||
Comment 8•22 years ago
|
||
They've fix it now. What happened was, quoting them directly:
"We have been doing some work on the site to improve the download speed.
The homepage was originally .asp (from dreamweaver) but to speed up the
download time we have converted it to HTML for the short term. This was
done simply by vewing the .asp source code from the browser and saving
as a new HTML file. You get 3 different options when you select "save as".
Default, unicode and utf-8. unicode is what origonally caused the problems,
changing to utf-8 makes it work again in mozilla."
So in summary using M$ explorer and saving as unicode should generate
data that mozilla can't parse (it may have to have blank lines at start).
Updated•10 years ago
|
Product: Tech Evangelism → Tech Evangelism Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•