Closed Bug 231610 Opened 22 years ago Closed 22 years ago

topjobs.ie sends UTF-16LE encoded pages without charset specified in http header

Categories

(Tech Evangelism Graveyard :: Other, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: P, Unassigned)

References

()

Details

works in M$ IExplorer
I think this bug is invalid since the server sends the page as ISO-8859-1, the content is UTF-16 or USC-2
IT's not valid as it was filed, but it's valid as a tech-evangelism bug. The page is in UTF-16LE, but http Content-Type header is just 'text/html' without charset specified. It should be Content-Type: text/html; charset=UTF-16LE Our parser fails to detect it as UTF-16LE because the page has several CR/LFs before "<html>". We may have to change our parser to cope with this situation. I found that Tomcat (other java application servers may do the same) spits out several new lines in html output. bz and smontagu, what do you think?
Component: Browser-General → Other
OS: Linux → All
Product: Browser → Tech Evangelism
Hardware: PC → All
Summary: unicode not displayed correctly → topjobs.ie sends UTF-16LE encoded pages without charset specified in http header
Version: Other Branch → unspecified
Component: Other → Arabic
OS: All → Linux
Hardware: All → PC
doh!
Component: Arabic → Other
Hmm.. So the data I see sent is: cr lf 0 cr lf 0 cr lf 0 ..... How is that anything resembling UTF-16?? The whole file is like that (the actual HTML seems to correctly have null bytes every other byte, except at newlines). The parser could certainly look for null bytes and guess based on that (does it not?) but in this case that would fail anyway...
You're absolutely right. It's horrible beyond imagination. I thought it's like 0d 00 0a 00 0d 00 0a 00 .... followed by '<html>' in UTF-16LE. A part of the page can be interpreted as UTF-16 while other parts are not. I have no idea how in the world they can produce such a beast.
> look for null bytes and guess based on that (does it not?) It does only for certain characters. I guess it was considered to extend that but not implemented. See http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsParser.cpp#2000 In case of xml/xhtml, '<?xml' has to be at the very beginning, but as I wrote, tomcat emits several new lines.
Even lynx displays something reasonable, because it accepts the server's header, treats the page as iso8859-1 (my default), and just ignores the null bytes. There is something to be said for just following the server. :-) This looks like a unix utf16 document passed through a simple-mined newline converter.
They've fix it now. What happened was, quoting them directly: "We have been doing some work on the site to improve the download speed. The homepage was originally .asp (from dreamweaver) but to speed up the download time we have converted it to HTML for the short term. This was done simply by vewing the .asp source code from the browser and saving as a new HTML file. You get 3 different options when you select "save as". Default, unicode and utf-8. unicode is what origonally caused the problems, changing to utf-8 makes it work again in mozilla." So in summary using M$ explorer and saving as unicode should generate data that mozilla can't parse (it may have to have blank lines at start).
yup fixed
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Product: Tech Evangelism → Tech Evangelism Graveyard
You need to log in before you can comment on or make changes to this bug.