Closed Bug 224772 Opened 21 years ago Closed 21 years ago

leefish.ch not display text as of Moz 1.5

Categories

(Core :: DOM: Core & HTML, defect)

Other Branch
defect
Not set
major

Tracking

()

VERIFIED WORKSFORME

People

(Reporter: silicon, Unassigned)

References

()

Details

(Keywords: intl)

Attachments

(3 files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6a) Gecko/20031030 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6a) Gecko/20031030 this page looks correct from netscape 6 to mozilla 1.4 finall. since mozilla 1.5 part of the content (text) is missing. Reproducible: Always Steps to Reproduce: 1. 2. 3. Expected Results: page should be rendered like mozilla 1.4 the page is made with a cms that does use the ie contenteditable features. sites generated with the cms work correctly in ie, mozilla < 1.5, opera 7.2 on win32. very difficult to say whats wrong, but there seems to be something wrong :-(
confirm problem on Moz 1.5 final on Win XP Pro confirm it works ok on IE 6.0 on WinXP Pro See screenshots - severity -> major - clarify summary
Severity: normal → major
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: since mozilla 1.5 text content of this page is missing → leefish.ch not display text as of Moz 1.5
This is a regression from bug 200984, looks like. I can't find where the site is calling unescape(), but there is clearly a character-encoding issue here -- if I go back to treating the string to be unescaped as ASCII, things work. I have no idea why this works in IE, given that it does NOT treat the string as ASCII... but the site is doing a lot of UA-sniffing (eg passing the appname and UA string to all subframes in the URL), so I would not be surprised if they send IE and Mozilla different data...
the text is unescaped because of javascript issues. e.g.: div.innerHTML = 'O'Brien is a Scottish name'; would generate a javascript error, so we need to do the following: div.innerHTML = unescape('O%27'Brien is a Scottish name'); browser sinffing is done one time (ok reloading the page will do it again). the content for all browsers is absolutly the same.
Well, here is the problem. The long string with all sorts of URL-escapes that's passed to unescape() contains a char that's not expressible as ISO-8859-1 (the charset of the page). When converted from Unicode to UTF-8, the byte sequence is: S T R O N G % 3 E 0xe2 0x80 0xa6 F O R % 2 0 Y O U R ... So we end up discarding the whole thing, since we can't do the right charset conversions. We could probably try to recover and knowingly produce bad data and hope that it looks "about right", which is probably what IE does...
I don't think IE makes any distinction between ISO-8859-1 and windows-1252, where the character (HORIZONTAL ELLIPSIS) does appear. If I override the encoding to windows-1252 in Mozilla, the problem seems to go away. Olaf, are you the page author or webmaster? If so, can you change the charset declaration of the page to windows-1252?
thanks Simon for your answer, we did some testing against the charsets. ISO-8859-1: Mozilla 1.1b to Mozilla 1.4 works Mozilla 1.5 or greater does not work UTF-8: Mozilla 1.1b to Mozilla 1.4 has problems with ä,ü,ö and so on. Text is not displaying. Mozilla 1.5 or greater works. Windows-1252: All Mozilla versions works...but as you may saw the leefish.ch does also sell fishes in the japanese markets. japaneses characters are saved in escaped form in the database and displayed through unescaping the characters. this does not work with Windows-1252. At the moment there is no test page for this online, but we will provide you with test case tomorrow. With IE 5.1-6.0 the pages works, it does not matter what type of charset we use. Test are done on win2000 and winXP.
This may be irrelevant, but the page works fine in Opera 7 too.
> japaneses characters are saved in escaped form > in the database and displayed through unescaping the characters. Olaf, the problem is that unescaping produces _bytes_, not characters. Those unescaped bytes can only be interpreted as characters encoded in some encoding. At the moment, Mozilla assumes that encoding is the page encoding (since that makes the most sense). But the page encoding here is ISO-8859-1, and Japanese characters can't possibly be encoded in that... so what encoding _are_ you using exactly? And how are we supposed to know that? By the way, the problem you are seeing with UTF-8 in old Mozilla builds is precisely what bug 200984 was about.
>Comment #6 >So we end up discarding the whole thing, since we can't do the right charset >conversions. We could probably try to recover and knowingly produce bad data >and hope that it looks "about right", which is probably what IE does... please understand, that I want to use an encoding like iso-8859-1 as general encoding. All characters not fitting in this characterset are available in unicode sequence e.g. \u30D7\u30ED\u30B8\u30A7\u30AF\u30C8 (some japanese characters) unescapeing such a sequence and applying as innerHTML or value in a html input element produces strange results with different versions of mozilla builds can anybody tell me why the behaviour changes from version to version? this is a nightmare! >Comment #10 >But the page encoding here is ISO-8859-1, and Japanese >characters can't possibly be encoded in that... so what encoding _are_ you >using exactly? And how are we supposed to know that? We are using iso-8859-1 right now. I wonder, why windows-1252 and iso-8859-1 produce different results on ver. 1.1-1.4??? >By the way, the problem you are seeing with UTF-8 in old Mozilla builds is >precisely what bug 200984 was about. Using UTF-8 works fine with latest builds 1.5+ ,but is not really an option for us, since we are not working in a laboratory environment, but dealing with the real world and a lot of users are using version before 1.5! All these people would be really upset, if we would change to utf-8 any suggestions that work for all version 1.2 through 1.5+?
> e.g. \u30D7\u30ED\u30B8\u30A7\u30AF\u30C8 (some japanese characters) > unescapeing such a sequence There is nothing there to unescape, is there? That's not a URL-escaped string... > can anybody tell me why the behaviour changes from version to version? Because the pre-1.5 behavior was buggy with non-western (and possibly with just non-ascii) chars. > I wonder, why windows-1252 and iso-8859-1 produce different results > on ver. 1.1-1.4??? Because they are different encodings? > any suggestions that work for all version 1.2 through 1.5+? You have to tell me the constraints for me to be able to answer this question... as far as I can tell, you have the following constraints: 1) The string you are passing to unescape() cannot always be represented as ISO-8859-1 (thus causing problems in 1.5). 2) Something about Japanese characters. Please clearly explain _exactly_ what issue #2 is (what exact Unicode string you pass to unescape() and what exact unicode string you expect out and why).
Try the page with many different character encodings. The page is 'compatible' with any ASCII-preserving encodings. Characters in Java notation is converted to Unicode characters internally so that escaping and unescaping of them don't work if the current document encoding can't represent them. What MSIE does might be either A. don't convert '\uxxxx' to Unicode characters internally until it is printed out or B. escape/unescape converts characters unrepresentable in the current document encoding to Java notation (or some other representation) If Mozilla 1.1-1.4 compatibility were not an issue, I would suggest using UTF-8. We(mozilla) might add 'Java-style notation' to nsISaveAsCharset (if it's not there yet) and use it in escape/unescape, but it wouldn't help Mozilla 1.1-1.4 users.
jshin, on a separate note, do you think we should handle conversion errors as I mention in comment 6 (fill in 0xFFFD and go on or something)? That's not so bad going native-to-unicode, but the problem here is going unicode-to-native...
bz, when going from unicode to native (in GlobalWindowImpl::Unescape), we're calling ConvertCharset which returns null on coming across an unrepresentable char. By using nsISaveAsCharset (with Java notation) instead of ConvertCharset, we can avoid that. nsUnescape would leave \uxxxx alone. Then, after converting back to Unicode, we have to replace \uxxxx with a PRUnichar corresponding to \uxxxx. The last step is not pretty, but we can copy code from JS engine or call it if it's public... What if there's literal '\uxxxx'......
nsISaveAsCharset needs to deal with that anyway, no? Preferably by escaping '\' as \uxxxx if that's the escapeing method chosen...
OK to make things clearer, I prepared some test pages: three encodings are available: utf-8, windows-1252, and iso-8859-1 http://dev.leefish.ch/utftest.jsp http://dev.leefish.ch/1252test.jsp http://dev.leefish.ch/iso88591test.jsp 12 tests per page: (div and input) the first six fields are not escaped, the second six are in IE and Opera 7.02 on Win32 all three encodings display the correct characters for the last six (unescaped) fields
Addendum for Comment 17: Sorry folks: Please use the test pages without dev http://leefish.ch/utftest.jsp http://leefish.ch/1252test.jsp http://leefish.ch/iso88591test.jsp
Ok, so Jungshik's guess was right. The problem is that \uxxxx escapes are converted into unicode chars at JS compile time, so unescape('\uxxxx') can fail in Mozilla 1.5 if the unicode char in question cannot be represented in the page encoding... It would succeed in Mozilla 1.4 or earlier, but at the cost of corrupting the char (since it would "convert" the Unicode chars to bytes by simply casting each 16-bit unsigned int into to an 8-bit signed int, then unescape %-escapes in the resulting bytes and convert the bytes back into chars using the page encoding). In short, doing Japanese this way through unescape() in Mozilla before 1.5 simply doesn't work (notice the garbage displayed in the last two fields of the iso88591 and 1252 test by Mozilla 1.4). It works in Mozilla 1.5 if the page encoding can encode those chars (due to the fix for bug 200984). If we make the changes Jungshik proposes, we can make it work even if the page does not support those chars. Jungshik, could we do this in the 1.6b timeframe? How much change to the nsISaveAsCharset code would be needed? It seems to already support \uxxxx escapes (see http://lxr.mozilla.org/seamonkey/source/intl/unicharutil/src/nsSaveAsCharset.cpp#335), but has the problem you mentioned with literal \\uxxxx being present in the string that would need addressing... Thomas, I'm afraid I cannot offer you a solution that works in 1.2-1.5+ Mozilla builds. This is largely because there is simply no way to make the Japanese chars work in pre-1.5 builds without using an encoding in which all the Japanese chars involved are single-byte (such do not exist to my knowledge....). You basically have two options: 1) Use UTF-8. Then both Japanese and western non-ascii chars work in Mozilla 1.5 and neither really works in pre-1.5 builds. 2) Use Windows-1252. Then Japanese chars fail in all Mozilla builds and western chars work in all Mozilla builds, as you observed. There is also "option three", which is to sniff the browser version and use UTF-8 for 1.5+ and windows-1252 for pre-1.5, which just gives you broken Japanese in pre-1.5 builds.... I don't know how your various markets compare and hence which decision makes the most sense for you. With any luck, things should work correctly in 1.6 even if you choose option 2 above. There is also the question of whether we want to try to land bug 200984 on the 1.4 branch. I doubt that would really improve the situation, though.
Many thanks for the information. We can live for now with the windows-1252 charset. but it would be really helpful if mozilla is rendering the unicode chars correctly in future. I will change the bug to resolved as soon as mozilla is doing right with the chars. thomas olaf
bz, I can't promise it will be done in 1.6b timeframe. I'll give it a try but I have to solve real-life issues as well :-). As for nsISaveAsCharset, yes it has '\uxxxx' already, but I guess it doesn't escape literal '\' because it's only for output and doesn't care about converting back. I wish nsISaveAsCharset::Convert had an out parameter indicating whether there's any character that is escaped or how many chars. are escaped. [somewhat ot, but related] BTW, we might have to fix bug 44272 at the same time. If MS IE does the right thing(in regard to ECMAscript standard) and hasn't caused compatibility problems, we shoudl be able to do it without much worry. In that case, the last step in comment #15 has to deal with '%uxxxx' as well as '\uxxxx'. An alternative would be to add '%uxxxx' escaping to nsISaveAs and deal with only '%uxxxx' in GlobalWindowImpl:UnEscape. This addition can be also used by GlobalWindowImpl:Escape (to fix bug 44272).
I'm gonna fix this eventually. As for leefish site, I came up with a solution that should work across versions. 1. Use UTF-8 2. Instead of mxiing '%xx' (for characters like single quotation as in O'Brien) and '\uxxxx' for Japanese, always use '\uxxxx' for ASCII characters you currently url-escape as well as for Japanese characters. Actually, if you use UTF-8, you can put literal Japanese characters. You have to use '\uxxxx' notation only for url-unsafe ASCII characters. It should be very easy to write a Java function for this on the server side, shouldn't it? 3. With that, you don't need to call |unescape| in your ECMAscript(Javascript) so that you don't have to worry about the version/browser dependency of 'escape()'
I'm gonna fix this eventually. As for leefish site, I came up with a solution that should work across versions. -Please do. as soon as you dealt with the real live issues :-) 1. Use UTF-8 our customer does not have a UTF-8 database. 2. Instead of mxiing '%xx' (for characters like single quotation as in O'Brien) and '\uxxxx' for Japanese, always use '\uxxxx' for ASCII characters you currently url-escape as well as for Japanese characters. -okey that would be easy to do for javascript char problems. Actually, if you use UTF-8, you can put literal Japanese characters. You have to use '\uxxxx' notation only for url-unsafe ASCII characters. It should be very easy to write a Java function for this on the server side, shouldn't it? -thats right:-) 3. With that, you don't need to call |unescape| in your ECMAscript(Javascript) so that you don't have to worry about the version/browser dependency of 'escape()' -we need to use a default charset (like windows-1252) if we would escape every single none ASCII char, the bytestream would be 6 times larger then yet. -there is also an extranet behind the page which needs sorting and search capabilities accordingly to the region (e.g.: regions/countries do use diffrent sorting) -mainly we use a default region charset (e.g western style for switzerland, shift-jis for japan) for the customer. Chars which cannot displayed by this charset are saved in escaped form (and then unescaping them for display). -as for now the customer (leefish) does not need japanese characeters (maybee in 3 months) so, we can live for now with windows-1252. -about 99% of the visitor uses ie5-6 and this browsers is doing okey, so mozilla is not mission critical. -!We think because ie and opera are doing ok with the this problem/issue/bug mozilla should do it too. maybee there are also other humans/companies around the globe which would be happy if mozilla can display it.
I fixed the problem by fixing bug 44272. I'm gonna upload a patch there in a minute. Here is some clarification. > mainly we use a default region charset (e.g western style for switzerland, > shift-jis for japan) for the customer. Well, I'm afraid this is not so good an idea. I would use UTF-8 everywhere from the start to the end. If your customer has DB in legacy character encoding, that would be the only point at which you have to deal with legacy character enc odings. Once into your system, why bother to deal with those things of the past especially considering that virtually all modern browsers have no problem dealing with UTF-8 (that is, site visitors would never notice the difference if you use 'lang' and 'xml:lang' to specify the language of the content correctly) You may wish to visit http://www.w3.org/international where there are a couple of FAQ items as to why use Unicode. > Chars which cannot displayed by this > charset are saved in escaped form (and then unescaping them for display) You have a problem because you use 'unescape()'.If you don't use 'unescape()', there's no problem. You do NOT have to use it as long as you use '\uxxxx' (Java notation) for a small subset of ASCII characters. > we need to use a default charset (like windows-1252) if we would escape every > single none ASCII char, the bytestream would be 6 times larger then yet. As for the size bloat, note that you do NOT have to use \uxxxx for Japanese characters(letters with diacritic marks for Western European languages).in pages encoded in Shift_JIS/EUC-JP (Windows-1252/ISO-8859-1) or UTF-8. \uxxxx notation is only necessary for two cases : 1) to 'escape' characters that should not appear in JS string literal directly (a _small subset of ASCII characters) 2) to represent characters OUTSIDE the character repertoire of the current page encoding (that is, Japanese characters in Windows-1252/ISO-8859-1 or Latin letters with diacritic marks in Shift_JIS/EUC-JP).
Component: Layout → DOM Other
Depends on: 44272
Keywords: intl
OS: Windows XP → All
Hardware: PC → All
I fixed the problem by fixing bug 44272. I'm gonna upload a patch there in a minute. > I will test it tomorrow, today I had a f... hard working day at a customer > and I'm to tired now for testing. Well, I'm afraid this is not so good an idea. I would use UTF-8 everywhere from the start to the end. If your customer has DB in legacy character encoding, that would be the only point at which you have to deal with legacy character enc odings. Once into your system, why bother to deal with those things of the past especially considering that virtually all modern browsers have no problem dealing with UTF-8 (that is, site visitors would never notice the difference if you use 'lang' and 'xml:lang' to specify the language of the content correctly) You may wish to visit http://www.w3.org/international where there are a couple of FAQ items as to why use Unicode. > your right but sometimes you do not have the choice. i will visit the site and > study more the chareset and encoding, i must admin, im not the guru with > encoding and charsets. many thanks! your and mozilla group help blasts away any support hotline form an normal company!!!
Thanks for your kind words and your willingness to support Mozilla. [OT] I wish Korean web admins/designers were like you. They don't have a single bit of interest in interoperability, platform/device independence, standard compliance and univeral accessibility (all at the very heart of the web and internet). On, bug 44272 (the fix for which was landed to the trunk), I am gonna 'lobby' for applying the patch to 1.4 branch.
Work now in 1.6trunk thanks to the patch for bug 44272. it's not fixed on 1.5... are we gonna have 1.5.1? maybe not. not sure what to do here
Version: Trunk → Other Branch
with firebird nightly build 20031114 all testcases works for me :-) http://leefish.ch/utftest.jsp http://leefish.ch/1252test.jsp http://leefish.ch/iso88591test.jsp many thanks! will change to fixed when the bug is resolved in an "official" 1.x release
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → WORKSFORME
with moz 1.6b it works as it should :-) many thanks olaf
Status: RESOLVED → VERIFIED
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: