Closed Bug 68738 (BOM) Opened 25 years ago Closed 22 years ago

UTF-XX with BOM always maps to UTF-XXBE

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla1.4beta

People

(Reporter: nhottanscp, Assigned: ftang)

References

()

Details

(Keywords: intl, testcase)

Attachments

(1 file, 2 obsolete files)

Currently "UTF-16" is always mapped to "UTF-16BE" for the alias resolution. But it could be "UTF-16LE" depends on the data. http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetalias.properties#69 We can keep "UTF-16" without mapping to other names.
assign. Future
Status: NEW → ASSIGNED
Target Milestone: --- → Future
QA Contact: teruko → ftang
This bug seems to mean that Windows' 'save as Unicode' format (which is little-endian UTF-16 with byte-order mark) cannot be used in web pages. If I tag my page as UTF-16, IE 5.5 reads it correctly but Mozilla attempts to read it as UTF-16BE, producing gibberish. It I tag it as UTF-16LE (and remove the BOM), Mozilla works fine but IE doesn't recognize the tag and defaults to ISO 8859-1, producing gibberish again. UTF-16 is neither UTF-16BE nor UTF-16LE. The -LE and -BE encodings NEVER have a byte-order mark, whereas text tagged as UTF-16 MAY or may not have a BOM (RFC 2781).
Same happens with UTF-32
Summary: UTF-16 always maps to UTF-16BE → UTF-XX always maps to UTF-XXBE
Linking testcases made by Jungshik Shin.
There are too many testcases on that page, this bug is concerned only with these 2: http://jshin.net/i18n/utftest/bom_utf16le.utf16.html http://jshin.net/i18n/utftest/bom_utf32le.utf32.html
The problem is with pages that contain BOM. We do correct thing with BOM-less pages, which should be treated as BE. See: http://www-106.ibm.com/developerworks/library/utfencodingforms/#h4
Summary: UTF-XX always maps to UTF-XXBE → UTF-XX with BOM always maps to UTF-XXBE
*** Bug 112690 has been marked as a duplicate of this bug. ***
Always assuming BE on data without a BOM is not in the spirit of Annex F of the XML recommendation: http://www.w3.org/TR/REC-xml#sec-guessing It recommends to check whether the first character is '<', in any of the UTFs (i.e. followed by null bytes). Even though this annex is not normative, it sounds like a sensible thing to do. If the autodetection fails, you can still fall back to UTF-16BE. In a strict interpretation of this annex, you would require '<?' for autodetection to work, but I think the scheme readily extends to HTML.
*** Bug 175083 has been marked as a duplicate of this bug. ***
Frank, could we get this fixed within milestone 1.3 perhaps? We are starting to get dupes so people clearly are affected by this.
Alias: BOM
Attached patch a patch (obsolete) — Splinter Review
With this patch, all the test cases given at the page in the URL field get rendered as intended.
> Always assuming BE on data without a BOM is not in the spirit of Annex > F of the XML recommendation: I fully agree with Martin that the spirit of XML recommendation is to try as hard as possible to detect the endianness. UTF-16/UTF-32 should not be considered as meaning UTF-16BE/UTF-32BE by default. Only after all those efforts fail, as the last resort, we have to assume network byte order(BE). My patch implements that and all 48(?) test cases in my test page are successfully rendered with my patch.
Attachment #108664 - Flags: review?(ftang)
Sorry for spamming. I didn't realize that I wasn't on the CC list.
This looks very good, and it also covers <!-- and <![CDATA - everything you'd want. But have a look at my solution in bug 183354 This is very similar kind of code we're navigating, but in different places. I may be wrong, but the way I see it - BOM detection should be left to nsUniversalDetector.cpp and BOM-less detection - for nsParser.cpp UniversalDetector needs BOM detection because a document isn't neccessarily a markup. text/plain UTF needs detection too and it doesn't go through Parser? And if markup goes through Universal Auto Detector first, and you have detection in Parser as well - that's doing same work twice? This is merely speculation, for I don't know where the document goes through in different cases. It just sounds logical, but I might be wrong.
What a coincidence ! I was about to look at it because I was also concerned about text/plain, but you got there first :-) Thank you for the note. Acoording to nsIParser.h, 'autodetection' has a lower priority than 'BOM detection' (see a set of macro definitions in the file). I haven't yet checked the code path followed by html/xml documents, but we may need both (yours and mine). I'll try your patch after removing BOM detection in nsParser and see what happens... Even if we need not BOM detection in nsParser, it may be good to keep it just for the sake of robustness....
I haven't done the experiment mentioned in my prev. comment (my mozilla build got kinda screwed up and I'm clobbering and rebuilding it). Anyway, I think we have to put both patches in because the universal detection is not always on (see View|Character Coding|AutoDetection menu). Perhaps, a separate bug might as well be filed (if deemed necessary) to make Mozilla always detect BOM even with autodetection off(it does for html/xml in nsParser.cpp but it does not for text/plain). Alternatively, a new entry may be added to View|Character Coding|AutoDetection that is to enable BOM detection only with other detection turned off. Again this is only useful for non-html/non-xml files. BTW, there are a couple of other places where BOM is searched for. One of them is in intl/chardet which is for language-specific (as opposed to 'universal) charset detection, I believe.
cc-ing Hixie, because we were discussing this issue on IRC. >I may be wrong, but the way I see it - BOM detection should be left to >nsUniversalDetector.cpp and BOM-less detection - for nsParser.cpp I think exactly the opposite, but we probably need to define the different cases more systematically.
Yes, I actually was thinking the same thing yesterday about user switching off auto detection completely. So I agree, it should be both places. And I'd hope that the code avoids redundancy, and that if the charset is already known through autodetector, the parser won't bother detecting it. But in case of UTF-16 and UTF-32 (when charset is already provided) what drives it to do detection of endianness? By the way there is at least one more place this kind of code can be found in. I think either JavaScript or CSS parsers also do UTF detection.
Just a thought, for BOM-less detection to make it completely bulletproof, we could also skip the leading whitespaces before doing detection. Worth the bother?
>>I may be wrong, but the way I see it - BOM detection should be left to >>nsUniversalDetector.cpp and BOM-less detection - for nsParser.cpp >I think exactly the opposite, I sorta agree , but the answer really depends on what 'BOM-less detection' is. If it's a simple check of the first four octets, it has to be done in nsParser and only works for html/xml. On the other hand, more 'invasive' detection method (reading the first few tens of octets and analyzing octet-pattern, e.g. '00 xx 00 xx 00 xx 00 xx...' for UTF-16BE) that wors for text/plain as well as for html/xml belongs to universal character detection module. As for BOM detection, Alexey may have reach that conclusion because BOM is not a part of html/xml per se but is a signature for the endianness and is found not only in html/xml but in text/plain as well. In any case, because the universal character detection can be turned off, I think we need to keep a simple 'bom-less' detection as well as BOM detection in nsParser. Now, what should we do to avoid the code repetition in multiple modules? Do we need a method/function to return endianness given 'four octets' that can be shared across modules?
I think the XML recommendation requres that only the first four octets be checked. I should check the spec to make sure, but if so then the more generic approach is not appropriate for XML.
Blocks: 182796
Comment on attachment 108664 [details] [diff] [review] a patch It looks like http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info add additional condictions than http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing from your patch, you does not cover the following cases mention in the XML1.02edition 00 00 FF FE UCS-4 unusaual octet order (2143) FE FF 00 00 UCS-4 unusual octet order (3412)
Attachment #108664 - Flags: review?(ftang) → review-
Frank's concern addressed.
Attachment #108664 - Attachment is obsolete: true
Comment on attachment 115591 [details] [diff] [review] a new patch with two unusual BO UCS-4's added + // 3C 00 XX 00 + if(IS_SECOND_MARKER(aBytes[2])) { + // 00 3C SM 00 UTF-16, little-endian, no Byte Order Mark There's a typo in comment above. // 00 3C SM 00 UTF-16, little... should read // 3C 00 SM 00 UTF-16, little.... The actual code is not affected. I have tested all usual BO UTF-16/32's, but haven't tested unusual BO cases. Do we have decoders for them? Anyway, the detection should work for them, too.
Attachment #115591 - Attachment description: a new patch with two unusual BO UCS-4 added → a new patch with two unusual BO UCS-4's added
Attachment #115591 - Flags: review?(ftang)
Comment on attachment 115591 [details] [diff] [review] a new patch with two unusual BO UCS-4's added r=ftang
Attachment #115591 - Flags: review?(ftang) → review+
Comment on attachment 115591 [details] [diff] [review] a new patch with two unusual BO UCS-4's added asking for sr. let's try to put this before 1.4
Attachment #115591 - Flags: superreview?(heikki)
Comment on attachment 115591 [details] [diff] [review] a new patch with two unusual BO UCS-4's added + oCharset.AssignWithConversion(UCS2_LE); // should change to UTF-16BE Typo in comment? Besides, what do those comments mean anyways? What should be changed?
Changing milestone to 1.4beta to reflect that this is what we are trying to achieve. I've been going back and forth on this because intl is not really my speciality. Adding blizzard to Cc 'cos he seems to be (according to sr list). Anyway, I think using an obscure function like PL_strnpbrk to just test if a character matches one of four possible characters seems like an overkill. Why not write a simple inline function for this, for example: static inline PRBool IsSecondMarker(unsigned char aChar) { switch (aChar) { case '!': case '?': case 'h': case 'H': return PR_TRUE; } return PR_FALSE; } Something that I am concerned about is that for XML the detection must be very strict; does this IsSecondMarker stuff change that? http://www.w3.org/TR/REC-xml#sec-guessing
Target Milestone: Future → mozilla1.4beta
> Why not write a simple inline function for this, for example: Thanks for the suggestion. Done in my tree. I'll upload a new patch after rebuilding Mozilla. Another possibility is just using memcmp instead of a series of switch/if-else as suggested by Chris in bug 183354. If memcmp is sorta 'inlined', it may be faster (I read in another bug that 'VC++ -O1' doesn't do that..) > does this IsSecondMarker stuff change that? > http://www.w3.org/TR/REC-xml#sec-guessing No, I don't think so. The reason I added it is without it, Mozilla fails to detect the encoding of html docs in BOM-less UTF-16LE/UTF-16BE. The code I'm changing deals with both html and xml. See also comment #8, comment #12, comment #14. > Besides, what do those comments mean anyways? I'm not sure ... > What should be changed? I just kept them. Anyway, in a new patch I'm gonna upload, I removed them replacing UCS2_LE|BE with UTF16_LE|BE.
Basically the same patch. Can I get r/sr from Frank and Heikki? This has been sitting for quite a while. Let's try to get this in for 1.4beta as Heikki suggested.
Attachment #115591 - Attachment is obsolete: true
Comment on attachment 119949 [details] [diff] [review] a new patch with inline function for SM detection + oCharset.AssignWithConversion(UCS4_LE); // should change to UTF-32LE
Comment on attachment 119949 [details] [diff] [review] a new patch with inline function for SM detection You don't need |const| in the parameter to IsSecondMarker().
Attachment #119949 - Flags: superreview+
Thank you all. Fix just checked in. In the patch checked in, I remvoed |const|. I don't know what I was thinking :-) I also removed the comment pointed out by Alexey.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Verified. Although it overrides Universal Charset Detector now, even for text/plain files. (See bug 183354)
Status: RESOLVED → VERIFIED
The HTML parser isn't the only consumer of encoders/decoders. This patch made the CSS parser not support UTF-16 anymore. See bug 235090.
Even without this patch, UTF-16 (with BOM indicating LE) didn't work because we mapped UTF-16 to UTF-16BE in charset.aliases.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: