68738 - (BOM) UTF-XX with BOM always maps to UTF-XXBE

Reporter

Description

•

25 years ago

Currently "UTF-16" is always mapped to "UTF-16BE" for the alias resolution. But it could be "UTF-16LE" depends on the data. http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetalias.properties#69 We can keep "UTF-16" without mapping to other names.

Frank Tang

Assignee

Comment 1

•

25 years ago

assign. Future

Status: NEW → ASSIGNED

Target Milestone: --- → Future

Teruko Kobayashi

Updated

•

25 years ago

QA Contact: teruko → ftang

P. Damian Cugley

Comment 2

•

24 years ago

This bug seems to mean that Windows' 'save as Unicode' format (which is little-endian UTF-16 with byte-order mark) cannot be used in web pages. If I tag my page as UTF-16, IE 5.5 reads it correctly but Mozilla attempts to read it as UTF-16BE, producing gibberish. It I tag it as UTF-16LE (and remove the BOM), Mozilla works fine but IE doesn't recognize the tag and defaults to ISO 8859-1, producing gibberish again. UTF-16 is neither UTF-16BE nor UTF-16LE. The -LE and -BE encodings NEVER have a byte-order mark, whereas text tagged as UTF-16 MAY or may not have a BOM (RFC 2781).

Alexey Chernyak

Comment 3

•

23 years ago

Same happens with UTF-32

Summary: UTF-16 always maps to UTF-16BE → UTF-XX always maps to UTF-XXBE

Alexey Chernyak

Comment 4

•

23 years ago

Linking testcases made by Jungshik Shin.

URL: http://jshin.net/i18n/utftest/

Keywords: intl, testcase

Alexey Chernyak

Comment 5

•

23 years ago

There are too many testcases on that page, this bug is concerned only with these 2: http://jshin.net/i18n/utftest/bom_utf16le.utf16.html http://jshin.net/i18n/utftest/bom_utf32le.utf32.html

Alexey Chernyak

Comment 6

•

23 years ago

The problem is with pages that contain BOM. We do correct thing with BOM-less pages, which should be treated as BE. See: http://www-106.ibm.com/developerworks/library/utfencodingforms/#h4

Summary: UTF-XX always maps to UTF-XXBE → UTF-XX with BOM always maps to UTF-XXBE

Alexey Chernyak

Comment 7

•

23 years ago

*** Bug 112690 has been marked as a duplicate of this bug. ***

Martin v. Löwis

Comment 8

•

23 years ago

Always assuming BE on data without a BOM is not in the spirit of Annex F of the XML recommendation: http://www.w3.org/TR/REC-xml#sec-guessing It recommends to check whether the first character is '<', in any of the UTFs (i.e. followed by null bytes). Even though this annex is not normative, it sounds like a sensible thing to do. If the autodetection fails, you can still fall back to UTF-16BE. In a strict interpretation of this annex, you would require '<?' for autodetection to work, but I think the scheme readily extends to HTML.

Simon Montagu :smontagu

Comment 9

•

23 years ago

*** Bug 175083 has been marked as a duplicate of this bug. ***

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 10

•

23 years ago

Frank, could we get this fixed within milestone 1.3 perhaps? We are starting to get dupes so people clearly are affected by this.

Simon Montagu :smontagu

Updated

•

23 years ago

Alias: BOM

Jungshik Shin

Comment 11

•

23 years ago

Attached patch a patch (obsolete) — Details — Splinter Review

With this patch, all the test cases given at the page in the URL field get rendered as intended.

Jungshik Shin

Comment 12

•

23 years ago

> Always assuming BE on data without a BOM is not in the spirit of Annex > F of the XML recommendation: I fully agree with Martin that the spirit of XML recommendation is to try as hard as possible to detect the endianness. UTF-16/UTF-32 should not be considered as meaning UTF-16BE/UTF-32BE by default. Only after all those efforts fail, as the last resort, we have to assume network byte order(BE). My patch implements that and all 48(?) test cases in my test page are successfully rendered with my patch.

Jungshik Shin

Updated

•

23 years ago

Attachment #108664 - Flags: review?(ftang)

Jungshik Shin

Comment 13

•

23 years ago

Sorry for spamming. I didn't realize that I wasn't on the CC list.

Alexey Chernyak

Comment 14

•

23 years ago

This looks very good, and it also covers <!-- and <![CDATA - everything you'd want. But have a look at my solution in bug 183354 This is very similar kind of code we're navigating, but in different places. I may be wrong, but the way I see it - BOM detection should be left to nsUniversalDetector.cpp and BOM-less detection - for nsParser.cpp UniversalDetector needs BOM detection because a document isn't neccessarily a markup. text/plain UTF needs detection too and it doesn't go through Parser? And if markup goes through Universal Auto Detector first, and you have detection in Parser as well - that's doing same work twice? This is merely speculation, for I don't know where the document goes through in different cases. It just sounds logical, but I might be wrong.

Jungshik Shin

Comment 15

•

23 years ago

What a coincidence ! I was about to look at it because I was also concerned about text/plain, but you got there first :-) Thank you for the note. Acoording to nsIParser.h, 'autodetection' has a lower priority than 'BOM detection' (see a set of macro definitions in the file). I haven't yet checked the code path followed by html/xml documents, but we may need both (yours and mine). I'll try your patch after removing BOM detection in nsParser and see what happens... Even if we need not BOM detection in nsParser, it may be good to keep it just for the sake of robustness....

Jungshik Shin

Comment 16

•

23 years ago

I haven't done the experiment mentioned in my prev. comment (my mozilla build got kinda screwed up and I'm clobbering and rebuilding it). Anyway, I think we have to put both patches in because the universal detection is not always on (see View|Character Coding|AutoDetection menu). Perhaps, a separate bug might as well be filed (if deemed necessary) to make Mozilla always detect BOM even with autodetection off(it does for html/xml in nsParser.cpp but it does not for text/plain). Alternatively, a new entry may be added to View|Character Coding|AutoDetection that is to enable BOM detection only with other detection turned off. Again this is only useful for non-html/non-xml files. BTW, there are a couple of other places where BOM is searched for. One of them is in intl/chardet which is for language-specific (as opposed to 'universal) charset detection, I believe.

Simon Montagu :smontagu

Comment 17

•

23 years ago

cc-ing Hixie, because we were discussing this issue on IRC. >I may be wrong, but the way I see it - BOM detection should be left to >nsUniversalDetector.cpp and BOM-less detection - for nsParser.cpp I think exactly the opposite, but we probably need to define the different cases more systematically.

Alexey Chernyak

Comment 18

•

23 years ago

Yes, I actually was thinking the same thing yesterday about user switching off auto detection completely. So I agree, it should be both places. And I'd hope that the code avoids redundancy, and that if the charset is already known through autodetector, the parser won't bother detecting it. But in case of UTF-16 and UTF-32 (when charset is already provided) what drives it to do detection of endianness? By the way there is at least one more place this kind of code can be found in. I think either JavaScript or CSS parsers also do UTF detection.

Alexey Chernyak

Comment 19

•

23 years ago

Just a thought, for BOM-less detection to make it completely bulletproof, we could also skip the leading whitespaces before doing detection. Worth the bother?

Jungshik Shin

Comment 20

•

23 years ago

>>I may be wrong, but the way I see it - BOM detection should be left to >>nsUniversalDetector.cpp and BOM-less detection - for nsParser.cpp >I think exactly the opposite, I sorta agree , but the answer really depends on what 'BOM-less detection' is. If it's a simple check of the first four octets, it has to be done in nsParser and only works for html/xml. On the other hand, more 'invasive' detection method (reading the first few tens of octets and analyzing octet-pattern, e.g. '00 xx 00 xx 00 xx 00 xx...' for UTF-16BE) that wors for text/plain as well as for html/xml belongs to universal character detection module. As for BOM detection, Alexey may have reach that conclusion because BOM is not a part of html/xml per se but is a signature for the endianness and is found not only in html/xml but in text/plain as well. In any case, because the universal character detection can be turned off, I think we need to keep a simple 'bom-less' detection as well as BOM detection in nsParser. Now, what should we do to avoid the code repetition in multiple modules? Do we need a method/function to return endianness given 'four octets' that can be shared across modules?

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 21

•

23 years ago

I think the XML recommendation requres that only the first four octets be checked. I should check the spec to make sure, but if so then the more generic approach is not appropriate for XML.

Alexey Chernyak

Updated

•

23 years ago

Blocks: 182796

Frank Tang

Assignee

Comment 22

•

23 years ago

Comment on attachment 108664 [details] [diff] [review] a patch It looks like http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info add additional condictions than http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing from your patch, you does not cover the following cases mention in the XML1.02edition 00 00 FF FE UCS-4 unusaual octet order (2143) FE FF 00 00 UCS-4 unusual octet order (3412)

Attachment #108664 - Flags: review?(ftang) → review-

Jungshik Shin

Comment 23

•

23 years ago

Attached patch a new patch with two unusual BO UCS-4's added (obsolete) — Details — Splinter Review

Frank's concern addressed.

Attachment #108664 - Attachment is obsolete: true

Jungshik Shin

Comment 24

•

23 years ago

Comment on attachment 115591 [details] [diff] [review] a new patch with two unusual BO UCS-4's added + // 3C 00 XX 00 + if(IS_SECOND_MARKER(aBytes[2])) { + // 00 3C SM 00 UTF-16, little-endian, no Byte Order Mark There's a typo in comment above. // 00 3C SM 00 UTF-16, little... should read // 3C 00 SM 00 UTF-16, little.... The actual code is not affected. I have tested all usual BO UTF-16/32's, but haven't tested unusual BO cases. Do we have decoders for them? Anyway, the detection should work for them, too.

Attachment #115591 - Attachment description: a new patch with two unusual BO UCS-4 added → a new patch with two unusual BO UCS-4's added

Attachment #115591 - Flags: review?(ftang)

Frank Tang

Assignee

Comment 25

•

22 years ago

Comment on attachment 115591 [details] [diff] [review] a new patch with two unusual BO UCS-4's added r=ftang

Attachment #115591 - Flags: review?(ftang) → review+

Jungshik Shin

Comment 26

•

22 years ago

Comment on attachment 115591 [details] [diff] [review] a new patch with two unusual BO UCS-4's added asking for sr. let's try to put this before 1.4

Attachment #115591 - Flags: superreview?(heikki)

Alexey Chernyak

Comment 27

•

22 years ago

Comment on attachment 115591 [details] [diff] [review] a new patch with two unusual BO UCS-4's added + oCharset.AssignWithConversion(UCS2_LE); // should change to UTF-16BE Typo in comment? Besides, what do those comments mean anyways? What should be changed?

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 28

•

22 years ago

Changing milestone to 1.4beta to reflect that this is what we are trying to achieve. I've been going back and forth on this because intl is not really my speciality. Adding blizzard to Cc 'cos he seems to be (according to sr list). Anyway, I think using an obscure function like PL_strnpbrk to just test if a character matches one of four possible characters seems like an overkill. Why not write a simple inline function for this, for example: static inline PRBool IsSecondMarker(unsigned char aChar) { switch (aChar) { case '!': case '?': case 'h': case 'H': return PR_TRUE; } return PR_FALSE; } Something that I am concerned about is that for XML the detection must be very strict; does this IsSecondMarker stuff change that? http://www.w3.org/TR/REC-xml#sec-guessing

Target Milestone: Future → mozilla1.4beta

Jungshik Shin

Comment 29

•

22 years ago

> Why not write a simple inline function for this, for example: Thanks for the suggestion. Done in my tree. I'll upload a new patch after rebuilding Mozilla. Another possibility is just using memcmp instead of a series of switch/if-else as suggested by Chris in bug 183354. If memcmp is sorta 'inlined', it may be faster (I read in another bug that 'VC++ -O1' doesn't do that..) > does this IsSecondMarker stuff change that? > http://www.w3.org/TR/REC-xml#sec-guessing No, I don't think so. The reason I added it is without it, Mozilla fails to detect the encoding of html docs in BOM-less UTF-16LE/UTF-16BE. The code I'm changing deals with both html and xml. See also comment #8, comment #12, comment #14. > Besides, what do those comments mean anyways? I'm not sure ... > What should be changed? I just kept them. Anyway, in a new patch I'm gonna upload, I removed them replacing UCS2_LE|BE with UTF16_LE|BE.

Jungshik Shin

Comment 30

•

22 years ago

Attached patch a new patch with inline function for SM detection — Details — Splinter Review

Basically the same patch. Can I get r/sr from Frank and Heikki? This has been sitting for quite a while. Let's try to get this in for 1.4beta as Heikki suggested.

Attachment #115591 - Attachment is obsolete: true

Alexey Chernyak

Comment 31

•

22 years ago

Comment on attachment 119949 [details] [diff] [review] a new patch with inline function for SM detection + oCharset.AssignWithConversion(UCS4_LE); // should change to UTF-32LE

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 32

•

22 years ago

Comment on attachment 119949 [details] [diff] [review] a new patch with inline function for SM detection You don't need |const| in the parameter to IsSecondMarker().

Attachment #119949 - Flags: superreview+

Jungshik Shin

Comment 33

•

22 years ago

Thank you all. Fix just checked in. In the patch checked in, I remvoed |const|. I don't know what I was thinking :-) I also removed the comment pointed out by Alexey.

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Heikki Toivonen (remove -bugzilla when emailing directly)

Updated

•

22 years ago

Attachment #115591 - Flags: superreview?(heikki)

Alexey Chernyak

Comment 34

•

22 years ago

Verified. Although it overrides Universal Charset Detector now, even for text/plain files. (See bug 183354)

Status: RESOLVED → VERIFIED

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 35

•

22 years ago

The HTML parser isn't the only consumer of encoders/decoders. This patch made the CSS parser not support UTF-16 anymore. See bug 235090.

Jungshik Shin

Comment 36

•

22 years ago

Even without this patch, UTF-16 (with BOM indicating LE) didn't work because we mapped UTF-16 to UTF-16BE in charset.aliases.

a patch 23 years ago Jungshik Shin 5.07 KB, patch	ftang : review-	Details \| Diff \| Splinter Review
a new patch with two unusual BO UCS-4's added 23 years ago Jungshik Shin 5.33 KB, patch	ftang : review+	Details \| Diff \| Splinter Review
a new patch with inline function for SM detection 22 years ago Jungshik Shin 5.45 KB, patch	hjtoi-bugzilla : superreview+	Details \| Diff \| Splinter Review