Closed Bug 183354 Opened 23 years ago Closed 23 years ago

Make Universal Charset Autodetector recognise UTF by BOM

Categories

(Core :: Internationalization, defect)

x86
All
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla1.4beta

People

(Reporter: alexeyc2003, Assigned: alexeyc2003)

References

()

Details

(Keywords: intl)

Attachments

(2 files, 2 obsolete files)

At the moment Universal Charset Autodetection only recognises UTF-16XX by it's BOM. It should recognise all UTF by their BOM, see: http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ For testcases see: http://jshin.net/i18n/utftest/ In particular all these should be recognisable: http://jshin.net/i18n/utftest/bom_utf16be.html http://jshin.net/i18n/utftest/bom_utf16le.html http://jshin.net/i18n/utftest/bom_utf32be.html http://jshin.net/i18n/utftest/bom_utf32le.html patch coming up.
Attached patch patch (obsolete) — Splinter Review
The patch recognises BOM in following encodings: UTF-8 UTF-16BE UTF-16LE UTF-32BE UTF-32LE I have no build environment so I haven't tested it. I assumed that "UTF-8", "UTF-32BE" and "UTF-32LE" is meaningfull output, please see that it is so. I increased the minimum data size for BOM detection from 2 to 4 bytes.
taking the bug also see related patch in bug 68738
Assignee: smontagu → alexey
Target Milestone: --- → mozilla1.3beta
Status: NEW → ASSIGNED
Attachment #108117 - Flags: superreview?(blizzard)
Attachment #108117 - Flags: review?(ftang)
Comment on attachment 108117 [details] [diff] [review] patch >+ if (aLen > 3) ... >+ if ((0x00 == aBuf[1]) && (0xFE == aBuf[2]) && (0xFF == aBuf[3])) >+ // 00 00 FE FF UTF-32, big-endian BOM This is going to run off the end of the buffer if aLen == 3. Also, instead of mixed switch() and if() why not just use memcmp() and if() statements? It will probably be more readable.
Attachment #108117 - Flags: superreview?(blizzard) → superreview-
Comment on attachment 108117 [details] [diff] [review] patch sorry, I mis-read that
Attachment #108117 - Flags: superreview- → superreview+
Blocks: 182796
Comment on attachment 108117 [details] [diff] [review] patch + case 0xFE: + if (0xFF == aBuf[1]) + // FE FF UTF-16, big endian BOM + mDetectedCharset = "UTF-16BE"; not necessary true. What about the case FE FF 00 00 ? + case 0xFF: + if ((0xFE == aBuf[1]) && (0x00 == aBuf[2]) && (0x00 == aBuf[3])) This may cause access read violation since + if (aLen > 3) instead of + if (aLen > 4)
Attachment #108117 - Flags: review?(ftang) → review-
This patch just recognises all 5 UTF encodings by BOM instead of just 2 in the current code. When would FE FF 00 00 be encountered? And what should Mozilla do here in that case? And I can't see how aBuf[3] can cause access read violation in at least 4 byte long array?
>When would FE FF 00 00 be encountered? looks at http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info for the case
>how aBuf[3] can cause access read violation in at least 4 byte long array? you are right, I didn't see that if(aLen >3)
But this is not nessesarily markup. This could be a plaintext document. Or should the markup specific checks still be done? And what values should be assigned to mDetectedCharset then? By the way, I couldn't find what is receiving these values, and what are valid ones?
>But this is not nessesarily markup. This could be a plaintext document. >Or should the markup specific checks still be done? My ref to the XML spec it to anwer your question about "When would FE FF 00 00 be encountered" FE FF 00 00 is a BOM, not a mark up at all, it still apply to plain text. >And what values should be assigned to mDetectedCharset then? >By the way, I couldn't find what is receiving these values, and what are valid ones? 1987 #define UCS4_2143 "X-ISO-10646-UCS-4-2143" 1988 #define UCS4_3412 "X-ISO-10646-UCS-4-3412" as defined in htmlparser/nsParser.cpp I think
I don't think those nsParser strings are used here. They seem to be just for internal use in it's own detector function. And their definitions are not used anywhere else in the code. The Universal CharDet however seems to be using (via nsICharsetAlias.h): http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetalias.properties And there are no entries for these unusual encodings in it. Moreover: http://lxr.mozilla.org/mozilla/source/intl/uconv/ucvlatin/nsUTF32ToUnicode.cpp doesn't seem to handle it either. And I can't find a converter/decoder in LXR that handles these. I have no idea what nsParser relies on to handle those. They seem to be useless. There is no mention of these in the UTF-32 spec: http://www.unicode.org/unicode/reports/tr19/ or in RFC: http://www.ietf.org/rfc/rfc3023.txt Or anywhere else: http://www-106.ibm.com/developerworks/library/utfencodingforms/#h4 apart from the XML spec: http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info And XML spec does not define UFT-32 or how it should be generally handled.
Besides, If BOM is an unusual FE FF 00 00, we still can't tell wether it is a (3412) or (3421) order. Do you know of any actual machines that operate with such byte orders? Do we have a backend to support it? Do we really *have* to support it when there is no mention of it in UTF-32 spec?
Attachment #108117 - Attachment is obsolete: true
Attachment #119607 - Flags: superreview?(blizzard)
Attachment #119607 - Flags: review?(ftang)
Target Milestone: mozilla1.3beta → mozilla1.4beta
Comment on attachment 119607 [details] [diff] [review] patch with unusual BOM checks added r=ftang
Attachment #119607 - Flags: review?(ftang) → review+
Comment on attachment 119607 [details] [diff] [review] patch with unusual BOM checks added sr=blizzard
Attachment #119607 - Flags: superreview?(blizzard) → superreview+
Checking in nsUniversalDetector.cpp; /cvsroot/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp,v <-- nsUniversalDetector.cpp new revision: 1.14; previous revision: 1.13 done (I added the missing "if" on the "else (..)" line)
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
however, after checking in, I believe this fix is not quite right. See these compile warnings: /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:134: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:137: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:142: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:142: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:145: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:145: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:150: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:153: warning: comparison is always false due to limited range of data type You just can't assume that a char is unsigned. Maybe this function should take a "const PRUint8*" isntead of "const char*".
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
just some friendly reminders: //If the data starts with BOM, we know it is UTF 'BOM' needs an article 'a' 'data' doesn't match 'is' the line could be a sentence you could replace ',' with ' then'
This warning is also fun: 5.extensions/universalchardet/src/nsUniversalDetector.cpp:133 (See build log excerpt) Overflow in implicit constant conversion This is on the "case 0xFE" line. It's taking the 0xFE and casting it to "char". Yes, you can't assume that "char" is unsigned. It's not.
Attached patch fix (obsolete) — Splinter Review
Comment on attachment 119979 [details] [diff] [review] fix smontagu/dbaron still don't like data(pl) starts(sing) and i'm not fond of the end of that line, but we'll agree on something before someone checks in
Attachment #119979 - Flags: review+
Comment on attachment 119979 [details] [diff] [review] fix Um, |const unsigned char*| and |const char*| shouldn't be compatible types on platforms where |char| is signed, right? Also, "data start", not "data starts".
Attachment #119979 - Flags: superreview-
For what it's worth, the gcc 3.2.2 warnings are /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp: In member function `virtual void nsUniversalDetector::HandleData(const char*, unsigned int)': /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:128: warning: overflow in implicit constant conversion /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:133: warning: overflow in implicit constant conversion /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:134: warning: comparison is always false due to limited range of data type /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:137: warning: comparison is always false due to limited range of data type
Attachment #119979 - Attachment is obsolete: true
Attachment #119985 - Flags: superreview?(bzbarsky)
Attachment #119985 - Flags: review?(smontagu)
Comment on attachment 119985 [details] [diff] [review] proposed fix for warnings r=smontagu
Attachment #119985 - Flags: review?(smontagu) → review+
Attachment #119985 - Flags: superreview?(bzbarsky) → superreview+
Comment on attachment 119985 [details] [diff] [review] proposed fix for warnings Checked in to trunk, 2003-04-09 14:13 PDT.
Due to bug 68738 check in that happened on the same day, there is no way to test this. Testing shows, that parser code runs even for text/plain files, and this detector is pretty much useless now. Anyways, just assume fixed. R.I.P Universal Detector.
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Mark as verified. Pages: http://jshin.net/i18n/utftest/bom_utf32be.html http://jshin.net/i18n/utftest/bom_utf32le.html are displayed fine on 05-08 trunk build while they were not in N7.02.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: