Closed
Bug 183354
Opened 22 years ago
Closed 21 years ago
Make Universal Charset Autodetector recognise UTF by BOM
Categories
(Core :: Internationalization, defect)
Tracking
()
VERIFIED
FIXED
mozilla1.4beta
People
(Reporter: alexeyc2003, Assigned: alexeyc2003)
References
()
Details
(Keywords: intl)
Attachments
(2 files, 2 obsolete files)
2.19 KB,
patch
|
ftang
:
review+
blizzard
:
superreview+
|
Details | Diff | Splinter Review |
3.14 KB,
patch
|
smontagu
:
review+
bzbarsky
:
superreview+
|
Details | Diff | Splinter Review |
At the moment Universal Charset Autodetection only recognises UTF-16XX by it's BOM. It should recognise all UTF by their BOM, see: http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ For testcases see: http://jshin.net/i18n/utftest/ In particular all these should be recognisable: http://jshin.net/i18n/utftest/bom_utf16be.html http://jshin.net/i18n/utftest/bom_utf16le.html http://jshin.net/i18n/utftest/bom_utf32be.html http://jshin.net/i18n/utftest/bom_utf32le.html patch coming up.
Assignee | ||
Comment 1•22 years ago
|
||
The patch recognises BOM in following encodings: UTF-8 UTF-16BE UTF-16LE UTF-32BE UTF-32LE I have no build environment so I haven't tested it. I assumed that "UTF-8", "UTF-32BE" and "UTF-32LE" is meaningfull output, please see that it is so. I increased the minimum data size for BOM detection from 2 to 4 bytes.
Assignee | ||
Comment 2•22 years ago
|
||
taking the bug also see related patch in bug 68738
Assignee: smontagu → alexey
Target Milestone: --- → mozilla1.3beta
Assignee | ||
Updated•22 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Updated•22 years ago
|
Attachment #108117 -
Flags: superreview?(blizzard)
Attachment #108117 -
Flags: review?(ftang)
Comment 3•22 years ago
|
||
Comment on attachment 108117 [details] [diff] [review] patch >+ if (aLen > 3) ... >+ if ((0x00 == aBuf[1]) && (0xFE == aBuf[2]) && (0xFF == aBuf[3])) >+ // 00 00 FE FF UTF-32, big-endian BOM This is going to run off the end of the buffer if aLen == 3. Also, instead of mixed switch() and if() why not just use memcmp() and if() statements? It will probably be more readable.
Attachment #108117 -
Flags: superreview?(blizzard) → superreview-
Comment 4•22 years ago
|
||
Comment on attachment 108117 [details] [diff] [review] patch sorry, I mis-read that
Attachment #108117 -
Flags: superreview- → superreview+
Comment 5•21 years ago
|
||
Comment on attachment 108117 [details] [diff] [review] patch + case 0xFE: + if (0xFF == aBuf[1]) + // FE FF UTF-16, big endian BOM + mDetectedCharset = "UTF-16BE"; not necessary true. What about the case FE FF 00 00 ? + case 0xFF: + if ((0xFE == aBuf[1]) && (0x00 == aBuf[2]) && (0x00 == aBuf[3])) This may cause access read violation since + if (aLen > 3) instead of + if (aLen > 4)
Attachment #108117 -
Flags: review?(ftang) → review-
Assignee | ||
Comment 6•21 years ago
|
||
This patch just recognises all 5 UTF encodings by BOM instead of just 2 in the current code. When would FE FF 00 00 be encountered? And what should Mozilla do here in that case? And I can't see how aBuf[3] can cause access read violation in at least 4 byte long array?
Comment 7•21 years ago
|
||
>When would FE FF 00 00 be encountered? looks at http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info for the case
Comment 8•21 years ago
|
||
>how aBuf[3] can cause access read violation in at least 4 byte
long array?
you are right, I didn't see that if(aLen >3)
Assignee | ||
Comment 9•21 years ago
|
||
But this is not nessesarily markup. This could be a plaintext document. Or should the markup specific checks still be done? And what values should be assigned to mDetectedCharset then? By the way, I couldn't find what is receiving these values, and what are valid ones?
Comment 10•21 years ago
|
||
>But this is not nessesarily markup. This could be a plaintext document. >Or should the markup specific checks still be done? My ref to the XML spec it to anwer your question about "When would FE FF 00 00 be encountered" FE FF 00 00 is a BOM, not a mark up at all, it still apply to plain text. >And what values should be assigned to mDetectedCharset then? >By the way, I couldn't find what is receiving these values, and what are valid ones? 1987 #define UCS4_2143 "X-ISO-10646-UCS-4-2143" 1988 #define UCS4_3412 "X-ISO-10646-UCS-4-3412" as defined in htmlparser/nsParser.cpp I think
Assignee | ||
Comment 11•21 years ago
|
||
I don't think those nsParser strings are used here. They seem to be just for internal use in it's own detector function. And their definitions are not used anywhere else in the code. The Universal CharDet however seems to be using (via nsICharsetAlias.h): http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetalias.properties And there are no entries for these unusual encodings in it. Moreover: http://lxr.mozilla.org/mozilla/source/intl/uconv/ucvlatin/nsUTF32ToUnicode.cpp doesn't seem to handle it either. And I can't find a converter/decoder in LXR that handles these. I have no idea what nsParser relies on to handle those. They seem to be useless. There is no mention of these in the UTF-32 spec: http://www.unicode.org/unicode/reports/tr19/ or in RFC: http://www.ietf.org/rfc/rfc3023.txt Or anywhere else: http://www-106.ibm.com/developerworks/library/utfencodingforms/#h4 apart from the XML spec: http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info And XML spec does not define UFT-32 or how it should be generally handled.
Assignee | ||
Comment 12•21 years ago
|
||
Besides, If BOM is an unusual FE FF 00 00, we still can't tell wether it is a (3412) or (3421) order. Do you know of any actual machines that operate with such byte orders? Do we have a backend to support it? Do we really *have* to support it when there is no mention of it in UTF-32 spec?
Assignee | ||
Comment 13•21 years ago
|
||
Attachment #108117 -
Attachment is obsolete: true
Assignee | ||
Updated•21 years ago
|
Attachment #119607 -
Flags: superreview?(blizzard)
Attachment #119607 -
Flags: review?(ftang)
Assignee | ||
Updated•21 years ago
|
Target Milestone: mozilla1.3beta → mozilla1.4beta
Comment 14•21 years ago
|
||
Comment on attachment 119607 [details] [diff] [review] patch with unusual BOM checks added r=ftang
Attachment #119607 -
Flags: review?(ftang) → review+
Comment 15•21 years ago
|
||
Comment on attachment 119607 [details] [diff] [review] patch with unusual BOM checks added sr=blizzard
Attachment #119607 -
Flags: superreview?(blizzard) → superreview+
Comment 16•21 years ago
|
||
Checking in nsUniversalDetector.cpp; /cvsroot/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp,v <-- nsUniversalDetector.cpp new revision: 1.14; previous revision: 1.13 done (I added the missing "if" on the "else (..)" line)
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Comment 17•21 years ago
|
||
however, after checking in, I believe this fix is not quite right. See these compile warnings: /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:134: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:137: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:142: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:142: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:145: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:145: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:150: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:153: warning: comparison is always false due to limited range of data type You just can't assume that a char is unsigned. Maybe this function should take a "const PRUint8*" isntead of "const char*".
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 18•21 years ago
|
||
just some friendly reminders: //If the data starts with BOM, we know it is UTF 'BOM' needs an article 'a' 'data' doesn't match 'is' the line could be a sentence you could replace ',' with ' then'
Comment 19•21 years ago
|
||
This warning is also fun: 5.extensions/universalchardet/src/nsUniversalDetector.cpp:133 (See build log excerpt) Overflow in implicit constant conversion This is on the "case 0xFE" line. It's taking the 0xFE and casting it to "char". Yes, you can't assume that "char" is unsigned. It's not.
Assignee | ||
Comment 20•21 years ago
|
||
Comment 21•21 years ago
|
||
Comment on attachment 119979 [details] [diff] [review] fix smontagu/dbaron still don't like data(pl) starts(sing) and i'm not fond of the end of that line, but we'll agree on something before someone checks in
Attachment #119979 -
Flags: review+
Comment on attachment 119979 [details] [diff] [review] fix Um, |const unsigned char*| and |const char*| shouldn't be compatible types on platforms where |char| is signed, right? Also, "data start", not "data starts".
Attachment #119979 -
Flags: superreview-
For what it's worth, the gcc 3.2.2 warnings are /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp: In member function `virtual void nsUniversalDetector::HandleData(const char*, unsigned int)': /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:128: warning: overflow in implicit constant conversion /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:133: warning: overflow in implicit constant conversion /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:134: warning: comparison is always false due to limited range of data type /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:137: warning: comparison is always false due to limited range of data type
Attachment #119979 -
Attachment is obsolete: true
Attachment #119985 -
Flags: superreview?(bzbarsky)
Attachment #119985 -
Flags: review?(smontagu)
Comment 24•21 years ago
|
||
Comment on attachment 119985 [details] [diff] [review] proposed fix for warnings r=smontagu
Attachment #119985 -
Flags: review?(smontagu) → review+
Updated•21 years ago
|
Attachment #119985 -
Flags: superreview?(bzbarsky) → superreview+
Comment on attachment 119985 [details] [diff] [review] proposed fix for warnings Checked in to trunk, 2003-04-09 14:13 PDT.
Assignee | ||
Comment 26•21 years ago
|
||
Due to bug 68738 check in that happened on the same day, there is no way to test this. Testing shows, that parser code runs even for text/plain files, and this detector is pretty much useless now. Anyways, just assume fixed. R.I.P Universal Detector.
Status: REOPENED → RESOLVED
Closed: 21 years ago → 21 years ago
Resolution: --- → FIXED
Comment 27•21 years ago
|
||
Mark as verified. Pages: http://jshin.net/i18n/utftest/bom_utf32be.html http://jshin.net/i18n/utftest/bom_utf32le.html are displayed fine on 05-08 trunk build while they were not in N7.02.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•