Closed
Bug 289947
Opened 19 years ago
Closed 3 years ago
nsIScriptableUnicodeConverter::ConvertToUnicode is not able to handle error in conversion
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: mgueury, Assigned: smontagu)
References
Details
(Keywords: testcase)
Attachments
(1 file)
246 bytes,
text/html
|
Details |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2 Context ------- I am using the nsIScriptableUnicodeConverter in writing one firefox extension. http://users.skynet.be/mgueury/mozilla (Html validator) The extension is doing basically this: - when browsing the page is fetched from internet. - when the page is fully loaded, the extension take it from the cache - convert it to unicode (BUG) - and then validate the HTML with a XPCom version of HtmlTidy from w3c. - and finally shows the errors at the bottom of the page. Bug --- In this context, I use this code: ------------------------------------------------------------------------ var ucConverter = Components.classes[UC_CTRID].getService(nsIUnicodeConverter); ucConverter.charset = urlCharset; var s2 = ucConverter.ConvertToUnicode( s ); ------------------------------------------------------------------------ It basically convert a string in a characterset urlCharset in Unicode. The problem appears when validating for example, an UTF8 page containing invalid UTF8 characters. For ex, ------------------------------------------------------------------------------<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <title>c</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head> <body"> <input type="submit" value="é" /> </body> </html> ------------------------------------------------------------------------------ When this page is saved in ISO8859P1 or WIN1252, the character é is not coded correctly as a valid UTF8 character. When converting this iso8859 HTML in unicode it raises this error. ------------------------------------------------------------------------- Error: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIScriptableUnicodeConverter.ConvertToUnicode]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: chrome://tidy/content/tidyBrowser.js :: anonymous :: line 461" data: no] Source File: chrome://tidy/content/tidyBrowser.js Line: 461 ------------------------------------------------------------------------- What is in a way correct, because there is an error. Unhapilly, there is no way to use nsIScriptableUnicodeConverter to convert a scring that contain error. And it is really strange since most call the Unicode library written in other part of Mozilla handle this. The wrong code -------------- In nsScriptableUConv.cpp, there is this call simply: ---------------------------------------------------------------------------- NS_IMETHODIMP nsScriptableUnicodeConverter::ConvertToUnicode(const char *aSrc, PRUnichar **_retval) { ... rv = mDecoder->Convert(aSrc, &inLength, *_retval, &outLength); ---------------------------------------------------------------------------- If the stringis incorrect, it does not work. I am expecting the function to replace all the bad characters by '?' like mozilla is doing. For ex: - by a new option : ConvertUnicode( string src, boolean recover_error) In other part of the mozilla code, you can see how to correct the ConvertToUnicode function. A sample can be seen in nsScanner::append. An explanation on how convert is intended to be used in the Unicode convertion code. For ex, here is a code that work that does replace all errors by '?'. ---------------------------------------------------------------------------- NS_IMETHODIMP nsTidyImpl::ConvertToUnicode(const char *aSrc, const char *aCharset, PRUnichar **_retval) { nsresult rv = NS_OK; // This work-around a limitation in nsScriptableUnicodeConverter nsCOMPtr<nsICharsetConverterManager> ccm = do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv); if (NS_SUCCEEDED( rv) && (nsnull != ccm)) { // get charset atom due to getting unicode converter // get an unicode converter nsCOMPtr<nsIUnicodeDecoder> mDecoder; rv = ccm->GetUnicodeDecoder(aCharset, getter_AddRefs(mDecoder)); if (!mDecoder) return NS_ERROR_FAILURE; PRInt32 inLength = strlen(aSrc); PRInt32 outLength = 0; rv = mDecoder->GetMaxLength(aSrc, inLength, &outLength); if (NS_SUCCEEDED(rv)) { *_retval = (PRUnichar*) nsMemory::Alloc((outLength+1)*sizeof(PRUnichar)); if (!*_retval) { return NS_ERROR_OUT_OF_MEMORY; } PRUnichar *unichars, *start; start = unichars = *_retval; PRInt32 totalChars = 0; PRInt32 unicharLength = outLength; // rv = mDecoder->Convert(aSrc, &inLength, *_retval, &outLength); do { PRInt32 srcLength = inLength; rv = mDecoder->Convert(aSrc, &srcLength, unichars, &unicharLength); totalChars += unicharLength; // Continuation of failure case if(NS_FAILED(rv)) { // if we failed, we consume one byte, replace it with U+FFFD // and try the conversion again. unichars[unicharLength++] = (PRUnichar)0xFFFD; unichars = unichars + unicharLength; unicharLength = outLength - (++totalChars); mDecoder->Reset(); if(((PRUint32) (srcLength + 1)) > inLength) { srcLength = inLength; } else { ++srcLength; } aSrc += srcLength; inLength -= srcLength; } } while (NS_FAILED(rv) && (inLength > 0)); return NS_OK; } } *_retval = nsnull; return NS_ERROR_FAILURE; } ------------------------------------------------------------------------------ Reproducible: Always Steps to Reproduce: A way to reproduce the issue is to use my extension (version 0.58) and just open a HTML file that I will provide as testcase. 1. Install the extension v 0.58 (I will upload it in the testcase Linux/Win version). 2. Open the charset file in the browser 3. Look in the java console and you get the Error: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIScriptableUnicodeConverter.ConvertToUnicode]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" Expected Results: Handle the convertion error correctly. It should also be nice to print an error in the javascript console when getting such errors. Because it means that a file is incorrect.
The extension can be dowloaded here, I give the direct link since I rewrote the function convertion for the next version to handle the problem. (what is bad since I am duplicating code...) Windows: http://rars.sourceforge.net/mozilla/tidy_firefox_058.xpi Linux Download and install Html Validator (based on Tidy) for Firefox 1.0.x MacOsX Download and install Html Validator (based on Tidy) for Firefox 1.0.x
Linux http://rars.sourceforge.net/mozilla/tidy_firefox_linux_058.xpi MacOsX http://rars.sourceforge.net/mozilla/tidy_firefox_macos_058.xpi
The extension above are for Firefox 1.0 only. (please see this if you need a version for 1.0+, 1.1) > http://users.skynet.be/mgueury/mozilla/faq.html (point 2) Windows version 0.58 for Firefox 1.1. ------------------------------------ http://rars.sourceforge.net/mozilla/tidy_firefox_11_058.xpi
Assignee | ||
Comment 5•19 years ago
|
||
It would also be good to have more flexible error handling for the scriptable Unicode encoder. Currently we have hardcoded mEncoder->SetOutputErrorBehavior(nsIUnicodeEncoder::kOnError_Replace, nsnull, (PRUnichar)'?'); I think there are existing bugs about adding SetOutputErrorBehavior() to nsIUnicodeDecoder
Assignee: firefox → smontagu
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → amyy
Version: unspecified → Trunk
Updated•19 years ago
|
Comment 7•17 years ago
|
||
I have two suggestions to make. First, I think it's better to add attributes to the converter rather than parameters to the function (e.g. boolean attribute for recover/don't recover from errors). Also, if a coder is already adding this capability, perhaps it would be better to make it a bit more flexible, as I can see more than one 'useful' way to recover from errors... What to emit? - Emit an \xFFFD - Generalization: Emit a fixed character which is an attribute of the converter (e.g. '?') - Emit the raw unusable octets - Emit the values of the unusable octets in hexadecimal UTF-8 digits (\x0030 - \x0039) - Generalization: Emit the values of the unusable octets in hexadecimal UTF-8 digits (\x0030 - \x0039), preceded or wrapped in some indication that these digits are not part of the original text What to do with the state/previous chars in the continuation after an error? - Discard just the unusable octets, continue with the same state from the next octet - Discard the unusable octet and reset the state, e.g. discard incomplete continuations etc. etc.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Updated•17 years ago
|
QA Contact: amyy → i18n
Summary: nsIScriptableUnicodeConverter::ConvertToUnicode is not able to handle error in convertion → nsIScriptableUnicodeConverter::ConvertToUnicode is not able to handle error in conversion
Comment 8•16 years ago
|
||
(In reply to comment #5) > It would also be good to have more flexible error handling for the scriptable > Unicode encoder. Currently we have hardcoded > mEncoder->SetOutputErrorBehavior(nsIUnicodeEncoder::kOnError_Replace, nsnull, > (PRUnichar)'?'); > > I think there are existing bugs about adding SetOutputErrorBehavior() to > nsIUnicodeDecoder I wasn't able to find such a bug. Regardless, it would be good if SetOutputErrorBehavior-type functionality were added to this API -- not being able to determine the converter's behavior in cases of erroneous/corrupted data is a serious limitation.
Comment 10•13 years ago
|
||
Do you testing with Chinese? (你测试过中文吗?)
Comment 11•3 years ago
|
||
Old-style extensions are no longer supported.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•