Closed Bug 289947 Opened 19 years ago Closed 3 years ago

nsIScriptableUnicodeConverter::ConvertToUnicode is not able to handle error in conversion

Categories

(Core :: Internationalization, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: mgueury, Assigned: smontagu)

References

Details

(Keywords: testcase)

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2

Context
-------
I am using the nsIScriptableUnicodeConverter in writing one firefox extension.
http://users.skynet.be/mgueury/mozilla (Html validator)

The extension is doing basically this:
- when browsing the page is fetched from internet.
- when the page is fully loaded, the extension take it from the cache
- convert it to unicode (BUG)
- and then validate the HTML with a XPCom version of HtmlTidy from w3c.
- and finally shows the errors at the bottom of the page.

Bug
---
In this context, I use this code:

------------------------------------------------------------------------
var ucConverter =  Components.classes[UC_CTRID].getService(nsIUnicodeConverter);
ucConverter.charset = urlCharset;
var s2 = ucConverter.ConvertToUnicode( s );
------------------------------------------------------------------------

It basically convert a string in a characterset urlCharset in Unicode.
The problem appears when validating for example, an UTF8 page containing
invalid UTF8 characters. For ex,

------------------------------------------------------------------------------<!DOCTYPE
HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
  <title>c</title>
  <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body">
  <input type="submit" value="é" />
</body>
</html>
------------------------------------------------------------------------------

When this page is saved in ISO8859P1 or WIN1252, the character é is not coded 
correctly as a valid UTF8 character. When converting this iso8859 HTML in 
unicode it raises this error.

-------------------------------------------------------------------------
    Error: [Exception... "Component returned failure code: 0x80004005
(NS_ERROR_FAILURE) [nsIScriptableUnicodeConverter.ConvertToUnicode]"  nsresult:
"0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame ::
chrome://tidy/content/tidyBrowser.js :: anonymous :: line 461"  data: no]
    Source File: chrome://tidy/content/tidyBrowser.js Line: 461
-------------------------------------------------------------------------

What is in a way correct, because there is an error.
Unhapilly, there is no way to use nsIScriptableUnicodeConverter to convert
a scring that contain error. And it is really strange since most
call the Unicode library written in other part of Mozilla handle this.

The wrong code
--------------
In nsScriptableUConv.cpp, there is this call simply:

----------------------------------------------------------------------------
NS_IMETHODIMP
nsScriptableUnicodeConverter::ConvertToUnicode(const char *aSrc, PRUnichar
**_retval)
{
...
    rv = mDecoder->Convert(aSrc, &inLength, *_retval, &outLength);
----------------------------------------------------------------------------

If the stringis incorrect, it does not work.
I am expecting the function to replace all the bad characters by '?' like
mozilla is doing.

For ex:
- by a new option : ConvertUnicode( string src, boolean recover_error)

In other part of the mozilla code, you can see how to correct the
ConvertToUnicode function. A sample can be seen in nsScanner::append.
An explanation on how convert is intended to be used in the Unicode
convertion code.

For ex, here is a code that work that does replace all errors by '?'.

----------------------------------------------------------------------------
NS_IMETHODIMP nsTidyImpl::ConvertToUnicode(const char *aSrc, const char
*aCharset, PRUnichar **_retval)
{
  nsresult rv = NS_OK;

  // This work-around a limitation in nsScriptableUnicodeConverter
  nsCOMPtr<nsICharsetConverterManager> ccm =
do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv);

  if (NS_SUCCEEDED( rv) && (nsnull != ccm))
  {
    // get charset atom due to getting unicode converter

    // get an unicode converter
    nsCOMPtr<nsIUnicodeDecoder> mDecoder;
    rv = ccm->GetUnicodeDecoder(aCharset, getter_AddRefs(mDecoder));

    if (!mDecoder)
      return NS_ERROR_FAILURE;

    PRInt32 inLength = strlen(aSrc);
    PRInt32 outLength = 0;
    rv = mDecoder->GetMaxLength(aSrc, inLength, &outLength);
    if (NS_SUCCEEDED(rv))
    {
      *_retval = (PRUnichar*) nsMemory::Alloc((outLength+1)*sizeof(PRUnichar));
      if (!*_retval)
      {
        return NS_ERROR_OUT_OF_MEMORY;
      }
      PRUnichar *unichars, *start;
      start = unichars = *_retval;
      PRInt32 totalChars = 0;
      PRInt32 unicharLength = outLength;

      // rv = mDecoder->Convert(aSrc, &inLength, *_retval, &outLength);

      do
      {
        PRInt32 srcLength = inLength;
        rv = mDecoder->Convert(aSrc, &srcLength, unichars, &unicharLength);

        totalChars += unicharLength;
        // Continuation of failure case
        if(NS_FAILED(rv))
        {
          // if we failed, we consume one byte, replace it with U+FFFD
          // and try the conversion again.
          unichars[unicharLength++] = (PRUnichar)0xFFFD;
          unichars = unichars + unicharLength;
          unicharLength = outLength - (++totalChars);

          mDecoder->Reset();

          if(((PRUint32) (srcLength + 1)) > inLength)
          {
            srcLength = inLength;
          }
          else
          {
            ++srcLength;
          }

          aSrc += srcLength;
          inLength -= srcLength;
        }
      } while (NS_FAILED(rv) && (inLength > 0));

      return NS_OK;
    }
  }
  *_retval = nsnull;
  return NS_ERROR_FAILURE;
}
------------------------------------------------------------------------------

Reproducible: Always

Steps to Reproduce:
A way to reproduce the issue is to use my extension (version 0.58)
and just open a HTML file that I will provide as testcase.
1. Install the extension v 0.58 (I will upload it in the testcase 
   Linux/Win version). 
2. Open the charset file in the browser
3. Look in the java console and you get the  

Error: [Exception... "Component returned failure code: 0x80004005
(NS_ERROR_FAILURE) [nsIScriptableUnicodeConverter.ConvertToUnicode]"  nsresult:
"0x80004005 (NS_ERROR_FAILURE)" 



Expected Results:  
Handle the convertion error correctly.

It should also be nice to print an error in the javascript console when 
getting such errors. Because it means that a file is incorrect.
The extension can be dowloaded here, I give the direct link since I rewrote 
the function convertion for the next version to handle the problem.
(what is bad since I am duplicating code...)

Windows: 
http://rars.sourceforge.net/mozilla/tidy_firefox_058.xpi

Linux
Download and install Html Validator (based on Tidy) for Firefox 1.0.x

MacOsX
Download and install Html Validator (based on Tidy) for Firefox 1.0.x
The extension above are for Firefox 1.0 only. 
(please see this if you need a version for 1.0+, 1.1)
> http://users.skynet.be/mgueury/mozilla/faq.html (point 2)

Windows version 0.58 for Firefox 1.1.
------------------------------------
http://rars.sourceforge.net/mozilla/tidy_firefox_11_058.xpi

It would also be good to have more flexible error handling for the scriptable
Unicode encoder. Currently we have hardcoded
mEncoder->SetOutputErrorBehavior(nsIUnicodeEncoder::kOnError_Replace, nsnull,
(PRUnichar)'?');

I think there are existing bugs about adding SetOutputErrorBehavior() to
nsIUnicodeDecoder
Assignee: firefox → smontagu
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → amyy
Version: unspecified → Trunk
URL: -
Keywords: testcase
I have two suggestions to make.

First, I think it's better to add attributes to the converter rather than parameters to the function (e.g. boolean attribute for recover/don't recover from errors).

Also, if a coder is already adding this capability, perhaps it would be better to make it a bit more flexible, as I can see more than one 'useful' way to recover from errors...

What to emit?

- Emit an \xFFFD
- Generalization: Emit a fixed character which is an attribute of the converter (e.g. '?')
- Emit the raw unusable octets
- Emit the values of the unusable octets in hexadecimal UTF-8 digits (\x0030 - \x0039)
- Generalization: Emit the values of the unusable octets in hexadecimal UTF-8 digits (\x0030 - \x0039), preceded or wrapped in some indication that these digits are not part of the original text

What to do with the state/previous chars in the continuation after an error?

- Discard just the unusable octets, continue with the same state from the next octet
- Discard the unusable octet and reset the state, e.g. discard incomplete continuations

etc. etc.
Status: UNCONFIRMED → NEW
Ever confirmed: true
QA Contact: amyy → i18n
Summary: nsIScriptableUnicodeConverter::ConvertToUnicode is not able to handle error in convertion → nsIScriptableUnicodeConverter::ConvertToUnicode is not able to handle error in conversion
(In reply to comment #5)
> It would also be good to have more flexible error handling for the scriptable
> Unicode encoder. Currently we have hardcoded
> mEncoder->SetOutputErrorBehavior(nsIUnicodeEncoder::kOnError_Replace, nsnull,
> (PRUnichar)'?');
> 
> I think there are existing bugs about adding SetOutputErrorBehavior() to
> nsIUnicodeDecoder

I wasn't able to find such a bug. Regardless, it would be good if SetOutputErrorBehavior-type functionality were added to this API -- not being able to determine the converter's behavior in cases of erroneous/corrupted data is a serious limitation.
It's being added as part of bug 174351
Depends on: 174351
Do you testing with Chinese? (你测试过中文吗?)

Old-style extensions are no longer supported.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: