Open Bug 320807 Opened 16 years ago Updated 4 years ago

Refactor URI unescape for status bar

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

People

(Reporter: masayuki, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: intl)

Attachments

(1 file, 3 obsolete files)

We have many problem in URI unescaping on status bar.
See bug 309671 comment 34, bug 153640 and bug 229548.

I have some idea. Please wait, I'll attach a proposal patch.
Status: NEW → ASSIGNED
Attached patch Patch rv1.0 (obsolete) — Splinter Review
We have some suggestions.

1. Don't decode the host name by non UTF-8 charset.(See bug 309671)
2. Don't decode the another host's unescaped string if it is not UTF-8.
3. Use charset attribute of a element for the URI origin charset.
4. Decode the URI on each fragments(e.g., Path, Query and Ref).(See bug 229548)
5. Unescape %1B if the charset is ISO-2022-*.(See bug 153640)

We have experienced many broken non-ASCII URI that is link to another host. Because we are using document charset for the URI of the link. This is written in HTML4.01 spec. But we should not use this spec. Because in many cases, this is failed to decode.

Therefore, we propose to use a[charset] for fix this problem. This attribute specifies the link target document's charst. So, the URI may be encoded by this charset. And now, this attribute is not used on many sites. So, we don't need to mind the regression.

But in this patch, if the URI is not having host(e.g., the scheme is "mailto", "javascript"), the URI spec is decoded by current spec.(I.e., in these cases, the process is falled back to old behavior.)

And if the URI is not having nsIURL interface, the nsIURI::path is decoded as single fragment.
Attachment #207194 - Flags: review?(jshin1987)
I'll get to this on January 4th (KST=JST :-)) Thanks for the patch, btw.

Jshin:

Could you review it?
(In reply to comment #4)
> Jshin:
> 
> Could you review it?

Sorry for the delay. I've just lost an hour's worth of reviewing (I don't know what happened or what I did....) I'll get back to this in a day or two.


Comment on attachment 207194 [details] [diff] [review]
Patch rv1.0

Sorry again for the delay. Looks good overall. 


>Index: content/html/content/src/nsGenericHTMLElement.cpp
>+nsGenericHTMLElement::GetCharsetForHref(nsAFlatCString &aCharset)
>+  /*
>+   * NOTE:
>+   * We need to change toolkit/components/typeaheadfind/findBar.js also when
>+   * you change following charset attribute behavior.
>+   * See gFindBar.updateStatusBar in that file.
>+   */

If the way charset attribute is obtained is changed here, the corresponding change needs to be made in |gFindBar.updateStatusbar| of toolkit/components/typeaheadfind/findBar.js

>+  if (!charsetAttr || charsetAttr->GetStringValue().IsEmpty())
>+    return;
>+  CopyUTF16toUTF8(charsetAttr->GetStringValue(), aCharset);

This comes from an external source so that using CopyUTF16toUTF8 is safer than otherwise, but I wonder if we can just get away with LossyCopyUTF16toASCII...

>+     * NOTE:
>+     * We need to change toolkit/components/typeaheadfind/findBar.js also when
>+     * you change following unescape process.
>+     * See gFindBar.updateStatusBar in that file.

If the unescaping process is changed here, the corresponding change needs to be made in |gFindBar.updateStatusbar| of toolkit .....


>   /**
>    * Unescapes the given URI fragment (for UI purpose only)
>+   * WARNING: Don't use this for host name and including it(e.g., full URI).
>+   *          You should use only for URI *fragment* that doesn't have
>+   *          host name.

Don't call this method for a hostname or a fragment containing a hostname (e.g. full URI).  


>    * Note: 
....
>    * @param aCharset the charset to convert from
>    * @param aURIFragment the URI (or URI fragment) to unescape
>    * @return Unescaped aURIFragment  converted to unicode
   
While you're here, will you s/unicode/UTF-16/?

>   AString unEscapeURIForUI(in ACString aCharset, in AUTF8String aURIFragment);
> 
>   /**
>+   * Unescapes each parts of the given URI (for UI purpose only)
>+   * Note: 
>+   * <ul>
>+   *  <li> escaping back the result (unescaped string) is not guaranteed to
>+   *       give the original escaped string
>+   *  <li> In case of a conversion error, the URI spec (escaped) is
>+   *       assumed to be in UTF-8 and converted to AString (UTF-16)
>+   *  <li> Always succeeeds (callers don't need to do error checking)
>+   *  <li> The result is guaranteed that the host name is not changed.
>+   * </ul>
>+   *
>+   * @param aURI the URI to unescape.
>+   * @param aBaseURI if aURI is link target, set the owner document URI.

       if aURI is a link target, set it to the URI of the owner document.

>+   * @param aPreferredCharset the preferred charset to convert from.
>+   *                          If this value is empty, this may use
>+   *                          origin charset of aURI.

  You don't mean 'may', do you? 
  The origin charset of |aURI| will be used. ?

>Index: intl/uconv/src/nsTextToSubURI.cpp

>-static PRBool statefulCharset(const char *charset)
>+static PRBool UsingEscape(const char *charset)

  Perhaps, |UseEscape| is a better name.

>+  return !nsCRT::strncasecmp(charset, "ISO-2022-", sizeof("ISO-2022-")-1);

  I think |charset| passed here is always canonical (you also made it 
canonical in |UnEscapeFullURIForUIInternal|. In that
case, you can just use 

  |StringBeginsWith(charset, NS_LITERAL_CSTRING("ISO-2022-"))|

>+static PRBool StatefulCharset(const char *charset)

  |IsStateful|, maybe.

>   if (!nsCRT::strncasecmp(charset, "ISO-2022-", sizeof("ISO-2022-")-1) ||

    if (UseEscape(charset) ||

>       !nsCRT::strcasecmp(charset, "UTF-7") ||
>       !nsCRT::strcasecmp(charset, "HZ-GB-2312"))

  charset.EqualsLiteral("UTF-7")

>+    if (aCharset == NS_LITERAL_CSTRING("UTF-8"))
>+      return NS_ERROR_FAILURE;

      if (aCharset.EqualsLiteral("UTF-8"))

>+  // if the result value contains ESC(U+001B), we failed to convert.
>+  if (NS_SUCCEEDED(rv) && UsingEscape(aCharset.get()) &&
>+      nsPromiseFlatString(_retval).FindChar((PRUnichar)0x1B) != kNotFound)

   I prefer pseudo-ctor style to C-style casting. |PRUnichar(0x1B)|


>+static PRBool
>+GetUnicodeFragmentIfUTF8(nsAFlatCString &aCharset,
>+                         nsAFlatCString &aFragment,
>+                         nsAString &_retval)

It's better to make this function just check if it's UTF-8. 
name suggestion: |IsFragmentUTF8|

>+  // if the charset is stateful charset and the unescaped spec is ASCII,
>+  // the spec may be encoded by the charset. So, we cannot decide the
>+  // spec is encoded by UTF-8 in this case.

if |aCharset| is stateful, the spec may not be encoded in US-ASCII even if 
it is entirely made of characters in the ASCII range. In that case, 
we can't tell whether the spec is UTF-8

>+  if (StatefulCharset(aCharset.get()) && IsASCII(unescapedFragment))
>+    return PR_FALSE;


>+nsresult
>+nsTextToSubURI::ReplaceFragment(const nsAFlatCString &aCharset,
>+  PRInt32 start = aURIPath.FindChar(aSeparator);
>+  if (start == kNotFound) {
>+    NS_ERROR("What's this case!?");

      NS_ASSERTION("Fragment separator not found");

>+nsresult
>+nsTextToSubURI::UnEscapeFullURIForUIInternal(nsIURI* aURI, nsIURI* aBaseURI,

>+  //////////////////////////////////////////////////////////////////////////////
>+  // Decide sub charset

I'm not sure what you meant by 'sub charset'. "Canonicalize |charset|" is better, I guess.

>+  nsCOMPtr<nsICharsetAlias> csAlias(do_GetService(kCharsetAliasCID));
>+  NS_ASSERTION(csAlias, "failed to get the CharsetAlias service");
>+  if (csAlias) {
>+    nsCAutoString preferredName;
>+    rv = csAlias->GetPreferred(charset, preferredName);
>+    if (NS_SUCCEEDED(rv))
>+      charset = preferredName;
>+    else
>+      charset = NS_LITERAL_CSTRING("UTF-8");

>+  //////////////////////////////////////////////////////////////////////////////
>+  // Check that the URI doesn't have escaped character
>+  //////////////////////////////////////////////////////////////////////////////

Do we really want these 3-line comments in this function? 
      
>+  if (!isStatefulCharset && kNotFound == spec.FindChar('%')) {
>+    // We don't need to unescape
>+    CopyUTF8toUTF16(spec, _retval);
>+    return NS_OK;
>+  }

>+  //////////////////////////////////////////////////////////////////////////////
>+  // Check that the URI doesn't refer to network resource
>+  //////////////////////////////////////////////////////////////////////////////
>+  rv = aURI->GetHost(host);
>+  if (NS_FAILED(rv) || host.IsEmpty()) {
>+    // aURI doesn't refer to network resource.
>+    PRBool isIRI = (!scheme.EqualsLiteral("file"));

Perhaps, it's a good idea to add a comment that this has to be removed
when bug 278161 is fixed.


>+  //////////////////////////////////////////////////////////////////////////////
>+  // Check that we should try to decode by non UTF-8 charset
>+  //////////////////////////////////////////////////////////////////////////////
>+  if (!charset.EqualsLiteral("UTF-8")) {
>+    PRBool shouldUseCharset = PR_FALSE;

      PRBool shouldUseCharset;

>+    if (!aPreferredCharset.IsEmpty()) {
>+      // We should always unescape if the link has linked document charset.

        We should always unescape if the link has an associated charset
        attribute.

>+      shouldUseCharset = PR_TRUE;
>+    } else if (!aBaseURI) {
>+      // We should not unescape. Because we may fail to convert.

        // We should not unescape because the conversion may fail.

        instead of adding a dummy |else if|, why don't you do this? 

        shouldUseCharset = PR_FALSE;

>+  //////////////////////////////////////////////////////////////////////////////
>+  // Try to decode on each frament

    Try to decode fragments one by one.

>+      if (atIndex != kNotFound && (PRUint32)atIndex > userPass.Length()) {

         PRUint32(aIndex) > userPass.Length 

>+    if (GetUnicodeFragmentIfUTF8(charset, fragment, tmp))
>+      _retval += tmp;
>+    else
>+      _retval += path;

      if (IsFragmentUTF8(charset, fragment)) 
        AppendUTF8toUTF16(fragment, _retval);
      else 
        _retval += path;


>Index: xpcom/io/nsEscape.h

>+  esc_SkipControl    = PR_BIT(15), /* skips C0 and DEL from unescaping */
>+  esc_DontSkipEsc    = PR_BIT(16)  /* don't skip 0x1B always(for ISO-2022-*) */
    unescapes ESC(0x1B) even if esc_SkipControl is set (for ISO-2022-*)
Attachment #207194 - Flags: review?(jshin1987)
Attached patch Patch rv1.1 (obsolete) — Splinter Review
Thank you, jshin. Please check this.
Attachment #207194 - Attachment is obsolete: true
Attachment #211737 - Flags: review?(jshin1987)
+  if (Tag() != nsHTMLAtoms::a && Tag() != nsHTMLAtoms::link &&
+      Tag() != nsHTMLAtoms::script)

don't you need to check the namespace here? also, what about area elements?
(In reply to comment #8)
> +  if (Tag() != nsHTMLAtoms::a && Tag() != nsHTMLAtoms::link &&
> +      Tag() != nsHTMLAtoms::script)
> 
> don't you need to check the namespace here? also,

Thanks.

> what about area elements? 

area element doesn't have charset attribute...(Is that a mistake of HTML spec?)
Attached patch Patch rv1.2 (obsolete) — Splinter Review
Attachment #211737 - Attachment is obsolete: true
Attachment #211745 - Flags: review?(jshin1987)
Attachment #211737 - Flags: review?(jshin1987)
Jungshik:

Please re-review the new patch.
Comment on attachment 211745 [details] [diff] [review]
Patch rv1.2

Sorry for the delay, again.

r=jshin. asking cbie for a second look. jst or darin would be a good superreviewer.

>Index: intl/uconv/src/nsTextToSubURI.cpp

>+  // Check that the URI doesn't have escaped character

   Check if the URI has any escaped character.

>+  if (!isStatefulCharset && kNotFound == spec.FindChar('%')) {
>+    // We don't need to unescape
>+    CopyUTF8toUTF16(spec, _retval);

>+  // Check that the URI doesn't refer to network resource

  Check if the URI refers to a network resource.

>+  rv = aURI->GetHost(host);
>+  if (NS_FAILED(rv) || host.IsEmpty()) {
>+    // aURI doesn't refer to network resource.
>+    // XXX this should be removed after bug 278161 is fixed.
>+    PRBool isIRI = (!scheme.EqualsLiteral("file"));

   LowercaseEqualsLiteral("file")

>+  // Check that we should try to decode by non UTF-8 charset

    See if we need to try an charset other than UTF-8

>+  if (!charset.EqualsLiteral("UTF-8")) {
>+    PRBool shouldUseCharset;
>+    if (!aPreferredCharset.IsEmpty()) {
>+      shouldUseCharset = PR_FALSE;
>+    } else {
>+      // We should use charset only at the URIs having same host.
        Should use charset for |URI| refering to a resource 
        at the same host as that of |aBaseURI|

>+      nsCAutoString baseHost;
>+      rv = aBaseURI->GetHost(baseHost);

>+  // Process the path from end to start

      perhaps, 'from the tail to the head'?
Attachment #211745 - Flags: review?(cbiesinger)
Attached patch Patch rv1.3Splinter Review
Thank you, Jungshik.
Attachment #211745 - Attachment is obsolete: true
Attachment #212480 - Flags: superreview?(darin)
Attachment #212480 - Flags: review+
Attachment #211745 - Flags: review?(jshin1987)
Attachment #211745 - Flags: review?(cbiesinger)
Comment on attachment 212480 [details] [diff] [review]
Patch rv1.3

>Index: content/html/content/src/nsGenericHTMLElement.cpp

>+nsGenericHTMLElement::GetCharsetForHref(nsAFlatCString &aCharset)

nit: nsAFlatCString is a deprecated typedef for nsCString.  Please
just use nsCString instead.


>+{
>+  /*
>+   * NOTE:
>+   * If the way charset attribute is obtained is changed here, the corresponding
>+   * change needs to be made in |gFindBar.updateStatusbar| of
>+   * toolkit/components/typeaheadfind/findBar.js
>+   */

This tells me that we should expose an API somewhere that allows
both places to use the same algorithm for getting the charset.
Perhaps we should define an attribute on nsIDOMNSHTMLElement?
However, I wouldn't do that without asking the DOM maintainers
first (maybe jst or bz).


>+  aCharset.Truncate();
>+  if ((!mNodeInfo->NamespaceEquals(kNameSpaceID_None) &&
>+       !mNodeInfo->NamespaceEquals(kNameSpaceID_XHTML)) ||
>+      (Tag() != nsHTMLAtoms::a && Tag() != nsHTMLAtoms::link &&
>+       Tag() != nsHTMLAtoms::script))
>+    return;

What about nsHTMLAtoms::area?


>+    nsresult rv;
>+    if (charset.IsEmpty()) {
>+      rv = nsContentUtils::NewURIWithDocumentCharset(aURI,
>+                                                     attr->GetStringValue(),
>+                                                     GetOwnerDoc(),
>+                                                     baseURI);
>+    } else {
>+      rv = NS_NewURI(aURI,
>+                     attr->GetStringValue(),
>+                     charset.get(),
>+                     baseURI);
>+    }
...
>+  nsresult rv;
>+  if (charset.IsEmpty()) {
>+    rv = nsContentUtils::NewURIWithDocumentCharset(getter_AddRefs(attrURI),
>+                                                   attrValue, GetOwnerDoc(),
>+                                                   baseURI);
>+  } else {
>+    rv = NS_NewURI(getter_AddRefs(attrURI),
>+                   attrValue,
>+                   charset.get(),
>+                   baseURI);
>+  }

Repeated code like this sort of begs for a subroutine.  Maybe you could
define something on nsContentUtils or something else on nsGenericHTMLElement.



>Index: intl/uconv/idl/nsITextToSubURI.idl

>   /**
>+   * Unescapes each parts of the given URI (for UI purpose only)
>+   * Note: 
>+   * <ul>
>+   *  <li> escaping back the result (unescaped string) is not guaranteed to
>+   *       give the original escaped string
>+   *  <li> In case of a conversion error, the URI spec (escaped) is
>+   *       assumed to be in UTF-8 and converted to AString (UTF-16)
>+   *  <li> Always succeeeds (callers don't need to do error checking)
>+   *  <li> The result is guaranteed that the host name is not changed.
>+   * </ul>
>+   *
>+   * @param aURI the URI to unescape.


>+   * @param aBaseURI if aURI is a link target, set it to the URI of the owner
>+   *                 document.

Reading this documentation leaves me confused by this parameter.  I
think you should say how aBaseURI will be used.


>Index: intl/uconv/src/nsTextToSubURI.cpp

>+static PRBool UseEscape(const nsAFlatCString &aCharset)

Again, please use nsCString instead of nsAFlatCString.


> {
>+  return StringBeginsWith(aCharset, NS_LITERAL_CSTRING("ISO-2022-"));
>+}

The name "UseEscape" doesn't seem to describe this function very well,
or maybe I just don't understand what it means.  Could you add a comment
or maybe pick a more descriptive name?


>+static PRBool IsStateful(const nsAFlatCString &aCharset)
>+{
>+  if (UseEscape(aCharset) ||
>+      aCharset.EqualsLiteral("UTF-7") ||
>+      aCharset.EqualsLiteral("HZ-GB-2312"))
>     return PR_TRUE;
> 
>   return PR_FALSE;
> }

nit: Might be nicer to write this function like this:

  {
    return UseEscape(aCharset) || aCharset.EqualsLiteral("UTF-7") ||
                                  aCharset.EqualsLiteral("HZ-GB-2312");
  }


>+  if (NS_SUCCEEDED(rv) && UseEscape(aCharset) &&
>+      nsPromiseFlatString(_retval).FindChar(PRUnichar(0x1B)) != kNotFound)

You shouldn't need to use nsPromiseFlatString here.  FindChar is defined
on nsAString.  Just use it directly:

  _retval.FindChar(...


>+      charset = NS_LITERAL_CSTRING("UTF-8");

nit: prefer |charset.AssignLiteral("UTF-8");|


>+      charset = NS_LITERAL_CSTRING("UTF-8");

ditto


>+    PRBool replaceUserPass;
>+    nsAutoString tmp;
>+    if (userPass.IsEmpty() || userPass.FindChar('%') == kNotFound) {
>+      replaceUserPass = PR_FALSE;
>+    } else {
>+      rv = UnEscapeURIForUIInternal(charset, userPass, PR_TRUE, tmp);
>+      replaceUserPass = NS_SUCCEEDED(rv) && !userPass.IsEmpty();
>+    }

Are you sure you need to test userPass.IsEmpty() here?  You test it
above, and if it is empty, then you never enter the "else" section.
Therefore, why bother testing it again?  Did you mean to test |tmp|
instead?

Also, from a security point of view, maybe we shouldn't unescape the
username and password since that may result in changing how the URL
would be parsed.  For example, consider what happens if the username
contains an escaped '/' character.  Then, the username would look like
a hostname.  I notice below that you test for '@', but that doesn't
seem to be sufficient to me.  Or, am I missing something?


>+  rv = aURI->GetPath(fragment);
>+  NS_ENSURE_SUCCESS(rv, rv);
>+  if (fragment.IsEmpty())
>+    return NS_OK;




>Index: toolkit/components/typeaheadfind/content/findBar.js

>+    // See nsWebShell::OnOverLink, nsGenericHTMLElement::GetHrefURIForAnchors
>+    // and nsGenericHTMLElement::GetURIAttr for following process.
>+    var targetURI = null;
>+    var baseURI = null;
>+    var preferredCharset = "";
>+
>     var docCharset = "";
>     var ownerDoc = this.mFoundLink.ownerDocument;
>+    if (ownerDoc) {
>       docCharset = ownerDoc.characterSet;
>+      baseURI = this.makeURI(ownerDoc.location.href, ownerDoc.characterSet);
>+    }
>+    if (this.mFoundLink.charset)
>+      preferredCharset = this.mFoundLink.charset;
>+    targetURI = this.makeURI(this.mFoundLink.href,
>+                             preferredCharset ? preferredCharset : docCharset);
> 
>     var url =
>+      this.mTextToSubURIService.unEscapeFullURIForUI(targetURI, baseURI,
>+                                                     preferredCharset);

So, unEscapeFullURIForUI only needs baseURI so that it can know
what hostname served the document.  Perhaps you should change
unEscapeFullURIForUI to only require the hostname of the originating
document or you could even require the originating document itself.
It just seems a bit awkward to require the baseURI of the document
since there is currently no convenient scriptable API to get that
info as a nsIURI without jumping through hoops to construct it as
you are doing here.


>Index: xpcom/io/nsEscape.cpp

>+    // ESC(0x1B) is not skipped for ISO-2022-* even if this flag is true.
>+    PRBool skipControl = (flags & esc_SkipControl);
...
>-                !(skipControl && 
>+                !(skipControl && !(*p1 == '1' && (*p2 == 'b' || *p2 == 'B')) &&

OK, is this really the proper thing to do for all charsets?


I think boris should review this as well.
Attachment #212480 - Flags: superreview?(darin)
Attachment #212480 - Flags: superreview-
Attachment #212480 - Flags: review?(bzbarsky)
> 3. Use charset attribute of a element for the URI origin charset.

Why?  That's wrong, imo.  The charset attribute of <a> is a charset hint for the resource the <a> points to.  Why would it have anything to do with the charset of the attribute?  Same thing for <script> and <link> and so forth.

Also, I am not happy adding information about the toolkit front end into Gecko.  If we have to do that, that indicates to me that we have a code-factoring bug.

Put another way, why is there ever any charset other than the documentCharacterSet involved?
(In reply to comment #15)
> > 3. Use charset attribute of a element for the URI origin charset.
> 
> Why?  That's wrong, imo.  The charset attribute of <a> is a charset hint for
> the resource the <a> points to.  Why would it have anything to do with the
> charset of the attribute?  Same thing for <script> and <link> and so forth.
The alternative charset is often same document charset on many site that is not using UTF-8 for query of URI. Therefore, if the linking page's author writes the target document charset in the charset attribute, it and the encoding of URI are much.

> Put another way, why is there ever any charset other than the
> documentCharacterSet involved?
In Japan, the many sites are using Shift_JIS or EUC-JP. So, we have twin standard. Therefore, the many links that target is another host, the URI encoding is not much the linking document charset. So, in many times, we can see the broekn URI on our statusbar...
(In reply to comment #14)
> >+  aCharset.Truncate();
> >+  if ((!mNodeInfo->NamespaceEquals(kNameSpaceID_None) &&
> >+       !mNodeInfo->NamespaceEquals(kNameSpaceID_XHTML)) ||
> >+      (Tag() != nsHTMLAtoms::a && Tag() != nsHTMLAtoms::link &&
> >+       Tag() != nsHTMLAtoms::script))
> >+    return;
> 
> What about nsHTMLAtoms::area?

See comment 9. the area element doesn't have charset attribute.
So wait.  Could we back up and actually explain the problem we're trying to solve?  That is, what escapes are we unescaping and how did they come to be escaped?
This is a major case in Japan:

A Shift_JIS document links to EUC-JP encoded CGI on another host. In this case, current code unescapes the query by Shift_JIS (i.e., document charset).

But my patch doesn't unescape in this case without charset attribute. But this is inconvenience. Therefore, we need a way for specifying the URL encoding. We think that charset attribute is best way.
> A Shift_JIS document links to EUC-JP encoded CGI on another host.

Ok.  How does it do that?  That is, does the actual HTML contain characters?  Or URI escapes? This is the "how did they come to be escaped" part of my previous question, more or less; actually answering it would really help.
Sorry. I cannot understand what you said...

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
</head>
<body>
<a href="http://example.com/?q=%xx%xx....">jump to another host</a>
</body>
</html>

If the value of href attribute is escaped by EUC-JP if the linked CGI uses EUC-JP, the URI is displayed in statusbar and unescaped by Shift_JIS. I want to solve this problem using charset attribute.

I.e.,

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
</head>
<body>
<a href="http://example.com/?q=%xx%xx...." charset="EUC-JP">jump to another host</a>
</body>
</html>
> <a href="http://example.com/?q=%xx%xx....">jump to another host</a>

OK.  So the web site itself is what's doing the escaping.  And we have to guess what charset it's escaping from?
(In reply to comment #22)
>  And we have to guess what charset it's escaping from?

It's simple. We need to display the 'correct' unescaped string.
E.g., searching query that contains the keywords.
That doesn't answer the question I asked (which was a yes or no question).  If you want me to review this, then please do answer the questions I ask?  It'll really help.  Truly.
Ah, Yes.  We have to guess what charset it's escaping from.
So if this is really targeting 1.9, I would prefer to expose some of the nsILink stuff (eg the nsIURI object for the href) on these nodes (see bug 324464 for similar ideas).  Then we could just use the origin charset of the URI throughout and be happy.  In any case, we should just be using the origin charset of the uri in webshell instead of looking at the content node...

More detailed comments hopefully later tonight or on Sunday if that doesn't happen.
Comment on attachment 212480 [details] [diff] [review]
Patch rv1.3

>Index: content/html/content/src/nsGenericHTMLElement.cpp
>+nsGenericHTMLElement::GetCharsetForHref(nsAFlatCString &aCharset)

I still think that this should be exposed on an API for 1.9.  for 1.8.1 I suppose I could live with the "see this code in this totally unrelated place" mess... :(

>+  if ((!mNodeInfo->NamespaceEquals(kNameSpaceID_None) &&
>+       !mNodeInfo->NamespaceEquals(kNameSpaceID_XHTML)) ||

No need for those checks.  The namespace will always be one or the other here.

>+      (Tag() != nsHTMLAtoms::a && Tag() != nsHTMLAtoms::link &&
>+       Tag() != nsHTMLAtoms::script))

I'd put Tag() into a local instead of making 3 virtual calls here.

>+  if (!charsetAttr || charsetAttr->GetStringValue().IsEmpty())

That crashes if the value is not a string value, no?

I think the safe thing to do here would be a GetAttr() into an nsAutoString followed by a LossyCopy....  Could make the GetAttr be nsGenericHTMLElement::GetAttr to avoid a virtual call, of course.

The other option is to check the type of charsetAttr and handle it sanely, of course...  Doesn't seem like it's worth the code to do that.

> nsGenericHTMLElement::GetHrefURIForAnchors(nsIURI** aURI)

>+    if (charset.IsEmpty()) {

If we're going to do the whole NS_NewURI thing here anyway, we might as well do:

  if (charset.IsEmpty()) {
    nsIDocument* doc = GetOwnerDoc();
    if (doc) {
      NS_LossyCopyUTF16toASCII(doc->GetDocumentCharacterSet(),
                               charset);
    }
  }

and then just call NS_NewURI.

But on the other hand, it looks like you have code much like this in other places, which suggests to me that it should be factored out better somehow.

>@@ -2952,20 +2986,33 @@ nsGenericHTMLElement::GetURIAttr(nsIAtom

>+  nsCAutoString charset;
>+  if (aAttr == nsHTMLAtoms::href)
>+    GetCharsetForHref(charset);

I'd prefer if we did this for all URI attrs... or if we have to treat some differently that we explicitly did that by using a different macro for those and refactoring this code accordingly.

>Index: content/html/content/src/nsGenericHTMLElement.h
>+  /**
>+   * Get charset attribute value if this is 'a' or 'link' or 'script' element.

That's obvious based on reading the code.  What this should document is _what_ this is really getting.  Perhaps something like "Get the charset which should be used when manipulating the href URI (eg encoding it to bytes, unescaping, etc) for this node".

Or something like that; feel free to modify the wording as needed.

>Index: docshell/base/nsDocShell.cpp

>+            nsCAutoString charset;
>+            aURI->GetOriginCharset(charset);
>+            // UnEscapeFullURIForUI always succeeds
>+            textToSubURI->UnEscapeFullURIForUI(aURI, nsnull, charset,

Why does UnEscapeFullURIForUI even take a charset arg?  And given that it takes one, why are you just passing in the origin charset of aURI?

Also, why is the second arg null here?  That seems wrong to me...

>Index: docshell/base/nsWebShell.cpp

>+    nsCOMPtr<nsIDOMHTMLAnchorElement> anchor = do_QueryInterface(aContent);

Why do you need this?  aURI already has the charset off this element....

>Index: intl/uconv/idl/nsITextToSubURI.idl

>+#define NS_TEXTTOSUBURI_CID \
>+{ 0xb5058d64, 0xf289, 0x4d59, \
>+  { 0x83, 0x3, 0x39, 0xb7, 0x7e, 0xd7, 0x8e, 0xbf } }
> #define NS_ITEXTTOSUBURI_CONTRACTID "@mozilla.org/intl/texttosuburi;1"
> %}

Is there a bug on moving that out of the IDL?  It doesn't belong here.

>+   * Unescapes each parts of the given URI (for UI purpose only)

"each part" and "for UI purposes".

>+   *  <li> escaping back the result (unescaped string) is not guaranteed to
>+   *       give the original escaped string

What escaped string?  T here's no escaped string in sight...

>+   *  <li> The result is guaranteed that the host name is not changed.

Perhaps "The hostname in the resulting unescaped string is guaranteed to be the same as the hostname in the original URI spec"?  But I'm not sure what that means either.  What does it mean if the hostname contains escapes?

>+   * @param aURI the URI to unescape.
>+   * @param aBaseURI if aURI is a link target, set it to the URI of the owner
>+   *                 document.

I really don't follow this...  For example, why should <a href=""> and <form action=""> be treated differently?  And generally, how is the caller of this supposed to know whether aURI is a "link target" and if so where the heck it came from?  It seems like this interface as written is really hard to use.  If the problem is that URI objects don't carry enough information to properly work with them, maybe we need to fix that?

It's also not clear whether aBaseURI is allowed to be null whatn aURI is a "link target".  Or what will go wrong if it is.  So as a consumer of the interface, I would tend to just pass null for the second arg, since there's no obvious reason I should bother getting a different value.

>+   * @param aPreferredCharset the preferred charset to convert from.
>+   *                          If this value is empty, this will use
>+   *                          origin charset of aURI.

I don't see any consumers that would want a different charset than the origin charset of aURI.  In fact, I can't really imagine any situations where we'd want that, so I propose removing this argument.

>Index: intl/uconv/src/nsTextToSubURI.cpp

>+static PRBool UseEscape(const nsAFlatCString &aCharset)

Why UseEscape for these particular charsets?  Unlike IsStateful (which is reasonably self-documenting), this is not at all clear.  Please either comment or give this a better name.

>+static PRBool IsStateful(const nsAFlatCString &aCharset)
>+{
>+  if (UseEscape(aCharset) ||
>+      aCharset.EqualsLiteral("UTF-7") ||
>+      aCharset.EqualsLiteral("HZ-GB-2312"))
>     return PR_TRUE;

While you're here, why not just return that condition instead of doing |if () return true; return false;| ?

>+GetUnescapedFragment(const nsAFlatCString &aCharset,

The aCharset argument seems unused... why is it here?

>+  // if the result value contains ESC(U+001B), we failed to convert.
>+  if (NS_SUCCEEDED(rv) && UseEscape(aCharset) &&
>+      nsPromiseFlatString(_retval).FindChar(PRUnichar(0x1B)) != kNotFound)

So why is this something we're checking _here_ instead of having ConvertURItUnicode return the right thing to start with?

>+IsFragmentUTF8(nsAFlatCString &aCharset,

So an ASCII-only fragment is "not UTF8"?  In any case, see below for why I think this method is not needed.

>+nsTextToSubURI::ReplaceFragment(const nsAFlatCString &aCharset,

What does this method actually do?  That should be documented in the header where it's declared.  It's really hard to review whether it does what it's supposed to otherwise... :(

>+  aURIPath.Replace(start + 1, aFragment.Length(), tmp);

Is there no way we can do something like convert tmp to UTF8, call SetRef (or whaver) on the URI, and then later just convert the whole URI spec from (escaped?) UTF8 to UTF16?  Or something?  I really don't like this hand-parsing of URI strings; more on this below.

>+nsTextToSubURI::UnEscapeFullURIForUIInternal(nsIURI* aURI, nsIURI* aBaseURI,

Please document this method in the header.

>+  NS_ENSURE_ARG_POINTER(aURI);

How could aURI possibly be null here?  This should be an assert, in my opinion.

>+  // Specail case of view-source scheme

"Special".

But why are we special-casing just this?  Why not jar:?  Or other URI schemes that may behave this way?

I'd rather we didn't do this view-source thing and actually fixed the bug on having a sane API for such "nested" URLs, then made this code use that API.  Please file a bug to do that, dependent on whatever the bug on the nested URL API is.  I'm pretty sure we have one already.

>+  // Canonicalize |charset|

And this could go away if aPreferredCharset died, no?  That is, is the charset of an nsIURI required to be canonical?  The API doesn't say....

>+  // Check if the URI has any escaped character.

"Check whether the URI has any escaped characters."

>+  // Check if the URI refers to a network resource.

"Check whether ..."

>+  rv = aURI->GetHost(host);

I'm not sure I follow.  Do you really care whether this is referring to a network resource?  Or just whether there is a host present?  If the latter, that's what the comments should talk about.

>+    // XXX this should be removed after bug 278161 is fixed.

How's anyone going to know?  File a bug dependent on bug 278161 to do said removal.  Then reference that bug number in the comment here.

>+    PRBool isIRI = (!scheme.LowerCaseEqualsLiteral("file"));

No need for the extra parens.

>+  // See if we need to try an charset other than UTF-8

"a charset"

>+  if (!charset.EqualsLiteral("UTF-8")) {
>+    PRBool shouldUseCharset;
>+    if (!aPreferredCharset.IsEmpty()) {
>+      // We should always unescape if the link has an associated charset

We're unescaping no matter what, so I really don't follow the comment...  I'm also not clear on what "the link" means here or why it's relevant that it has a charset attribute.

If you mean that "we should always use the caller-provided charset if there is one", that's a different story.

>+    } else if (!aBaseURI) {
>+      // We should not unescape because the conversion may fail.

Again, we're unescaping no matter what, no?  I don't follow this comment.

>+      // Should use charset for |aURI| refering to a resource
>+      // at the same host as that of |aBaseURI|

Why?  This is the sort of thing where you need to explain what you're really doing in the comments, not the mechanics of it.  In my opinion, of course.  That is, why does the unescaping of the URI depend on aBaseURI?  Will the request we actually make depend on aBaseURI?  I somehow strongly doubt that....

>+      nsCAutoString baseHost;
>+      rv = aBaseURI->GetHost(baseHost);
>+      NS_ENSURE_SUCCESS(rv, rv);

This will break, no?  There's no reason that aBaseURI should have a host.  It could be a data: URI, or a wyciwyg: URI or a jar: URI or any of a whole bunch of other things that throw when you call GetHost() on them.  Or do we want to bail out and use UTF8 for all of those?  How do we know?  For the rest of this review I'll assume that that's not the behavior we want...

>+    if (!shouldUseCharset)
>+      charset = NS_LITERAL_CSTRING("UTF-8");

Again, why?

>+  // Note that We MUST NOT change(unescape/decode) host name for security.

Can you cite a bug here?  If not, I think it's worth explaining the security concern in more detail.  I, for one, fail to see the problem, assuming that we unescape hostnames for DNS lookup... If we don't, then that's the problem, of course.  ;)

>+    rv = aURI->GetUserPass(userPass);
>+    NS_ENSURE_SUCCESS(rv, rv);

This will return failure for data:, jar:, etc URIs.

>+    if (userPass.IsEmpty() || userPass.FindChar('%') == kNotFound) {

Why bother with the IsEmpty() test?  The FindChar() call will return kNotFound if it's empty, I'd hope.

>+  rv = aURI->GetPath(fragment);
>+  NS_ENSURE_SUCCESS(rv, rv);

OK, so that will throw for various URIs.  Is that desirable?

>+  NS_ConvertUTF8toUTF16 path(fragment);
>+
>+  nsCOMPtr<nsIURL> url = do_QueryInterface(aURI);
>+  if (!url) {
>+    if (IsFragmentUTF8(charset, fragment))
>+      AppendUTF8toUTF16(fragment, _retval);
>+    else
>+      _retval += path;
>+    return NS_OK;

So let me get this straight.  |path| is just |fragment| converted from UTF8 to UTF16.  If IsFragmentUTF8 is true, we convert |fragment| from UTF8 to UTF16 and append to _retval.  Otherwise we append |path| to _retval.

Why are we even bothering with the call to IsFragmentUTF8?  We end up doing exactly the same thing no matter what it returns!


>+  // reference
>+  rv = url->GetRef(fragment);
>+  NS_ENSURE_SUCCESS(rv, rv);
>+  rv = ReplaceFragment(charset, fragment, '#', path);

I'm pretty sure this will break with jar: URIs where the URI to the zip has a ref.  Consider for example:

  jar:file:///home/bzbarsky/jprof.html.zip#x!/jprof.html#y

your code will unescape 'y' and replace 'x!/jprof.html#y' with the unescaped version.  This is why I commented above about avoiding parsing of URI strings -- you simply don't know exactly how a given URI string will be treated by its protocol handler....

The handling of query and param has the exact same issue.  The query case is the most serious -- while there's really no good reason to stick a ref at the end of the zip file URI in a jar: URI (other than to mess with this code, of course), there are _very_ good reasons to stick a query there: if you want the zip file to depend on some params that your server will receive.

>+  rv = url->GetFilePath(fragment);
>+  NS_ENSURE_SUCCESS(rv, rv);

So this will again throw for data:, etc (if they've gotten down here, of course).  For jar: this won't throw, but return what you're expecting -- if nothing else, the filePath of a jar: URI doesn't start at position 0 in the path.

So it generally seems to me like this code, as written, will only work correctly if aURI is an nsStandardURL....

>Index: intl/uconv/src/nsTextToSubURI.h

Please don't add methods without documenting them.

>Index: toolkit/components/typeaheadfind/content/findBar.js

>+      baseURI = this.makeURI(ownerDoc.location.href, ownerDoc.characterSet);

I believe I agree with darin that this is very cumbersome.  Further, note that ownerDoc.location.href is not necessarily the same thing as a useful "base URI".  Think wycywig: (document.written content) or javscript: or data: document URIs...

>+    if (this.mFoundLink.charset)

That'll produce a strict JS warning.  Not only that, but it seems like this will break if we add a non-HTML link thing that has a |charset| property that has a slightly different meaning....

It'd probably be better to just have an nsIURI accessor for the link URI on links.

>Index: xpcom/io/nsEscape.cpp
>-    PRBool skipControl = (flags & esc_SkipControl); 
>+
>+    // ESC(0x1B) is not skipped for ISO-2022-* even if this flag is true.
>+    PRBool skipControl = (flags & esc_SkipControl);

What's ISO-2022-*?  Where did it come from?  That is, how does this code know that it's dealing with ISO-2022-*?

Also, why is 0x1B special?  This code deserves a whole lot more in the way of documentation, imo.
Attachment #212480 - Flags: review?(bzbarsky) → review-
Comment on attachment 212480 [details] [diff] [review]
Patch rv1.3

ok, as this already has review- as well as superreview-, I'm clearing my review request for now.
Attachment #212480 - Flags: review?(cbiesinger)
Blocks: 261929
Boris:

Hi, sorry for the delay. I'm working on this.

You said:
>>Index: intl/uconv/idl/nsITextToSubURI.idl
> 
>>+#define NS_TEXTTOSUBURI_CID \
>>+{ 0xb5058d64, 0xf289, 0x4d59, \
>>+  { 0x83, 0x3, 0x39, 0xb7, 0x7e, 0xd7, 0x8e, 0xbf } }
>> #define NS_ITEXTTOSUBURI_CONTRACTID "@mozilla.org/intl/texttosuburi;1"
>> %}
> 
> Is there a bug on moving that out of the IDL?  It doesn't belong here.

What should I do? NS_TEXTTOSUBURI_CID is used on here.
http://lxr.mozilla.org/mozilla/ident?i=NS_TEXTTOSUBURI_CID
I think that we need the new file that defines CIDs for each services, right?
# http://lxr.mozilla.org/mozilla/source/intl/uconv/idl/
# all IDL files on intl/uconv/idl are having same issue.
If it's so, I think that it's not scope of this bug...
We're doing per-module CID files, yes.  And yes, it's not in scope for this bug; I just asked whether there's a bug on it.  If not, please file one.
Target Milestone: mozilla1.9alpha1 → ---
Masayuki Nakano: are you still working on this (2 years after)?
QA Contact: amyy → i18n
OK. I should fix this bug for bug 456375 ASAP. However, it might be on Fx3.2.
See also : bug 415491
I'm resetting bugs which are assigned to me but I'm not working on them and I don't have plan for fixing them in near future.
Assignee: masayuki → smontagu
In these days, most websites are written in UTF-8. So, I think that we don't need to take such complicated fix for some websites which still use legacy encoding.
Assignee: smontagu → nobody
Status: ASSIGNED → NEW
Priority: P1 → --
You need to log in before you can comment on or make changes to this bug.