nsIExternalProtocolService.loadURI() throws NS_ERROR_ILLEGAL_INPUT for URLs containing escaped non-ASCII characters

RESOLVED FIXED in Firefox 52

Status

()

Core
General
RESOLVED FIXED
5 years ago
3 months ago

People

(Reporter: wolschke, Assigned: Jorg K (GMT+2))

Tracking

({intl, regression, reproducible})

13 Branch
mozilla54
x86
All
intl, regression, reproducible
Points:
---
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox52 fixed, firefox53 fixed, firefox54 fixed, firefox-esr52 fixed)

Details

(Whiteboard: [regression:TB13])

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
User Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; AskTbAVR-W1/5.15.4.23930)

Steps to reproduce:

I updated to Thunderbird Version 13.0.1 . 
I get html-Mails with several Links to php-programs with a parameter as shown below. 
1. http://www.nix.xx/de/shop/kunde.php?name=Meier
oder
2. http://www.nix.xx/de/shop/kunde.php?name=M%FCller




Actual results:

The link shown in 1. operates well.
The link shown in 2. results no operation (the name Müller contents a special german Letter, this special letter is changed to %FC and that should work).
In version 12.0.1 both links where working well.


Expected results:

The link shown in 2. should send to the server. The server answers with a html-sequence. This answer should be shown in a browser window.
What charater of what charset does correspond to the binary of 0xFC in the escaped URL?
ü(small u umlaut) of Windows-1252, iso-8859-1, and so on?
> http://en.wikipedia.org/wiki/Windows-1252

There is no way to notify browser about charset of escaped data in URL.
UTF-8 for ü is 0xC3BC. ( http://www.fileformat.info/info/unicode/char/fc/index.htm )
Does following link in HTML mail or Text mail(linkified by Tb) work?
(a) Default Browser=Firefox, (b) Default Browser=IE
  http://www.nix.xx/de/shop/kunde.php?name=M%C3%BCller
(Reporter)

Comment 2

5 years ago
Hi WADA,
a) it dosn't work
b) it dosn't work
Both Links are searching a name Müller

You are right, the 0xFC corresponds to ü (small u umlaut)

I installed the version 12.0.1 of thunderbird, both links as described under 
1. http://www.nix.xx/de/shop/kunde.php?name=Meier
and
2. http://www.nix.xx/de/shop/kunde.php?name=M%FCller
are working well
Checked with Tb 13.0.1(en-US) on Japanese Win-XP, with Default Browser=Firefox. 

(1) Content-type: text/html; charset=windows-1252
    <a href="http://www.googl.com?name=M%C3%BCller">link utf-8 is escaped</a>.
Following log was obtained by LiveHTTPHeaders of Firefox. 
> http://www.google.com/?name=M%C3%BCller
> 
> GET /?name=M%C3%BCller HTTP/1.1

(2) Content-type: text/html; charset=windows-1252 
    <a href="http://www.googl.com?name=M%FCller">link windows-1252 is escaped</a>
Following exception was reported to Error Console.
> Timestamp: 2012/07/02 17:24:25
> Error: uncaught exception: [Exception... "Component returned failure code: 0x8050000e (NS_ERROR_ILLEGAL_INPUT) [nsIExternalProtocolService.loadUrl]"  nsresult: "0x8050000e (NS_ERROR_ILLEGAL_INPUT)"  location: "JS frame :: chrome://communicator/content/contentAreaClick.js :: openLinkExternally :: line 188"  data: no]

Confirming.

Phenomenon like next in Tb 13?
(i) Unescape %FC in URL => 0xFC
(ii) Try to escape binary of 0xFC in UTF-8
(iii) The binary of 0xFC produces "wrong char in UTF-8", then exception occurs.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
Summary: Links with modified special german letters don't work in version 13 → Links with escaped special german letters don't work in version 13, with 0x8050000e (NS_ERROR_ILLEGAL_INPUT) [nsIExternalProtocolService.loadUrl], location: "JS frame :: chrome://communicator/content/contentAreaClick.js :: openLinkExternally :: line 188"
(Reporter)

Comment 4

5 years ago
what happens now with the error ?
will anybody fix it ????

Comment 5

5 years ago
Please change Version to Trunk.

I can reproduce in this build.
ftp://ftp.mozilla.org/pub/thunderbird/nightly/2012-11-29-03-03-46-comm-central/thunderbird-20.0a1.en-US.win32.zip

Error: NS_ERROR_ILLEGAL_INPUT: Component returned failure code: 0x8050000e (NS_ERROR_ILLEGAL_INPUT) [nsIExternalProtocolService.loadUrl]
Source File: chrome://communicator/content/contentAreaClick.js
Line: 152
reproduces with current beta
Keywords: intl
OS: Windows XP → All
Whiteboard: [regression:TB13]
Duplicate of this bug: 1100017
Duplicate of this bug: 1335693
Keywords: reproducible
(Assignee)

Comment 9

4 months ago
Hmm, console shows:

NS_ERROR_ILLEGAL_INPUT: Component returned failure code: 0x8050000e (NS_ERROR_ILLEGAL_INPUT) [nsIExternalProtocolService.loadUrl]  contentAreaClick.js:187

And that code is:
function openLinkExternally(url)
{
  let uri = url;
  if (!(uri instanceof Components.interfaces.nsIURI))
    uri = Services.io.newURI(url);

  PlacesUtils.asyncHistory.updatePlaces({
    uri: uri,
    visits:  [{
      visitDate: Date.now() * 1000,
      transitionType: Components.interfaces.nsINavHistoryService.TRANSITION_LINK
    }]
  });

  Components.classes["@mozilla.org/uriloader/external-protocol-service;1"]
            .getService(Components.interfaces.nsIExternalProtocolService)
            .loadUrl(uri);
}

where line 187 is the last line with the loadUrl(uri). So it's a M-C service that failing there, but I didn't check which input it gets.

Maybe Alta88 has got some spare cycles to investigate this further.
Flags: needinfo?(alta88)

Comment 10

4 months ago
it's almost certainly due to what was done in some places in feeds when opening a web page link - the url (string) must be converted to utf8.  there may be a more 'modern' way to do this rather than nsIScriptableUnicodeConverter, which may even be on a hit list..

let converter = Cc["@mozilla.org/intl/scriptableunicodeconverter"]
                 .createInstance(Ci.nsIScriptableUnicodeConverter);
converter.charset = "UTF-8";
url = converter.ConvertToUnicode(url);
Flags: needinfo?(alta88)
(Assignee)

Comment 11

4 months ago
Care to help us out with a patch? That would be much appreciated.
AFAIK, conversion to UTF-8 can be done like this:
https://dxr.mozilla.org/comm-central/search?q=TextEncoder(%22UTF-8%22)&redirect=false
Flags: needinfo?(alta88)
(Assignee)

Comment 12

4 months ago
OK, comment #0 says to use http://www.nix.xx/de/shop/kunde.php?name=M%FCller which I believe is invalid. Since this is an ASCII string, converting this to UTF-8 does nothing. I even tried.

http://stackoverflow.com/questions/29988255/browser-support-for-utf8-encoded-characters-in-urls
suggests to try:
http://example.com/lörickè
or
http://example.com/l%c3%b6rick%c3%a8/

I made myself a message with both and both links work just fine in TB 54.

Note that here the encoding is UTF-8, so ö is %c3%b6.

All the examples from this bug and bug 1100017
http://www.nix.xx/de/shop/kunde.php?name=M%FCller
http://www.cx5-forum.de/f14/zubeh%F6rsammlung-f%FCr-cx-5-a-1245-new-post.html 
http://www.jyllands-posten.dk/protected/premium/indland/ECE7200862/Kommunerne-om-b%F8rnemillioner-R%E6kker-kun-til-015-p%E6dagog/
are not UTF-8 encoded, but windows-1252/ANSI/ISO-8859-1 encoded. But hey, Wada already said that in comment #1 !

Wada also said in comment #3, that if you want Müller, you need to use
http://www.googl.com?name=M%C3%BCller

Just hover you mouse over the links and you can see that FF can handle the UTF-8 ones, but not the ANSI ones.

I don't see a bug here I can fix, I'd mark this invalid. Maybe it worked differently in FF 12.

Comment 13

4 months ago
in Bug 1018589, the issue was that the header (and entire message) was stored as a utf8 string and had to be converted to unicode for the domain (because IDN is now on) part, in order to construct an accepted uri for loadUrl().  if the url isn't encoded correctly per the message charset, i'm not sure what you can do to 'craft' it.
Flags: needinfo?(alta88)
https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
https://en.wikipedia.org/wiki/Internationalized_domain_name
IDN is applicable to domain name which corresponds to "host" part in URI.
  URI =  scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Why IDN can be applied to path part, query part, fragment part of next URI in HTML mail?
  http://www.nix.xx/de/shop/kunde.php?name=M%FCller

Following is apparently wrong action.
1. Unescape %FC of Link in HTML mail and convert it to binary of 0xFC.
2. Try to escape the binary of 0xFC as UTF-8.

HTTP server can freely interpret and use %FC part in URI.
Who is reponsible to correctly escape a non-ascii character in URI according to server's request
   == Who created URI string of link in HTML
"path part, query part, fragment part of next URI in HTML mail" should be passed as-is to browser, after appropriate IDN processing on www.nix.xx part if required.
  <a href="http://www.nix.xx/de/shop/kunde.php?name=M%FCller">Link Text</a>
Please note that "character in URI string of link in HTML" won't be(shouldn't be) affected by character set conversion because "URI of link in HTML" is 7bit-ascii string except IDN part. 

Following are different issues.
(a) Linkify of URI like string in text/plain mail by Thunderbird, with IDN support.
    Because many HTTP servers already interpret escaped non-ascii as UTF-8,
    and many browsers already assumes that servers interpret escaped non-ascii as UTF-8,
    non-ascii character in URI like string is better escaped after UTF-8 conversion.
(b) Generate IDN compatible URI for RSS.
(c) HTML link creation support in HTML mode composer of Thunderbird with IDN support.
(Assignee)

Comment 15

4 months ago
I'd really like some clear answers here, so let's ask someone who knows.

Valentin, are these URLs valid or not:
http://www.nix.xx/de/shop/kunde.php?name=M%FCller
http://www.cx5-forum.de/f14/zubeh%F6rsammlung-f%FCr-cx-5-a-1245-new-post.html 
http://www.jyllands-posten.dk/protected/premium/indland/ECE7200862/Kommunerne-om-b%F8rnemillioner-R%E6kker-kun-til-015-p%E6dagog/

Note that the special characters are encoded with one byte, like %FC, which is some windows-1252 encoding, and not UTF-8, which would be %C3%BC.

I think they are all invalid, at least FF doesn't recognise then when it does recognise UTF-8 encoding like:
http://example.com/l%c3%b6rick%c3%a8/ or
http://www.example.com?name=M%C3%BCller

And a second question: Is it possible that FF 12 recognised them and FF from version 13 doesn't?

I'd really like to close this bug, one way or another.
Flags: needinfo?(valentin.gosu)

Comment 16

4 months ago
Hi,


1. My point of view
-------------------

In my opinion, those URL are valid.

It is the responsibility of the application to know the encoding of the parameters.

The application can expect UTF-8 *or* ISO-8859-1 *or* another encoding.

The browser just has to call the URL.

2. Firefox behavior
-------------------

Firefox loads the URL, but because UTF-8 is the better and recommended encoding today, it *visually* turns the URL http://example.com/?p=%C3%A9 into http://example.com/?p=é.

And, if you write the URL http://example.com/?p=é in the URL bar, Firefox *assumes* that it's encoded in UTF-8 and sends an HTTP request with 'C3%A9'.

And Firefox doesn't try to translate this URL (because it's obviously not UTF-8 encoded), but it loads it : http://example.com/?p=%E9.

3. RFC 3986 : Uniform Resource Identifier (URI): Generic Syntax
---------------------------------------------------------------

2.  Characters

   The URI syntax provides a method of encoding data, presumably for the
   sake of identifying a resource, as a sequence of characters. The URI
   characters are, in turn, frequently encoded as octets for transport
   or presentation.  This specification does not mandate any particular
   character encoding for mapping between URI characters and the octets
   used to store or transmit those characters.  


   ***When a URI appears in a
   protocol element, the character encoding is defined by that protocol;
   without such a definition, a URI is assumed to be in the same
   character encoding as the surrounding text.***

1.2.1.  Transcription

   In local or regional contexts and with improving technology, users
   might benefit from being able to use a wider range of characters;
   such use is not defined by this specification.  Percent-encoded
   octets (Section 2.1) may be used within a URI to represent characters
   outside the range of the US-ASCII coded character set if this 
   representation is allowed by the scheme or by the protocol element in
   which the URI is referenced.  Such a definition should specify the
   character encoding used to map those characters to octets prior to
   being percent-encoded for the URI.

2.1.  Percent-Encoding

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.  A percent-encoded octet is encoded as a character
   triplet, consisting of the percent character "%" followed by the two
   hexadecimal digits representing that octet's numeric value.  For
   example, "%20" is the percent-encoding for the binary octet
   "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
   character (SP).  Section 2.4 describes when percent-encoding and
   decoding is applied.

https://tools.ietf.org/html/rfc3986

Comment 17

4 months ago
(In reply to msd+bugzilla from comment #16)
> 
> In my opinion, those URL are valid.
> 

There's no possible way to make that determination, given that no one has attached an actual email containing such a link. So the email's charset and the html document's charset (if it's even html) are unknown.

Further, the failure is in loadUrl(), which was passed an nsIURI with a value it rejects, with a pretty clear NS_ERROR_ILLEGAL_INPUT. So commentary regarding anything that Fx may do in the url bar or anything with what servers accept is irrelevant.
Indeed, the URLs, even though they are valid by themselves, there is no way for the browser to know that %FC is actually ü, without the charset being specified in some way.
Converting the URL to a UTF8 encoding _may_ be possible, but I'm not sure there is a one-size-fits-all solution here.
Flags: needinfo?(valentin.gosu)

Comment 19

4 months ago
But why, in Thunderbird, do you need to know the charset of the URL's parameters ? Except for a better *user* display, I don't understand why.

If you want to display to the user an 'é' intead of 'C3%A9', then you should do it only when you can presume it's UTF-8. If you can't because it is '%E9', then just display '%E9'. No ?

But when the user clicks on the URL, please, throw the URL to the browser with percent-encoding values ! Currently Thunderbird does nothing !
(Assignee)

Comment 20

4 months ago
(In reply to msd+bugzilla from comment #19)
> But when the user clicks on the URL, please, throw the URL to the browser
> with percent-encoding values ! Currently Thunderbird does nothing !
The URL is passed to the Mozilla Core function loadUrl() and that function returns an error, see comment #9. Most likely it expects UTF-8 encoded URLs, as was previously stated.

In a windows1252-encoded webpage, http://example.com/?p=%E9 does in fact translate to http://example.com/?p=é, however, when clicking the link, http://example.com/?p=%E9 appears in the URL bar.

Try:
<html><head><meta http-equiv="content-type" content="text/html; charset=windows-1252"></head>
<body><a href="http://example.com/?p=%E9">Test</a></body></html>

So we could drill open the box and transcode http://example.com/?p=%E9 to http://example.com/?p=%C3%A9 before passing it to loadUrl() if that link was found in a windows1252-encoded e-mail body. Alta88 already suggested that in comment #10.

My apologies for dragging out the discussion.

Comment 21

4 months ago
1. URL are in US-ASCII
----------------------

> The URL is passed to the Mozilla Core function loadUrl() and that function returns an error, see comment #9. Most likely it expects UTF-8 encoded URLs, as was previously stated.

The URL http://example.com/?p=%E9 and http://example.com/?p=%C3%A9 are both encoded in _US-ASCII_. 

So I don't understand why the Mozilla Core function loadUrl() tries to percent-decode the URL. *For me, the problem is may be here.*

2. The sample webpage
---------------------

> In a windows1252-encoded webpage, http://example.com/?p=%E9 does in fact translate to http://example.com/?p=é, however, when clicking the link, http://example.com/?p=%E9 appears in the URL bar.

It's interesting to see that Firefox behavior differs from Chrome and Internet Explorer behavior with your sample.

Firefox shows 'http://example.com/?p=é'.
Chrome and IE show 'http://example.com/?p=%E9'

With this page (be careful to encode it in Windows-1252), it's even better to understand that be displaying behavior differs : 

<html><head><meta http-equiv="content-type" content="text/html; charset=windows-1252"></head><body><a href="http://example.com/?p=%E9">link1 - %E9</a> - <a href="http://example.com/?p=%C3%A9">link2 - %CE%A9</a> - <a href="http://example.com/?p=é">link3 - é</a></body></html>

- Firefox :
  - link1 : show 'é'
  - link2 : show 'é'
  - link3 : show 'é'
- Chrome :
  - link1 : show '%E9' 
  - link2 : show 'é' (because it *presume* it's UTF-8)
  - link3 : show '%E9' (because of the _encoding of the page_)
- Internet Explorer 11 : 
  - link1 : show '%E9' 
  - link2 : show '%CE%A9'
  - link3 : show 'é'

But it's just information for the users.

When you click on the link1, on each browser, you reach the good webpage : http://example.com/?p=%E9
When you click on the link2, on each browser, you reach the good webpage : http://example.com/?p=%C3%A9
When you click on the link3 :
  - Firefox : http://example.com/?p=%E9
  - Chrome  : http://example.com/?p=%E9
  - IE11    : http://example.com/?p=é (and I don't know what is technically sent, surely '%E9')

3. Not a good solution to use the encoding of the email body
------------------------------------------------------------

For me, the encoding of the e-mail body is not the good encoding to decode the percent-encoded URL.

I explain why :

I want to send you a link to a blog http://blog.example.com/?p=%E9. This blog expect the param p to be encoded in ISO-8859-1.
So I write you an e-mail, and I share you this link : http://blog.example.com/?p=%E9.
The body of my email can be encoded in UTF-8, in ISO-8859-1 or anything else, but it's not the good encoding to match the encoding of %E9.
In fact, in Thunderbird, _you can't know the expected encoding_ : only the programmer of blog.example.com knows. 
And it's not a problem because Thunderbird doesn't have to percent-decode this URL it is not its responsibility.

The URL http://blog.example.com/?p=%E9 is in _US-ASCII_.

The param 'p' of the link http://example.com/?p=%41A is 'AA' with :
- UTF-8
- windows-1252
- iso-8859-1
- iso-8859-15

The param 'p' of the link http://example.com/?p=%BE is : 
- '¾' with iso-8859-1
- 'Ÿ' with iso-8859-15

The param 'p' of the link http://example.com/?p=%A4 is : 
- '¤' with iso-8859-1
- '€' with iso-8859-15

And Thunderbird can not know what is the expected encoding. Only example.com knows.

So it's impossible to convert http://example.com/?p=%BE in UTF-8 because Thunderbird don't know what is the good encoding.

4. To resume my though
----------------------

- Thunderbird doesn't have to percent-decode the URL (except may be for a display comfort for the user like Firefox and Chrome do)
- Firefox doesn't have to percent-decode the URL (except may be for a display comfort)
- The final website have to percent-decode the URL, and, perfect, this website knows what is the good encoding.

So, if I well understand, the problem is that the Mozilla Core function loadUrl() tries to percent-decode the URL, and should not have to do this. But why ?
(Assignee)

Comment 22

4 months ago
Thanks for you comment #21. Interesting reading. And thanks for doing a browser comparison.

You are quite right, all the mentioned URLs are US-ASCII and it's surprising that loadUrl() rejects those where the  percent encoding doesn't represent UTF-8.

A URL consists of various parts:
https://dxr.mozilla.org/mozilla-central/rev/af8a2573d0f1e9cc6f2ba0ab67d7a702a197f177/netwerk/base/nsIURI.idl#16
and in the examples in comment #15, only pathname and query contain percent encodings, not the hostname.

So your suggestion is to accept anything in the pathname and query and just open a that URL and let the server sort out which page needs to be served.

I looked up loadUrl() and it's deprecated:
https://dxr.mozilla.org/mozilla-central/rev/af8a2573d0f1e9cc6f2ba0ab67d7a702a197f177/uriloader/exthandler/nsIExternalProtocolService.idl#96
Another function loadURI() should be used instead. Let me try that and report back.
(Assignee)

Comment 23

4 months ago
I changed the code to use loadURI() and the result is just the same: NS_ERROR_ILLEGAL_INPUT

Mozilla Core code is long and twisted, but I found the function here:
http://searchfox.org/mozilla-central/source/uriloader/exthandler/nsExternalHelperAppService.cpp#1019

I tried a search for NS_ERROR_ILLEGAL_INPUT:
http://searchfox.org/mozilla-central/search?q=NS_ERROR_ILLEGAL_INPUT&case=false&regexp=false&path=
It seems to be mostly used in the "intl" module.

In the end I traced through it and I got NS_ERROR_UDEC_ILLEGALINPUT returned from 
rv = unicodeDecoder->Convert(aURI.get(), &srcLen, ustr, &dstLen);
where the input was surprisingly mData = 0x000000000033bec0 "http://example.com/?name=Müller/". Stack is this:
xul.dll!nsTextToSubURI::convertURItoUnicode(const nsCString & aCharset, const nsCString & aURI, nsAString_internal & _retval) Line 210	C++
xul.dll!nsTextToSubURI::UnEscapeNonAsciiURI(const nsACString_internal & aCharset, const nsACString_internal & aURIFragment, nsAString_internal & _retval) Line 285	C++
xul.dll!nsMIMEInfoWin::LoadUriInternal(nsIURI * aURL) Line 238	C++
xul.dll!nsMIMEInfoBase::LaunchWithURI(nsIURI * aURI, nsIInterfaceRequestor * aWindowContext) Line 326	C++
xul.dll!nsExternalHelperAppService::LoadURI(nsIURI * aURI, nsIInterfaceRequestor * aWindowContext) Line 1082	C++

The key to understanding the problem is this function:

NS_IMETHODIMP  nsTextToSubURI::UnEscapeNonAsciiURI(const nsACString & aCharset, 
                                                   const nsACString & aURIFragment, 
                                                   nsAString &_retval)
{
  nsAutoCString unescapedSpec;
  NS_UnescapeURL(PromiseFlatCString(aURIFragment),
                 esc_AlwaysCopy | esc_OnlyNonASCII, unescapedSpec);
  // leave the URI as it is if it's not UTF-8 and aCharset is not a ASCII
  // superset since converting "http:" with such an encoding is always a bad 
  // idea.
  if (!IsUTF8(unescapedSpec) && 
      (aCharset.LowerCaseEqualsLiteral("utf-16") ||
       aCharset.LowerCaseEqualsLiteral("utf-16be") ||
       aCharset.LowerCaseEqualsLiteral("utf-16le") ||
       aCharset.LowerCaseEqualsLiteral("utf-7") ||
       aCharset.LowerCaseEqualsLiteral("x-imap4-modified-utf7"))){
    CopyASCIItoUTF16(aURIFragment, _retval);
    return NS_OK;
  }

  return convertURItoUnicode(PromiseFlatCString(aCharset), unescapedSpec, _retval);
}

In my debugging I get:
aURIFragment = {mData=0x0000000008830e48 "http://example.com/?name=M%FCller/" mLength=0x00000022 mFlags=0x00010005 }
unescapedSpec = {mStorage=0x000000000033bec0 "http://example.com/?name=Müller/" }

but then, as mentioned, rv = unicodeDecoder->Convert(aURI.get(), &srcLen, ustr, &dstLen); fails.

I think it's time to ask someone who knows.

Masatoshi-san, can you help us here. The problem is this:

In an e-mail we have this link: http://example.com/?name=M%FCller/ and when we click on it, it goes through loadUrl() or loadURI() and the stack shown above.

In nsTextToSubURI::UnEscapeNonAsciiURI() the text is un-escaped to http://example.com/?name=Müller/ which is passed to convertURItoUnicode() and that fails. Why?

There are of course other problems here which we should ask other people, like for example: Why unescape the entire URI when the pathname and query should just be left alone.
Flags: needinfo?(VYV03354)
Information is insifficient to answer the question.

> unescapedSpec = {mStorage=0x000000000033bec0 "http://example.com/?name=Müller/" }

What is the binary byte value of "ü" in this string? ("\xFC" or "\xC3\xBC"?) What is aCharset value?
Flags: needinfo?(VYV03354) → needinfo?(jorgk)
FYI, bug 227268 added the UnEscapeNonAsciiURI() call.
(Assignee)

Comment 26

4 months ago
I added this in nsTextToSubURI::convertURItoUnicode() since I don't know how to convince the MS VC++ debugger to show me the bytes:
  printf("=== charset %s\n", aCharset.get());
  int32_t i;
  for (i=0; i< srcLen; i++)
    printf("=== %x\n", aURI.get()[i]);
  rv = unicodeDecoder->Convert(aURI.get(), &srcLen, ustr, &dstLen);
  // rv = NS_OK;
and that gives (I added the stuff in parenthesis):

=== charset UTF-8
=== 68 (h)
=== 74 (t)
=== 74 (t)
=== 70 (p)
[stuff deleted]
=== 3d (=)
=== 4d (M)
=== fffffffc
=== 6c (l)
=== 6c (l)
=== 65
=== 72
=== 2f

Sorry about the clumsy output, looks like the leading bit is carried into the whole int.
But as you can see, the answer is "\xFC". So that's ANSI and most likely a result of the unescaping of %FC.

I can see two problems here:
1) LoadURI() don't take a charset argument (unless that can be extracted from 'aWindowContext').
2) If UnEscapeNonAsciiURI() produced something that isn't UTF-8, it should most likely use the escaped value.
   If I just hover with the mouse over the HTML, I can see the same debug output.
   So the same function is called also without a click. Since there is an encoding error, the
   display just comes out as: http://example.com/?name=M%FCller/
   We should do the same in the click processing. So if there's an encoding error, revert back to the
   original escaped string.
Flags: needinfo?(jorgk)
(In reply to Jorg K (GMT+1) from comment #26)
> 2) If UnEscapeNonAsciiURI() produced something that isn't UTF-8, it should
> most likely use the escaped value.
>    If I just hover with the mouse over the HTML, I can see the same debug
> output.
>    So the same function is called also without a click. Since there is an
> encoding error, the
>    display just comes out as: http://example.com/?name=M%FCller/
>    We should do the same in the click processing. So if there's an encoding
> error, revert back to the
>    original escaped string.

It is caller's responsibility to handle unEscapeNonAsciiURI() failures because the nsITextToSubURI contract says unEscapeNonAsciiURI() can fail:
https://dxr.mozilla.org/mozilla-central/source/intl/uconv/nsITextToSubURI.idl#48-49
>   * @throws NS_ERROR_UCONV_NOCONV when there is no decoder for aCharset
>   *         or error code of nsIUnicodeDecoder in case of conversion failure

Blame nsMIMEInfoWin in this case.
(Assignee)

Comment 28

4 months ago
Thanks.

https://dxr.mozilla.org/mozilla-central/rev/f4f374622111022d41dd8d5eb9220624135c534a/uriloader/exthandler/win/nsMIMEInfoWin.cpp#238
    rv = textToSubURI->UnEscapeNonAsciiURI(urlCharset, urlSpec, utf16Spec);
    NS_ENSURE_SUCCESS(rv, rv);
Who would be in charge of that?
Bz according to <https://wiki.mozilla.org/Modules/Core>?
Feel free to redirect if you know a more appropriate person.
Flags: needinfo?(bzbarsky)
(Assignee)

Comment 30

4 months ago
OK, Boris, here's the problem. If you click this link http://example.com/?name=M%FCller/ in a mail display, you trigger an error in nsTextToSubURI::convertURItoUnicode() since one level up in nsTextToSubURI::UnEscapeNonAsciiURI() the %FC was converted to an ANSI character "\xFC" by calling NS_UnescapeURL().

Relevant comments: Comment #23 with call stack and comment #26 with more debug. The caller of UnEscapeNonAsciiURI() is here:
https://dxr.mozilla.org/mozilla-central/rev/f4f374622111022d41dd8d5eb9220624135c534a/uriloader/exthandler/win/nsMIMEInfoWin.cpp#238

The interesting thing is that when you hover the link, it displays as http://example.com/?name=M%FCller/ in the status bar. In my debug I can see that it also goes through nsTextToSubURI::convertURItoUnicode() but that is on another code path which handles the error better:

xul.dll!nsTextToSubURI::convertURItoUnicode(const nsCString & aCharset, const nsCString & aURI, nsAString_internal & _retval) Line 215	C++
xul.dll!nsTextToSubURI::UnEscapeURIForUI(const nsACString_internal & aCharset, const nsACString_internal & aURIFragment, nsAString_internal & _retval) Line 237	C++
xul.dll!nsDocShell::OnOverLink(nsIContent * aContent, nsIURI * aURI, const char16_t * aTargetSpec) Line 14208	C++

You have to look twice: The click calls UnEscapeNonAsciiURI() and the hover calls UnEscapeURIForUI() and the latter has:

  if (convertURItoUnicode(
                PromiseFlatCString(aCharset), unescapedSpec, _retval)
      != NS_OK) {
    // assume UTF-8 instead of ASCII  because hostname (IDN) may be in UTF-8
    CopyUTF8toUTF16(aURIFragment, _retval);
  }

Who would have some spare cycles for this? This bug has been on the books for five years. Surely I can send a small patch to rectify this, by the pain and time-consuming part is always to come up with some test.
UnEscapeURIForUI() is for UI (as the name implies) and the conversion is lossy. Don't mimic the behavior here.
OK, so the nsMIMEInfo* impls are owned by per-platform experts in theory, at least in parts.  In this case, probably Jim Mathies [:jimm] for the interaction with ShellExecute.

Anyway, if we do need to unescape before calling ShellExecute, and assuming we do in fact need to pass punycoded hostnames (so have to GetAsciiSpec on the URI)  there are several problems here:

1) We do in fact need a fallback if UnEscapeNonAsciiURI fails.  Just doing CopyASCIItoUTF16(urlSpec, utf16Spec) is probably the best we can do under the circumstances.  It will mean you kinda get garbage in some cases (e.g. those of bug 227268 when we end up taking the fallback path), but slightly-garbage is probably better than nothing at all.

2) The fact that we have a URI claiming a charset of UTF-8 that is clearly not escaped-UTF8 is somewhat wrong.  OK, but where does that URI come from?  Most immediately, nsExternalHelperAppService::LoadURI does this:

  nsAutoCString spec;
  aURI->GetSpec(spec);
  // Do some fixups on "spec"
  nsCOMPtr<nsIURI> uri;
  nsresult rv = ios->NewURI(spec, nullptr, nullptr, getter_AddRefs(uri));

It sure would be a good idea to pass the charset of aURI in as the second argument there.  Whether that would help on its own or not depends on where aURI is coming from.  The caller is openLinkExternally in contentAreaClick.js which has:

  if (!(uri instanceof Components.interfaces.nsIURI))
    uri = Services.io.newURI(url);

Note lack of charset arg.  So if a string is passed in, you will get a URI that thinks it's UTF-8-encoded.  If an actual URI is passed in we could maybe do better.  It looks like the caller in contentAreaClick in fact passes a string.  Passing along a charset, and specifically the charset of the target's owner document when the target is a node, might help somewhat in some cases.

You should probably get two separate bugs going here; one on issue #1 (which is a core issue) and one on #2, which is a Thunderbird UI issue.

As for testing... automated testing is hard, obviously, because the whole point of this codepath is to hand a load over to a different application.  Manual testing should probably be done; the obvious cases are http:// links being handed from Thunderbird to Firefox and other browsers and mailto: links handed from Firefox to Thunderbird and other mail clients (e.g. bug 227268 had some proposed fixes that worked with Thunderbird but not Outlook).  And obviously whatever the steps are in bug 227268.  But the upshot of at least the core change is that things will be no worse than before bug 227268 in the failing case (and better in the success case).  I wouldn't necessarily block landing on the manual testing here.
Flags: needinfo?(bzbarsky)
I guess actually we need three bugs:

1)  Add the fallback in mimeinfowin.  This is pure win, afaict, and can just land.
2)  Pass aURI's charset in nsExternalHelperAppService::LoadURI when calling NS_NewURI.  This part
    would probably need some careful testing before we land it.
3)  Pass a useful charset in to all this stuff in the Thunderbird UI.  Again, needs testing but no opinion
    whether that should block it being checked in.
(Assignee)

Comment 34

4 months ago
Boris, thanks for the detailed comment. You missed one question: ;-)
Who would have some spare cycles for this? I mean the M-C part. We're happy to do the TB parts of course.
Item #1 of the m-c part should be pretty simple.  I'll review if you write the patch.  ;)  That will at least get you to the point where the URI loads, whether it's "right" or not.

Item #2... I don't know.  You could ask Jim and see whether he has time or recommendations for someone who has time.
(Assignee)

Comment 36

4 months ago
Created attachment 8835148 [details] [diff] [review]
770022-unescape-fallback.patch (v1).

That's what you asked for. Works fine, BTW.

In case you approve, kindly add "checkin-needed". You know it's late here.
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
Attachment #8835148 - Flags: review?(bzbarsky)
(Assignee)

Updated

4 months ago
Component: General → General
Product: Thunderbird → Core
Comment on attachment 8835148 [details] [diff] [review]
770022-unescape-fallback.patch (v1).

r=me
Attachment #8835148 - Flags: review?(bzbarsky) → review+
Keywords: checkin-needed

Comment 38

4 months ago
Pushed by ihsiao@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a765f1fd046e
Provide fallback when UnEscapeNonAsciiURI() fails in nsMIMEInfoWin::LoadUriInternal(). r=bz
Keywords: checkin-needed

Comment 39

3 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/a765f1fd046e
Status: ASSIGNED → RESOLVED
Last Resolved: 3 months ago
status-firefox54: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
I'm assuming you want this for TB52, so go ahead and request Aurora/Beta approval when ready :)
status-firefox52: --- → affected
status-firefox53: --- → affected
Flags: needinfo?(jorgk)
(Assignee)

Comment 41

3 months ago
Comment on attachment 8835148 [details] [diff] [review]
770022-unescape-fallback.patch (v1).

Approval Request Comment
[Feature/Bug causing the regression]: I don't know. We never looked for the regression, apparently this worked in TB/FF 12 but not 13.
[User impact if declined]: Can't click on some URLs in an e-mail display.
[Is this code covered by automated tests?]: No.
[Has the fix been verified in Nightly?]: Yes, local TB build.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No, just a small fix to provide some fallback behaviour when some URL with encoded international characters can't be decoded.
[Why is the change risky/not risky?]: See above.
[String changes made/needed]: None.
Flags: needinfo?(jorgk)
Attachment #8835148 - Flags: approval-mozilla-beta?
Attachment #8835148 - Flags: approval-mozilla-aurora?
Comment on attachment 8835148 [details] [diff] [review]
770022-unescape-fallback.patch (v1).

Minor fix/fallback for escaped characters, looks pretty safe.
Attachment #8835148 - Flags: approval-mozilla-beta?
Attachment #8835148 - Flags: approval-mozilla-beta+
Attachment #8835148 - Flags: approval-mozilla-aurora?
Attachment #8835148 - Flags: approval-mozilla-aurora+
Jorg, does this still need followup as bz mentions in comment 33? Or is what you have enough for now?
Flags: needinfo?(jorgk)
(Assignee)

Comment 44

3 months ago
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #42)
> Minor fix/fallback for escaped characters, looks pretty safe.
Thanks for the approval, yes, it looks pretty safe ;-)

(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #43)
> Jorg, does this still need followup as bz mentions in comment 33? Or is what
> you have enough for now?
Thanks for the question. I was going to ask Boris if he would be very upset if we only went with the first part of the solution he envisaged since that's all we need at the moment. I've just tested it again in TB:
Clicking on http://example.com/?name=M%FCller/ shows exactly that in the status bar and opens exactly that URL.

Boris envisaged a solution which would take the charset into account and maybe show http://example.com/?name=Müller/ when hovering the link and opening the that link, too. I'm not convinced that this is required, at least not according to comment #21 (last paragraph).

So quick answer: What we have is enough for now, and perhaps even for the foreseeable future.

Thanks again!
Flags: needinfo?(jorgk)

Comment 45

3 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/afe4b9f1e68f
status-firefox53: affected → fixed

Comment 46

3 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/b9bc6b310328
status-firefox52: affected → fixed
If hovering the link doesn't show the "right" thing, then I don't think you'd get it even if you passed through the document charset.

I do think it's worth filing a bug on Gecko to at least do its part (use the incoming URI charset) right.
(Assignee)

Comment 48

3 months ago
Thanks for the comment, Boris, but I'm a little confused. As I pointed out in comment #20, if you hover the following in FF

<html><head><meta http-equiv="content-type" content="text/html; charset=windows-1252"></head>
<body><a href="http://example.com/?p=%E9">Test</a></body></html>

you see http://example.com/?p=é and if you click you open http://example.com/?p=%E9.

Is that not the desired behaviour that's already working? So which bug do you want me to file? Reading your comment #32 and comment #33 again I can see that you want to pass the character set further down the line, which is a good idea in general, but what's the immediate demonstrable bug in FF that this would fix?

If I hover this in TB with the e-mail charset of windows-1252, I see http://example.com/?p=%E9, but clicking opens http://example.com/?p= (yes, truncated). However, using http://example.com/?p=%E9e hovers and opens http://example.com/?p=%E9e. And http://example.com/?p=%E9%E9 hovers and opens http://example.com/?p=%E9%E9, so I'm not sure who eats the single character there. Strange.

But that's not the bug you're talking about.

So I could file two TB bugs here:
  Hovering http://example.com/?p=%E9 in windows-1252 e-mail doesn't show http://example.com/?p=é like in FF
and
  Clicking http://example.com/?p=%E9 in windows-1252 e-mail opens http://example.com/?p=
But for both, something seems to have gone wrong at the TB/Gecko interface. Without investigating, I can't blame that on Gecko ;-)
Flags: needinfo?(bzbarsky)
(Assignee)

Updated

3 months ago
Duplicate of this bug: 1138608
(Assignee)

Comment 50

3 months ago
I filed bug 1338758 for the second (truncation) issue. I also came across bug 1138608 which got fixed here.
> Is that not the desired behaviour that's already working?

Yes.

> So which bug do you want me to file?

Comment 33 item 2.

> but what's the immediate demonstrable bug in FF that this would fix?

Try doing the same thing but replace your http:// URL with something Firefox does not handle internally (e.g. mailto:).  See what gets handed out to the external helper app then.  Ideally, you'd register an app for said protocol that just spits out what it got, so you can easily observe incoming values.

No, it's not an obvious issue.  ;)

I can't speak to the TB issues.
Flags: needinfo?(bzbarsky)
(Assignee)

Updated

3 months ago
Blocks: 1338900
(Assignee)

Comment 52

3 months ago
Done, bug 1338900. I hope that's what you wanted, if not, please change the summary and/or post a comment there.
(Assignee)

Updated

3 months ago
No longer blocks: 1338900
Summary: Links with escaped special german letters don't work in version 13, with 0x8050000e (NS_ERROR_ILLEGAL_INPUT) [nsIExternalProtocolService.loadUrl], location: "JS frame :: chrome://communicator/content/contentAreaClick.js :: openLinkExternally :: line 188" → nsIExternalProtocolService.loadURI() throws NS_ERROR_ILLEGAL_INPUT for URLs containing escaped non-ASCII characters
Setting qe-verify- per comment 41.
Flags: qe-verify-

Comment 54

3 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-esr52/rev/b9bc6b310328
status-firefox-esr52: --- → fixed
You need to log in before you can comment on or make changes to this bug.