Closed Bug 1488356 Opened Last year Closed Last year

Port Bug 1488115 Remove the XPCOM component registration for nsUTF8ConverterService

Categories

(MailNews Core :: General, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 63.0

People

(Reporter: jorgk, Assigned: jorgk)

Details

Attachments

(4 files, 3 obsolete files)

We need to remove
#include "nsUConvCID.h"

and replace used of nsIUTF8ConverterService in nsSmtpService.cpp.
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
Comment on attachment 9006182 [details] [diff] [review]
1488356-nsUTF8ConverterService.patch

Henri, do you see a better solution for this than this fork?
Attachment #9006182 - Flags: feedback?(hsivonen)
Keywords: leave-open
Version: 24 → Trunk
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/1523aef1cb86
Port Bug 1488115: Replace use of nsUTF8ConverterService. rs=bustage-fix
Comment on attachment 9006182 [details] [diff] [review]
1488356-nsUTF8ConverterService.patch

Review of attachment 9006182 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not aware of a better solution, but since you call ToUTF8() only once, you could manually inline ToUTF8() and remove the branch that never runs given the boolean argument you pass.
Attachment #9006182 - Flags: feedback?(hsivonen) → feedback+
Hmm, I have failing tests now, too:

mozmake SOLO_TEST=composition/test-charset-upgrade.js mozmill-one
mozmake SOLO_TEST=composition/test-blocked-content.js mozmill-one

Both give:
EXCEPTION: Cc['@mozilla.org/intl/utf8converterservice;1'] is undefined, can't access property "getService" of it
    at: test-compose-helpers.js line 486
I really can't work out what
-  return Cc["@mozilla.org/intl/utf8converterservice;1"]
-           .getService(Ci.nsIUTF8ConverterService)
-           .convertURISpecToUTF8(content, "UTF-8");
was meant to do. Convert a body from which charset exactly to UTF-8? I don't get it. So I took a few stabs in the dark without success. Hard to tell where things go wrong, since I get a python error
  UnicodeEncodeError: 'charmap' codec can't encode characters in position 33-46
  most likely coming from prettyPrintException/prettyPrintResults :-(

I also tried decoding using TextEncoder, or should it be *de*coder, but that gave JS errors of TextEncoder not being defined. I thought it was somehow inbuilt.

I noticed that the 'true' argument in test-blocked-content.js can just be removed.
Attachment #9006230 - Flags: feedback?(acelists)
I finally worked it out ;-) - The comment was all wrong.
Attachment #9006230 - Attachment is obsolete: true
Attachment #9006230 - Flags: feedback?(acelists)
Attachment #9006231 - Flags: review?(acelists)
(In reply to Henri Sivonen (:hsivonen) from comment #4)
> I'm not aware of a better solution, but since you call ToUTF8() only once,
> you could manually inline ToUTF8() and remove the branch that never runs
> given the boolean argument you pass.
Thanks Henri, I prefer to leave the copied code as it was copied.
Keywords: leave-open
Target Milestone: --- → Thunderbird 63.0
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/88de6bc9d111
replace use of nsUTF8ConverterService in MozMill's get_msg_source(). rs=bustage-fix DONTBUILD
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
That comment threw me:
- * @param aUTF8    True if the contents should be returned in UTF-8.

The content was always returned as a JS string (internally UFT-16). The parameter was indicating whether the message content needed to be decoded. Since it was using that strange function, it could only decode UTF-8. Now it's more useful since it can decode anything.
Another question: The other six NewURI() functions *ignore* the charset argument. Should we do it here, too?
https://searchfox.org/comm-central/search?q=%3A%3ANewURI(&case=false&regexp=false&path=mailnews%2F
Oh, I see, this is for mailto: URL. There we can any type of wild stuff, so we'll have to leave the charset processing.
Comment on attachment 9006231 [details] [diff] [review]
1488356-mozmill.patch (v2)

Review of attachment 9006231 [details] [diff] [review]:
-----------------------------------------------------------------

Great, thanks.
Attachment #9006231 - Flags: review?(acelists) → review+
Comment on attachment 9006236 [details] [diff] [review]
1488356-follow-up-ignore-charset.patch - MUST not be used

OK, the try was green here:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=82ae420446994333a28870bb53e4ae3c4b2f85b5

Should be ignore the charset here?
Attachment #9006236 - Flags: feedback?(mkmelin+mozilla)
Attachment #9006236 - Flags: feedback?(acelists)
Comment on attachment 9006182 [details] [diff] [review]
1488356-nsUTF8ConverterService.patch

Magnus, do you want to take this any further, like inline ToUTF8() as Henri suggested in comment #4. If we rip it out, of course this question doesn't need to be answered.
Attachment #9006182 - Flags: review?(mkmelin+mozilla)
Attachment #9006182 - Flags: feedback?(acelists)
Comment on attachment 9006236 [details] [diff] [review]
1488356-follow-up-ignore-charset.patch - MUST not be used

Review of attachment 9006236 [details] [diff] [review]:
-----------------------------------------------------------------

If this works then all the better
Attachment #9006236 - Flags: feedback?(mkmelin+mozilla) → feedback+
(In reply to Magnus Melin from comment #17)
> If this works then all the better
Define "works". No tests fail. But I thing our forebears had a reason to acknowledge the charset for this type of link.

Henri, can you help? Any idea why all our NewURI() functions ignore the charset but this one? See comment #11. What's the mechanism of calling NewURI() and who would pass a charset?
Flags: needinfo?(hsivonen)
(In reply to Jorg K (GMT+2) from comment #18)
> Henri, can you help? Any idea why all our NewURI() functions ignore the
> charset but this one? See comment #11. What's the mechanism of calling
> NewURI() and who would pass a charset?

AFAIK, a non-UTF-8 charset would be passed if the URL object is created because the URL string exists in a link in a non-UTF-8 HTML page.

I'm guessing that the reason for ignoring the charset is that those URL schemes either don't have parts that would be analogous to non-ASCII query parts in http: URLs or that there doesn't exist a use case for URLs in those schemes to occur as links from an HTML page encoded in a legacy encoding.

https://tools.ietf.org/html/draft-earhart-url-smtp-00 looks like an expired draft, so it's unclear to me how/where smtp: URLs would even be used.
Flags: needinfo?(hsivonen)
Thanks, Henri!

(In reply to Henri Sivonen (:hsivonen) from comment #19)
> AFAIK, a non-UTF-8 charset would be passed if the URL object is created
> because the URL string exists in a link in a non-UTF-8 HTML page.
Oh! Well, this is about mailto: links and surely they can exist on any webpage. So we *cannot* ignore the charset here. 

> I'm guessing that the reason for ignoring the charset is that those URL
> schemes either don't have parts that would be analogous to non-ASCII query
> parts in http: URLs or that there doesn't exist a use case for URLs in those
> schemes to occur as links from an HTML page encoded in a legacy encoding.
The other URLs are internal, like imap, mailbox, pop, nntp/news, and yes, we control their parts.

> https://tools.ietf.org/html/draft-earhart-url-smtp-00 looks like an expired
> draft, so it's unclear to me how/where smtp: URLs would even be used.
We use them internally. But the question is about mailto: URLs.
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/897228f8e667
Added comment why charset cannot be ignored for mailto: URLs. rs=comment-only DONTBUILD
Attachment #9006236 - Attachment description: 1488356-follow-up-ignore-charset.patch - not useful → 1488356-follow-up-ignore-charset.patch - MUST not be used
Attachment #9006236 - Attachment is obsolete: true
Attachment #9006236 - Flags: feedback?(acelists)
(In reply to Pulsebot from comment #21)
> Pushed by mozilla@jorgk.com:
> https://hg.mozilla.org/comm-central/rev/897228f8e667
> Added comment why charset cannot be ignored for mailto: URLs.

I imagine this is wrong though. How do you end up with knowing a charset?

For any calls coming from us from say Firefox, there is no notion of encoding included in the mailto url. We must assume utf-8. 

I didn't check how mailto links in internal messages would do, but there too, I'd assume we're making them utf-8 before encoding (if anything) as the charset wouldn't be passed on to the system mailto handler which could be Thunderbird, or not.

Furthermore, the mailto url parameters must be urlencoded already by the author.
(In reply to Magnus Melin from comment #22)
> I imagine this is wrong though. How do you end up with knowing a charset?
Yes, those calls come out of M-C core. Best way to try it would be to do some windows-1252 pages/HTML messages with
mailto:xx@xx.com?subject=Hi, I föünd thät rëälly cööl site.

Let me try.
Regardless of what that outputs, how would you know which charset to decode to? The info isn't there once you reach Thunderbird.
OK, here's a message with windows-1252. We get called with:

=== mailto:xx@xx.com?subject=Hi, I fAA¼nd thAt rA«Ally cAAl site. UTF-8

using this debug:
printf("=== %s %s\n", PromiseFlatCString(aSpec).get(), aOriginCharset);

The funny characters come from the fact that the shell tries to print UTF-8 as single characters. But we see that it's really UTF-8 that's passed in, two bytes for every äöü.

I don't know where in Gecko the conversion happens, but I can imagine that historically other charsets got passed in.
Updated removal patch.
(In reply to Jorg K (GMT+2) from comment #25)
> === mailto:xx@xx.com?subject=Hi, I fAA¼nd thAt rA«Ally cAAl site. UTF-8
Oops, BMO ate some of this:

=== mailto:xx@xx.com?subject=Hi, I fA_A¼nd thA_t rA«A_lly cA_A_l site. UTF-8
I replaced the eaten characters with _ now.
(In reply to Magnus Melin from comment #24)
> Regardless of what that outputs, how would you know which charset to decode
> to? The info isn't there once you reach Thunderbird.
On the issue of triggering this from FF or another browser: I can't get Windows to start my local build for that right now, it would mean messing around with the Windows registry.

That said, ConvertURISpecToUTF8() also calls NS_UnescapeURL() So perhaps that is/was required.
OK, this example shows that you cannot remove the code.

If the URL is:
mailto:xx@xx.com?subject=%C3%B6
you get passed:
=== mailto:xx@xx.com?subject=%C3%B6 UTF-8

And you need to un-escape that.
Attachment #9006472 - Attachment is obsolete: true
Attachment #9006473 - Attachment description: 1488356-follow-up-ignore-charset.patch → 1488356-follow-up-ignore-charset.patch - MUST NOT be used.
Attachment #9006473 - Attachment is obsolete: true
Well, I tried to prove the point bug even if we don't unescape the mailto: URL, get get a correct subject for:
  mailto:xx@xx.com?subject=%C3%B6
So some part of the system does the unescaping later on anyhow. For there record, I wanted to add this comment:

+  // No cases have been observed where Gecko would actually pass a charset different
+  // to UTF-8, even if the original message is encoded in - say - windows-1252.
+  // However, the call to ConvertURISpecToUTF8() is needed to unescape mailto: URLs
+  // like for exmaple: mailto:example@example.com?subject=%C3%B6.

but it's not true :-(

So move further investigation to another bug?
Attachment #9006473 - Attachment description: 1488356-follow-up-ignore-charset.patch - MUST NOT be used. → 1488356-follow-up-ignore-charset.patch
Attachment #9006473 - Attachment is obsolete: false
(In reply to Henri Sivonen (:hsivonen) from comment #19)
> AFAIK, a non-UTF-8 charset would be passed if the URL object is created
> because the URL string exists in a link in a non-UTF-8 HTML page.

I tried all I could to get NewURI() called with windows-1252, but no success. My last attempt was running this in the error console:
openContentTab("http://www.jorgk.com/misc/windows-1252.html", "tab");
openContentTab("http://www.jorgk.com/misc/windows-1252x.html", "tab");

Those are windows-1252 pages with links and still when hovered I see escaped UTF-8 which is also passed in to the function.

So I haven't found a case what would make the special charset treatment necessary.

Let's ask some more people: Question: Can the charset parameter of the call to NewURI() be anything but UTF-8 when this is called from the core engine on a page that has a mailto: link?

I was going to ask Boris as well, but he's on PTO.
Flags: needinfo?(VYV03354)
Whatever the former reason was present, now we should always treat the charset as UTF-8 for mailto URLs:
https://url.spec.whatwg.org/#query-state
mailto is not a special URL:
https://url.spec.whatwg.org/#special-scheme
Flags: needinfo?(VYV03354)
Attachment #9006473 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9006182 [details] [diff] [review]
1488356-nsUTF8ConverterService.patch

Forget the review since we'll be removing it all again.
Attachment #9006182 - Flags: review?(mkmelin+mozilla)
Attachment #9006182 - Flags: feedback?(acelists)
Comment on attachment 9006473 [details] [diff] [review]
1488356-follow-up-ignore-charset.patch

Review of attachment 9006473 [details] [diff] [review]:
-----------------------------------------------------------------

Great! r=mkmelin
Attachment #9006473 - Flags: review?(mkmelin+mozilla) → review+
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/d1bb256c78f7
Follow-up: nsSmtpService::NewURI() (for mailto: URLs) can ignore the charset passed in. r=mkmelin
https://hg.mozilla.org/releases/comm-beta/rev/4ad95a32d3c082aaab01a79f9d8dc5f6d152bb5a
for TB 63 beta, so it joins the other changesets already on TB 63.
Type: enhancement → task
You need to log in before you can comment on or make changes to this bug.