Closed Bug 227268 Opened 21 years ago Closed 12 years ago

Subject line in File | Send Link... mail uses strange characters instead of non-ASCII characters (probably UTF-8)

Categories

(Core Graveyard :: File Handling, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla13

People

(Reporter: jesper.hertel.arbejde, Assigned: smontagu)

References

()

Details

(Keywords: intl, relnote)

Attachments

(1 file, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5) Gecko/20031007
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5) Gecko/20031007

When I use File | Send Link... on a page which title contains non-ASCII
characters, e.g. the Danish characters æøå, the subject line in the resulting
e-mail in Outlook 2000 contains two strange characters on each place where one
of the Danish characters should have been.

When I use File | Send Link... on the page
http://www.pcworld.dk/default.asp?Mode=2&ArticleID=4762 , this mailto URL is fired: 
mailto:?body=http%3A//www.pcworld.dk/default.asp%3FMode%3D2%26ArticleID%3D4762&subject=PC%20World%20-%20Digital%20video%20p%C3%A5%2065%20gram

(I collected this mailto by modifying the registry key
reg:\HKEY_CLASSES_ROOT\mailto\shell\open\command\(Default) to point to a small
Python script that collected the arguments sent.)

The title of the page is "PC World - Digital video på 65 gram", but the subject
in the resulting e-mail in Outlook 2000 is "PC World - Digital video på 65
gram", which the given mailto URL also reflects.

I have found out that in UTF-8, the two characters "Ã¥" is exactly the character
"å".

Maybe Mozilla should be converting to code page 1252 in the Windows case before
constructing the mailto url?

Reproducible: Always

Steps to Reproduce:
1. Go to the given URL http://www.pcworld.dk/default.asp?Mode=2&ArticleID=4762 .
2. Choose File | Send Link...
3. Look at the subject line in the resulting mail.

Actual Results:  
The subject is "PC World - Digital video på 65 gram".

Expected Results:  
The subject should have been "PC World - Digital video på 65 gram".

I use Windows 2000 SP4, Mozilla 1.5, Outlook 2000 SP-3.
I must mention that I have patched my Mozilla 1.5 with (exactly) the patch
mentioned in Bug 217328 comment 14
(http://bugzilla.mozilla.org/show_bug.cgi?id=217328#c14). But this was a
mailto:body= issue and does not affect this problem. The current subject problem
has also been there all the time, including before I made the patch.

Otherwise I have changed nothing in my installation.
Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.6b) Gecko/20031129

I´m not using Outlook, but Mozilla MailNews as default mail,
sending to an account I can access via Webmail only I got in the header:
Subject: =?iso-8859-1?Q?Digital_video_p=E5_65_gram?=
and this was displayed like seen on the website.

Can you retest with Mozilla 1.6b, when it comes?
invalid comment #2, I didn´t test what the reporter was claiming.

Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.6b) Gecko/20031202

I used File -> Send Link... to send the page using MozillaMail to my account,
Mozilla Mail opened, and was showing Subject and body like below,
same, as I received after sending.

So this is working internally, but I can´t test if it is also working, if using
an external mailclient, like Outlook.
This should be tested by someone using Outlook, or another mailclient.

Sent/received:

Subject: PC World - Digital video på 65 gram

<http://www.pcworld.dk/default.asp?Mode=2&ArticleID=4762>


Component: XPApps, as in Bug 217328 ?
Component: Browser-General → XP Apps
Using either

Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; rv:1.7a) Gecko/20040121
Firebird/0.8.0+ (scragz)

or Mozilla 1.6 release, I get the following in QM when I try to "send link"
(Moz) or "send page" (FB):

http%3A%2F%2Fbugzilla.mozilla.org%2Fshow_bug.cgi%3Fid%3D227268&subject=Bug%20227268%20-%20Subject%20line%20in%20File%20%7C%20Send%20Link...%20mail%20uses%20strange%20characters%20instead%20of%20non-ASCII%20characters%20(probably%20UTF-8)

With older versions of Firebird the page title was also appended (after the word
"subject"), but at least the URL part didn't convert the colon and slashes to
the hex equivalents which make the link completely useless.  This is a major
regression.

Flags: blocking1.7a?
Flags: blocking1.4.2?
I just installed Mozilla 1.6 ("Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
rv:1.6) Gecko/20040113 MultiZilla/1.6.0.0c"), and the problem is still the same.
Not a blocker.

Copying someone who might have an idea.
Flags: blocking1.7a?
Flags: blocking1.7a-
Flags: blocking1.4.2?
Flags: blocking1.4.2-
This is actually much easier to see with Firefox since the mail doesn't get in
the way.

I set up my mailto for Lotus Notes and sure enough my subject is wrong.
Assignee: general → file-handling
Status: UNCONFIRMED → NEW
Component: XP Apps → File Handling
Ever confirmed: true
QA Contact: general → ian
It's a URI.  It contains non-ascii chars.  There is no source document
information.  So it gets encoded as UTF-8 (which is the standard for non-ASCII
chars in URIs in any case); that's what GetAsciiSpec does on nsIURI objects.
<mkaply> biesi: The native charset would be the right call
<biesi> mkaply: yeah, it means unescaping the spec, converting to native
charset, escaping again (probably), and calling shellexecute on that, I suppose
Or you could GetSpec instead of GetAsciiSpec...
104      * Some characters may be escaped.
105      */
106     attribute AUTF8String spec;

doesn't gain you anything.
What mozilla currently does is 'correct'. External mail programs should be
'fixed' to understand IRI (Internationalized Resource Identifier : URIs in
UTF-8) [1].  That is, it's not Mozilla but MS OE, Lotus Notes, etc that should
be fixed. Of course, there's little, if any, we can do about them. How does
Thunderbird work when used as an external mail program for firefox? At least, we
have to get TB to do the 'right thing'. 

We might convert to the most widely used _legacy_ character encoding for a given
locale, but that doesn't seem to be the 'right' thing. What should we do when
there are characters not representable in the encoding? Using IRIs for those
cases _only_ seems to be worse than what we do now. What if the default
character encoding for outgoing emails in MS OE is different from the legacy
encoding of the default system locale? 

A work-around for MS OE users is to change the default encoding (for outgoing
emails) to UTF-8. Nowadays, except for __stupid__ web mail services (such as
hotmail and yahoo mail),  mail clients can handle messages in UTF-8
'transparently'. There should be a similar option in Lotus Notes. We can add
this to the release notes. 

[1] http://www.w3.org/International/O-URL-and-ident.html
Keywords: intl, relnote
We should probably pay attention to RFC 2368 (mailto URL scheme):

http://www.cis.ohio-state.edu/cs/Services/rfc/rfc-text/rfc2368.txt

Let me quote from this RFC a bit:

"8-bit characters in mailto URLs are forbidden. MIME encoded 
words (as defined in [RFC2047]) are permitted in header values, 
but not for any part of a "body" hname."

If the data provided by the original bug reporter is correct,
Mozilla does not use MIME in the mailto "subject" URL scheme that
is sent to the mail program when this "File > Send Link"
menu is used. If we are using mailto protocol to populate
the subject header with the title of the document, then 
we can at least indicate what the charset of the MIME encoded
string is. 

Assuming that this is the right way to go, then we have one of
two choices:

1. Use the encoding of the document in which the title resides
and use that as the MIME charset. In this Danish example, that
would be ISO-8859-1. As long the charset is indicated in the MIME'd
URL, any mailer capable of interpreting MIME header should be
ale to handle it.

2. We can uniformly change all non-ASCII titles into UTF-8 MIME
encoded mailto url. 


We earlier had this same discussion in:

http://bugzilla.mozilla.org/show_bug.cgi?id=12851

where I did not get my opinion to prevail. MIME-decode was checked
in but not MIME-encode. You might want to review the rationale
discussed there and see if that is still valid. 
Thanks for the note on RFC 2368.
Naoki had the following rationale for not supporting it.

> In fact, I am not sure MIME encode in mailto URL is practically useful.
> It is not supported by IE and 4.x. Using UTF-8 for URL is simpler than including
> MIME encoded words inside URL.
 To support RFC 2368, we need to move some  mailnews code out of mailnews into
necko (netwerk/mime : see bug 162765 for a similar change). It's doable, but
before actually doing it,  we have to make sure that what you wrote below is
true. Being able to decode RFC 2047-encoded words is one thing and being able to
decode RFC-2047 encoded words in URLs before putting them in 'Subject' header is
another.

> As long the charset is indicated in the MIME'd
> URL, any mailer capable of interpreting MIME header should be
> ale to handle it.

 
Protocol handlers can actually be used to invoke any native applications.

These native applications all expect native character sets.

I think it is a bit naive to believe that we should change every mail client
versus fixing our application.

However, I will point out that this doesn't work on IE either :)
> However, I will point out that this doesn't work on IE either

It works on my IE 6.0.2800.1106 with Outlook 2000 SP3 (9.0.0.6627) on Windows
2000 5.00.2195: The subject of the mail is right when I use File | Send | Link
(or whatever the English translation is - my IE speaks Danish). 

But -- it doesn't use the mailto: protocol. It doesn't matter what I set the
registry key HKEY_CLASSES_ROOT\mailto\shell\open\command\(default) to. IE must
speak directly with Outlook in some way. 
The concept of 'native' charset is not so clear any more on modern Unicode-based
OS' like Windows 2k/XP, Mac OS X, BeOS.  Either we have to leave this alone or
have to make Mozilla compliant to RFC 2368. 
I was testing IE to Notes and it fails there using mailto.
If we do what's suggested in comment #9, characters that have never been a part
of any legacy code page can't be used at all. With what we have now, UTF-8-aware
mail clients can be configured to work for any characters. Obviously, we cannot
fix others' bug (we can report it though), but we shouldn't _break_ (it's not a
fix) ours to work around others' bug. Besides, as time goes on, IRI will be
supported by more and more programs. 
I also have to point out that any characters outside the repertoire of the
current default system locale can't be used, either if we use so-called native
charset.  For instance, Chinese can't be used on Windows with the default locale
set to French. 

beos native charset is utf8, for windows we can use utf16 apis. don'T know about
macos.
Well, I'm aware that BeOS uses UTF-8 througout its APIs and file system. Mac OS
X uses UTF-8 (in NFD) on its file system and in some of its APIs (especially,
POSIX-related ones). However, most of Cocoa APIs are based on UTF-16 (afaik).
So, depending on how you look at this, the native charset on Mac OS X can be
either UTf-8 or UTF-16. I'd say it's UTF-8 (assuming what file system uses is
what's closest to the 'native charset'). However, that doesn't help if mail
clients on Mac OS X don't understand IRI. For them, the 'native' charset could
be the most widely used legacy character encoding for a given locale.

More or less the same is true of Windows (2k/XP) except that NTFS uses UTF-16 so
that our 'operational definition' (for the sake of discussion here [1]) on Win
2k/XP has to use either UTF-8 or what we get from GetACP() (or equivalent),
which is   I think what mkaply meant by 'native' charset (if I'm not mistaken).
 GetACP() returns  cp1252, cp949, cp932, cp1251, etc. That doesn't work for
cases I mentioned earlier. 



[1] because using UTF-16 for emails is out of question.   
what prevents us from passing an utf16 url to ShellExecuteW?
Still happens in Firefox 1.0.7 (Win32) & Outlook 2003.
*** Bug 314477 has been marked as a duplicate of this bug. ***
Still happens in Firefox 1.5 & Outlook 2003.
(In reply to comment #24)
> what prevents us from passing an utf16 url to ShellExecuteW?

That's probably what we have to do arguing that RFC 2368 is not so relevant for 'inter-process' communication. One 'minor' problem is that it's not available on Win 9x/ME, but we can work around it. 

Was there any progress on this bug?
Same as bug 412076. Bug fixed in Firefox 3.0.8. The site works fine on Vista.
Still happens in Firefox 3.0.8, Outlook 2003 SP3 and Windows 2000.
Still happens in Firefox 3.0.11, Outlook 2003 SP3 and Windows XP Professional SP3.
Assignee: file-handling → nobody
QA Contact: ian → file-handling
Not only one URL but every URL with non ASCII 7 chars generate bad Subject line in the email.

In french all éèçàîù (.../...) generate messages with characters like that :

Subject line : éèàç - Recherche Google
From page : http://www.google.com/search?q=%C3%A9%C3%A8%C3%A0%C3%A7&btnG=Rechercher&meta=&aq=f&oq=

Still present in Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)
I just tried this latest example in Firefox 3.5.5 and it appears to be working now. Has this been fixed or is my mail client compensating in some way? I have outlook 2007.
Tried using FF 3.5.5 under Linux

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.5) Gecko/20091107 Firefox/3.5.5 (Swiftfox) (.NET CLR 3.5.30729)

and Thunderbird version 2.0.0.23 (20090817)

and yes it seems to work.

That would mean the bug stands in the email client I guess.
This might have been fixed, but it might also be a Windows-specific bug; in fact that is quite likely if the problem is in converting UTF-8 as if it was the native character set. I'll check next week when I have access to a Windows system.
Assignee: nobody → smontagu
The bug still occurs on Windows on trunk with Outlook Express set as default email client, but not with Thunderbird. The current patch for bug 411511 doesn't fix it, but I may be able to tweak it so that it does (I'm still not sure of the exact code path in question -- the reference in comment 10 is out-of-date).
Depends on: 411511
Attached patch Possible patch (obsolete) — Splinter Review
This is only a partial solution, at least on my system (Windows XP Professional with Outlook 2003 (11.83183.8221) SP3 and Outlook Express 6.00.2900.5512).

With both Outlook and Outlook Express it only works for Subjects that are expressible in the default Windows non-unicode character set, even though we call ShellExecuteW with UTF-16 arguments. 

The last 3 hunks of the patch are not strictly relevant, but fix a problem that I found in the WINCE code path while experimenting: 
 { sinfo.lpFile =  NS_ConvertUTF8toUTF16(urlSpec).get(); }
doesn't work because the converted string goes out of scope before it's used. This may be the cause of bug 518164, but I have no way to verify that.

Thoughts? Comments?
Attachment #418854 - Flags: superreview?(cbiesinger)
Attachment #418854 - Flags: review?(benjamin)
Blocks: 518164
Attachment #418854 - Flags: review?(benjamin) → review+
http://www.idg.se/2.1085/1.285190/ny-databas-raddningen-for-webben

Same result with Outlook 2003 and it works with IE 7 so it really looks like
this is the Firefox browser somehow.
Comment on attachment 418854 [details] [diff] [review]
Possible patch

I don't think this is the right fix. There's no guarantee that after unescaping you'll get UTF-8...
(In reply to comment #40)
> (From update of attachment 418854 [details] [diff] [review])
> I don't think this is the right fix. There's no guarantee that after unescaping
> you'll get UTF-8...

No? Did I misunderstand comment 8?
That's true for the specific case of Send Link, but LoadURI has other callers. For example, if a web page has a mailto: URI, in Firefox that will go through this function. And in those case we do have the web page's charset information, and of course it's also possible that the web page already has specified escaped characters.

Maybe using GetSpec instead of GetAsciiSpec (and not unescaping stuff in addition to that) would actually be good enough, contrary to comment 12. It would probably fix this bug at least, though there's still cases it would get wrong.
(In reply to comment #42)
> Maybe using GetSpec instead of GetAsciiSpec (and not unescaping stuff in
> addition to that) would actually be good enough, contrary to comment 12. It
> would probably fix this bug at least, though there's still cases it would get
> wrong.

No, it turns out that it doesn't fix this bug.
Like the previous patch, this works well with Thunderbird, but only works with Outlook and Outlook Express if the page title is expressible in the native character set.
Attachment #418854 - Attachment is obsolete: true
Attachment #423795 - Flags: superreview?(cbiesinger)
Attachment #423795 - Flags: review?(benjamin)
Attachment #418854 - Flags: superreview?(cbiesinger)
Comment on attachment 423795 [details] [diff] [review]
More stable patch

Is there a way to write an automated test for this (without registering a system MIME handler, which doesn't sound wise)? If not, litmus?
Attachment #423795 - Flags: review?(benjamin) → review+
Attachment #423795 - Flags: superreview?(cbiesinger) → superreview+
https://hg.mozilla.org/mozilla-central/rev/68a94128a3b1
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: