Closed Bug 155569 Opened 18 years ago Closed 18 years ago

null filename (+ extension) used for downloading files with non-ASCII names

Categories

(Core Graveyard :: File Handling, defect, P1, major)

Tracking

(Not tracked)

VERIFIED FIXED
mozilla1.0.1

People

(Reporter: yoar_sux, Assigned: nhottanscp)

References

()

Details

(4 keywords, Whiteboard: [adt2 RTM] [ETA 08/21] [Post MachV])

Attachments

(2 files, 4 obsolete files)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.1a) Gecko/20020610
BuildID:    20020610

Mozilla invents a msdos-like filename (8+3) and .exe extension   when trying to
download a file with non-english characters like ñ or á é etc

Reproducible: Always
Steps to Reproduce:
1. Try to download a file with non-english characteres (for example:
mañana_fria.mp3)

Actual Results:  Mozilla suggest a bad msdos-like filename like xangkbqh.exe,
SFensxGx.exe, ... always a different filename

Expected Results:  save file as his original filename =)
Please list exact steps to reproduce, and point to the exact link in question...
Are you saving by clicking on the link?  Or using "Save link as"?  Or what?

this sounds like helper app random strings.
Yes, it is.  We use those when we get no other filename of any sort.  Now the
question is why we got no other filename...
This does not seem to occur with build 2002061104 on Windows 2000.

Instead, mozilla returns a hex encoded filename.

I named an image file: δζηθικκ.jpg (for kicks) and put it here:

http://www.unc.edu/~jwatt/mozilla/155569/

When you right click on it and choose "Save link target as", you get the
following filename:

%84%91%87%8a%82%88%88.jpg
(which is its correct hex-encoded filename)
if that's all you have, then this is bug 154214 (which itself may be a dupe).
i didn't resolve it because your original description seemed to imply that the 
name was totally random.
Exact steps:
Browse to http://www.iespana.es/yoar/
Click any link (Save as or simple click)
Mozilla offers a random 8+3 filename based with .exe as extension.
OK.  "Save link as" works fine, in fact.  But clicking directly on the link does
not.  Investigating.
OK.  The issue is the following code in nsExternalHelperAppHandler.cpp:

    nsCAutoString leafName; // may be shortened by NS_UnescapeURL
    url->GetFileName(leafName);
    if (!leafName.IsEmpty())
    {
      NS_UnescapeURL(leafName);
      mSuggestedFileName = NS_ConvertUTF8toUCS2(leafName); // XXX leafName may
not be UTF-8

In this case, |leafName| is in fact not UTF-8, so NS_ConvertUTF8toUCS2 just
returns an empty string and there we are.

Possible options:

1) Implement NS_Unescape* for UCS2 data so that we can convert to UCS2 _before_
   unescaping.  This is bug 110943 (note in particular bug 110943 comment 6).
2) Store mSuggestedFilename in the native charset instead of in UCS2 (doable;
   since the only possible sources of this string are the URL and the
   Content-Disposition header, which must be ASCII).  This runs into trouble
   when the leafname in the URL is not a valid string in the native charset (eg
   this same URL on a system on which UTF8 is native).
3) Stop unescaping the URL (nice and simple)
4) ??? other cool intl stuff we can do ???

ccing some people who may have a clue as to what the right thing to do is
here... The problem is that the URI can encode arbitrary data as
percent-escapes; data that need not map usefully to anything resembling a
filename in any useful encoding....

Reporter, could you leave that site up for a while so possible solutions can be
tested against it?
Assignee: law → bzbarsky
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Priority: -- → P2
Hardware: PC → All
Target Milestone: --- → mozilla1.2beta
"OK.  "Save link as" works fine, in fact.  But clicking directly on the link does
not.  Investigating."

Save as doesn't work for me :(
bug happens in my computer when saving link from Save As or clicking directly :(


"Reporter, could you leave that site up for a while so possible solutions can be
tested against it?"

Yes, site will be up for testing (until the bug get fixed :-P)
*** Bug 161174 has been marked as a duplicate of this bug. ***
copy keywords from bug 161174
Keywords: intl, nsbeta1
>1) Implement NS_Unescape* for UCS2 data so that we can convert to UCS2 _before_
>   unescaping.  This is bug 110943 (note in particular bug 110943 comment 6).
In order to convert to UCS2, it needs charset conversion. So, unescape has to be
done first before the charset conversion.

After unescaping, we can try UTF-8, if that fails then try local system file
charset. I did a similar thing in mail code (nsSmtpService.cpp, rev1.112).
Yep.  That looks like exactly what's needed here.

We should just move that code out to live somewhere in a library where anyone
can call it...

I'm probably going to have no time to get to this for a good long while...
Keywords: helpwanted
Priority: P2 → P3
Reassign to nhotta.
Assignee: bzbarsky → nhotta
Blocks: 157673
Added topembed keyword.
Keywords: topembed
Keywords: nsbeta1nsbeta1+
Whiteboard: [adt2] [ETA Needed]
nsExternalAppHandler objects are very short-lived.  SetUpTempFile is called
exactly once in the life of these objects.  So there's no reason to add new
members; all the code could be inline.  If we wanted, would could move the
InitializeCharset stuff out to the nsExternalHelperAppService and pass the
relevant data to the constructor of the nsExternalAppHandler object...

> +          mSuggestedFileName.Assign(unescapedURI.get());

Lose the .get(); the code is more readable, the Assign() is faster, and things
are better if that's done.  ;)

Over in uconv:

+	wstring UnEscapeURI(in string aCharset, in string aURI);

Is there a reason not to use nsAString and nsAUTF8String instead of wstring and
string here?  I know some of the other interfaces in uconv do that, but... 
Module owner's call, I suppose.

Please use javadoc syntax for the comment.  So

/** (two stars -- important!)
 * Unescapes the given URI fragment .....
 * @param aCharset the charset to convert from
 * @param aURI the URI (or URI fragment) to unescape
 * @return Unescaped aURI converted to unicode
 * @throws [whatever error is thrown when there is no decoder for aCharset]
 */

I think "aURI" is misleading here.. aURIFragment may be a better name (since
we're using this for leafName and all).
 
I'd think we would want to name the function "unEscapeURI", (lowercase 'n') but
that's up to the module owner too, I guess.

In UnEscapeURI:

> +  NS_ENSURE_ARG_POINTER(_retval);

Just assert that it's not null.  Anyone passing in a null _retval deserves to
crash for violating COM rules.  There's no way to do that from JS, so it's ok. 
And we don't want to spend time doing that check in opt builds, imo.

> +    *_retval = ToNewUnicode(NS_ConvertASCIItoUCS2(aURI));

Shouldn't that be NS_ConvertUTF8toUCS2?  I get the impression (from nsIURI.idl)
that the fileName (and other such) are UTF8, with some stuff possibly escaped...
but I could be wrong.

> +  rv = convertURItoUnicode(aCharset, unescapedSpec, PR_FALSE, _retval);

Shouldn't that be PR_TRUE?

In convertURItoUnicode:

> +  if (IsASCII(aURI)) {
> +    *_retval = ToNewUnicode(NS_ConvertASCIItoUCS2(aURI));
> +    NS_ENSURE_TRUE(*_retval, NS_ERROR_OUT_OF_MEMORY);
> +  }

Couldn't we return NS_OK from inside this if() after the NS_ENSURE_TRUE?  Then
we do not need an "else" and can un-indent some code....

> +      if (aURI.Equals(NS_ConvertUCS2toUTF8(ustr))) {
> +        *_retval = ToNewUnicode(ustr);
> +        NS_ENSURE_TRUE(*_retval, NS_ERROR_OUT_OF_MEMORY);
> +      }

Same here...

> +      else
> +        nsMemory::Free(ustr);

shouldn't we set *_retval to null here?  Not a big deal either way, since we're
returning an error....
> Is there a reason not to use nsAString and nsAUTF8String instead of wstring and
> string here?  I know some of the other interfaces in uconv do that, but... 
> Module owner's call, I suppose.
I can do that. Is there a reason we want to use nsAString and nsAUTF8String
instead of wstring and string? Are they more efficient?

> +    *_retval = ToNewUnicode(NS_ConvertASCIItoUCS2(aURI));
This line is after the IsAscii check, so it is Ascii only.

> +  rv = convertURItoUnicode(aCharset, unescapedSpec, PR_FALSE, _retval);
I plan to add an unescape funtion for UI display purpose only later. For that,
PR_TRUE will be used. But the one in the patch does not assume the input URI as
UTF-8 in order to avoid possible data loss.


Status: NEW → ASSIGNED
>> +    *_retval = ToNewUnicode(NS_ConvertASCIItoUCS2(aURI));
>This line is after the IsAscii check, so it is Ascii only.

It's in two places.  The one I think is wrong is:

+  rv = convertURItoUnicode(aCharset, unescapedSpec, PR_FALSE, _retval);
+
+  if (NS_FAILED(rv) && 
+    rv != NS_ERROR_OUT_OF_MEMORY) {
+    *_retval = ToNewUnicode(NS_ConvertASCIItoUCS2(aURI));

> I plan to add an unescape funtion for UI display purpose only later.

Well.. the one place this is being used right now (nsExternalHelperAppHandler)
_is_ for UI display purposes.  It's a default filename that he filepicker will
suggest; the user can then change it as needed.

As for nsAString and nsAUTF8String, there are a few reasons:

1)  Using nsAUTF8String _would_ be more efficient here than using "string".
    for the URI.  It would also be more flexible for the caller, since it would
    allow passing in a non-flat uri string.
2)  Using nsAUTF8String makes it clear that the aURI param _must_ be UTF8.
    "string" does not make that clear 
3)  Doing this hides the detail that you need flat strings for some of these (eg
    the charset).  You can use PromiseFlatCString() to get a flat string from
    the charset...
4)  More consistent with Mozilla coding style in general.

A drawback is that in this particular case it actually is more efficient to use
a wstring for the return value than to use an nsAString.  So I'd be fine with
keeping that as-is.... but the URI should really be nsAUTF8String, as it is in
all the necko IDL files.
AString and AUTF8String are more efficient because they maintain the length of
the string.  newly written code should use these instead of string and wstring
whenever possible.

> +    *_retval = ToNewUnicode(NS_ConvertASCIItoUCS2(aURI));

how about this instead:

       *_retval = ToNewUnicode(nsDependentCString(aURI));

that way you avoid an extra string copy.
Okay, I will change to use AString and AUTF8String.
we have some testcases at
ftp://10.169.114.243/
It's working on Communicator 4.x, the original Chinese filename is shown on File
Save dialog.
On Netscape 6.1, the Chinese filename is shown as escaped string.
On Netscape 6.2, the filename is shown as a junk ascii string.
On the latest branch build, the filename is not shown at all.
That page is not likely to get fixed by this patch... the patch just uses the
platform charset for the filename.  We should eventually use the document
charset, but I'm not quite sure how to get that over to this code yet...
Please ignore the previous one.
The correct URL is:
http://202.107.236.189/male/zhj/ÖÜ»ª½¡-ÕæÐÄÓ¢ÐÛ.mp3
http://61.133.3.244/update/mp3/windust/ÑîÀ¤-ÎÞËùν.mp3

Please view this page in Simplified Chinese (GB2312)
QA contact to myself.
QA Contact: sairuh → ji
About the comment #23, that is right. The patch works when the file system
charset of the server and the client matches. Using the document charset is better.
Let me focus on the current approach which covers the common cases then we can
do the document charset approach later separately.
Agreed. That should be a separate bug (probably assigned to me).
Upping the impact of this one to [adt1 RTM], as this is a visible issue, that is
functionaly different than tha asccii user experince.
Blocks: 143047
Severity: normal → critical
Keywords: regression
Priority: P3 → P1
Whiteboard: [adt2] [ETA Needed] → [adt1 RTM] [ETA 08/09]
Target Milestone: mozilla1.2beta → mozilla1.0.1
Keywords: 4xp
Attachment #94413 - Attachment is obsolete: true
bzbarsky, please review the patch, thanks.
>Actual Results:  Mozilla suggest a bad msdos-like filename like xangkbqh.exe,
>SFensxGx.exe, ... always a different filename
>
>Expected Results:  save file as his original filename =)

I try this and what I see is
actual result:
1. a file save dialog box show up. and it have a suggested file name. Thae
"suggest file name" looks like escaped url
2. the user can change the file name in the file save dialogbox. If he decide to
do that, then he know what is file name he pick. if the user leave the file name
along, then he can write down that file name. I don't think there are any data
loss here since user are prompt about the file name and have a chance to change it. 
i agree that the data loss is not very severe (bumping down to ADT2, Severity:
major), but loss of the name of file is, data that is lost. shoudl the user save
the file without the file name they will get ".exe" or ".hjpg" (i.e. no file
name), if the name is ascii. this is a visibly different experience then one
gets when using ascii file name, or if one uses 4.x or MSIE for either ascii or
non-ascii characters.
Severity: critical → major
Keywords: dataloss
Whiteboard: [adt1 RTM] [ETA 08/09] → [adt2 RTM] [ETA 08/09]
nhotta-
why we assume the charset is kPlatformCharsetSel_FileName ?
should we assume this is the "default charset" of the browser (from reading the
pref) instead ?
or, can we call charset detector to find out what it is ?
Comment on attachment 94520 [details] [diff] [review]
Changed string type and address other comments.

>+   * @throws [no decoder for aCharset, the input cannot be converted]

I'd do

@throws NS_ERROR_UCONV_NOCONV when there is no decoder for aCharset

I thought about it some more, and returning nsAString from this method is
better than returning wstring... for reasons explained below.  My apologies for
not thinking this through enough the first time around...

>+    *_retval = ToNewUnicode(nsDependentCString(aURI.get()));

*_retval = ToNewUnicode(aURI);

>+  // try UTF-8 if the caller sets an option for IRI

What is an "IRI"?  It's called the same in the function decl... I assume that
should be "URI"?

>+  if (NS_SUCCEEDED(rv)) {
>+    ustr[dstLen] = (PRUnichar) '\0';
>+    *_retval = ustr;

If we're returning nsAString, at this poing we can do:

_retval = nsDependentString(ustr, dstLen);

This way no scanning for the '\0' needs to happen...

The rest looks great!
Frank, how good is the charset detector with really short strings?  We're
talking filenames, here, right?  On the order of 10-20 chars....

About comment #34, I think the common case is that the charset of server and
client are the same. The other issue is that there is other restriction and we
cannot really save a file other than the default system charset.
Anyway, that part can be done later separately (see comment #27).
FYI, in bug 118179, we are reading the pref.
>What is an "IRI"?  
That is "Internationalized Resource Identifiers" which is always UTF-8. 
http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt
I need to put more comment.

>_retval = nsDependentString(ustr, dstLen);

If we do that, how the memory is freed? Does nsAString do that automatically?
Um, no.  We'd need to free it ourselves after the assignment.....

jag, is that something we should be avoiding?  I know that assigning out of an
nsDependentString copies at the moment, but is that something we can depend on?
OK.  I talked to scc, and we should just leave the code as it is now for that
nsAString/wstring stuff.  That is, returning a wstring.

sr=bzbarsky if you make those comment changes (like explaining what an IRI is
and what setting that argument true implies).
Attached patch Changed comments. (obsolete) — Splinter Review
Attachment #94520 - Attachment is obsolete: true
+  wstring unEscapeURIForUI(in ACString aCharset, in AUTF8String aURIFragment);

ok, so i had a talk with bz about this, and we concluded that returning AString
is actually the right thing here.

nhotta: sorry to make you rev this patch so many times, but it would be really
great to migrate toward a world that just uses abstract string classes instead
of raw character arrays (for all the reasons previously enumerated).  so, if you
don't want to make the changes back to AString now, i'd totally understand, but
one day i think we will want to make that change.  if you are so inclined to do
it now, that'd be great!  thx :-)
So can we use nsDependentString or assign the unicode string to the rerurn value
to do a copy (and free the original)?
Yes.  We can do that.
You mean nsDependentString?
Yep.  nsDependentString.
I am not sure what the benefit of using nsDependentString if I need to free the
PRUnichar* string.
What is different if I just assign PRUnichar* to the return value?

I mean,

_retval = nsDependentString(ustr, dstLen);
nsMemory::Free(ustr);

v.s.

_reval.Assign(ustr, dstLen);
nsMemory::Free(ustr);

Um...  I just forgot that Assign has a version on nsAString that takes a length.
 ;)  By all means, use that instead of the dependent string stuff.
BTW, nsIURI in nsExternalHelperAppService.cpp has originCharset set to the 
document charset, so probably we can use it instead of the system's default.
Oh.  Wow. I never noticed that attribute...  Yes, that should do exactly what we
want.  Excellent.
I suggest add a another function for unescape nsAString,then it will be used in 
bug110943.I think the patch should like this
+NS_IMETHODIMP  nsTextToSubURI::UnEscapeURIForUI(const nsACString & aCharset, 
+                                                const nsAString &aURIFragment, 
+                                                PRUnichar **_retval)
+{
+  *_retval = nsnull;
+  nsresult rv = NS_OK;
+  NS_ConvertUCS2toUTF8 cstr(aURIFragment)
+  nsCAutoString unescapedSpec(cstr.get());
+  NS_UnescapeURL(unescapedSpec);
+
+  rv = convertURItoUnicode(PromiseFlatCString(aCharset), unescapedSpec, 
PR_TRUE, _retval);
+
 
   return rv;
 }
About comment #52, let me do that separately in bug 110943.
For FTP listing (e.g. ftp://10.169.114.243/), the document charset is always
UTF-8 but the acutal file name is encoded in server's charset.
The current patch does not work for that case. But I think the change should be
done somewhere else (the document charset and the file name encoding charset
need to match).

bzbarsky/ftang, please review the last patch.
Comment on attachment 94648 [details] [diff] [review]
Changed to return AString for the unescape funtion, also changed to use originCharset instead of the platform charset.

r=ftang
Attachment #94648 - Flags: review+
> +  // note: there is no definate way to distinguish IRI and other URI encode with 
> +  // non UTF-8 charset and may cause dataloss
> +  // the option has to be used carefully (recommended to set to true for UI
purpose only)

"definite".  "distinguish between IRI and a URI encoded with a non-UTF-8 charset"
"Use this option carefully -- it may cause dataloss (...)"

Could we move this comment to the header insted of having it in the middle of
the function?

> +#include "nsIPlatformCharset.h"

This is no longer needed, is it?

Looks great, sr=bzbarsky with those changes, and thanks!
Attachment #94648 - Attachment is obsolete: true
Comment on attachment 94932 [details] [diff] [review]
Change to address bzbarsky's comment.

copy r/sr
Attachment #94932 - Flags: superreview+
Attachment #94932 - Flags: review+
Keywords: topembedtopembed+
The fix depends on bug 137182 (patch for nsWebShell.cpp only). 
If the current patch goes to the branch, the change of nsWebShell.cpp is also
needed.
Depends on: 137182
Keywords: helpwanted
checked in to the trunk
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Filed a seperate bug 162377 for ftp download.
Comment on attachment 94995 [details] [diff] [review]
patch for 1.0.1 (added nsWebShell.cpp from bug 137182, and makefile.win)

sr=bzbarsky
Attachment #94995 - Flags: superreview+
ji: can you pls verify this as fixed on the trunk? thanks!
Keywords: adt1.0.1+
Whiteboard: [adt2 RTM] [ETA 08/09] → [adt2 RTM] [ETA 08/14]
Verified as fixed.
Problem with ftp listing page is filed as bug 162377.
Problem with javaScript is filed as bug 162523.
Status: RESOLVED → VERIFIED
This is not a stopper for 1.0.1. Let's get it in the next release.
No longer blocks: 143047
Keywords: adt1.0.1+adt1.0.1-
Summary: Mozilla invents filename when trying to download a file with non-english characters. → Mozilla invents filename when trying to download a file with non-english characters
Whiteboard: [adt2 RTM] [ETA 08/14] → [adt2 RTM]
Updating summary (based upon testing results) from:
  Mozilla invents filename when trying to download a file with non-english
characters
to:
  null filename (+ extension) used for downloading files with non-ASCII names
Summary: Mozilla invents filename when trying to download a file with non-english characters → null filename (+ extension) used for downloading files with non-ASCII names
Let's schedule this to land on the 1.0.1 Tuesday afternoon, so that it is in the
08/21 Mozilla1.0.1 builds, and can be picked up by a major embedor, without
effecting MachV.
Blocks: 154896
Keywords: mozilla1.0.1
Whiteboard: [adt2 RTM] → [adt2 RTM] [ETA 08/21] [Post MachV]
Comment on attachment 94995 [details] [diff] [review]
patch for 1.0.1 (added nsWebShell.cpp from bug 137182, and makefile.win)

r=shanjian
Attachment #94995 - Flags: review+
Removing "-", to renominate this one for the branch after MachV.
Keywords: adt1.0.1-adt1.0.1
QA contact to Rui for the verification on the branch build. Thanks.
QA Contact: ji → ruixu
jaimejr, when should we land this into the m1.0 branch? how many approval we
need to got post machv?
adt1.0.1+ (on ADT's behalf) approval for checkin to the 1.0 branch, pending
Drivers' approval. pls check this in asap, the replace "Mozilla1.0.1+" with
"fixed1.0.1". [Using adt1.0.1 keywords as a proxy, since no edt1.0.1 keywords
were created for the 1.0.1 branch]. thanks! 
Keywords: adt1.0.1adt1.0.1+, approval
Comment on attachment 94995 [details] [diff] [review]
patch for 1.0.1 (added nsWebShell.cpp from bug 137182, and makefile.win)

a=chofmann for 1.0.2
Attachment #94995 - Flags: approval+
patch checked into 1.02 branch.
mark it as fixed1.0.2
Keywords: fixed1.0.2
No longer depends on: 110943
verified as fixed with 2002-08-28-08-1.0 build.
Depends on: 180372
No longer blocks: 157673
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.