Closed Bug 158006 Opened 22 years ago Closed 21 years ago

save as, save link target as, save image as : non-ASCII filename/title/URL handling

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.4alpha

People

(Reporter: jshin1987, Assigned: jshin1987)

References

()

Details

(Keywords: intl)

Attachments

(1 file, 3 obsolete files)

This is to add an I18N twist to bug 115176, which
is by itself pretty complicated. 
When 'save as' is used to save the html doc
(with non-ASCII URL whether URL-encoded or not, non-ASCII title) in a
browser window  or 'save link target as' or 'save
image as' is invoked on docs and images
(with non-ASCII URL whether URL-encoded or not or 
non-ASCII title, non-ASCII
link name) linked from the page, Mozilla suggests a filename
which is not what users usually expect.

This is a complex problem. Some aspects of this issue
cannot be resolved by web clients alone and require
coordination with popular http servers. Implication is
that Mozilla.org may have to be involved with standardization
process to deal with this issue. 

I'll get back with more details soon.
Status: UNCONFIRMED → NEW
Ever confirmed: true
agree with Jungshik.
future for now
Status: NEW → ASSIGNED
Target Milestone: --- → Future
A real life example sent by KANG Jeong-hee :

http://www.mogaha.go.kr/webapp/viewNoticeBoardServlet?serial=214012&board_list=54&page=1&sort=reg_date&listnum=10&main=1

In the page, there's a link to  '국제포럼(020821).hwp' (followed by
'(61481 bytes)' above
the  row in a light purple( with 02-3703-4895) and right of a skyblue cell.

When 'save link target as' is selected, the filename suggested by Mozilla
URL-encodes non-ASCII characters. 

The page is in EUC-KR, but there's no guarantee that the local filesystem
uses the same encoding as the web page itself. 
As I wrote in my bug report, there are many factors to take into account.

BTW, in this particular case, 'Content-Disposition:' header is not
emitted by the server. Therefore, making Mozilla's C-D header handling
compliant to RFC 2231 wouldn't help. (see bug 155949 and bug 162765 for
RFC 2231 compliance)


 
Comment 2 is more relevant to bug 161242 than here.
In particular case mentioned in comment #2, fixing bug 155569 solved
the problem. The page in comment #2 is in EUC-KR(so are
the URL and the filename included in the URL in question) and I'm running
Mozilla in ko_KR.UTF-8 locale. The filename suggested by Mozilla
is the UTT-8(the local file system encoding) representation of
the file name represented in EUC-KR(the document encoding).
of course, this does not mean that we can close this bug because
there are more complicated cases as discussed in bug 115176. 
'save target as' and 'save image as' are handled by
http://lxr.mozilla.org/seamonkey/source/xpfe/communicator/resources/content/contentAreaUtils.js

Once I fix bug 162765 and expose necessary methods to Javascript,
it should be easy to fix this bug for most cases (some other related
bugs have been resolved over the last few months by Boris).  
On 'non-Unicode platforms', a mismatch between the
local file system encoding and what Mozilla put filenames in 
may  still remain, though. 

 
attachment 107217 [details] [diff] [review] to bug 162765 makes Mozilla work well 
for standard-compliant cases and some non-standard-compliant
cases as well(RFC 2231 has to be used, but some servers use
RFC 2047-encoding). Unlike its C++ counterpart (that makes
use of 'originCharset' of the current URI when untagged
raw 8bit chars are encountered), Javascript 
(for 'save target as') doesn't. I have to figure out how
to access 'originCharset' of the current URI with JS. 

BTW, Roy, can you assign this to me? 

sure. thanks
Assignee: yokoyama → jshin
Status: ASSIGNED → NEW
QA Contact: ruixu → ylong
Blocks: 162765
Sorry for spamming. I got the dependency the other way around.  
No longer blocks: 162765
Depends on: 162765
Attached patch a partial patch (obsolete) — Splinter Review
This patch has to be applied along with my patches for
bug 162765 and bug 191541. Still, it only works
for RFC 2231 and RFC 2047 encoded filename parameters
in C-D header. When raw 8bit characters are used as
filename paramter in C-D header, this patch doesn't work.
The key is to figure out how to get |originCharset| of
the referring document in contentAreaUtil.js. 

Last December, I tinkered with this problem for a while
and I believe I came up with a solution, but I couldn't
find any trace of it on my disk. (my computer underwent
a huge overhaul..).  Anyway, I'm uploading this patch 
so that others can take a look (there are a lot of
commented-out lines and debug output, which reflects the tentative nature
of the patch) and help me out. 

Getting originCharset in nsExternalHelperAppService.cpp 
was easy, but I couldn't quite figure out in JS code.

BTW, I always get  UTF-8 as |documentCharset|
in |getDocumentCharset| at the end of my patch 
regardless of the charset used in the refering document. 
For instance, http://jshin.net/moztest/download.html
 is explicitly marked as in EUC-KR, but when I try
to save one of files with raw 8bit chars in C-D
header, I get UTF-8 in JS code. 

I changed the locale under which Mozilla is launched
(my primary locale is ko_KR.UTF-8), but nothing
has changed. I tried ja_JP.EUC-JP, ko_KR.EUC-KR
and C. 

Any idea as to what's going on?
Adding Naoki to CC because he wrote 
|nsTextToSubURI| apparently because
he wanted to solve problems related to this
(and I used it a couple of places to solve
similar problems.)

Another example to show that getting
|originCharset| is important is 
as following:

http://www.mogaha.go.kr/webapp/bbs/pub/view.action?bid=200&serial=268306&sc_param=common&listStatusStr=rO0ABXNyABpjb20uZ2VuMTI4LmRhdGEuTGlzdFN0YXR1c7sGCWct5HSSAgAESQAHbWF4U2l6ZUkABm9mZnNldEwACmNvbmRpdGlvbnN0ABNMamF2YS91dGlsL0hhc2hNYXA7TAAEc29ydHQAEkxqYXZhL2xhbmcvU3RyaW5nO3hwAAAADwAAAABzcgARamF2YS51dGlsLkhhc2hNYXAFB9rBwxZg0QMAAkYACmxvYWRGYWN0b3JJAAl0aHJlc2hvbGR4cD9AAAAAAABLdwgAAABlAAAAAHh0AAZzZXJpYWw%3D

Sorry this URL is reaallly long. That document
is in EUC-KR and there's a jpg file linked
(look for '284943'). Now click on the image
linked and it will get loaded. Mozilla gets
the filename correctly in Korean and nicely
puts it on the titlebar. However, 'Save Image
as' is attempted, it comes up
with the suggested file name as if
'escaped URL' is in ISO-8859-1. Why?
Because |documentCharset| is assummed to be
in ISO-8859-1. 

It seems like there are two possible solutions.
One is to figure out a way to get charset
of the refering document (|originCharset|)
in nsContentAreaUtil.js and the other
is to set 'charset' (nominal) of image
documents to that of the referring document
instead of the default(?) ISO-8859-1. 

I just hit upon another idea, but that's beyond
the scope of this bug. On several places
in Mozilla, ISO-8859-1 is taken as the last
resort (or sometimes the default). This may have
to be subjected to the localization and/or
the configurability. Anyway, I'll discuss
this in another forum. 
 

I'm sorry it appears that my second solution
in my prev. comment would not work because
charset is obtained from this code 

+  else if (document.commandDispatcher.focusedWindow) {
+    charset = document.commandDispatcher.focusedWindow.document.characterSet;
+    dump ("getDocumetnCharset: charset obtained from cmddispatch.fw :" +
charset + "\n");
+  }

rather than this code

+      charset = aDocument.characterSet;
+      charsetSrc = aDocument.characterSetSource;
+      dump ("getDocumetnCharset: charset obtained from aDcoument :" + charset +
"\n");
+      dump ("getDocumetnCharset: charset source :" + charsetSrc + "\n");
The originCharset of a URI will often be UTF8 just because someone in Mozilla
internals helpfully converted the unescaped URI string to UTF8 before creating
the URI object....  or just passed in a null charset at URI object creation (a
lot of people do that).
> The originCharset of a URI will often be UTF8 
   
   At least in |nsExternalHelperAppService::ExtractSuggestedFilename| 
(see attachment 117822 [details] [diff] [review]) and somehwer else in nsImageDocument.cpp(see 
http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&file=nsImageDocument.cpp&root=/cvsroot&subdir=mozilla/content/html/document/src&command=DIFF_FRAMESET&rev1=1.92&rev2=1.93)
it helped me with guessing the encoding
of the URL.  In JS code, I think I need to figure out the equivalent
of what I do in attachment 117822 [details] [diff] [review] (getting the URI of a channel and 
the originCharset of that URI).

The following may be a bit off-topic in a sense.

> internals helpfully converted the unescaped URI string to UTF8 before 
> creating the URI object.... 

  Is that what originCharset is supposed to indicate? I thought
it indicates the charset of the referrer (referring document)
if it's 'well-defined' and 'well-known'. 
 
 
   When a standalone image is loaded in its own window by clicking on the link,
even getting originCharset the way described above (via channel) wouldn't help
because the charset of an 'image-only document' is set to ISO-8859-1[1], which
is why |charset=document.commandDispatcher.focusedWindow.document.characterSet| 
gives me ISO-8859-1 (comment #11)

[1] This may have to do with the following or related code:

http://lxr.mozilla.org/seamonkey/source/content/base/src/nsDocument.cpp#511

For 'image-only' documents, mCharacterSet seems never set. It may be
argued that |mCharacterSet| of 'stand-alone non-textual' documents
have to be set to charset of the referring document if it's well-defined.

 




 
> internals helpfully converted the unescaped URI string to UTF8 before 
> creating the URI object....  or just passed in a null charset at URI 
> object creation

That's why I want to have originCharset, instead. I suspect this is not
set to UTF-8 if the referring document is explicitly set to charset
other than UTF-8. For instance, http://jshin.net/moztest/download.html 
is in EUC-KR and I believe 
originCharset of http://jshin.net/moztest/random21.yyy
is EUC-KR. Otherwise, my patch to 162765 wouldn't work, but it works. 

Sorry the last paragraph in comment #13 had to be deleted. Pls, ignore it.  
Target Milestone: Future → mozilla1.4alpha
Attached patch a working patch (obsolete) — Splinter Review
It turned out that I was missing the obvious. Anyway, with this patch
along with patches for bug 162765 and bug 191541, 'save target as
and save image as' work fine. 

The only exception is 
when 'save image as' is loaded in a standalone browser pane.
In that case, 'charset' of the focused window is always set to
ISO-8859-1 (see my previous comment). Anyway, that has to be
dealt with in a separate bug I'm gonna file if not filed yet.
Attachment #117838 - Attachment is obsolete: true
Comment on attachment 118012 [details] [diff] [review]
a working patch

asking r/sr. this may be a bit premature
because bug 162765  that blocks this
is not yet fixed. But I'm pretty sure
it'll soon (a working patch is waiting for
review).
Attachment #118012 - Flags: superreview?(jaggernaut)
Attachment #118012 - Flags: review?(ben)
bug 198598 has just been filed for a problem
with saving
'stand-alone' 'non-textual' document.
Comment on attachment 118012 [details] [diff] [review]
a working patch

>+function getCharsetforSave(aDocument)
>+{
>+  if (aDocument) 
>+      charset = aDocument.characterSet;

Too much indentation.

>+  else if (document.commandDispatcher.focusedWindow) {
>+    charset = document.commandDispatcher.focusedWindow.document.characterSet;
>+  }
>+  else {
>+    charset = window._content.document.characterSet;
>+  }
>+  return charset;
>+}
>+

It's inconsistent to use { } for the second and third case but not for the
first.

sr=jag with those nits fixed.
Attachment #118012 - Flags: superreview?(jaggernaut)
Attachment #118012 - Flags: superreview+
Attachment #118012 - Flags: review?(neil)
Attachment #118012 - Flags: review?(ben)
Thank you jag for sr. 
I'll fix nits as you pointed out.
BTW, could you take a look at my patch
for  bug 162765 if you can? This bug is blocked
by it and I like to land both patches together
before 1.4. 
Thanks again.
Comment on attachment 118012 [details] [diff] [review]
a working patch

>+    var dispHeader = this.mContentDisposition;
Is it worth duplicating this variable?

>+      // To make JS engine happy.
>+      var dummy = new Object();
Yeah, I hate that too, but you should use
var dummy = { value: null };

>+      try {
>+        fileName = mhp.getParameter(dispHeader, "filename", charset, true, dummy);
>+      } 
>+      catch (e) {
>+        fileName = mhp.getParameter(dispHeader, "name", charset, true, dummy);
>       }
OK, so you catch an exception for the filename, can getting the name throw an
exception too? That sounds bad.

>+function getCharsetforSave(aDocument)
>+{
>+  if (aDocument) 
>+      charset = aDocument.characterSet;
>+  else if (document.commandDispatcher.focusedWindow) {
>+    charset = document.commandDispatcher.focusedWindow.document.characterSet;
>+  }
>+  else {
>+    charset = window._content.document.characterSet;
>+  }
>+  return charset;
>+}
You forgot to declare charset. But you could just use
if (...)
  return ...;
if (...)
  return ...;
return ...;
Attachment #118012 - Flags: review?(neil) → review-
Thanks for catching those things, Neil.
Thank you for review. I addressed concerns of both of you. 
Can you review this patch?  Thanks.
Attachment #118012 - Attachment is obsolete: true
Comment on attachment 121757 [details] [diff] [review]
a new patch per neil's and jag's comments

realized that neil was not on CC. asking
for r/sr explicitly.
BTW, I'll change two statements below
with 'return fileName.replace(...);'.

+    fileName = fileName.replace(/^"|"$/g, "");
+    return fileName;
Attachment #121757 - Flags: superreview?(jaggernaut)
Attachment #121757 - Flags: review?(neil)
Comment on attachment 121757 [details] [diff] [review]
a new patch per neil's and jag's comments

>+    reuturn document.commandDispatcher.focusedWindow.document.characterSet;
Silly typo :-)

>+  }
>+  else {
Don't bother with else after return (twice).
Attachment #121757 - Flags: review?(neil) → review+
Comment on attachment 121757 [details] [diff] [review]
a new patch per neil's and jag's comments

+    reuturn document.commandDispatcher.focusedWindow.document.characterSet;

Typo. Combined with Neil's comment, this becomes:

function getCharsetforSave(aDocument)
{
  if (aDocument)
    return aDocument.characterSet;

  if (document.commandDispatcher.focusedWindow)
    return document.commandDispatcher.focusedWindow.document.characterSet;

  return  window._content.document.characterSet;
}


Does commandDispatcher.focusedWindow always point to the window you want? Can
we somehow get to this function while it points to another frame, or to the XUL
window instead?
Thank you for review, neil and jag.

> Does commandDispatcher.focusedWindow always point to the window you want? Can
> we somehow get to this function while it points to another frame, or to the XUL
> window instead?

  I guess we're safe as long as 'Save Target As/Save Image As/Save Frame As' are
concerned.
As for 'Save Page As', I'm not sure what it does for framed pages if the encoding
of a page is different from the encoding of a focused frame.  Do you think it may
be better to put window._content.document.characterSet before commandDispatcher.fW? 
I haven't tested framed pages, but in non-frame cases, both are always the came.

BTW, you may be interested in bug 199237, too. That bug has little to do with the
question at hand, but is one of several 'save'-related bugs. 
I did some experiments with a framed page
(http://jshin.net/moztest/dl_frame.html) and found that trying
aDocument.characterset,document.commandDispatcher.focusedWindow, and 
window._content.document.characterSet in that order is the 'right' thing to do. 

For right-click-activate menu items (saev link target as, save imgge as), 
document.commandDispatcher.focusedWindow always work giving us charset of the
focused frame. For Save Frame As, both aDocument.characterset and d.c.f work.
However, window._content.document.characterSet has the charset declared in the
frameset html (instead of a focused frame).  

Now bug 162765 this bug depends on was fixed, let's land this one as well. jag,
can I get sr? 
Status: NEW → ASSIGNED
Could you attach a new patch without the typo (and perhaps with the changes made
as suggested in comment 24 and comment 25)? Will sr when that's attached.
Attachment #121757 - Attachment is obsolete: true
Attachment #126311 - Flags: superreview?(jaggernaut)
Comment on attachment 126311 [details] [diff] [review]
a new patch (with typo fixed and redundant 'else if' eliminated)

carrying over r=Neil, sr=jag

When editing an existing file, prefer to stay in that file's style. In this
case, don't use {} for single line |if|s.

No need for a new patch.
Attachment #126311 - Flags: superreview?(jaggernaut)
Attachment #126311 - Flags: superreview+
Attachment #126311 - Flags: review+
Thank you all. Fix checked in (with braces removed). Now moving on to bug 199237.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Attachment #121757 - Flags: superreview?(jaggernaut)
*** Bug 119028 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: