save as, save link target as, save image as : non-ASCII filename/title/URL handling

RESOLVED FIXED in mozilla1.4alpha

Status

()

Core
Internationalization
RESOLVED FIXED
16 years ago
15 years ago

People

(Reporter: Jungshik Shin, Assigned: Jungshik Shin)

Tracking

({intl})

Trunk
mozilla1.4alpha
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(1 attachment, 3 obsolete attachments)

(Assignee)

Description

16 years ago
This is to add an I18N twist to bug 115176, which
is by itself pretty complicated. 
When 'save as' is used to save the html doc
(with non-ASCII URL whether URL-encoded or not, non-ASCII title) in a
browser window  or 'save link target as' or 'save
image as' is invoked on docs and images
(with non-ASCII URL whether URL-encoded or not or 
non-ASCII title, non-ASCII
link name) linked from the page, Mozilla suggests a filename
which is not what users usually expect.

This is a complex problem. Some aspects of this issue
cannot be resolved by web clients alone and require
coordination with popular http servers. Implication is
that Mozilla.org may have to be involved with standardization
process to deal with this issue. 

I'll get back with more details soon.
(Assignee)

Updated

16 years ago
Status: UNCONFIRMED → NEW
Ever confirmed: true

Comment 1

16 years ago
agree with Jungshik.
future for now
Status: NEW → ASSIGNED
Target Milestone: --- → Future
(Assignee)

Comment 2

16 years ago
A real life example sent by KANG Jeong-hee :

http://www.mogaha.go.kr/webapp/viewNoticeBoardServlet?serial=214012&board_list=54&page=1&sort=reg_date&listnum=10&main=1

In the page, there's a link to  '국제포럼(020821).hwp' (followed by
'(61481 bytes)' above
the  row in a light purple( with 02-3703-4895) and right of a skyblue cell.

When 'save link target as' is selected, the filename suggested by Mozilla
URL-encodes non-ASCII characters. 

The page is in EUC-KR, but there's no guarantee that the local filesystem
uses the same encoding as the web page itself. 
As I wrote in my bug report, there are many factors to take into account.

BTW, in this particular case, 'Content-Disposition:' header is not
emitted by the server. Therefore, making Mozilla's C-D header handling
compliant to RFC 2231 wouldn't help. (see bug 155949 and bug 162765 for
RFC 2231 compliance)


 
(Assignee)

Comment 4

16 years ago
In particular case mentioned in comment #2, fixing bug 155569 solved
the problem. The page in comment #2 is in EUC-KR(so are
the URL and the filename included in the URL in question) and I'm running
Mozilla in ko_KR.UTF-8 locale. The filename suggested by Mozilla
is the UTT-8(the local file system encoding) representation of
the file name represented in EUC-KR(the document encoding).
of course, this does not mean that we can close this bug because
there are more complicated cases as discussed in bug 115176. 
(Assignee)

Comment 5

16 years ago
'save target as' and 'save image as' are handled by
http://lxr.mozilla.org/seamonkey/source/xpfe/communicator/resources/content/contentAreaUtils.js

Once I fix bug 162765 and expose necessary methods to Javascript,
it should be easy to fix this bug for most cases (some other related
bugs have been resolved over the last few months by Boris).  
On 'non-Unicode platforms', a mismatch between the
local file system encoding and what Mozilla put filenames in 
may  still remain, though. 

 
(Assignee)

Comment 6

16 years ago
attachment 107217 [details] [diff] [review] to bug 162765 makes Mozilla work well 
for standard-compliant cases and some non-standard-compliant
cases as well(RFC 2231 has to be used, but some servers use
RFC 2047-encoding). Unlike its C++ counterpart (that makes
use of 'originCharset' of the current URI when untagged
raw 8bit chars are encountered), Javascript 
(for 'save target as') doesn't. I have to figure out how
to access 'originCharset' of the current URI with JS. 

BTW, Roy, can you assign this to me? 

Comment 7

16 years ago
sure. thanks
Assignee: yokoyama → jshin
Status: ASSIGNED → NEW

Updated

16 years ago
QA Contact: ruixu → ylong
(Assignee)

Updated

15 years ago
Blocks: 162765
(Assignee)

Comment 8

15 years ago
Sorry for spamming. I got the dependency the other way around.  
No longer blocks: 162765
Depends on: 162765
(Assignee)

Comment 9

15 years ago
Created attachment 117838 [details] [diff] [review]
a partial patch

This patch has to be applied along with my patches for
bug 162765 and bug 191541. Still, it only works
for RFC 2231 and RFC 2047 encoded filename parameters
in C-D header. When raw 8bit characters are used as
filename paramter in C-D header, this patch doesn't work.
The key is to figure out how to get |originCharset| of
the referring document in contentAreaUtil.js. 

Last December, I tinkered with this problem for a while
and I believe I came up with a solution, but I couldn't
find any trace of it on my disk. (my computer underwent
a huge overhaul..).  Anyway, I'm uploading this patch 
so that others can take a look (there are a lot of
commented-out lines and debug output, which reflects the tentative nature
of the patch) and help me out. 

Getting originCharset in nsExternalHelperAppService.cpp 
was easy, but I couldn't quite figure out in JS code.

BTW, I always get  UTF-8 as |documentCharset|
in |getDocumentCharset| at the end of my patch 
regardless of the charset used in the refering document. 
For instance, http://jshin.net/moztest/download.html
 is explicitly marked as in EUC-KR, but when I try
to save one of files with raw 8bit chars in C-D
header, I get UTF-8 in JS code. 

I changed the locale under which Mozilla is launched
(my primary locale is ko_KR.UTF-8), but nothing
has changed. I tried ja_JP.EUC-JP, ko_KR.EUC-KR
and C. 

Any idea as to what's going on?
(Assignee)

Comment 10

15 years ago
Adding Naoki to CC because he wrote 
|nsTextToSubURI| apparently because
he wanted to solve problems related to this
(and I used it a couple of places to solve
similar problems.)

Another example to show that getting
|originCharset| is important is 
as following:

http://www.mogaha.go.kr/webapp/bbs/pub/view.action?bid=200&serial=268306&sc_param=common&listStatusStr=rO0ABXNyABpjb20uZ2VuMTI4LmRhdGEuTGlzdFN0YXR1c7sGCWct5HSSAgAESQAHbWF4U2l6ZUkABm9mZnNldEwACmNvbmRpdGlvbnN0ABNMamF2YS91dGlsL0hhc2hNYXA7TAAEc29ydHQAEkxqYXZhL2xhbmcvU3RyaW5nO3hwAAAADwAAAABzcgARamF2YS51dGlsLkhhc2hNYXAFB9rBwxZg0QMAAkYACmxvYWRGYWN0b3JJAAl0aHJlc2hvbGR4cD9AAAAAAABLdwgAAABlAAAAAHh0AAZzZXJpYWw%3D

Sorry this URL is reaallly long. That document
is in EUC-KR and there's a jpg file linked
(look for '284943'). Now click on the image
linked and it will get loaded. Mozilla gets
the filename correctly in Korean and nicely
puts it on the titlebar. However, 'Save Image
as' is attempted, it comes up
with the suggested file name as if
'escaped URL' is in ISO-8859-1. Why?
Because |documentCharset| is assummed to be
in ISO-8859-1. 

It seems like there are two possible solutions.
One is to figure out a way to get charset
of the refering document (|originCharset|)
in nsContentAreaUtil.js and the other
is to set 'charset' (nominal) of image
documents to that of the referring document
instead of the default(?) ISO-8859-1. 

I just hit upon another idea, but that's beyond
the scope of this bug. On several places
in Mozilla, ISO-8859-1 is taken as the last
resort (or sometimes the default). This may have
to be subjected to the localization and/or
the configurability. Anyway, I'll discuss
this in another forum. 
 

(Assignee)

Comment 11

15 years ago
I'm sorry it appears that my second solution
in my prev. comment would not work because
charset is obtained from this code 

+  else if (document.commandDispatcher.focusedWindow) {
+    charset = document.commandDispatcher.focusedWindow.document.characterSet;
+    dump ("getDocumetnCharset: charset obtained from cmddispatch.fw :" +
charset + "\n");
+  }

rather than this code

+      charset = aDocument.characterSet;
+      charsetSrc = aDocument.characterSetSource;
+      dump ("getDocumetnCharset: charset obtained from aDcoument :" + charset +
"\n");
+      dump ("getDocumetnCharset: charset source :" + charsetSrc + "\n");
The originCharset of a URI will often be UTF8 just because someone in Mozilla
internals helpfully converted the unescaped URI string to UTF8 before creating
the URI object....  or just passed in a null charset at URI object creation (a
lot of people do that).
(Assignee)

Comment 13

15 years ago
> The originCharset of a URI will often be UTF8 
   
   At least in |nsExternalHelperAppService::ExtractSuggestedFilename| 
(see attachment 117822 [details] [diff] [review]) and somehwer else in nsImageDocument.cpp(see 
http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&file=nsImageDocument.cpp&root=/cvsroot&subdir=mozilla/content/html/document/src&command=DIFF_FRAMESET&rev1=1.92&rev2=1.93)
it helped me with guessing the encoding
of the URL.  In JS code, I think I need to figure out the equivalent
of what I do in attachment 117822 [details] [diff] [review] (getting the URI of a channel and 
the originCharset of that URI).

The following may be a bit off-topic in a sense.

> internals helpfully converted the unescaped URI string to UTF8 before 
> creating the URI object.... 

  Is that what originCharset is supposed to indicate? I thought
it indicates the charset of the referrer (referring document)
if it's 'well-defined' and 'well-known'. 
 
 
   When a standalone image is loaded in its own window by clicking on the link,
even getting originCharset the way described above (via channel) wouldn't help
because the charset of an 'image-only document' is set to ISO-8859-1[1], which
is why |charset=document.commandDispatcher.focusedWindow.document.characterSet| 
gives me ISO-8859-1 (comment #11)

[1] This may have to do with the following or related code:

http://lxr.mozilla.org/seamonkey/source/content/base/src/nsDocument.cpp#511

For 'image-only' documents, mCharacterSet seems never set. It may be
argued that |mCharacterSet| of 'stand-alone non-textual' documents
have to be set to charset of the referring document if it's well-defined.

 




 
> internals helpfully converted the unescaped URI string to UTF8 before 
> creating the URI object....  or just passed in a null charset at URI 
> object creation

That's why I want to have originCharset, instead. I suspect this is not
set to UTF-8 if the referring document is explicitly set to charset
other than UTF-8. For instance, http://jshin.net/moztest/download.html 
is in EUC-KR and I believe 
originCharset of http://jshin.net/moztest/random21.yyy
is EUC-KR. Otherwise, my patch to 162765 wouldn't work, but it works. 

(Assignee)

Comment 14

15 years ago
Sorry the last paragraph in comment #13 had to be deleted. Pls, ignore it.  
Target Milestone: Future → mozilla1.4alpha
(Assignee)

Comment 15

15 years ago
Created attachment 118012 [details] [diff] [review]
a working patch

It turned out that I was missing the obvious. Anyway, with this patch
along with patches for bug 162765 and bug 191541, 'save target as
and save image as' work fine. 

The only exception is 
when 'save image as' is loaded in a standalone browser pane.
In that case, 'charset' of the focused window is always set to
ISO-8859-1 (see my previous comment). Anyway, that has to be
dealt with in a separate bug I'm gonna file if not filed yet.
Attachment #117838 - Attachment is obsolete: true
(Assignee)

Comment 16

15 years ago
Comment on attachment 118012 [details] [diff] [review]
a working patch

asking r/sr. this may be a bit premature
because bug 162765  that blocks this
is not yet fixed. But I'm pretty sure
it'll soon (a working patch is waiting for
review).
Attachment #118012 - Flags: superreview?(jaggernaut)
Attachment #118012 - Flags: review?(ben)
(Assignee)

Comment 17

15 years ago
bug 198598 has just been filed for a problem
with saving
'stand-alone' 'non-textual' document.

Comment 18

15 years ago
Comment on attachment 118012 [details] [diff] [review]
a working patch

>+function getCharsetforSave(aDocument)
>+{
>+  if (aDocument) 
>+      charset = aDocument.characterSet;

Too much indentation.

>+  else if (document.commandDispatcher.focusedWindow) {
>+    charset = document.commandDispatcher.focusedWindow.document.characterSet;
>+  }
>+  else {
>+    charset = window._content.document.characterSet;
>+  }
>+  return charset;
>+}
>+

It's inconsistent to use { } for the second and third case but not for the
first.

sr=jag with those nits fixed.
Attachment #118012 - Flags: superreview?(jaggernaut)
Attachment #118012 - Flags: superreview+
Attachment #118012 - Flags: review?(neil)
Attachment #118012 - Flags: review?(ben)
(Assignee)

Comment 19

15 years ago
Thank you jag for sr. 
I'll fix nits as you pointed out.
BTW, could you take a look at my patch
for  bug 162765 if you can? This bug is blocked
by it and I like to land both patches together
before 1.4. 
Thanks again.

Comment 20

15 years ago
Comment on attachment 118012 [details] [diff] [review]
a working patch

>+    var dispHeader = this.mContentDisposition;
Is it worth duplicating this variable?

>+      // To make JS engine happy.
>+      var dummy = new Object();
Yeah, I hate that too, but you should use
var dummy = { value: null };

>+      try {
>+        fileName = mhp.getParameter(dispHeader, "filename", charset, true, dummy);
>+      } 
>+      catch (e) {
>+        fileName = mhp.getParameter(dispHeader, "name", charset, true, dummy);
>       }
OK, so you catch an exception for the filename, can getting the name throw an
exception too? That sounds bad.

>+function getCharsetforSave(aDocument)
>+{
>+  if (aDocument) 
>+      charset = aDocument.characterSet;
>+  else if (document.commandDispatcher.focusedWindow) {
>+    charset = document.commandDispatcher.focusedWindow.document.characterSet;
>+  }
>+  else {
>+    charset = window._content.document.characterSet;
>+  }
>+  return charset;
>+}
You forgot to declare charset. But you could just use
if (...)
  return ...;
if (...)
  return ...;
return ...;
Attachment #118012 - Flags: review?(neil) → review-

Comment 21

15 years ago
Thanks for catching those things, Neil.
(Assignee)

Comment 22

15 years ago
Created attachment 121757 [details] [diff] [review]
a new patch per neil's and jag's comments

Thank you for review. I addressed concerns of both of you. 
Can you review this patch?  Thanks.
Attachment #118012 - Attachment is obsolete: true
(Assignee)

Comment 23

15 years ago
Comment on attachment 121757 [details] [diff] [review]
a new patch per neil's and jag's comments

realized that neil was not on CC. asking
for r/sr explicitly.
BTW, I'll change two statements below
with 'return fileName.replace(...);'.

+    fileName = fileName.replace(/^"|"$/g, "");
+    return fileName;
Attachment #121757 - Flags: superreview?(jaggernaut)
Attachment #121757 - Flags: review?(neil)

Comment 24

15 years ago
Comment on attachment 121757 [details] [diff] [review]
a new patch per neil's and jag's comments

>+    reuturn document.commandDispatcher.focusedWindow.document.characterSet;
Silly typo :-)

>+  }
>+  else {
Don't bother with else after return (twice).
Attachment #121757 - Flags: review?(neil) → review+

Comment 25

15 years ago
Comment on attachment 121757 [details] [diff] [review]
a new patch per neil's and jag's comments

+    reuturn document.commandDispatcher.focusedWindow.document.characterSet;

Typo. Combined with Neil's comment, this becomes:

function getCharsetforSave(aDocument)
{
  if (aDocument)
    return aDocument.characterSet;

  if (document.commandDispatcher.focusedWindow)
    return document.commandDispatcher.focusedWindow.document.characterSet;

  return  window._content.document.characterSet;
}


Does commandDispatcher.focusedWindow always point to the window you want? Can
we somehow get to this function while it points to another frame, or to the XUL
window instead?
(Assignee)

Comment 26

15 years ago
Thank you for review, neil and jag.

> Does commandDispatcher.focusedWindow always point to the window you want? Can
> we somehow get to this function while it points to another frame, or to the XUL
> window instead?

  I guess we're safe as long as 'Save Target As/Save Image As/Save Frame As' are
concerned.
As for 'Save Page As', I'm not sure what it does for framed pages if the encoding
of a page is different from the encoding of a focused frame.  Do you think it may
be better to put window._content.document.characterSet before commandDispatcher.fW? 
I haven't tested framed pages, but in non-frame cases, both are always the came.

BTW, you may be interested in bug 199237, too. That bug has little to do with the
question at hand, but is one of several 'save'-related bugs. 
(Assignee)

Comment 27

15 years ago
I did some experiments with a framed page
(http://jshin.net/moztest/dl_frame.html) and found that trying
aDocument.characterset,document.commandDispatcher.focusedWindow, and 
window._content.document.characterSet in that order is the 'right' thing to do. 

For right-click-activate menu items (saev link target as, save imgge as), 
document.commandDispatcher.focusedWindow always work giving us charset of the
focused frame. For Save Frame As, both aDocument.characterset and d.c.f work.
However, window._content.document.characterSet has the charset declared in the
frameset html (instead of a focused frame).  

Now bug 162765 this bug depends on was fixed, let's land this one as well. jag,
can I get sr? 
Status: NEW → ASSIGNED

Comment 28

15 years ago
Could you attach a new patch without the typo (and perhaps with the changes made
as suggested in comment 24 and comment 25)? Will sr when that's attached.
(Assignee)

Comment 29

15 years ago
Created attachment 126311 [details] [diff] [review]
a new patch (with typo fixed and redundant 'else if' eliminated)
(Assignee)

Updated

15 years ago
Attachment #121757 - Attachment is obsolete: true
(Assignee)

Updated

15 years ago
Attachment #126311 - Flags: superreview?(jaggernaut)

Comment 30

15 years ago
Comment on attachment 126311 [details] [diff] [review]
a new patch (with typo fixed and redundant 'else if' eliminated)

carrying over r=Neil, sr=jag

When editing an existing file, prefer to stay in that file's style. In this
case, don't use {} for single line |if|s.

No need for a new patch.
Attachment #126311 - Flags: superreview?(jaggernaut)
Attachment #126311 - Flags: superreview+
Attachment #126311 - Flags: review+
(Assignee)

Comment 31

15 years ago
Thank you all. Fix checked in (with braces removed). Now moving on to bug 199237.
Status: ASSIGNED → RESOLVED
Last Resolved: 15 years ago
Resolution: --- → FIXED
(Assignee)

Updated

15 years ago
Attachment #121757 - Flags: superreview?(jaggernaut)
(Assignee)

Comment 32

15 years ago
*** Bug 119028 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.