Closed Bug 205682 Opened 21 years ago Closed 21 years ago

Save page as (complete) : URLs referred to are turned to/sent back in UTF-8

Categories

(Core :: Networking, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

References

()

Details

(Keywords: intl)

Attachments

(1 file, 3 obsolete files)

Mozilla sends URLs in UTF-8 when an inline image with a non-UTF-8 filename in a
non-UTF-8 document is right-clicked on and 'view image' is selected. The same is
true of background images. 

How to reproduce :

Go to  http://jshin.net/moztest/download.html.
In the page (in EUC-KR), right-click on the second blue ball (next to
'KO : EUC-KR : raw 8bit') and select 'view image'. Mozilla sends the URL
in url-escaped UTF-8. On the other hand, if you do the same on the third
blueball (<img src="url-escaped image name in EUC-KR.png">), Mozilla
sends the URL in url-escaped EUC-KR. 

---------------
Because viewing an in-line image in the current window  is not
speed-critical(unlike in general cases discussed in bug 129726), we can get away
with trying once more with the origin charset if the first request (with UTF-8
url) fails. 

Javascript code that invokes this is at 
http://lxr.mozilla.org/seamonkey/source/xpfe/communicator/resources/content/nsContextMenu.js#604
http://lxr.mozilla.org/seamonkey/source/xpfe/communicator/resources/content/utilityOverlay.js#213

There's already some fallback mechanism for failed requests, but afaik, it falls
back to the locale charset instead of origincharset. I'm writing from memory and
I have to check again. Related to this bug is bug 199237.
There's another symptom which is  more serious. When 'Web page complete' is
chosen in 'Save Page As' dialog box, images with names in the encoding 'native
to the document' are not downloaded because URLs are sent back in UTF-8, but the
server doesn't know how to handle URLs in UTF-8 because its filesystem charset
is not UTF-8. As a result, web pages saved with 'web page complete' do not
include images. This would surprise users because they are able to view images,
but can't save them using 'Save Page As' and "Save image as'
Summary: view image/view background image sends URL in UTF-8 → once loaded, image urls are sent back in UTF-8 (view image, save page as, save image as)
This bug  was fixed by bz's recent nsIURI-related   
check-in(s) ( bug 47534. I thought there's another   
bug in which |originCharset| was made to be set explicitly, but couldn't find.   
bz, can you tell me which one is?). That is, contextual menu 'view image' and   
'save image' work fine with non-ASCII urls. (Try it at  
http://i18nl10n.com/dl.html).   
 
However, now 'save page (as complete)' doesn't work. Files referred to in a 
html document with _relative_ urls are not downloaded/saved but instead  their 
urls (relative) are turned absolute in the downloaded/saved version of the 
html document.  I'm going to file a new bug on that.  
 
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Probably bug 166996?  That's what added the function the checkin for bug 47534
uses...

What's up with the persistence stuff?  Is it broken only on non-ASCII uris?  Or
all URIs?  Is it something that broke with the bug 47534 checkin?

If this is only broken for non-ASCII, does hacking all the NS_NewURI calls in
nsWebBrowserPersist to pass the page charset of whatever page it's failing on
fix things?   (I hope it does....)  In any case, please cc me on this bug that
you file.
I'm reopening with a revised summary line because only 'view image'/'save  
image' was fixed.   
  
'Save page as' (added in comment #1) issue  is still with us.  :-)  
  
Thanks, bz. Yeah, it's bug 166996.   
  
As for the persistent stuff, I was mistaken. It turned out that Mozilla always  
turned URLs (referred in <a href="....">) to absolute URLs and never  
downloaded referred documents (i.e. Mozilla is not a 'recursive batch-mode  
downloader'. there's a bug filed for this *feature*) although inline images  
are downloaded and saved. So, there's nothing wrong with turning relative urls  
to absolute urls and fixes for bug 166996 and bug 47534 did NOT introduce a 
new bug. (they just fixed a part of this bug.) 
  
However, when converting (relative) urls to absolute urls, it also turns URL  
specs to UTF-8 (url-escaped). It also tries to download inline images with  
UTF-8 URLs only to get an error message returned from the server. A bit  
interesting is that Mozilla saves 'the error page' (in html) as if it's an  
image file and ends up with  a strange link like '<img src="dl_files/a.html">  
  
I added PR_LOG() to nsStandardURL::Init() as suggested by Darin in bug 150376 
and found that |originCharset| is sometimes 'EUC-KR' (doccharset), but other 
times it's null for files with non-ASCII names.  I suspect it's null whenever 
a url spec (urlstring) is passed from xpfe code to load/open it.  
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
> However, when converting (relative) urls to absolute urls, it also turns URL  
> specs to UTF-8 (url-escaped). 

Yeah; that could be the NS_NewURI calls I mentioned in comment 3.  Does hacking
them to pass "EUC-KR" (literal string for now) as the charset fix the EUC-KR
tescase?  Just trying to establish which code needs to be fixed, exactly... ;)
Thanks. I added "EUC-KR" to all 4 calls in NS_NewURI() in the file and it 
worked in ko_KR.EUC-KR, ja_JP.EUC-JP and ko_KR.UTF-8 locales.  Now I have to 
figure out how to get the value to use when calling NS_NewURI() and which one 
to  invoke with |charset|   
 
 
BTW, non-ASCII inline images are saved into 'a.gif' (and presumably 'b.gif', 
'c.gif', and so forth) instead of their original names (url-escaped. It's not 
clear what to do when the local filesystem charset is different from the 
charset of a document to save) This should be a separate bug, anyway.    
Summary: once loaded, image urls are sent back in UTF-8 (view image, save page as, save image as) → Save page as (complete) : URLs referred to are turned to/sent back in UTF-8
The persistance object only does this whole fixup thing if it's told to
saveDocument, so there should be a document object floating around somewhere to
get the charset off of....
Attached patch a patch (obsolete) — Splinter Review
It took a bit more extensive change than I thought at first. Anyway, it works
as intended. There may be still some rough edges. I used 'const char*' for
|a(Origin)Charset| (partly because that's what NS_NewURI accepts), but changing
some of them to |nsACString| may be better. 

Any comment would be welcome.
Comment on attachment 128053 [details] [diff] [review]
basically the same patch with nsACString

Asking for r/sr.

I'll replace the following comment to FixupNode() 

+   * The charset of a document is passed along so that the charset conversion 
+   * is done, if necessary, when fixing up nodes.

with

The charset of .... passed along so that the uri attribute of the fixed-up node
will be created with |originCharset| set to the charset.
Attachment #128053 - Flags: superreview?(bzbarsky)
Attachment #128053 - Flags: review?(adamlock)
Why pass the charset around?  Why not just set it as a member at the same time
as we set mCurrentBaseURI or something?

btw, check that adam is still reading his @netscape.com email address?
Attached patch a new patch (obsolete) — Splinter Review
Instead of passing around the charset, I use member variables, mCurrentCharset
and mCharset as suggested by bz.
Attachment #127800 - Attachment is obsolete: true
Attachment #128053 - Attachment is obsolete: true
Attachment #128053 - Flags: superreview?(bzbarsky)
Attachment #128053 - Flags: review?(adamlock)
Comment on attachment 128609 [details] [diff] [review]
a new patch

It seems that adamlock's new email address (adamlock @ eircom.net) is not yet
in bugzilla. 

bz, do you know anyone else I can ask for r?
Attachment #128609 - Flags: superreview?(bzbarsky)
Comment on attachment 128609 [details] [diff] [review]
a new patch

>Index: embedding/components/webbrowserpersist/src/nsWebBrowserPersist.cpp
>     // Persist the main document
>     nsCOMPtr<nsIDocument> doc(do_QueryInterface(aDocument));
>+    NS_ENSURE_TRUE(doc, NS_ERROR_UNEXPECTED);

Don't add this -- it's not going to happen.  We already ensure aDocument is
non-null and the QI will never fail for a document object.

>         // Save the document
>         nsCOMPtr<nsIDocument> docAsDoc = do_QueryInterface(aDocument);
>+        NS_ENSURE_TRUE(docAsDoc, NS_ERROR_UNEXPECTED);

Same.

>+    rv = NS_NewURI(getter_AddRefs(uri), 
>+                   nsDependentCString(((nsCStringKey *) aKey)->GetString()),
>+                   (data->mCharset).get());

How about nsDependentCString(((nsCStringKey*)aKey)->GetString(),
			     ((nsCStringKey*)aKey)->GetStringLength())
?

Also, data->mCharset.get() should work, no?  No need for parens there.

>-    nsIDOMNode *aNode, const char *aAttribute, PRBool aNeedsPersisting,
>+    nsIDOMNode *aNode, const char *aAttribute, PRBool aNeedsPersisting, 

Don't add spaces at ends of lines.

>+    nsresult rv = NS_NewURI(getter_AddRefs(uri), NS_ConvertUTF16toUTF8(aURI), 
>+                            mCurrentCharset.get(), mCurrentBaseURI);

Just pass aURI without converting; NS_NewURI will convert as and if needed.

sr=me with those changes.

Adam is really the only person qualified to do the r=, in my opinion.  Just
mail him at whatever address works and as for review if bugzilla has not been
updated to that address yet.
Attachment #128609 - Flags: superreview?(bzbarsky) → superreview+
Before asking Adam for r, I thought it might be easier if I upload a new patch
with bz's concerns all addressed.
Attachment #128609 - Attachment is obsolete: true
Comment on attachment 129359 [details] [diff] [review]
a new patch (per bz's comment)

Assuming bz's sr still stands, I'm now asking for r.
Attachment #129359 - Flags: review?(adamlock)
Comment on attachment 129359 [details] [diff] [review]
a new patch (per bz's comment)

r=adamlock

Looks good to me
Attachment #129359 - Flags: review?(adamlock) → review+
Comment on attachment 129359 [details] [diff] [review]
a new patch (per bz's comment)

Thanks, Adam. 

Asking for a1.5b. This is a rather low-risk and simple patch that fixes
problems with saving html documents with images/frames/iframes (in non-ASCII
filenames). I've been using this patch for a couple of weeks now on my build
and it's worked fine on both Linux/Win(well, there's nothing platform
specific). A lot of international users (who want to save non-ASCII documents)
would be glad to see this fixed.

This attachment doesn't have sr stamp, but this is an update to the previous
version addressing the concerns of bz who gave me sr.
Attachment #129359 - Flags: approval1.5b?
Comment on attachment 129359 [details] [diff] [review]
a new patch (per bz's comment)

approved for 1.5 landing
Attachment #129359 - Flags: approval1.5b? → approval1.5b+
Thank you all. Fix got landed in the trunk. 
Status: REOPENED → RESOLVED
Closed: 21 years ago21 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: