Closed Bug 393246 Opened 12 years ago Closed 6 years ago

CreateFixupURI should use UTF-8, not the system charset

Categories

(Core :: Document Navigation, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla28
Tracking Status
firefox28 --- fixed

People

(Reporter: dao, Assigned: hsivonen)

References

Details

(Keywords: intl, Whiteboard: [qa-])

Attachments

(1 file, 2 obsolete files)

Five years ago, bug 130393 comment 0:
> Try to type any DBCS URL in the location bar and enter.
> 
> You will get "file not found" error.
> 
> It's b/c mozilla encode/escape URL typed in location bar to UTF-8 b/f sending 
> it to server.
> 
> For IE, it sends in system charset when "always send URLs as UTF-8" is off,
> and it sends in UTF-8 encoding when the option is on,
> which increases the possibility to find the file in the server.
> 
> According to research done in bugs related to non-ASCII URLs, most modern 
> servers are not ready to receive URLs in UTF-8 encoding, but expects them in 
> their file system charset.
> 
> Therefore, sending URLs in system charset will increase the possibility of 
> fidning the file wanted, even though it would not work when the server charset 
> does not match with client machine's charset.

Chances are that "most modern servers" are now ready to accept UTF-8 encoded URLs and that using the system charset actually produces more errors.

Since CreateFixupURI is utilized by LoadURI, the current behavior also violates the interface, which states: "For HTTP and FTP URLs and possibly others, characters above U+007F will be converted to UTF-8 and then URL- escaped per the rules of RFC 2396."

Note that "always send URLs as UTF-8" is on by default in IE6.
Flags: blocking1.9?
Whiteboard: [parity-opera] [parity-IE]
Summary: Always send typed URLs as UTF-8 → Always send URLs as UTF-8
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 261929
How can this be a duplicate of bug 261929? That one claims to be fixed.
It looks like the patch for bug added a "network.standard-url.encode-query-utf8" pref that defaults to false, and then changed the default value of network.standard-url.encode-utf8 to true. According to the comments there this makes us match IE...
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Summary: Always send URLs as UTF-8 → Always encode query string values as UTF-8 (network.standard-url.encode-query-utf8 = true)
Dão, is this fixed?
Bug 387723 is still open, which depends on this one (at least if we don't want the dirty fix). I think what should happen here is to revert the patch in bug 130393.
(In reply to comment #3)
> It looks like the patch for bug added

Doh, I meant "bug 261929 added".
Unclear what the issues here are.  Bug summary sounds like it's just flipping a pref anyway, so this shouldn't be a deal breaker.
Flags: blocking1.9? → blocking1.9-
Attached patch flip the pref (obsolete) — Splinter Review
I was told this would really fix bug 387723.
Attachment #284307 - Flags: superreview?(cbiesinger)
Attachment #284307 - Flags: review?(cbiesinger)
So what exactly has changed since the fix for bug 261929?
Bookmark keywords broke because the location bar decodes utf8-encoded addresses which are then encoded with the system charset.

bug 387723 comment 5:
> getWebNavigation().loadURI("http://www.google.com/search?&q=Искусство",
> nsIWebNavigation.LOAD_FLAGS_ALLOW_THIRD_PARTY_FIXUP,null, null, null);
> 
> ...which, despite the nsIWebNavigation::loadURI IDL comments about characters
> being converted to UTF-8, encodes the URI assuming system charset (win-1251 in
> my case):

bug 387723 comment 7:
> getShortcutOrURI() encodes the typed value only if it's a keyword search.
> Doesn't that mean that entering
> "http://www.google.com/search?&q=Искусство" without the new binding
> leads to the wrong page as well?

bug 387723 comment 8:
> Yes.
> Bookmark keywords broke because the location bar

That wasn't my question.  What's changed in web content that would allow us to break compat with IE here?  If bookmark keywords have a problem, maybe we should fix it without changing the behavior exposed to the web at large.

> despite the nsIWebNavigation::loadURI IDL comments 

Maybe we should be fixing whatever is going on here, instead of blanket-changing the behavior of necko?

For example, having whatever NS_NewURI call loadURI does here pass in "UTF8" as the origin charset will make Necko use that for encoding URIs coming through loadURI, no matter what the preference values are or what the system encoding is.

That said, the code in LoadURI passes a null origin charset, which means UTF-8.  So what's _really_ going on here?
In particular, please breakpoint in 

2754 nsDocShell::LoadURI(const PRUnichar * aURI,
2755                     PRUint32 aLoadFlags,
2756                     nsIURI * aReferringURI,
2757                     nsIInputStream * aPostStream,
2758                     nsIInputStream * aHeaderStream)

step down to the NS_NewURI call on line 2777 (this is rev 1.861 of nsDocShell.cpp).  When you step across that, what's the mCharset of the resulting URI?  What is it encoded as?
(In reply to comment #11)
> > Bookmark keywords broke because the location bar
> 
> That wasn't my question.  What's changed in web content that would allow us to
> break compat with IE here?  If bookmark keywords have a problem, maybe we
> should fix it without changing the behavior exposed to the web at large.

Why should typing "g Искусство" be different from "http://www.google.com/search?&q=Искусство"?

> That said, the code in LoadURI passes a null origin charset, which means UTF-8.
>  So what's _really_ going on here?

As I wrote earlier, I think another possible fix is to back out bug 130393. It really seems to break the interface.
> Why should typing "g Искусство" be different

No reason it should.  But if a site has a URI pointing to itself and uses the page encoding on the server, we need to use the page encoding when sending the URI, for compat.  Therefore, blanket-changing the necko behavior is not acceptable.

> another possible fix is to back out bug 130393.

Aha!  So this is purely a URI fixup issue, eh?  Bug 387723 comment 7 made i sound like a pure loadURI issue.

I think that we should in fact back out bug 130393 if IE now defaults to having "always send URLs as UTF-8" on by default, which it sounds like it does.
Attachment #284307 - Attachment is obsolete: true
Attachment #284307 - Flags: superreview?(cbiesinger)
Attachment #284307 - Flags: review?(cbiesinger)
I agree with Boris. IE*5* turned on "always send URLs as UTF-8" by default.
Also, we don't have to consider file system charset about "file:" URL because it now always uses UTF-8.
Summary: Always encode query string values as UTF-8 (network.standard-url.encode-query-utf8 = true) → CreateFixupURI should use UTF-8, not the system charset
Whiteboard: [parity-opera] [parity-IE]
Flags: blocking1.9- → blocking1.9?
Attached patch patch (obsolete) — Splinter Review
Currently untested, feel free to wait with the review until I get that done.
Attachment #285020 - Flags: superreview?(bzbarsky)
Attachment #285020 - Flags: review?(bzbarsky)
Comment on attachment 285020 [details] [diff] [review]
patch

I think you should ask biesi for review.  I can try to do it if he can't, but it will be _very_ laggy.
Attachment #285020 - Flags: superreview?(bzbarsky)
Attachment #285020 - Flags: review?(bzbarsky)
Comment on attachment 285020 [details] [diff] [review]
patch

Tested and appears to be working. My system charset is UTF-8, though.
Attachment #285020 - Flags: superreview?(cbiesinger)
Attachment #285020 - Flags: review?(cbiesinger)
Status: REOPENED → ASSIGNED
Flags: blocking1.9? → blocking1.9-
Should this be [wanted-1.9]?
Please don't ask this on every single bug we triage blocking1.9-.  If we intend to mark something wanted-1.9, we will.  That marking isn't going to make much difference in whether something gets fixed anyway.
Frankly, I think we should be blocking on this bug.  It's got a patch; we just need to make sure it doesn't fall by the wayside and actually goes in.
(In reply to comment #20)
> Please don't ask this on every single bug we triage blocking1.9-.  If we intend
> to mark something wanted-1.9, we will.  That marking isn't going to make much
> difference in whether something gets fixed anyway.

I'm sorry, I didn't mean to spam; I was just a little puzzled because I thought
that usually when something is marked blocking1.9-, it is marked as
[wanted-1.9] because although it has been decided that it would not block the
release, it is still targeted for 1.9. Thank you for correcting me though.
(In reply to comment #21)
> Frankly, I think we should be blocking on this bug.

We've asked for a concise explanation and we haven't received one, and it's not clear that this a regression.
It's not a regression from 1.8, except insofar as more and more servers don't work with our code as they switch to UTF-8.  And they're switching, because IE switched.

We really need a way to track "bugs that shouldn't block but that shouldn't miss the train just because a reviewer is being lame", which is where this bug could easily end up.
We've got a front end blocker depending on this, so I'll try to make sure it gets reviewed and landed.
Bug 387723 got a different fix. However, this should still be fixed in order to make "http://www.google.com/search?&q=Искусство" work just like "g Искусство".
Status: ASSIGNED → NEW
Attachment #285020 - Flags: superreview?(cbiesinger)
Attachment #285020 - Flags: superreview+
Attachment #285020 - Flags: review?(cbiesinger)
Attachment #285020 - Flags: review+
Comment on attachment 285020 [details] [diff] [review]
patch

Pretty late now, and I think this could wait until after 1.9 (now that bug 387723 got a different fix), but Boris thought this should be blocking, thus requesting approval.
Attachment #285020 - Flags: approval1.9?
Comment on attachment 285020 [details] [diff] [review]
patch

a1.9=beltzner
Attachment #285020 - Flags: approval1.9? → approval1.9+
Keywords: checkin-needed
Assignee: nobody → dao
Checking in docshell/base/nsDefaultURIFixup.cpp;
/cvsroot/mozilla/docshell/base/nsDefaultURIFixup.cpp,v  <--  nsDefaultURIFixup.cpp
new revision: 1.52; previous revision: 1.51
done
Checking in docshell/base/nsDefaultURIFixup.h;
/cvsroot/mozilla/docshell/base/nsDefaultURIFixup.h,v  <--  nsDefaultURIFixup.h
new revision: 1.12; previous revision: 1.11
done
Status: NEW → RESOLVED
Closed: 12 years ago12 years ago
Component: Networking → Embedding: Docshell
Keywords: checkin-needed
QA Contact: networking → docshell
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9beta5
See bug 432836

This actually breaks web services (like Yahoo Calendar) that don't handle UTF-8.

And we do NOT work like IE. Even with the IE8 preference turned on (UTF8), they still interpret native characters correctly. They don't convert them to UTF8.

If you have native characters, there is no way to convert them to an encoding that Yahoo Calendar understands, since escape makes them escaped unicode, encodeURIComponent makes them UTF-8, and the default is basically to call encodeURIComponent under the covers.

What happens with that site in IE?
All I can determine from IE8 is that the native characters stay native in the URL bar and appear correctly on the page.

I believe I have a better testcase in bug 432836.

I think the problem is going to be servers that use PHP urldecode. It does not handle UTF-8.
Boris/MConnor are you ok with backing this out to fix bug 432836?
Backed out for now.  Think this is probably the right thing, but the change came a little too late for us to be comfortable shipping it now.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I'm fine with backing it out, esp. since the information provided in this bug about IE behavior was incorrect...
The confusion about IE is unfortunate, but it's not like bug 432836 would be very clear either. Seems to have mixed up two issues and didn't convince me at all that we were doing something wrong or harmful. The lack of other regression reports since beta 5 is most notable, so I definitely think we could re-land this in the not-so-far future. I won't push it again, though, leaving it to biesi, Boris or whoever feels responsible what to do with this bug.
Assignee: dao → nobody
Status: REOPENED → NEW
We WERE doing something wrong though. I'm investigating this further on IE and I finally figured out what the UTF-8 preference does - as someone else guessed on IRC, it only affects the path part of the URL. it does NOT affect the query part. We were converting the entire URL.

For this change to be correct (or at least consistent with IE), it should have only converted the path to UTF8, not the entire query, and it should probably have a preference just like IE does to cover legacy servers that still use ASCII encoding for path part (The existing preference network.standard-url.encode-utf8 is not good enough since it encodes the entire URL)

If anyone wants to mess around with this, I've created:

http://www.kaply.com/work/äëïöü/?string=äëïöü

In IE, you can not navigate to this URL unless you have UTF-8 URLs turned on. But even if you have UTF-8 URLs turned on, the string part is NOT encoded to UTF-8.

In Firefox 2, you cannot navigate to this URL unless you turn on network.standard-url.encode-utf8 which then converts the query to UTF-8 incorrectly.

In Firefox 3 Beta 5, you can navigate to the URL but if converts the query to UTF-8 incorrectly.

So to reiterate - this patch was based on incorrect information about IE's behavior and changed the behavior to actually be incompatible to with IE.

It made more sense to back it out then to create a new third behavior that is different IE.


And incidentally RFC 2396 does NOT mandate UTF-8. To quote section 2.1:

   It is expected that a systematic treatment of character encoding
   within URI will be developed as a future modification of this
   specification.
(In reply to comment #37)
> In Firefox 3 Beta 5, you can navigate to the URL but if converts the query to
> UTF-8 incorrectly.

You keep claiming it's incorrect, but I don't see why. The system charset can be anything, including UTF-8. Why is using that to do the encoding better for the server?
(In reply to comment #37)
> For this change to be correct (or at least consistent with IE), it should have
> only converted the path to UTF8, not the entire query, and it should probably
> have a preference just like IE does to cover legacy servers that still use
> ASCII encoding for path part 

FWIW, that's the behaviour you'd get by backing out this patch.

Also note the network.standard-url.encode-query-utf8 pref (http://lxr.mozilla.org/seamonkey/source/modules/libpref/src/init/all.js#721)
oh right, the patch did get backed out. So yeah, what we do now is encode the path as UTF-8, the query in the platform charset.
Whiteboard: [not needed for 1.9]
Blocks: 461304
Duplicate of this bug: 552273
Blocks: 443588
Blocks: 943273
I found Firefox always converts a URL in the URL bar to UTF-8 regardless whether the user has typed in the URL bar since Firefox 10(!)
I propose to obsolete FIXUP_FLAG_USE_UTF8 and just always use UTF-8 because:
- This is already shipped by accident in two years, but nobody (including :mkaply) complained about the "breakage" of legacy Web pages.
- At least Yahoo! Calendar migrated to UTF-8. Bug 432836 shows no other Web pages in the real world. An artificially crafted test page is not a good reason to keep the behavior.
- We do not invent yet another new behavior because Chrome already does. It is not also minor anymore. (Chrome even beats our market share according to some statistics.)
support this motion.  Assuming the useragents were true, I have had numerous visits from mainland China on my website using Firefox that have incorrectly encoded the query params in GBK...
Also, Mac and most Linux configurations have been using UTF-8 here for a long time.
Assignee: nobody → hsivonen
Attachment #285020 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Target Milestone: mozilla1.9beta5 → ---
Duplicate of this bug: 943273
Blocks: 943272
(In reply to Masatoshi Kimura [:emk] from comment #42)
> I found Firefox always converts a URL in the URL bar to UTF-8 regardless
> whether the user has typed in the URL bar since Firefox 10(!)

Do we know what checkin caused this? Are we sure that it doesn't depend on prefs?
(In reply to Simon Montagu :smontagu from comment #48)
> (In reply to Masatoshi Kimura [:emk] from comment #42)
> > I found Firefox always converts a URL in the URL bar to UTF-8 regardless
> > whether the user has typed in the URL bar since Firefox 10(!)
> 
> Do we know what checkin caused this? Are we sure that it doesn't depend on
> prefs?

That's <https://hg.mozilla.org/mozilla-central/rev/5fdf8f720c7c>. In other words, it has never worked as expected from the start.
> That's <https://hg.mozilla.org/mozilla-central/rev/5fdf8f720c7c>. In other
> words, it has never worked as expected from the start.

The patch had two problems.
1. In loadCurrent(), "this" will no longer point gURLBar. So this.valueIsTyped will always be undefined.
2. Even if 1. is fixed, |this.value = url;| will clear .valueIsTyped in the setter function. 
   https://mxr.mozilla.org/mozilla-central/source/browser/base/content/urlbarBindings.xml?rev=dfb4db3ac446&mark=285-285#285
   So .valueIsTyped will always be false.
Attachment #8339136 - Flags: review?(bzbarsky)
Comment on attachment 8339136 [details] [diff] [review]
Patch that addresses bitrot

Can we also nix FIXUP_FLAG_USE_UTF8 and its consumers, please?
Attachment #8339136 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky [:bz] from comment #51)

Thanks for the r+.

https://hg.mozilla.org/integration/mozilla-inbound/rev/9b4c3cd28738

> Can we also nix FIXUP_FLAG_USE_UTF8 and its consumers, please?

Bug 945627.
Blocks: 945627
https://hg.mozilla.org/mozilla-central/rev/9b4c3cd28738
Status: ASSIGNED → RESOLVED
Closed: 12 years ago6 years ago
Resolution: --- → FIXED
Whiteboard: [not needed for 1.9]
Target Milestone: --- → mozilla28
Blocks: 946114
Blocks: 552273
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.