Closed Bug 432836 Opened 16 years ago Closed 16 years ago

Firefox incorrectly converts characters when loading URLs

Categories

(Core :: Networking: HTTP, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mkaply, Assigned: mconnor)

References

()

Details

(Keywords: regression)

The above URL loads correctly on FF2 and IE8, but fails on FF3.

We are incorrectly encoding the characters in the URL to UTF-8.
You'll have to construct the URL and paste it

bugzilla is screwing up

http://calendar.yahoo.com/?v=60&ST=20080220T075200%2D0600&TITLE=25

–

å

rs+jubileumssalg
I've attached a better testcase.

go to:

http://www.kaply.com/work/chartest.php

Copy the URL on the page, and paste it into the URL bar. Notice the different results on FF2 vs. FF3.

Behind the scenes I'm using urldecode. So if a web app uses urldecode to convert an encoding string back, it is now broke.
Incidentally, as far as the "Send UTF-8 URLs" option in Internet Explorer goes, no one can figure out what this option does (you can search around the net and see) and my testing shows that it has no effect in the referenced scenarios, whether clicking on a link or pasting in the URL bar.

So the statement in the other bug "we're doing it like IE" was incorrect.
On researching, URL decode in PHP uses the charsets page to determine what to do.

Since yahoo calendar is iso-8859-1, that's what it uses to decode.

So really the core problem here is the changes made in bug 393246 removed the ability to even send a page native characters because they are always being encoded as UTF-8.

I'm not sure what "the right thing" is, but I don't think what we're doing now is it, especially since no other browser works this way.

(In reply to comment #5)
> On researching, URL decode in PHP uses the charsets page to determine what to
> do.
> 
> Since yahoo calendar is iso-8859-1, that's what it uses to decode.

Well, if we'd use the system charset again as you enter an address, you would get the same failure on a page that doesn't use the system charset (e.g. UTF-8). You can test this with google: entering http://www.google.com/search?&q=Искусство works as of bug 393246.
That google search worked on Firefox 2 as well which doesn't have the fix for 393246, so that's not what fixed it.

Incidentally, that search does not work on IE8.
(In reply to comment #7)
> That google search worked on Firefox 2

Well, depending on your system charset, right? I assume a Russian locale with a charset like Win-1251 which UTF-8 is incompatible with.
Dão Gottwald   

I'm sorry, but your testcase is invalid. Your say that you should be able to paste

http://www.google.com/search?&q=Искусство

Into the URL bar and have it work on a russian system. 

But the answer is no you shouldn't. You should be able to paste

http://www.google.ru/search?&q=Искусство

into the url bar and have it work on a russian system (which I believe it does)

or

http://www.google.com/search?hl=ru&q=Искусство

into the URL bar and have it work (again, which I believe it does)

The problem is that you are expecting the single byte representations of Russian characters to be interpreted properly by the non russian version of Google. This is not expected.

Converting everything to UTF-8 is not a solution because it means that websites that don't use UTF-8 (like Yahoo Calendar) are now broken.

The way we were doing it before was as correct as it could be. To maintain compatibility with older websites, single byte representations of characters are sent based on the native character set.
Just to be clear for blocking purposes, this is a big deal.

The fix from bug 393246 was wrong and caused a change to a behavior that has been around for years.

Any web service that does not handle UTF-8 is now broken (like Yahoo! Calendar).

We are forcing any URL that is sent to be UTF-8 encoded, when historically only characters that were not in the native charset were converted to UTF-8.

There was no actual bug that 393246 was fixing except possibly the "russian search bug" but per my comment 9, that was not a bug, it is working as best it can.

393246 needs to be backed out.
Flags: blocking1.9?
Moving to blocking list so we get a resolution one way or another.

Michael this shipped in all 5 Betas - why haven't we seen any other reports?   I'm reluctant to backout something from 9 months ago without being absolutely sure it is needed.
Flags: blocking1.9? → blocking1.9+
It's not nine months ago, it's mid march (march 18 to be exact). This fix has only been in for beta 5 (no other betas). I'll look through bugzilla to see if there have been any manifestations.

My primary problem with the patch from the previous bug is that it didn't actually fix a bug.

That bug was not a "this is broke so let's do a patch to fix it"

That bug was a "this should do something different than it has done for the past four or more years" and it got checked for a final beta.
The bug that it fixed was partly described in bug 387723, too. In the end it's about making nsIWebNavigation::loadURI work as expected: "For HTTP and FTP URLs and possibly others, characters above U+007F will be converted to UTF-8 and then URL- escaped per the rules of RFC 2396." (http://www.xulplanet.com/references/xpcomref/ifaces/nsIWebNavigation.html#method_loadURI)

As for google.ru, that's an UTF-8 site as well. So according to commont 5, urldecode() would fail there if we used the system charset, correct? (That's a hypothetical question, as I'm sure google uses something smarter.)
When I say bug, I mean "this thing doesn't work." I understand the spec issue, but I'd rather see something that doesn't work that we fixed.

As far as google.ru, no that doesn't fail. Google is smarter than that.

If they get single byte characters, they assume they are in the russian because you are probably viewing the site on a russian machine. Try this URL:

http://www.google.com/search?q=äëïöü

vs. this URL (on a russian system, the aeiou would be russian characters)

http://www.google.ru/search?q=äëïöü

Websites are already aware of the single byte vs. UTF-8 problem and generally do the right thing. We don't need to help them along.
(In reply to comment #14)
> As far as google.ru, no that doesn't fail. Google is smarter than that.

That's what I wrote. But you brought urldecode() into the game, and I say it would fail.

> vs. this URL (on a russian system, the aeiou would be russian characters)
> 
> http://www.google.ru/search?q=äëïöü

But since we send the URL UTF-8 encoded, google.ru actually searches for what the original URL says (äëïöü) rather than randomly converted Russian characters.
just FYI, the reason that the link isn't working as expected is because the dash can't be encoded in ISO-8859-1. It's only part of Windows-1252. It's somewhat unfortunate that our Unicode conversion doesn't round-trip :/ at least that part is not a regression.
I think we probably shouldn't have fixed bug 393246. Bug 261929 fixed the IE-compat part already.

The difference that bug 393246 made is that we now encode the query part also in UTF-8. That's arguably the right thing but does not seem very compatible with web apps, and is not what IE does (actually IE does not do any escaping... it just sends the URL in the platform charset. I think we should continue doing that though)
Blocks: 393246
Any reason not to just back out bug 393246 to fix this? Doesn't seem like that bug fixes anything that's critical IMO.
(In reply to comment #18)
> Any reason not to just back out bug 393246 to fix this? Doesn't seem like that
> bug fixes anything that's critical IMO.
> 

That's what biesi and myself are leaning towards...
I'm not sure why this one is critical to start with.

(In reply to comment #12)
> I'll look through bugzilla to see if there have been any manifestations.

Found anything?
I didn't find anything in bugzilla. 

The reason this is an issue is because very late in the Firefox 3 cycle, we made a change in behavior that has been a certain way for a long time, with no actual justification for making that change.

It causes our behavior to be different than other browsers on the market, and different than users expect.

If someone could show me this fixed a few bugs or made us work with some web service or something, I wouldn't take an issue.

All this checkin seemed to fix is the case where someone manually adds a google keyword to their bookmarks and wants to search for Russian? They doesn't strike as the best justification for a patch.

Obviously it's up to product folks to decide what to do, but I recommend that patch be backed out simply because it was an unnecessary change that only got tested in one beta. Biesi concurs.

It did bust some of my extension code, but I'll work around that.
(In reply to comment #21)
> All this checkin seemed to fix is the case where someone manually adds a google
> keyword to their bookmarks and wants to search for Russian? They doesn't strike
> as the best justification for a patch.

Well, google is a random UTF-8 site. If it was broken there, it was broken elsewhere. Why is your testcase more valid? You can create a very similar testcase that fails if we use the system charset while your page uses UTF-8.

There are also systems out there with UTF-8 as the system charset which doesn't seem to cause serious problems. Dunno, maybe I'm missing something...

If nothing else, it fixed the interface.
Severity: critical → normal
What you're "missing" is the unknown unknowns :), I'm worried about this showing up on more and more sites as more and more people start using Firefox 3, so I'd rather go back to the known state where we were and worry about fixing that later.
It's not that unknown given existing UTF-8 systems with Firefox 2. We could wait with this for another release cycle or two and be confronted with the same questionable testcase right before the release. (Part of the testcase is still unrelated to this bug.) Doesn't really matter how many betas it shipped with, given that the last one will be the most used one.

All in all, I don't think we're in a broken state, we're rather more predictable now and conforming to our own interface. Either way, there are sites where directly entering a query string will fail.

Anyway, no big deal, make your choice.
Going to back out for now.  We should get this in early in an appropriate cycle and tech evang like mad.
Assignee: nobody → mconnor
Backed out.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.