Closed Bug 292762 Opened 20 years ago Closed 20 years ago

Mozilla/FF does reconverts a URL

Categories

(Core :: Networking, defect)

x86
All
defect
Not set
normal

Tracking

()

VERIFIED INVALID

People

(Reporter: ezh, Assigned: darin.moz)

References

()

Details

(Keywords: intl, testcase)

Attachments

(3 files)

1. Open the page in Moz/FF and IR or Opera.
2. Press the [2] in the main table (move to second page)
3. Moz/FF opens a wrong empty page. IE/Opera opens the right page.

It hapens due to URL conversion in Moz/FF in some other codepage.


PS Encoding autodetect may be set or turned off - does not matter.
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050419
Firefox/1.0.4

WFM, you didn't provide a UA, just said it doesnt work in Mozilla Suite/Firefox
but works in Internet Explorer/Opera, and I have a screenshot here (attached)
that shows the same page in IE as in FF, after clicking on the [2] as
specified.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → WORKSFORME
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8b2) Gecko/20050503
Firefox/1.0+

The URL to page 2 is
http://spravka.gramota.ru/buro.html?action=bytext&keyword=&rubrika=&findstr=%D1%81%D0%B5%D1%80%D0%B2%D0%B5%D1%80%D1%8B&page=2
It should be
http://spravka.gramota.ru/buro.html?action=bytext&keyword=&rubrika=&findstr=%F1%E5%F0%E2%E5%F0%FB&page=2

I think that I am seeing the same as you. It is as though the
encoding/conversion were being down twice.

You are specifying charset=windows-1251 
http://code.cside.com/3rdpage/windows/cyrillic.html It looks as though Firefox
is generating some sort of two byte characters. What should be %F0 
becomes %D1%80 .

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050502

Does not work, as comment #2 says.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(In reply to comment #2)
> Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8b2) Gecko/20050503
> Firefox/1.0+
> 
> 
> You are specifying charset=windows-1251 
> http://code.cside.com/3rdpage/windows/cyrillic.html It looks as though Firefox
> is generating some sort of two byte characters. 

I should have had the courage of my convictions. Pasting the hex into
http://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder shows
that it has become UTF-8

I don't know why view source uses CP-1251, the status bar uses CP-1251,
but the URL handler uses UTF-8; but there must be a reason!


This is interesting to me, but the original report states

(In reply to comment #0)
> 3. Moz/FF opens a wrong empty page. IE/Opera opens the right page.

The page is not empty.  The page is the same page as seen in other browsers. 
Therefore, while there is a bug (seemingly) with the conversion, and there's
weird stuff in the address bar, the problem is not as described.  WORKSFORME
still applies.

Reporter: Are you not getting the requested page, and still a blank page?  Or
are you getting the right page?  If you're getting the right page and the
problem is aesthetic only (meaning: It only looks bad, but still functions just
fine) then I suppose this bug should be dropped under Core's "Location Bar" and
the severity dropped down to minor (it still works, just not properly).  Also,
reporter, can you go to "about:" and copy your UA?

Example: for me it is
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050502
Firefox/1.0+
We are doing a look up on the Russian word 'серверы' - 'servers'. The word
should be passed from one page to the next in the URL, but this is not
happening (seemingly because Firefox is using UTF-8 encoding in the URL, and
the site wants CP-1251).

It is not the case that the site functions (with or without looking bad). The
second page has no hits - is empty.

It is as though clicking on the next button (e.g. in Google) transformed your
search term from 'servers' to 'sXeYrZvXeYrZsX'.

It is possible that http url encoding requires UTF-8 and both the 
site and any browser that uses CP-1251 in the URL are wrong, see
http://weblogs.mozillazine.org/gerv/archives/005539.html . That would imply
that IE is wrong, and any server requiring CP-1251 in the URL is mis-guided.
Having said that, the http request as viewed in tcpdump (I am using a 
proxy) does not bear any indication of what encoding is used for the query 
string.
Of cause I meant not empry in case of empty page, but of empty main table with
the results.

The second test is:
http://www.gramota.ru/dic/search.php?word=%ED%E8%E6%E5%ED%EA%E0&lop=x&gorb=x&efr=x&ag=x&zar=x&ab=x&sin=x&lv=x&pe=x&az=x

In theis URL I misspelled a word and the server offers me to choose the main
closes. In Moz/FF it also does not work. And look at the word in the input area
- it's also totally misspelled.

Actually it's a very popular site in russian speaking community (russian grammar
pages)... May someone control it on FF 1.0.?
After loading the URL I added &page=2 to the contents of the URL bar, and
clicked GO to goto the second page:
copied from Loacation Bar: 
http://spravka.gramota.ru/?action=bytext&findstr=%F1%E5%F0%E2%E5%F0%FB&page=2
copied from [2]
http://spravka.gramota.ru/buro.html?action=bytext&keyword=&rubrika=&findstr=%D1%81%D0%B5%D1%80%D0%B2%D0%B5%D1%80%D1%8B&page=2


When you do a view selection source on [1] [2] you'll see:

<td align="center" bgcolor="#f1f0f0"> <a class="def"
href="buro.html?action=bytext&amp;keyword=&amp;rubrika=&amp;findstr=%D1%81%D0%B5%D1%80%D0%B2%D0%B5%D1%80%D1%8B&amp;page=1">
[ 1 ] </a> <a class="def"
href="buro.html?action=bytext&amp;keyword=&amp;rubrika=&amp;findstr=%D1%81%D0%B5%D1%80%D0%B2%D0%B5%D1%80%D1%8B&amp;page=2">
<b>[ 2 ]</b> </a></td>

View source from Opera uses wordpad:
<A class=def
HREF='buro.html?action=bytext&keyword=&rubrika=&findstr=серверы&page=1'> [ 1 ] </A> 
Using Programmers Motepad I made testcases from the opera-saved copy, and got
both representations, findstr=серверы changed to
findstr=%D1%81%D0%B5%D1%80%D0%B2%D0%B5%D1%80%D1%8B&amp; maybe I copied from View
Selection source from Mozilla.
<a class="def"
href="buro.html?action=bytext&amp;keyword=&amp;rubrika=&amp;findstr=%D1%81%D0%B5%D1%80%D0%B2%D0%B5%D1%80%D1%8B&amp;page=1">
<b>[ 1 ]</b> </a>
Attached file testcase like URL:
<a class="def"
href="buro.html?action=bytext&amp;keyword=&amp;rubrika=&amp;findstr=%F1%E5%F0%E2%E5%F0%FB&amp;page=1">
<b>[ 1 ]</b> </a> 

I don´t see differences between testcases when hovering, the text shown in the
statusbar seems to be the same.
I see differences, when using the links.
Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8b2) Gecko/20050504
right-click on a link of the original URL, properties, gives:
view Selection Source on that link gives:
http://spravka.gramota.ru/buro.html?action=bytext&keyword=&rubrika=&findstr=%D1%81%D0%B5%D1%80%D0%B2%D0%B5%D1%80%D1%8B&page=2

Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.7.8) Gecko/20050430 Firefox/1.0.4
right-click on a link of the original URL, properties, gives:
http://spravka.gramota.ru/buro.html?action=bytext&keyword=&rubrika=&findstr=%F1%E5%F0%E2%E5%F0%FB&page=2
view Selection Source on that link gives:
 <a class="def"
href="buro.html?action=bytext&amp;keyword=&amp;rubrika=&amp;findstr=%D1%81%D0%B5%D1%80%D0%B2%D0%B5%D1%80%D1%8B&amp;page=2">


check the different action of the links using the testcases:
compare the testcases by looking at the statusline when hovering
compare the testcases by using the links

the links of the testcases should go to: 
http://spravka.gramota.ru/?action=bytext&findstr=%F1%E5%F0%E2%E5%F0%FB&page=1
http://spravka.gramota.ru/?action=bytext&findstr=%F1%E5%F0%E2%E5%F0%FB&page=2

Keywords: testcase
This is probably in the wrong product....
Keywords: qawanted
I'm only seeing a "[1]" on the actual URL.  But clicking that exhibits the bug
with linux suite trunk build 2005060905, if I understand the description correctly.

This regressed between linux trunk builds 2005011206 and 2005011306, pointing to
bug 261929.  It's possible this is the new "correct" behavior.

==> networking
Assignee: general → darin
Status: REOPENED → NEW
Component: General → Networking
Keywords: qawantedintl
OS: Windows XP → All
Product: Mozilla Application Suite → Core
QA Contact: general → benc
I thought I filed a bug on this (I clearly mentioned what I'm gonna write below
in another bug) , but I couldn't find it. What's happening is this:

1. Mozilla sends both path and query parts of URLs in UTF-8 

2. MS IE and Opera just url-escape the query part 'octet-wise' (without
converting to UTF-8).  They still use UTF-8 for the path part. (actually, I
haven't tested Opera yet, but MS IE certainly does that.)


What MS IE and Opera do make sense (at least until every form processing
server-side program understands UTF-8) and I guess we have to do the same. 
We should do exactly what is described here:
   http://whatwg.org/specs/web-forms/current-work/#x-www-form-urlencoded
...or the spec should be changed.
Well, this bug has little to do with the form submission. We do more or less the
right thing when submitting forms (although not exactly the way specified in
WHATWG). This bug is about the way we handle URLs with the query part written
out in an HTML document like this: 

<a href="http://www.example.com/test1/test2/test3.cgi?f1=abc&amp;f2=def">Link 1</a>

 

Oh, my bad. In that case the spec that reigns in this situation is the IRI spec.
I'm not familiar with that spec though. Bjoern, care to make a judgement on what
the spec says we should do in this case?
Well, Martin is here, too :-)
The test case is interesting, here is what the browsers do:

Interent Explorer 6

47 45 54 20 2f 62 75 72  6f 2e 68 74 6d 6c 3f 61 GET /bur o.html?a
63 74 69 6f 6e 3d 62 79  74 65 78 74 26 6b 65 79 ction=by text&key
77 6f 72 64 3d 26 72 75  62 72 69 6b 61 3d 26 66 word=&ru brika=&f
69 6e 64 73 74 72 3d f1  e5 f0 e2 e5 f0 fb 26 70 indstr=. ......&p
61 67 65 3d 31 20 48 54  54 50 2f 31 2e 31 0d 0a age=1 HT TP/1.1..

Opera 8.0

47 45 54 20 2f 62 75 72  6f 2e 68 74 6d 6c 3f 61 GET /bur o.html?a
63 74 69 6f 6e 3d 62 79  74 65 78 74 26 6b 65 79 ction=by text&key
77 6f 72 64 3d 26 72 75  62 72 69 6b 61 3d 26 66 word=&ru brika=&f
69 6e 64 73 74 72 3d 25  46 31 25 45 35 25 46 30 indstr=% F1%E5%F0
25 45 32 25 45 35 25 46  30 25 46 42 26 70 61 67 %E2%E5%F 0%FB&pag
65 3d 31 20 48 54 54 50  2f 31 2e 31 0d 0a 55 73 e=1 HTTP /1.1..Us

Gecko/20050323

47 45 54 20 2f 62 75 72  6f 2e 68 74 6d 6c 3f 61 GET /bur o.html?a
63 74 69 6f 6e 3d 62 79  74 65 78 74 26 6b 65 79 ction=by text&key
77 6f 72 64 3d 26 72 75  62 72 69 6b 61 3d 26 66 word=&ru brika=&f
69 6e 64 73 74 72 3d 25  44 31 25 38 31 25 44 30 indstr=% D1%81%D0
25 42 35 25 44 31 25 38  30 25 44 30 25 42 32 25 %B5%D1%8 0%D0%B2%
44 30 25 42 35 25 44 31  25 38 30 25 44 31 25 38 D0%B5%D1 %80%D1%8
42 26 70 61 67 65 3d 31  20 48 54 54 50 2f 31 2e B&page=1  HTTP/1.

In other words, MSIE6 and Opera8 use the document encoding to construct the 
request URL, except that MSIE6 fails to %hh encode the URL, Mozilla uses UTF-8 
to construct the request URL (and does %hh escaping).

This is a little bit suprising, for a simple ISO-8859-1 test case

  <a href="Bj&ouml;rn?Bj&ouml;rn">...</a>

Mozilla (since Bug 261929 IIRC) and Opera 8 will request 

  Bj%C3%B6rn?Bj%C3%B6rn

and MSIE6 will request

  Bj%C3%B6rn?Bj%F6rn

So I'm not sure why MSIE fails to %hh escape the URL; Opera 8 seems to apply 
some heuristics to determine when to use %hh escaping in the query component 
(or maybe even the complete URL). So Mozilla consistently does what

  http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

suggests and other specifications (like SVG 1.1, XLink 1.0, etc require)

(If <a href=""> in HTML is assumed to be an IRI, the result would be the same 
as long as the URL is in NFC; If it is not in NFC, the user agent must 
normalize the URL first, depending on the document encoding. I am not aware of 
any real-world implementation that does that and I do not think any 
implementation should do that though.)
(In reply to comment #14)
> I thought I filed a bug on this (I clearly mentioned what I'm gonna write 
below
> in another bug) , but I couldn't find it. What's happening is this:
> 
> 1. Mozilla sends both path and query parts of URLs in UTF-8 

This is the correct behavior. Putting actual characters into an URI/IRI
means that they have to be interpreted as UTF-8. The IRI spec (RFC 3987)
says so, and the HTML 4 spec said so years ago (even though it said that
putting such character into an URI was a bad idea). The URI/IRI might
be to the same site, to the same page, or to some different place. Just
because it's in a page encoded e.g. in windows-1251, that doesn't justify
in any way to assume that it should be interpreted as windows-1251.

If the site expects the data to come back in windows-1251, the right
thing is for the site to escape the URI; then there is no problem
at all on any browser (correct or not).

This is different from when an URI is constructed by the browser from
data in form fields; in that case, taking the encoding of the page to
encode the information is what has worked best for years, and should
not been changed. But it very clearly has to be distinguished from
the case above.

> 2. MS IE and Opera just url-escape the query part 'octet-wise' (without
> converting to UTF-8).  They still use UTF-8 for the path part. (actually, I
> haven't tested Opera yet, but MS IE certainly does that.)
> 
> 
> What MS IE and Opera do make sense (at least until every form processing
> server-side program understands UTF-8) and I guess we have to do the same. 

No, it doesn't make sense. The server can easily escape the URI before
sending it out, and everything works fine. Copying MSIE's and Opera's
mistakes will only get us wedged in a corner, and it will be difficult
to get out again.

I suggest that issues like this be discussed on a non-browser-specific
mailing list, e.g. public-iri@w3.org.

Regards,     Martin.


(In reply to comment #20)


> putting such character into an URI was a bad idea). The URI/IRI might
> be to the same site, to the same page, or to some different place. Just
> because it's in a page encoded e.g. in windows-1251, that doesn't justify
> in any way to assume that it should be interpreted as windows-1251.

 I immediately regretted writing comment #14 without any qualification. I
certainly agree with you on the above points.
 
> If the site expects the data to come back in windows-1251, the right
> thing is for the site to escape the URI; then there is no problem
> at all on any browser (correct or not).

 A little practical problem here:  we can't expect every Joe on the street to
know this. ... Well, this has to be automatically taken care of by 'authoring
tools', but I wonder if there's anyone that  does. 

 
> not been changed. But it very clearly has to be distinguished from
> the case above.

 sure. comment #16
 
> The server can easily escape the URI before
> sending it out, and everything works fine.

 Did you mean the server should examine every html it serves and escape URIs
before emitting it? 

resolving as invalid
Status: NEW → RESOLVED
Closed: 20 years ago20 years ago
Resolution: --- → INVALID
(In reply to comment #21)
> (In reply to comment #20)

> > The server can easily escape the URI before
> > sending it out, and everything works fine.
> 
>  Did you mean the server should examine every html it serves and escape URIs
> before emitting it? 

Sorry I was inprecise. I should have said "the CGI script (or whatever) that
constructs the URI/IRI and puts it into the page". It's at that place where
it should be known in what encoding that data is expected back at the server.
"The server", meaning the generic parts of a Web server, don't know anything
about the encoding.

Regards,    Martin.
verifying invalid
Status: RESOLVED → VERIFIED
(In reply to comment #22)

> Sorry I was inprecise. I should have said "the CGI script (or whatever) that
> constructs the URI/IRI and puts it into the page". It's at that place where
> it should be known in what encoding that data is expected back at the server.

So, we just have to live with 'ignorant' Joe (not the CGI author but someone who
just refers to a page with a URL with the query part) putting 'raw' characters
in html unless authoring tools help him deal with this problem. 
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: