Closed Bug 87298 Opened 23 years ago Closed 23 years ago

URL: tabs inside URLs

Categories

(Core :: Networking, defect, P3)

x86
Linux
defect

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: caruso, Assigned: vishy)

References

()

Details

(Keywords: testcase)

Attachments

(3 files)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux 2.4.2 i686; en-US; rv:0.9.1) Gecko/20010607
BuildID:    2001060713

The context of the article are not displayed at all. Almost blank page
is the result of any selection on article contex.

Reproducible: Always
Steps to Reproduce:
1.Go to htpp://www.corriere.it/
2. Follow the link "sezioni del giornale" on the right 
3. Follow one of the link on the right (ex. "Economia")
4. Choice an article.

Actual Results:  No article displayed

Expected Results:  Display the context of the article

URL: http://www.corriere.it/edicola/index.jsp?path=Economia
and try to read some article.
The problem here is the url generated on the 'Ecomonia' page.
If you click on the 'Accordo raggiunto a Bruxelles: sì all?ingresso della Cina
nel Wto' link in Mozilla you go to the following url:
http://www.corriere.it/edicola/index.jsp?path=ECONOMIA%09%09&doc=BOX
However in IE you go to:
http://www.corriere.it/edicola/index.jsp?path=ECONOMIA&doc=BOX

The %09%09 are two tab characters caused by the definition of the link:

		<A CLASS="titolo" HREF="/edicola/index.jsp?path=ECONOMIA
		&doc=BOX" CLASS="nav">

Is it valid to have a link defined on two lines?
mozilla seems to be stripping out the Carriage Return.

either invalid or evangelism...
Don't know if this applies, but rfc 2396 ( http://www.ietf.org/rfc/rfc2396.txt )
says:

 In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may
   need to be added to break long URI across lines. The whitespace
   should be ignored when extracting the URI.

Component: Browser-General → Parser
reassign & confirm, I thought something like that might exist.  I'll let parser 
qa consider..
Assignee: asa → harishd
Status: UNCONFIRMED → NEW
Ever confirmed: true
QA Contact: doronr → bsharma
Attached file Testcase
The decision, about stripping off \n,\t, etc, should probably happen in layout.
-->attinasi
Assignee: harishd → attinasi
Currently, attributes are processed in the parser or the content sink, not 
layout. I think it is important that we follow the advise of the RFC on URIs, 
specifically the part that says:

... The whitespace should be ignored when extracting the URI.

My guess is that this is a parser problem. If it turns out to be a problem in 
the ContentSink's handling of the attributes, then by all means send it back to 
me to look at (I have little clue what happens in the parser, but I believe that 
the attributes are read up into parser nodes or something that are sent to the 
sink...). Harish, can you check what can be done at the parser level? Since 
the CR is being stripped, stripping the tabs should be pretty easy. If 
necessary, the content sink can post-process the attribute, but that is messy 
since it will not know if the whitespace characters came from line breaks or if 
they were intentional...
Assignee: attinasi → harishd
Mark: For quotted attributes parser just consumes content withing the quotes. It
does not make any decision about what needs to be stripped off ( including CRs
). I don't think even the content sink can make this call. IMO, the attribute
handling code, where ever it is, should be responsible in stripping off unwanted
tabs, spaces, etc.
Yea, you are right Harish. The more I think about this, the more it seems like 
this is another case where our URI parsing code should take care of it (like bug 
 79929).

Over to dougt to look at: my read is that all whitespace from the URI should be 
stripped, and this is in accordance with the RFC on URIs 
(http://www.ietf.org/rfc/rfc2396.txt)
Assignee: harishd → dougt
Component: Parser → Networking
Priority: -- → P1
Target Milestone: --- → mozilla0.9.3
Although, I could not reproduce the problem with the above URL, I could with the
test case provided.  I attached a simple patch with strips out the whitespace
from the url string.  I am not totally convinced that this is kosher in all
cases even with the RFC telling us that it is.. :-/.  

andreas, could you review the patch?
Status: NEW → ASSIGNED
Pollman has a patch to do the same thing in bug 79929 - we might want to
coordinate these two fixes...
I opened bug 87298 to handle the general case of whitespace trimming on URLs...
The patch presented here by dougt could apply to that bug.

I don't think that pollman's fix will be needed with the one I attached.
From a quick look, it seems that the patch on this bug also strips internal
whitespace, no?  If so, will it be able to handle pages like this:

http://eric.302easyst.net/mozilla/bugs/gotofunny.html

Note that this page works in Nav 4.x, IE 5.5, and Netscape 6...  Note that this
testcase only demonstrates 2 of the 25 or so cases we need to verify backwards
compatability with Nav and IE, the complete list is on bug 79929.
Sorry, had not looked at the testcase on *this* bug.  :)  It appears that IE
treats spaces differently than tabs and linebreaks when stripping whitespace
from urls?  We really should build up so more extensive testcases before going
with any particular solution...
I agree with Eric, and hopefully we can hash it all out in bug 87298 instead of
in miscellaneous bugs reported on specific symptoms... Parsing URLs correctly
seems pretty important, from both a Standards perspective and a compatability
perspective.
Did you mean Bug 87894?
Yes, bug 87894. I guess I did that a couple of times... sorry.
tom, is there a group of testcases against URI handling correctness?
Always stripping whitespaces in the urlparser is really wrong because the parser
does not know the context (especially in Resolve). Currently in SetSpec we strip
leading and trailing whitespaces and control characters. We also strip
linebreaks in the middle. Whitespaces and other control characters left after
that are getting escaped. 

If we apply the patch we would kill all access to files with whitespace in the
directory names, at least when accessed with Resolve. Not very uncommon ... and
the rfc is not very helpful here. 

I guess we could fix this case with also removing the tab-char from the url in
SetSpec as we do with the linebreaks. It is very likely that this char was not
used as part of the url but as attempt to format it. 

The real deal would be to have code in htmlparser/contentSink/layout/... that
guesses from the context if it should remove or mask spaces/linebreaks/tabs from
an url. What gets through to the urlparser then gets stripped or escaped.

See bug 23485 for another lengthy discussion on this problem.

We do not strip anything when using Resolve because sometimes leading or
trailing whitespaces in relative urls are valid. After resolving the url SetSpec
does the rest ...

This bug sounds like a dup of 23485.  Over to vishy.
Assignee: dougt → vishy
Status: ASSIGNED → NEW
Keywords: nsbeta1
nav triage: moving to m1.0, not a stopper at this time. 
Keywords: nsbeta1nsbeta1+
Priority: P1 → P3
Target Milestone: mozilla0.9.3 → mozilla1.0
I agree with Marc Attinasi that, to be standards compliant, ``all'' whitespace
from the URI should be stripped.

But Andreas Otte brings up the good point that
people need to access files that have whitespace in the directory names.

I think we can do both, if we are very careful about how we handle URIs.

When I look at the original file created by the content author,
I can easily tell the difference between

(a) whitespace added to break a long file name into
reasonable-length lines, that should be stripped out before using it.
For example, href="more_
     details/really_long_file_name"
obviously refers to a file in a directory named
  ``more_details''.

(b) whitespace in the actual directory name that is necessary to retrieve the
file, written by a content author that is careful to comply to the standards.
For example, href="more
     %20details/really_long_file_name"
obviously refers to a file in a directory named
  ``more details''
.

(c) whitespace in the actual directory name that is necessary to retrieve the
file, written by a less careful content author.
For example, href="more details/really_long_file_name"
obviously refers to a file in a directory named
``more details'',
although to be pedantic the author really should have
encoded that space with the percent escape %20.

Does discriminating between (a) and (b) sound do-able ?
Is (c) really an option, or must we strip out even that whitespace
to maintain standards compliance ?
I would vote for not only removing \n from the middle of urls but also tabs. I
think this would fix this case. I would not remove whitespace from the middle of
an uri. That would kill tons of pages.
sr=darin
fix checked in
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Would someone like to update the subject so it is more descriptive of the
current problem? 

-TIA.
Summary: page context not displayed → tab character inside an absolute url is not getting removed (was: page context not displayed)
I'll verify this as I get through some URL testcase work...
Keywords: testcase
Summary: tab character inside an absolute url is not getting removed (was: page context not displayed) → URL: tabs inside URLs
QA Contact: bsharma → benc
VERIFIED: Mozilla 1.2.1, Win98
Whiteboard: checkmac checklinux
VERIFIED:
Mozilla 1.3a, Linux, Mac OS X 10.2.3.
Status: RESOLVED → VERIFIED
Whiteboard: checkmac checklinux
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: