We are currently incorrectly resolving fragment identifiers. Per RFC 2396 section 4.2: # A URI reference that does not contain a URI is a reference to the # current document. In other words, an empty URI reference within a # document is interpreted as a reference to the start of that document, # and a reference containing only a fragment identifier is a reference # to the identified fragment of that document. Traversal of such a # reference should not result in an additional retrieval action. Especially the last sentence is important.
Created attachment 169375 [details] testcase for page without document URI In fact, a fragment identifier should not be resolved at all. Usually this is the same as resolving against the document URI, except when a document URI doesn't exist. It should be applied to the current in-memory document. This is important for documents that have been created with script, as in this testcase. This makes it a bit awkward to determine what should happen to the location. (both in the DOM and in de location bar) The fragment identifier should certainly be updated.
That part of the URI spec doesn't make very much sense when different parts of the document have different base URIs, unfortunately... The net result is that the behavior of fragment identifiers in various cases is rather underdefined and is often abused. As a result there are compat issues involved with changing any aspect of it. That said, please check for existing bugs on fragment identifiers when <base> is present. I'm pretty sure we have some.
There was bug 241981 comment 6. That bug was marked fixed however, since it was depending on a different issue.
> That part of the URI spec doesn't make very much sense when different > parts of the document have different base URIs, unfortunately... Why? It means that the base URI does not matter at all, so that makes a lot of sense to me. Note that this only applies to fragment-id-only relative URIs. Not to URIs that happen to be the same URI as the document.
> Why? It means that the base URI does not matter at all, so that makes a lot of > sense to me. In an environment where a document is assembled from multiple pieces (one of the desiderata for XML), that makes no sense to me. This is why XML allows setting the base URI on a per-element basis -- so elements can reference things relative to that base using relative URIs without having to hardcode the location of the document pieces in the pieces themselves (the xml:base insertion can happen at assembly time). If the base URI is ignored for fragment ident resolution, how do you propose that this work?
Are there any examples of this already used on the web? Anyway, the RFC mentioned in comment 0 is about constructs like |href="#foo"|. If a link points to the current document, for example |href="foo#foo"| it has to be resolved.
It's the other way around: in an environment where a document is assembled from multiple pieces, this bug has to be fixed! Suppose you have wrapper.xhtml, which x-includes content.xhtml. Content.xhtml contains an index with a link like: <a href="#section1">. The include action makes content.xhtml the base location for the index. If I click on the link in the index in Mozilla, the link will take me to content.xhtml#section1 which is clearly not intended!
> Are there any examples of this already used on the web? Yes, in any document that includes both the XUL and the XBL in a single file (and there are some floating around). > If a link points to the current document, for example |href="foo#foo"| Writing a link like that requires knowing the current document URI, though. (In reply to comment #9) The problem is that you want it to be possible for documents to make assumptions about how and from where they will be included and _im_possible for them to not make such assumptions... From a general web architecture perspective, it is better if documents can avoid making such assumptions, so they are not limited in the ways in which they can be included.
I keep seeing advantages where you see problems. Could you give an example (not in words but in XML, just to be precise) where you think this will cause problems? > so they are not limited in the ways in which they can be included. Fixing this bug improves that. Same-document references are always used to scroll to a different position. If you want to do this in a document that is included from multiple other documents, it only works if you fix this bug.
> > If a link points to the current document, for example |href="foo#foo"| > Writing a link like that requires knowing the current document URI, though. No, writing a link like that requires knowing the current *base* URI. (To be clear: such links are not affected by fixing this bug, it's not a same-document reference.)
According to RFC2396, <a href=""> - always references current document <form action=""> - always references base URI (And since the fragment identifier is not part of the URI, "" and "#foo" are always exactly equivalent as far as resolving a URI is concerned.) Good luck implementing this without a headache. HOWEVER, we already don't follow RFC2396 for resolving URIs. RFC2396 is not compatible with a lot of content on the Web. We currently follow RFC1808, like most modern browsers. I would recommend WONTFIX on the principle that we shouldn't break compat with IE6.
Come to think of it, RFC2396 implies we'd use different URIs for: <link xlink:type="simple" xlink:href=""> ...and <link xlink:type="embed" xlink:href=""> ...which seems a bit silly. For example that would mean that in SVG the following would be a link to the current document using an image from another document: <a xlink:href="#circle" xml:base="another-document"> <image xlink:href="#circle" ...> </a> ...which seems a tad unintuitive. (Or would it? Does an embed could as something that is "always intended to result in a new request"? Maybe it doesn't. Section 4.2 of RFC2396 is getting less and less clear the more I think of it.)
(In reply to comment #11) > I keep seeing advantages where you see problems That's because you seem to think that fragment identifiers are only used for scrolling. They're used for a variety of other reasons too, including SVG resources, XBL bindings, etc, etc. For many of these uses, using the current document is inappropriate -- for example XBL bindings are not parsed the same way as other XML documents, so putting the binding in the same document as your content, which we currently support and should continue to support absolutely requires reparsing the document....
Re comment #13 > I would recommend WONTFIX on the principle that we shouldn't break compat with IE6. Note that we are breaking compat with IE6 when the base URI is unclear, like in the 3rd attachment of this bug. I'm guessing we'd fix more pages than we'd break, because people don't realize that adding a base href breaks internal links. Re comment #15 Why would embed work any differently? Re comment #16 > including SVG resources OK, say I include a SVG resource, and it contains a same-document reference to a path definition. Result: The SVG resource is retrieved again, instead of using the included path definition. This bug should be fixed if you want to flexible in ways SVG can be included. > absolutely requires reparsing the document ... using the local source of the document! There's no need to resolve the URI and use the cache or something like that. Again: not fixing this bug would limit the use of embedded XBL bindings; you can not use it when there is a different base URI. And there have already been a few cases in practice where this bug (also in other redering engines) made things complicated for me. #1 HTML E-mail: I wanted to use the same XSLT template for online content and E-mail content (so it is easy to mail a page to someone). To make all the links work, I added a base href. But some pages also had an index of the page. Clicking on a link in the index opened the browser, instead of scrolling to the right location in the mail. #2 Locally saved HTML. Same story actually. To conclude: I can't think of any use case for same-document references being resolved using the base URI.
> Note that we are breaking compat with IE6 when the base URI is unclear, like > in the 3rd attachment of this bug. I'm guessing we'd fix more pages than we'd > break, because people don't realize that adding a base href breaks internal > links. That's another bug, already filed separately. > Why would embed work any differently? See the spec. It explicitly says so. > OK, say I include a SVG resource, and it contains a same-document reference to > a path definition. Result: The SVG resource is retrieved again, instead of > using the included path definition. Not if the base URI is the same as the current document (which it is unless the author changes it), since then the document is found to already be loaded and is therefore simply reused. > #1 HTML E-mail: I wanted to use the same XSLT template for online content and > E-mail content (so it is easy to mail a page to someone). Good lord. XSLT in HTML e-mail. What a heinous idea. This use case on its own is the strongest argument so far for a WONTFIX. > #2 Locally saved HTML. Same story actually. > > To conclude: I can't think of any use case for same-document references being > resolved using the base URI. I agree with that, the problem is that the way the spec says to do it is simply screwed up and wouldn't work in the real world. Handling the empty URI differently based on context is just asking for trouble. If you can find a better way to handle this, then I'm happy to listen, but the current proposal is IMHO unworkable on the long run.
I believe Sjoerd talks about generating a HTML e-mail /with/ XSLT.
(In reply to comment #18) > That's another bug, already filed separately. At least that one is fixed too if this is fixed. > > > Why would embed work any differently? > > See the spec. It explicitly says so. Do you mean this part: "However, if the URI reference occurs in a context that is always intended to result in a new request, as in the case of HTML's FORM element, then an empty URI reference represents the base URI of the current document and should be replaced by that URI when transformed into a request." I don't think that applies to embedding. FORM isn't even a real exception: You either POST, or when you do a GET, the relative path automatically gets a query part, so it's not a same-document reference anymore. > > > OK, say I include a SVG resource, and it contains a same-document reference to > > a path definition. Result: The SVG resource is retrieved again, instead of > > using the included path definition. > > Not if the base URI is the same as the current document (which it is unless the > author changes it), since then the document is found to already be loaded and is > therefore simply reused. When the base URI is the same as the current document there's not a bug in any of the cases mentioned here. > > #1 HTML E-mail: I wanted to use the same XSLT template for online content and > > E-mail content (so it is easy to mail a page to someone). > > Good lord. XSLT in HTML e-mail. What a heinous idea. This use case on its own is > the strongest argument so far for a WONTFIX. :) I meant server-side XSLT. This applies to any server-side method that tries to generate html mail with the same code and templates as the online pages. > > #2 Locally saved HTML. Same story actually. > > > > To conclude: I can't think of any use case for same-document references being > > resolved using the base URI. > > I agree with that, the problem is that the way the spec says to do it is simply > screwed up and wouldn't work in the real world. Handling the empty URI > differently based on context is just asking for trouble. If you can find a > better way to handle this, then I'm happy to listen, but the current proposal is > IMHO unworkable on the long run. My suggestion is to not look at the context at all. The empty URI is always the current document, and a fragment-id-only URI is always a fragment of the current document. FORM may look special, but it isn't, as explained above.
(In reply to comment #17) > ... using the local source of the document! There is no such beastie; there is only cache (and network if the data is no longer in cache).
I accidentally found this example today: <http://www.stadtaus.com/docu/gallery_script/index_en.html> It shows exactly what people expect to happen. It is similar to attachment 169373 [details] only this is one from the real world. (They use the BASE element for referring to IMG elements only it causes to break the rest of the document in contradiction with RFC 2396.)
And the fact that it doesn't work in IE either suggests to me that we shouldn't change our behaviour here.
So there is finally a case where we can be more standard compliant and support something better than IE does and we do not, because we want to follow IE here?
I have absolutely no problems with ignoring specs to save compatibility with IE. But in this case I don't see the point. We really are fixing more existing pages then we are breaking (...if any. Try to find a good use for the current behavior, I couldn't find it.) I couldn't care less if there was a work-around, but there isn't. If this stays broken, there's simply *no way* to combine a base href or xml:base and same-document references. In more detail: you could add the location of the current document to your same-document links, but the whole point of a base href and xml:base is exactly that you don't know the location of the current document. What also isn't really mentioned yet is that the "fix" for bug 241981 can be reverted. Boris says: "If all of our behavior is correct, the only option I see is to back out support for content-location (perhaps leaving a comment explaining that no one should ever try implementing it because it is broken-by-design)". Well, not all of our behaviour is correct, and there's no need to tell the world to stop adoption of content-location headers. Boris also mentions there that "We've had similar issues with not being able to use fragment identifiers on pages where the document URI is something like wyciwyg, by the way...." So this really isn't just some freak side effect. It's about the essence of same-document references.
We're already better than IE here, as the testcases in comment 14 show. We _are_ compliant to a spec, as mentiond in comment 13. We're _already_ non-compliant with the new spec for other reasons, and that won't change, since that _does_ break sites. The new spec is extremely vague as to how it should be implemented, and there are still unanswered questions (see comment 21 comment 16, comment 15, comment 13). There are also other problems, like the effect it should have on the DOM (when you query a link for an absolute URI) and on window.location. Yes, it would be great if attachment 169375 [details] worked. If someone can propose an actual way of doing it that doesn't make matters worse, then I'd be all for it. But we need an actual proposal on how to do it.
Here's another example of this bug: Same-document references in Google Cache pages don't work, because Google adds a base href to the document. I agree that correctly solving this bug is complicated by details. Therefore I propose we fix this bug by resolving same-document references with the document uri. Any missing details should get their own bug. (If they haven't already been filed, like the one Ian mentioned in comment #18.) The wayback machine has the same problem, but oddly enough it works in Mozilla. (Not in IE.) That may be a another uri resolving bug caused by the odd format of the addresses, f.e. http://web.archive.org/web/20040202074232/www.w3.org/TR/REC-html40/intro/intro.html
> Therefore I propose we fix this bug by resolving same-document references with > the document uri. This isn't really a feasible approach -- it would require playing major whack-a-mole, since the existing apis for resolving URI are frozen, so we'd have to hardcode this at every single place where we resolve a URI.
Is the api frozen, or also it's semantics? Because this wouldn't need an interface change, only a change in the semantics.
Both are frozen. And yes, it would need an interface change, since resolving a string to a URI would require three items -- string, document URI, base URI. Unless every single caller ends up having to do a check for URIs starting with '#', or something equally silly.
If we do this at all, I would recommend doing one thing, and one thing only: When the user activates an <html:a href=""> link, if it starts with a "#" character, just go to that location without doing a reload or anything else. Don't fix it for XLinks, don't fix it for <link href="">, don't fix it for SVG, definitely don't fix it for url(), don't fix it for _anything_ except <html:a>.
Which means that we'd inconsistently resolve URIs, effectively? That's not happening, sorry.
I don't see any other way to do it, so I guess this is WONTFIX. It does suck that we don't handle links well in document.written() pages, though. (Along with the other cases raised on this bug.)
Technically what Ian proposes is not "inconsistently resolve URIs". It does not resolve URIs at all in a special case, i.e. bypassing the URI apis. What about extending the api? http://www.mozilla.org/projects/embedding/rev-interfaces.html#Extending
How could we all have forgotten about rfc2396bis? <a href="http://gbiv.com/protocols/uri/rev-2002/rfc2396bis.html#same-document">It defines same-document references completely differently</a>. The effect is the same, or even better as more references are considered to be same-document references. I think it also fits the current implementation much better. At least the URI api doesn't have to change, but I doubt it will make the implementation any easier. Bis sais: "When a same-document reference is dereferenced for the purpose of a retrieval action, the target of that reference is defined to be within the same entity (representation, document, or message) as the reference; therefore, a dereference should not result in a new retrieval action." I hope this makes sense and that this is unifiable with how currently same-document references are handled when there's no base ref, as they don't do a new retrieval action either.
I'm thinking the bug summary should be rephrased as something like: "The decision to navigate to a different *document* is now made by comparing the current document location with the link location. Instead the comparison should be made by comparing the current *base* location with the link location." It seems to be somewhat unintuitive, but I think this is how it should work.
(In reply to comment #35) What is a "retrieval action"? Does this mean that URIs are resolved differently depending on how you plan to use them? Sorry, but that's no good either. (In reply to comment #36) What is "current base location" in a document in which every single node has a different base URI?
Good points. About "retrieval action". There's nothing you need to change here. Currently there are times, including when clicking a link, where the reference is compared to the document location and it is somehow decided to use the current document. The only change is that the reference should similarly be compared to the base location. About which base location to use: it's the base location of the node containing the same-document reference. (Easy answer, hard to implement probably.)
So linking to an image is not a "retrieval action"? What about doing a form submission? How is <a href="#foo" target="bar"> to be handled? > About which base location to use: it's the base location of the node > containing the same-document reference. This wouldn't be too bad to do, actually, but it would mean, for example, that one can't get the href of a link, set window.location to it, and have it do the same thing as clicking a link... That would break some existing content, for sure.
Just ask yourself the question: what happens if there is no base location (or the base location is the same as the document location.) Same-document references in combination with: 1. a base href or 2. no document location is the main point of this bug. Changing anything else based on the new rfc doesn't seem like a good idea to me. Setting window.location should imho have the same semantics as just following a link. Currently when you set window.location to a URI that is the same as the document location (apart from the fragment identifier), only the fragment identifier is changed. The same should happen when you set window.location to a URI that is the same as the base location (again apart from the fragment identifier). A way to think about what should happen in each case is to remember that if you have a working document without a base href, you should be able to move it somewhere else, and add a base href equal to the original document location and still have everything working. (Take the Google cache as an example.) You should also be able to include it into another document, and have everything working if you set xml:base on the included part. (Taking this to an extreme would mean that in script window.location should actually contain the base location of the script element.)
*** Bug 146441 has been marked as a duplicate of this bug. ***
> Same-document references in combination with: 1. a base href or 2. no document > location is the main point of this bug. The whole point is that these concepts are not particularly well-defined... Is the example in comment 39 a same-document reference if "bar" is the frame the anchor is in, for example? In brief, it sounds like we're trying to have different behavior for identical markup based on ambiguous criteria. This is rather hard to implement in any sort of logical setup, for obvious reasons. Breaking window.location based on the base href is simply not an option. Neither is treating two different URIs (document base URI and document URI) identically when they are loaded an option. I'm still tempted to mark this wontfix, due to a lack of clear understanding by anyone, as far as I can see, of what the behavior should be.
Oh, and also of note, see bug 146441 comment 6 and bug 146441 comment 8. Note that those are both retrieval actions, and are implemented in the same code in Mozilla. I see no reason for the arbitrary distinction the RFC makes between those two cases, past the fact that they realized that breaking URI resolution like they did would break real-world pages that used forms.
Created attachment 171374 [details] Link and From with empty URI Guess what: a href="" does follow the base href, but form action="" doesn't. Apparently the code is not that equal.
> How is <a href="#foo" target="bar"> to be handled? First resolve the href like it is done now (using the base URI of the link node). Compare it with the base URI of the link node. If they, aside from the fragment identifier, match then this is a same-document reference. When clicked it should set the location of the target window to the document location of the current window, with the fragment identifier of the link. This works if the target window happens to be the current window. You are right that window.location shouldn't change. But that doesn't break any exisiting pages. The only effect would be that currently broken pages because of this bug, will still be broken when this bug is fixed if they use script to set window.location. (Instead, <a href="http://annevankesteren.nl/archives/2005/01/fragment-identifiers#comment-2974">people now have to write script</a> to just get links working correctly.)
> Guess what: a href="" does follow the base href, but form action="" doesn't. That's a quirk thing for HTML documents only. See comment in the code at http://lxr.mozilla.org/seamonkey/source/content/html/content/src/nsHTMLFormElement.cpp#1327 and code following. Note that this does NOT apply to href="#whatever". I'd rather not scatter this sort of crap all over the rest of our code, yes. ;) > First resolve the href like it is done now (using the base URI of the link > node). Compare it with the base URI of the link node. If they, aside from the > fragment identifier, match then this is a same-document reference. When > clicked it should set the location of the target window to the document > location of the current window, with the fragment identifier of the link. That's not how this situations behaves in practice; such code is often used to scroll one frame from another one (generally a table of contents). Breaking this is not really acceptable -- a lot of websites depend on it. Please do try to propose solutions that are not self-contradictory and don't break existing content, ok?
> Please do try to propose solutions that are not self-contradictory > and don't break existing content, ok? Did you mean <a href="#foo" target="bar"> together with a base href?
I mean that whatever solution you propose needs to handle all the possible cases, including base hrefs, targeted links, links where just an anchor name is listed, links with relative URIs, links with absolute URIs, form submission, etc, in a reasonably consistent way. The solution should also not break significant amounts of existing content.
If <a href="#foo" target="bar"> together with a base href is abused for TOCs, then it's obvious you can't fix this bug in quircks mode, only in standards compliant mode.
It's used without a base href. Who said anything about a base href? And it's commonly used in standards-mode pages, since all browsers implement it interoperably. So what's the benefit of implementing a complicated, self-contradictory spec that is incompatible with pretty much all the existing content that would be affected by it? Keep in mind -- just because someone called something a spec doesn't mean it needs to be implemented...
The document just had an update: <http://www.ietf.org/rfc/rfc3986.txt>.
So... the text quoted in comment 0 is gone. The "abnormal resolution" examples in RFC 3986 clearly indicate that '#foo' is resolved relative to the base URI. The long discussion of fragment identifiers makes it quite clear that they are resolved...
> However, it does not change the status of existing same-document references. I don't see anything in RFC 3986 defining said same-document references, though. Could you please point to that part of the RFC? > - docURI, an absolute URI, the URI of the current document Which document is the "current document"? > resultURI = docURI.resolve(relURI); Where is this code being run? > (This also means that this code must also run on values that are assigned to > window.location It's not clear to me how that works, since at least one of the 4 objects your code depends on is missing in that case.
I reread your use case and I think you mean there a link <a href="#foo" target="bar"> in say a.html, and there's a document b.html in the bar frame. And then clicking that link should change the url of the bar frame to b.html#foo. But I created a testpage and that doesn't happen. The url becomes a.html#foo.
> Then I don't understand your use case. Could you elaborate? Hmm... I had a use case this broke; I'll try to recreate it. > When a base URI changes, all relative links should be reresolved. They are. But I don't think you understood. When a link is clicked, we resolve the URI to load, then post an event to load it. When the event fires, we load it. Your system assumes that the "resolve" and "load" stage happen at the same time, but they do not. > The invariant is only violated is pages which are currently broken allready > because of this bug. The point is, the invariant is violated. Any time that happens, pages that depend on the invariant will break. Maybe that's acceptable in this case because there are very few such pages; I'd have to see numbers to make that call.
> > When a base URI changes, all relative links should be reresolved. > > They are. But I don't think you understood. When a link is clicked, we resolve > the URI to load, then post an event to load it. When the event fires, we load > it. Your system assumes that the "resolve" and "load" stage happen at the same > time, but they do not. Maybe you can point to the implementations of these stages (for hyperlinks as an example)? (I can't write C, but the Mozilla code is usually very readable.) > > The invariant is only violated is pages which are currently broken allready > > because of this bug. > > The point is, the invariant is violated. Any time that happens, pages that > depend on the invariant will break. Maybe that's acceptable in this case > because there are very few such pages; I'd have to see numbers to make that call. It is acceptable because from the pages that depend on the invariant, only the ones that *are already broken* will break.
> Maybe you can point to the implementations of these stages Start at nsGenericHTMLElement::HandleDOMEventForAnchors at: http://lxr.mozilla.org/seamonkey/source/content/html/content/src/nsGenericHTMLElement.cpp#1437 Go through the NS_UI_ACTIVATE case to nsGenericElement::TriggerLink at http://lxr.mozilla.org/seamonkey/source/content/base/src/nsGenericElement.cpp#3151 This posts an event in nsWebShell::OnLinkClick. When the event fires, we end up in nsWebShell::OnLinkClickSync which proceeds to nsDocShell::InternalLoad
(In reply to comment #61) > Go through the NS_UI_ACTIVATE case to nsGenericElement::TriggerLink at > http://lxr.mozilla.org/seamonkey/source/content/base/src/nsGenericElement.cpp#3151 Thanks, this helps a lot. The NS_UI_ACTIVATE case seems to be the right spot. All the needed data is available. TriggerLink does some security stuff, so there the URI needs to be the real one. And most of the code before it is DOM related, so there the URI still needs the be the one resolved to the base href. The first line of the pseudo code can be skipped, because hrefURI already is resolved to the baseURI. (Although it doesn't have to be skipped if that might be convenient.) It probably needs to be a utility function, as it is going to be called from several different places, perhaps something like: nsCOMPtr<nsIURI> retrievalURI = nsContentUtils::convertURIIfSameDocumentRef( hrefURI, GetOwnerDoc(), baseURI); (OT: I see that the baseURI is passed as aOriginURI to securityManager->CheckLoadURI. This sounds like it would be possible to circumvent certain security restrictions by simply setting the baseURI to a more trusted domain.)
Note that not all nsIURIs support fragments.... This is an ongoing issue with Mozilla code, unfortunately. Again, this is just the code for clicked-on HTML links. This doesn't handle form submissions (how should those behave?), XLinks, etc. Those are elsewhere in the code, as you probably noticed. So if we decide we want to do this, we'd need an exhaustive list of places where URIs should be treated in this weird way. As for baseURI, that URI is already security-checked. See nsDocument::SetBaseURI and nsGenericElement::GetBaseURI.
*** Bug 281463 has been marked as a duplicate of this bug. ***
*** Bug 270606 has been marked as a duplicate of this bug. ***
-> default owner
It's not clear to me what the specs say, what browseers do, or what authors need. It's also not clear to me what the specs should say, browsers should do, and authors should write. If anyone wants to go ahead and fix this mess -- either in specs or by fixing browsers -- I encourage them to do so.
HTML5 handles the action="" case and navigation to frag IDs specially now. The rest is expected to work per RFC3986. Please let me know if this causes new problems that I can help fix.
anne, I'm going to wontfix this based on webcompat.. if there should be a concrete alternative action plan please reopen and ni valentin.
I agree that we should not do this. Thanks for closing.