Closed Bug 444931 Opened 16 years ago Closed 16 years ago

img src="" causes un-needed(?) network fetch

Categories

(Core :: DOM: HTML Parser, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mcmanus, Assigned: mcmanus)

References

()

Details

Attachments

(3 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9) Gecko/2008061017 Firefox/3.0
Build Identifier: mobile development tree

The referenced URL contains an img tag with src=""

firefox in turn tries to load an image from the base url. (i.e. the page foo.html will be loaded as the image). 

This is the correct interpretation of the "" relative url, but it is kind of nonsensical to load a url we know to be HTML (the HTML parser is running on it) and expect an image to come back. Indeed shortly after that response starts to come in, the firefox client aborts the HTTP transaction - realizing its not an image.

It is this abort that brought me to the issue. The abort means a new connection may need to be made to the server, and any items currently in the pipeline need to be rescheduled. 

I am looking at mobile network performance and these networks have very high latency, which is why when I started taking packet traces I noted this behavior. This abort will cost on average over 1000 ms to the page load time, a significant chunk.

As reasonable argument can be made that the browser is just doing what the page requested. That's true - but I think it is also forseeable that it will not ever work and so it is also reasonable to optimize this load away.

I've attached a one line patch that "works for me", but I harbor no illusions that I understand the code base well enough to have made the right decision about where to put this logic. My patch is attached for illustration, I am actively seeking advice on the "right path".



Reproducible: Always

Steps to Reproduce:
1. load the referenced URL, note the double requests for the base page. Caching may interfere with observation.

Actual Results:  
double fetch of base page, with second fetch aborted

Expected Results:  
assumed the img src="" would fail without going to network
It's not nonsense. A web application can return different things for the same URL and it would know that it needed to as the http request data would be different.

This seems to me to be a duplicate of bug 106912
Status: UNCONFIRMED → RESOLVED
Closed: 16 years ago
Resolution: --- → DUPLICATE
Robert and Mats - Thanks for taking the time to look at this bugzilla entry.

I don't agree this is a dup of 106912. That bugzilla entry wrongly asserts that "" should not be treated as a reference to the same page. I agree the current behavior is correct in doing so as reflected by rfc 2396 4.2 (same-document referneces). However I hadn't seen that bug - thanks for the link. I do think this covers the same ground as some of the other bugs duped by 106912 - nonetheless I think it has merit to consider at this time.

I hope to make this bugzilla entry focus specifically on whether a img reference of "" from an HTML base object should force a second network retrieval of the base page. I believe that is the only case the patch addresses (and if it addresses something different, we need a different patch ;))

I don't believe the content negotiation referenced in #2 is really viable. content negotiation is meant for different mechanical representations of the same resource - if an image embedded within a document were really just a different representation of the document which embedded it, then we would be living in some kind of wacky mc escher world ;)

A strong argument in favor of this bugzilla entry is the fact that one document with multiple image tags using the same src="foo.png" attribute will be fetched from the network only once - even if they are returned with cache-control: no-cache. In my view, this is optimal behavior for multiple uses of the same resource within a document. This bug just suggests that src="" leverages the same logic - we already know what the base page is html and therefore cannot satisfy the img request without ever making it.

Finally, while this is hardly the most pressing issue, it is not as obscure as it might seem. It has a meaningful performance impact on one of just seven data sets I am measuring of "in the wild" data. Furthermore it has showed up in one form or another as a real performance issue in at least bugs 263980, 433481, and 339112 - so there certainly is a case for improving usability by making the change.

Status: RESOLVED → UNCONFIRMED
Resolution: DUPLICATE → ---
He duped to the wrong bug, a simple search with "img src" and you will find 225554
Status: UNCONFIRMED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → DUPLICATE
Blocks: 437953
Hi Matthias, thanks for commenting and providing the reference to the other bugzilla entry.

Before marking this a duplicate of a long closed isssue, can we please discuss the merits of comments 4 and 1?

The rationale cited in closing the other entries is simply "content
negotiation", which I have tried to communicate I believe is mistakenly applied
in this specific case.

Content negotiation is not carte blanche for a server to return completely
different resources. Con-neg deals with different representations of the same
resource - choosing between png, gif, jpeg - or between responses in different
languages for example. 

An img embedded in HTML is not simply a different representation of the HTML -
it is a different resource because of the embedding bit. Content negotiation is
not meant to be a recursive mechanism of some sort.

I think all of the bugs marked as dups of 225554 and 106912 are testament to
the practicality of the issue. Combine that with the very high latency
environments that the mobile project is targeting, which piqued my interest,
and the impact of this issue seems to make it worth revisiting to be sure we
are making the right decision.

Status: RESOLVED → UNCONFIRMED
Resolution: DUPLICATE → ---
Sorry, that is not the way how it works in bugzilla.
This is a clear duplicate of bug 225554 and i will mark it as such.
Please do not reopen and ask your questions in bug 225554


Status: UNCONFIRMED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → DUPLICATE
Status: RESOLVED → VERIFIED
No longer blocks: 437953
hey guys, i think we need to seriously consider this.  it is hurting network performance a good bit. 
Status: VERIFIED → UNCONFIRMED
Resolution: DUPLICATE → ---
Status: UNCONFIRMED → NEW
Ever confirmed: true
Blocks: 437953
It's not clear to me why this bug is in HTML parser land.

It would also be a good idea to summarize the long offline thread that's happened.

And all this should probably be happening in bug 225554, no?
(In reply to comment #8)
> it is hurting network performance a good bit. 

On which site(s)?
(In reply to comment #9)
> It would also be a good idea to summarize the long offline thread that's
> happened.

You mean the one at the URL?  If so, here you go:

>Original post:
I'm searching on Google, I load up websites the first few work fine,
load fast. After about three or four, they'll get slower. I'll only
get the top half the page. From there it gets worse, it won't even load...

>np:
It sounds to me like Firefox is only making one request at a time...

>MacK:
It really acts random. ... It also seems like to me that Firefox is making
one request at a time.  ... Another odd thing is Firefox doesn't go above
56,468k in memory usage. 

>wheelercub:
Web pages *sometimes* load slowly and *sometimes* have errors.
When my browser is having one of these *sometimes* attacks, my mouse pointer
gets stuck for a split second as I am moving it across the screen, pages
take longer to load, and scrolling down pages also sticks for a split second.
...

>wirewhip:
A couple of months ago firefox started loading real slow after
double clicking icon. ...

>lonnstrom:
I was having the same problem and was blaming Norton's phishing protection.
Turns out in my Firefox Web Developer Toolbar, ...

[ a dozen or so "me too" posts saying "it's slow" ]

>Anonymous
I have found the problem!! It was CookieSafe that was slowing Firefox down.

>nash57:
I uninstalled FF then I went to the application data folder C:\Documents
and Settings\\Application Data\Mozilla and renamed the folder ...
Worked for me

>Fred
http://www.mepis.org/node/5420
Have a look and drive these parameters. It did good here ! 

==================================

So, there are many different issues in that thread, some were solved
by uninstalling extensions, creating a new profile...
I suspect most of the issues are in fact related to extensions, or in
some case, badly tuned about:config params.
None of the posts mentions <img src=""> AFAICT.
> You mean the one at the URL?

No, I mean the one that started with the email Patrick sent to a bunch of MoCo people.
Mats, Re #10 - one such site is listed in the URL field of this bug. The penalty is RTT based, and as such is orders of magnitude more severe in a mobile network than a traditional ethernet or high-speed WLAN.

bug 437953 has an attached pdf which discusses the dataset this was identified in and its impact on the measurements. the issue has apparently come up independently many other times based on the dups of 225554.
Boris, Re #12 - I will attach that thread here.
One other question, as far as the e-mail thread goes.  Does IE's, et al, behavior change if there is a <base> tag that sets the base URI for the page to something other than the page itself?
Hi Boris - re #16: good thought.

first the data
-----------------

illustrated at http://www.ducksong.com/cgi-bin/ff-conneg-test/base.cgi
added a <base href="http://www.ducksong.com/dne"></base> element

ff: fetches the image src="" as "/dne" using Accept: image*

msie-7: again misinterprets the relative nature of "" and fetches the image as "/" (just as before it fetched the base directory of the url instead of the url itself, it is doing the same thing with the base tag adjustment), and the image is fetched accept: */* making conneg a non-issue.. 

safari/webkit: acts exactly as before, and does not even attempt to fetch the img src="" depsite the different base element.

-------

msie7 is totally off in left field, obviously.

I think firefox here is more right in this sepcfic case than webkit. in this case there is no reliance on content-negotiation (which I continue to think is misapplied as general reasoning here) as two differnet urls are in play.

So the preferred change would not eliminate fetches of img src="" but rather cases where RESOLVED(img.src) == BASE_URL. 

re the confusion of comment #11 - the URL mentioned in this bug itself contains src="" in its HTML. The content of that page talks, coincidentally, about some other mozilla performance issue - but the interesting part from the pov of this bug is the markup of the page itself. I'm sorry about the confusing way I referenced it.

I could probably be happy with the last paragraph of comment 17.
This patch implements the full URL equivalence check from comments 17/19
Attachment #329248 - Attachment is obsolete: true
Attachment #332790 - Flags: review?(bzbarsky)
Comment on attachment 332790 [details] [diff] [review]
patch which prevents loading of img when imguri == documenturi

Could we only do that in the cases when the src value really is "" to start with?  The Equals() check there is not that cheap, and this is very perf-sensitive code, sadly.
version that screens for uri.isempty() before doing equals()..

it is not really equivalent, but it certainly catches the common case.
Attachment #332790 - Attachment is obsolete: true
Attachment #332828 - Flags: review?(bzbarsky)
Attachment #332790 - Flags: review?(bzbarsky)
Comment on attachment 332828 [details] [diff] [review]
patch to skip fetch of img when imguri == documenturi

Put the bug# in the commit message, and take it out of the code comment, please.  Also, I'd rather we give the exact rationale here as a comment: that it doesn't really mean anything useful and causes expensive-to-recover from network resets.

With that, looks great.
Attachment #332828 - Flags: superreview+
Attachment #332828 - Flags: review?(bzbarsky)
Attachment #332828 - Flags: review+
Patch adjusted for review comments in #23

Boris, Thanks for the help.
Pushed changeset cb3008a06288.

Not sure how to test it...
Assignee: nobody → mcmanus
Flags: in-testsuite?
+      NS_SUCCEEDED(doc->GetDocumentURI()->Equals(imageURI,&equal)) && 

missing space before the &equal

Also, can a document ever have a null URI?
nsIDocument documentation says yes.  I have no idea how to go about creating one via DOM-exposed methods, but until we actually ensure that it's always non-null (which I think we should!) this is safer.

This is fixed, since I pushed the patch.  I pushed an update to add the space too.
Status: NEW → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Depends on: 455226
Is this patch going to show up in Firefox 2.0.0.18 as well as 3.0.4?
It's not going to show up in either one.  It'll be in Firefox 3.1.
I don't understand the reasoning behind this patch. Isn't the behavior pretty clearly defined?

WebKit shows the image for markup, why don't we?
http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3E%3Cbase%20href%3D%22image%22%3E%3Cimg%20src%3D%22%22%3E
> but it is kind of nonsensical to load a url we know to be HTML

The patch is based on an assumption that is incorrect; using content negotiation the same resource may have different representations.

Of course, usually this will not be the case, but then one might then also assert that usually websites won’t have empty src attributes.

> I don't believe the content negotiation referenced in #2 is really viable.
> content negotiation is meant for different mechanical representations of the
> same resource - if an image embedded within a document were really just a
> different representation of the document which embedded it, then we would be
> living in some kind of wacky mc escher world ;)

To provide an example: example.org/photos/11231 could point to both an HTML page containing meta-data of the photo and the photo itself (referenced by <img src="">). The HTML representation of the same resource shows the image’s EXIF meta-information in a more human-readable way, and this is perfectly valid use of content negotiation.

> A strong argument in favor of this bugzilla entry is the fact that one document
> with multiple image tags using the same src="foo.png" attribute will be fetched
> from the network only once - even if they are returned with cache-control:
> no-cache. In my view, this is optimal behavior for multiple uses of the same
> resource within a document. This bug just suggests that src="" leverages the
> same logic - we already know what the base page is html and therefore cannot
> satisfy the img request without ever making it.

Have you consulted the HTTP specification? Because the answer seems to be already there.

Specifically, a cached copy of a previous response can only be used if the headers indicated in the cached copy’s Vary header (if present) match the request header. When content negotiation is being applied, the Vary header must be present, so this is a reliable mechanism to base the caching on.

A related bug that occurs because Firefox does not properly consider HTTP headers when retrieving data from cache is bug 388141.

My suggestion would be to un-do the current patch that seems to special-case the "" image URI under certain conditions (which is based on an incorrect assumption and makes src handling more complex and harder to document/explain), and improve the caching mechanism so that it respects the Vary header.

~Laurens
Let me reference the HTTP specification just to be clear on what it says:

http://greenbytes.de/tech/webdav/rfc2616.html#caching.negotiated.responses

> When the cache receives a subsequent request whose Request-URI specifies one or
> more cache entries including a Vary header field, the cache MUST NOT use such a
> cache entry to construct a response to the new request unless all of the
> selecting request-headers present in the new request match the corresponding
> stored request-headers in the original request.

Also: http://greenbytes.de/tech/webdav/rfc2616.html#header.vary

~Laurens
Hi Laurens,

(In reply to comment #35)

> The patch is based on an assumption that is incorrect; using content
> negotiation the same resource may have different representations.

I think you mistake the reasoning. The important part of the reasoning is that in order to be compliant those different representations you cite of a URI must all represent the same resource. While they can be different sets of bytes and content-types, they must all be logically the same thing.

> To provide an example: example.org/photos/11231 could point to both an HTML
> page containing meta-data of the photo and the photo itself (referenced by <img
> src="">). 

I do not believe so - what you cite are two different resources. One is the meta data, one is the image. They are not different representations of the same thing, rather they are two different resources. You would want to be able to address, and link to, them individually. There is no URI sytax for linking to different representations. 

The very specific case impacted by this change is an HTML document with an img element embedding the same resource within itself via src="". That's recursive and doesn't make any sense - its an error and forseeing that error has a performance benefit.

fwiw its been agreed the caching issues are orthogonal and are not the supporting reason for the change.

There are dozens of cases linked off of this bug, especially via its myriad dups, which describe in the wild cases of src="" being used in an error context which do result in a meaningful performance impact. The impact is negated by this change. Changing it back creates an identifiable loss for users. Who would win if the change were to be reverted? There has been no citation of any production site engaging in this other than in error, nor does a survey of current client behavior make it appear that such behavior would be feasible.
That last duplicate (bug 473528) is actually about URIs in CSS background-images, whose behaviour hasn’t changed (intentionally, afaik), so it is not a duplicate.

Shows that the current change in <img> behaviour now makes it inconsistent with CSS, which is confusing.

Anyway, I still think that this change is a very bad idea; it deviates from the standard, and it is also not needed for compatibility with the majority browser (that is, IE also loads a file).

The extra request this creates is a performance problem for site owners to deal with; if they happen to make the error, it’s their loss. Behaviour might be not what un-knowing HTML authors expect, but then again, they make plenty of other logical errors that we’re not deviating from the standard for them either (and who creates img tags with empty src attributes anyway, really). Also, I’d say the underlying erroneous assumption, being that an empty attribute is the same as no attribute, affects much more elements than <img> alone.

The fact that a couple of un-named sites load a little slower is IMO a very poor reason to deviate from the standard in this manner, especially as nowhere actual site names are listed and we do not know if any of them are top-20 sites for example.

Of the dupes, they are not ‘dozens’ but only 7, two which are really not about <img> but about <a> and CSS background-image, showing that this confusion is a broader issue than <img> alone and thus an img-only ‘fix’ does not make sense. All those authors found out the issue, and this is an education issue, where the response should be ‘this is behaviour specified by the standards, and it works like so’ and not ‘we will deviate from the standard for you’.

Also trying to standardise this behaviour in HTML5 is not an excuse ;p. If you really want to change this, you should propose a change to RFC 2396 (which will never fly because it would break a lot of existing stuff).

~Laurens
Our testing revealed that neither IE (7.0.6001) or Chrome (1.0.154.43) fetched empty background-image URLs.

Also, what happens in JavaScript if you do "img.src='';"? I suppose the base URL would be fetched..

Anyway, the argument about standards is the best one. If this really is standard behaviour as described in the appropriate RFCs (and not an interpretation), then it should stay as-is. We don't want FF to turn into another IE.. ! :)
> Of the dupes, they are not ‘dozens’ but only 7

Did you go up the chain?  Some of those have multiple bugs duplicated to them.

Note that no browsers implement RFC 2396 last I checked (and in particular Gecko doesn't), because doing that in fact breaks the web...
HTML5 standardises special behavior around src="" specifically for this case. So that's not an argument.
@Will:
> Our testing revealed that neither IE (7.0.6001) [...] fetched
> empty background-image URLs.

Well I don’t know about CSS background-image. But at least for <img src="">, IE8 in IE7 mode does, according to my testcases. Although it behaves as if <img src=".">, which is wrong.

http://www.grauw.nl/test/test_imgsrc.html
http://www.grauw.nl/test/test_imgsrc_base.html

I must admit that I’m actually using IE8 in IE7 mode. But if you could re-confirm it on IE7… :)

If I understand comment 34 correctly, Webkit at least used to process it according to standards, maybe they changed it as well.

@Boris: hmz, not really, I see your point now :). Still, even though there may be a common misconception about how <img src=""> is treated, if a standard specifies a certain behaviour, browsers implement that behaviour, and it does not need to change to follow some IE bug (and even then it could be limited to quirks mode), why does it need to be changed? Just educate the people that file bugs.

Also, regardless of whether you implement RFC 2396 fully, I do assume you implement the base URI resolution mechanism specified there... Or at least the part that says that resolving the empty string against an absolute URI yields the absolute URI. Because that’s what happens everywhere else except for in <img src="">.

@Hixie: my argument was that the confusion applies to not just <img> but also <a> and URLS in CSS, so why special-case <img> and not everything else (aka change RFC 2396). HTML5 seems to be not documenting what is current behaviour, but modifying it.

Also, I wonder if the misconception is really that widespread, because I if you look for <form action="">, this is used all over the place [1]. So apparantly people *do* get it, and make good use of it.

I am just having a little trouble comprehending this change. All it seems to do is make URL resolving behaviour inconsistent, and <img> behaviour more complex than needed, to suit a few users that need to be explained how URI resolving works. I thought traditionally Mozilla favoured evangelisation and standards-compliance over non-standard behaviour and breaking backwards compatibility.

Anyway, thanks for putting up with me :).

~Laurens

[1] http://www.google.com/codesearch?q=%3Cform+action%3D%22%22%3E
> Or at least the part that says that resolving the empty string against an
> absolute URI yields the absolute URI

Actually, as I recall it depends on whether a retrieval action is being performed, etc (in the RFC)...

In terms of Gecko code, empty form actions get treated specially: they're resolved relative to the document URI, not the base URI.  In HTML.  In non-HTML documents, forms with empty action are not submitted at all.

This explains the form action="" thing, by the way: people use that as a way to submit a form to the current documentURI, which is in fact a common use case.  Note that submitting to the document's baseURI would break sites.

> I thought traditionally Mozilla favoured evangelisation and
> standards-compliance over non-standard behaviour and breaking backwards
> compatibility.

While true, in cases when the standard is basically broken (arguably the case with RFC 2396) we might make exceptions...  ;)

You're not saying anything I haven't already thought, by the way; you _did_ read all my comments in bugs on this issue, right?
Firefox 11 still load current page twice if page contain IMG with empty SRC. And if page contain frames, it reload them also. And this happens with image in DIV with "display:none"... :/
Maxim, that shouldn't be happening for <img> with empty src.  Can you please file a new bug with a link to a testcase and cc me?

For frames it's expected behavior.  display:none has no effect on image loading, both per spec and for web compat.
(In reply to Maxim from comment #51)
> Firefox 11 still load current page twice if page contain IMG with empty SRC.

At least the speculative loader should reject the empty string as URL:
http://mxr.mozilla.org/mozilla-central/source/parser/html/nsHtml5TreeOpExecutor.cpp#1000
I have the problem that empty img-tags generated by js makes FF do a GET against the base-url if a base tag is present in head. Nothing bad happens otherwise.

http://stackoverflow.com/questions/15927998/firefox-calls-base-url/15930873

FF 20.0
> empty img-tags generated by js makes FF do a GET against the base-url if a base tag is
> present in head.

Yes, that's the right behavior.  If you have a non-default base, don't stick random <img src=""> in your document.
Yes, I'm definitely going to remove all those empty img-tags. But what I don't understand is why there is no call against the default base if there is no base-tag? Or maybe there was? Gotta check again :)
> But what I don't understand is why there is no call against the default base

Because it's super-special-cased to deal with all the broken sites that had it and UAs that special-cased it.  That's what this bug was about.
The same problem hat us scratching our heads for quite some time, but because of an errorneously empty script tag:
<script type="text/javascript" src=""></script>

This can easily happen when you have code like this:
<script type="text/javascript" src="<%= scriptURL %>"></script>
and the scriptURL is empty due to a bug or misconfiguration. It is super hard to find out why this happens in a large page if you have never encountered this problem yet - especially since this behavior is so counter intuitive. And I think it is save to ignore these broken references, since taking the HTML document as a script or image cannot ever be the the behavior intended by the developer.

Should I report a bug for the script tag as well?
You need to log in before you can comment on or make changes to this bug.