Open Bug 432391 Opened 12 years ago Updated 1 year ago

Slow loading background images sent with cache-control: no-cache do not reload upon completion

Categories

(Core :: ImageLib, defect, minor)

x86
Windows XP
defect
Not set
minor

Tracking

()

REOPENED

People

(Reporter: dvdckl, Assigned: gw280)

References

(Depends on 1 open bug, )

Details

Attachments

(3 files, 3 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9pre) Gecko/2008050506 Minefield/3.0pre
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9pre) Gecko/2008050506 Minefield/3.0pre

When loading a dynamically generated image sent with a "Cache-control: no-cache, no-store" header using a background-image style, it is possible to cause Firefox to incorrectly display the previously downloaded version of that image when reloading the page that contains it. In a low latency environment where the image can be downloaded before the page begins rendering, this problem will not occur. By comparison, when the same image URL is used in an <img/> tag, it will show the obsolete image after the page is reloaded until the new image is downloaded, at which time it will re-render the image.

The workaround is to display images that rely upon a lack of caching using <img/> tags and/or to append a token that ensures a unique URL for each image request.

Reproducible: Always

Steps to Reproduce:
1. Load the test case page.
2. Confirm that both the background-image and <img/> correctly display the same dynamic image.
3. Reload the page by pressing F5, clicking the reload button, or pressing Enter in the URL bar.
4. Observe that both the background-image and <img/> show the same obsolete image for at least 2 seconds, and then once the new image download completes, only the <img/> re-renders to show the current image.
Actual Results:  
The image as displayed by the <img/> tag is current, whereas the same image displayed using background-image displays the obsolete version.

Expected Results:  
The image as displayed by the background-image style should have re-rendered once the image download completed.
Reporter, are you still seeing this issue with Firefox 3.6.x or later in safe mode? If not, please close. These links can help you in your testing.
http://support.mozilla.com/kb/Safe+Mode
http://support.mozilla.com/kb/Managing+profiles
Whiteboard: [CLOSEME 2010-09-15]
Version: unspecified → 3.0 Branch
Yes, this bug is reproducible in Firefox 3.6.8 and 4.0b5pre.
Whiteboard: [CLOSEME 2010-09-15]
Version: 3.0 Branch → 3.6 Branch
Note that doing something to cause a re-render of the page, like switching between tabs or resizing the window will fix the out-of-sync image.
Component: General → Layout: Images
Product: Firefox → Core
Version: 3.6 Branch → Trunk
QA Contact: general → layout.images
I can definitely reproduce on the page linked in the url field.

Bobby, Joe, how the _heck_ are we ending up showing the old image here at all (as opposed to showing nothing until the image loads)?  That seems totally broken.
Status: UNCONFIRMED → NEW
Component: Layout: Images → ImageLib
Ever confirmed: true
QA Contact: layout.images → imagelib
OK, so what's happening here is that for images with unknown expiry LoadImage hands back the cached image for the url, whatever that is, then goes off and validates it.  And if it comes back valid we're done, and otherwise it switches around the imgRequest the imgRequestProxy points to.

Which is fine (modulo bug 606309), but the proxies that handle invalidation for backgrounds are created via imgRequestProxy::Clone and know nothing about imgCacheValidator.
Attachment #485164 - Attachment is obsolete: true
Comment on attachment 485167 [details] [diff] [review]
Hook up imgRequestProxy clones to the imgCacheValidator for the imgRequest as needed.

Joe, any idea how to test this?
Attachment #485167 - Flags: review?(joe)
Assignee: nobody → bzbarsky
Priority: -- → P3
Whiteboard: [need review]
Comment on attachment 485167 [details] [diff] [review]
Hook up imgRequestProxy clones to the imgCacheValidator for the imgRequest as needed.

Totally correct, but please to add a test too.
Attachment #485167 - Flags: review+
Best bet is to do it programatically - manually load an image with sjs, which on subsequent load changes which image is sent.

var img = new Image("foo.sjs");
img.onload = function() { 
  var img2 = new Image("foo.sjs");
  img2.onload = function() { compareImages(img1, img2); }
}
Whiteboard: [need review]
Attachment #485167 - Flags: review?(joe)
Err sorry - it'll need to be more involved than that.

var img1 = imgLoader.loadImage("http://mochi.test/foo.sjs", blah blah blah)
// wait for onload
var img2 = imgLoader.loadImage(...)
var imgclone = img2.Clone();

There are tests that do things like this in libpr0n/test/unit.
Boris, do you think you're going to be able to finish this up?
Unlikely; writing that test is low on my priority list given the time investment needed....  If someone wants to pick this up, write the test, and land it, that would be great.
Brian, do you want to write this test? This is a pretty bad bug.
Sure I'll take it, I'll post a new bug and link it up though so it'll be on my list.
Depends on: 693023
This can probably be pushed if Joe is ok with it, since the test will be tracked in bug 693023.
It's almost certainly bitrotten, but sounds great.
> It's almost certainly bitrotten,

Still applies, still builds, still fixes the bug.  Good to be working in a non-churny part of the codebase.  ;)

http://hg.mozilla.org/integration/mozilla-inbound/rev/5bdc787cf16f
Flags: in-testsuite?
Target Milestone: --- → mozilla10
(In reply to Ed Morley [:edmorley] from comment #20)
> Backed this out at bz's request due to xpcshell crashes:

BTW, this also appears to have triggered a crash (in imgRequestProxy::OnStopRequest), on Linux Opt M-Oth: https://tbpl.mozilla.org/php/getParsedLog.php?id=6861224&tree=Mozilla-Inbound
(no crash on Linux64 Opt M-Oth, though, so it's either intermittent or 32bit-specific)
(In reply to Daniel Holbert [:dholbert] from comment #21)
> (In reply to Ed Morley [:edmorley] from comment #20)
> > Backed this out at bz's request due to xpcshell crashes:
> 
> BTW, this also appears to have triggered a crash (in
> imgRequestProxy::OnStopRequest), on Linux Opt M-Oth:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=6861224&tree=Mozilla-Inbound
> (no crash on Linux64 Opt M-Oth, though, so it's either intermittent or
> 32bit-specific)

Seems to be crashing on all platforms intermittently either int imgRequestProxy::OnStopRequest or imgRequestProxy::OnStartRequest
Hmm.  So the xpcshell crash in Clone() looks like nsRefPtr<imgRequest>::operator-> asserting and dying because the nsRefPtr is null.

So in other words, mOwner is null?

Should that be possible?  imgRequestProxy null-checks it some places but not others, and I don't see how it could _become_ null unless the proxy was initially created with a null mOwner!
Yep.  The xpcshell test is cloning the return value of getStaticRequest, which is in fact initialized with a null mOwner.

Looking into the mOth fail.
The mOth fail is a lot more worrisome.  For example, https://tbpl.mozilla.org/php/getParsedLog.php?id=6863016&tree=Mozilla-Inbound shows us failing the browser-chrome test at browser/base/content/test/browser_bug609700.js because of a crash with this stack:

 0  XUL!imgRequestProxy::OnStartRequest [nsCOMPtr.h : 607 + 0x5]
    eip = 0x03189a71   esp = 0xbfffce50   ebp = 0x073066f8   ebx = 0x031899a7
    esi = 0x318078e0   edi = 0x04718408   eax = 0x00000001   ecx = 0x0000003b
    edx = 0x30a42080   efl = 0x00010202
    Found by: given as instruction pointer in context
 1  XUL!imgStatusTracker::SyncNotify [imgStatusTracker.cpp:5bdc787cf16f : 226 + 0xb]
    eip = 0x0318c71c   esp = 0xbfffcf20   ebp = 0x2f585690   ebx = 0x0318c33f
    esi = 0x03189070   edi = 0x00000001
    Found by: call frame info
 2  XUL!imgRequestProxy::SyncNotifyListener [imgRequestProxy.cpp:5bdc787cf16f : 859 + 0xb]
    eip = 0x031898ce   esp = 0xbfffd070   ebp = 0x2f960b60   ebx = 0x318078e0
    esi = 0x318078e0   edi = 0x00000001
    Found by: call frame info
 3  XUL!imgCacheValidator::OnStartRequest [imgLoader.cpp:5bdc787cf16f : 2235 + 0x7]
    eip = 0x0317ff32   esp = 0xbfffd090   ebp = 0x2f960b60   ebx = 0x0317fabf
    esi = 0x318078e0   edi = 0x00000001
    Found by: call frame info

so that crash is precisely due to being hooked up to the cache validator.
Ah, https://tbpl.mozilla.org/php/getParsedLog.php?id=6862644&tree=Mozilla-Inbound has a debug version of that crash, on Windows.  The OnStartRequest crash is on this line:

   727     nsCOMPtr<imgIDecoderObserver> kungFuDeathGrip(mListener);

and we have:

Crash reason:  EXCEPTION_ACCESS_VIOLATION_READ
Crash address: 0xffffffffdddddde1

Some of the other debug logs show the crash address as 0x42fa0008 (linux) or 0xffffffffcdcdcdd1 (another windows debug).

0xcdcdcdcd is "crt: clean land (new, uninitd objects)" according to firebot.  0xdddddddd is "MSVC's dead memory marker".

So the imgRequestProxy in frame 0 there is dead, presumably.
But imgCacheValidator holds a strong ref to the proxy it calls SyncNotifyListener on (via mProxies, from which I never see anything get removed), right?  So how could the imgRequestProxy have gone away?  Joe, any ideas?
The *imgCacheValidator* could have gone away, and it's the listener for the proxy in this case. Although I'm not sure when the Necko stream can outlive the imgCacheValidator, which is its mListener.
Ah, after the first OnStopRequest the imgRequestProxy drops its ref to mListener.  So maybe it's mListener that's garbage here and it just never nulled itself out on the proxy when it went away?

But I thought our listeners behaved sanely in that regard....
Necko holds strong refs to its listeners.  So the image cache validator is owned by the input stream pump here.
And for purposes of comment 29, we only care about consumers that clone imgRequestProxies, too, right?
The reason I say that the imgCacheValidator could have gone away is because the imgRequest doesn't have a strong reference to it. I'm not sure that makes much sense though. It might be required to catch this in a debugger to really get it; I wonder if this is all just us cloning static requests, and somehow getting all this fallout from it. IOW, maybe the MSVC sentinels are just red herrings.

In the case of a static request, it has no mOwner at all, so perhaps it'd be enough to check mOwner && mOwner->mValidator? Honestly I'm grasping at straws here. A check for mOwner is definitely required though.
If mOwner->mValidator is non-null, then mListener and owner->mValidator can be the same for any given imgRequestProxy.
(In reply to Boris Zbarsky (:bz) from comment #31)
> And for purposes of comment 29, we only care about consumers that clone
> imgRequestProxies, too, right?

I don't quite understand what this means - maybe you forgot a word?
> I don't quite understand what this means 

I mean that the patch in this bug only changes Clone(), so only consumers that actually use Clone() might possibly be affected, not all consumers that use imgIRequests.

> In the case of a static request, it has no mOwner at all

Yeah, we definitely need the mOwner null-check. Still trying to reproduce the other issue....
So back to comment 26.  0xdddddde1 - 0xddddddd == 4.  Likewise for 0xcdcdcdd1 - 0xcdcdcdcd.

The relevant crash line, as I said, is:

  nsCOMPtr<imgIDecoderObserver> kungFuDeathGrip(mListener)

If mListener is dead, then its vtable pointer points to 0xddddddddd and the addresses above mean the second entry in the vtable.  Which happens to be AddRef for any interface inheriting from nsISupports.

If, on the other hand, |this| is dead, then mListener would point to 0xdddddddd and the attempt to look up the vtable pointer would read data from 0xdddddddd and presumably crash at _that_ address.

If |this| is actually 0xdddddddd then we would crash at the address you get by adding offsetof(imgRequestProxy, mListener) to 0xdddddddd.  And that offset is much bigger than 4.

So my working hypothesis continues to be that mListener is dead.  Going to go audit them all again now...
So as far as I can tell, all the C++ callers of clone() on imgRequestProxy properly call CancelAndForgetObserver.  Those callers are: bullet frame, tree body frame, image box frame, nsImageLoader, nsImageLoadingContent.  Joe, would you be willing to double-check these?

I've found at least one JS caller that does not do that.  modules/libpr0n/test/unit/async_load_tests.js in function checkClone does just that.  The resulting listener calls cancel() from onStopRequest, but the async cancel could race with the cache validator, I think... and by the time we're in onStopRequest the imgRequestProxy has dropped its ref to us.

But that's an xpcshell test, and the crash failures are all in browser-chrome tests...
I don't see any other JS callers of clone() that would matter here.

Still trying (and failing) to reproduce this locally...
Aha!  I finally caught this in a debugger!

In my case, on Mac, I'm looking at an imgRequestProxy whose mListener pointer looks ok at first glance, but points to memory marked with 0x5a5a5a5a which is jemalloc's freed memory marker.  So mListener is definitely dead.

mURI is chrome://browser/skin/tabbrowser/tabbar-top-bg-inactive.png

This is used in two places and in both it's a background image....

That would seem to indicate nsImageLoader, but nsImageLoader most definitely calls CancelAndForgetObserver any time it's going away.  So what gives?

Joe, I'm keeping this in a debugger for now, until that starts hosing my machine.  Any idea what I can look for at this point?  :(
Oh, mCanceled is true here.

You know what might be happening?  Maybe we're calling Cancel() on an imgRequestProxy and only calling CancelAndForgetObserver after that?  So the order of operations is:

1)  Cancel().  Sets mCanceled, posts a runnable to really cancel things.
2)  CancelAndForgetObserver() from mListener destructor.  mCanceled is set, so becomes a
    no-op.
3)  OnStartRequest on the cache validator.  Calls into imgRequestProxy::OnStartRequest,
    which tries to call through to the listener even if mCanceled per the nsIRequest
    contract.  We crash.
4)  Runnable posted in step 1 would fire now, if we lived this long.

I think step 2 is buggy; we should NullOutListener() even if mCanceled in CancelAndForgetObserver, right?

But a larger issue, perhaps, is that it sounds like we're getting an entire extra set of notifications on clone with this patch, from the validator.  That seems bad to me....
Note that I have no idea who does step 1 above...  nsImageLoader does _not_.  And imgRequestProxies created via clone() are never added to loadgroups, right?
OK, I tried adding the null-check for mOwner and adding a call to NullOutListener before the early return on mCanceled in CancelAndForgetObserver.  The try push for that is https://tbpl.mozilla.org/?tree=Try&rev=445ea6c52dc3 and nicely green.

Joe, thoughts?  Especially about comment 41 last paragraph.
The other obvious option is to not pass through onStart/StopRequest if mCanceled and holding a weak ref to the listener, since in that case we've already done our required start/stop calls before.
CancelAndForgetObserver() should _always_ forget the observer, and we should further audit Cancel's runnable to make sure that mListener being null is handled properly.
Whiteboard: [need review]
Comment on attachment 567479 [details] [diff] [review]
This should do the right thing

Review of attachment 567479 [details] [diff] [review]:
-----------------------------------------------------------------

Two issues:

1. Remove the hunk that actually fixes bug 432391 (this bug) into its own patch.
2. I think we can get around the requirement for mFullyCanceled by:

[3:59 PM] <joe> in imgRequest::RemoveProxy, we can test the result of RemoveElement
[3:59 PM] <joe> if it returns "did not exist" (false), we don't have to do anything else
[3:59 PM] <joe> so we can just early-return
[3:59 PM] <joe> which should make RemoveProxy idempotent
[3:59 PM] <joe> and remove the need to test for mCanceled at all
[4:00 PM] <joe> and thus obviate the need for mFullyCanceled
Attachment #567479 - Flags: review?(joe) → review-
OK.  I can do that.
Attached patch fix for this bugSplinter Review
Working on pushing this to try too
Attachment #567565 - Flags: review?(joe)
Attachment #567479 - Attachment is obsolete: true
Attachment #567565 - Flags: review?(joe) → review+
Comment on attachment 567562 [details] [diff] [review]
Make canceling work right, take 2

Review of attachment 567562 [details] [diff] [review]:
-----------------------------------------------------------------

This is fine, but I'd also be OK with removing the if (mCanceled) checks in Cancel and CancelAndForgetObserver. If you don't have reason *not* to, I'd prefer that, since it emphasizes that they're idempotent.
Attachment #567562 - Flags: review?(joe) → review+
Well, one reason not to is to avoid posting more and more events from those methods....
Try run for 012e25cbc849 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=012e25cbc849
Results (out of 193 total builds):
    exception: 1
    success: 182
    warnings: 10
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bzbarsky@mozilla.com-012e25cbc849
Looks like the Android intermittent orange from comment 35 is still there.  Given the reftest analyzer data, the images are not rendered, which presumably implies that the validator managed to get onload to fire at a time when the image is not available.  Android likely only matters insofar as it's causing an HTTP load, not a file:// one, for the image, which almost certainly affects the cache validator's behavior.
Target Milestone: mozilla10 → ---
OK.  I can't reproduce this locally (on Mac) running the relevant tests over HTTP.  I'm going to stop spending time on this sinkhole...

Backing out the second changeset; the first one can probably stay in.
Assignee: bzbarsky → nobody
Priority: P3 → --
https://hg.mozilla.org/mozilla-central/rev/81665fc485dd

The other one has been backed out, leaving the bug open for it.
Attachment #485167 - Attachment is obsolete: true
Attachment #567562 - Flags: checkin+
By my decree, George gets to work on fixing the remainder of this bug. :)
Assignee: nobody → gwright
OK, so the weird thing is that even without the change to Clone() I'm seeing the R2 test failures on Android, so I would suspect that this is an intermittent bug that was present anyway.

The good news is I can reproduce this reliably so taking a look.
So it looks like basically this is a network latency issue and the reftests are showing it because CSS images don't block onload, so if the image takes too long to load the reftest shows a failure. This is being tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=697230

This is somewhat hard to reproduce on android because in the normal fennec browser, when you load the test manually it'll do the initial layout at 1.0 zoom (which about 10% of the time on my local test setup doesn't get laid out with CSS images) and then it relayouts to zoom the page to screen width, which forces the images to get laid out properly.
Depends on: 697230
Blocks: 705347
I think this is resolved now. Please reopen if we find that's not the case.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Resolution: FIXED → WORKSFORME
I can still reproduce this as described in my original report in Firefox 36.0.4 on PC.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(In reply to David Eckel from comment #62)
> I can still reproduce this as described in my original report in Firefox
> 36.0.4 on PC.

Can you reproduce it on Nightly? Major changes were made to how this sort of thing works in more recent versions.
Flags: needinfo?(dvdckl)
Yes, I can also repro in nightly 39.0a1 (2015-03-28).
This sounds like bug 1063369. I made a try build to confirm. The builds should show up in a couple of hours at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/tnikkel@gmail.com-ea6592f13dfd/ if you want to download and test to confirm that would be great.
I can still repro using the win64 build: 39.0a1 (2015-03-29).
(In reply to David Eckel from comment #66)
> I can still repro using the win64 build: 39.0a1 (2015-03-29).

If I understand correctly, you mean the normal Nightly build? Timothy was suggesting that you try the special build he made available at the link in comment 65.
I was using http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/tnikkel@gmail.com-ea6592f13dfd/try-win64/firefox-39.0a1.en-US.win64.zip

When using that build, I still see the background image and the img tag displaying different images in the linked test case URL. Are you guys seeing something different?
(In reply to David Eckel from comment #68)
> When using that build, I still see the background image and the img tag
> displaying different images in the linked test case URL. Are you guys seeing
> something different?

Somehow I completely missed that you had posted a hosted version of the test. Thanks for pointing it out.

Indeed, I can reproduce the problem as well. (With current Nightly. Haven't tried with Timothy's build.)
Flags: needinfo?(dvdckl)
So after debugging this I think this is just bug 1063369 combined with the fact that we don't send any notifications whatsoever to the CSS ImageLoader when we ChangeOwner the imgRequestProxy, so it doesn't invalidate. If you use my try server build (or an equivalent local build) then after you switch tabs the correct image will then be drawn for the background image.
Oh, and that's just because the change I was making on the test build wasn't a fully correct way of testing. The request we store in the hashtable with nullptr as it's document key has a null listener. The clone gets a properly attached listener. So this is bug 1063369.
(In reply to Timothy Nikkel (:tn) from comment #71)
> So this is bug 1063369.

Should we dupe this, then?
You need to log in before you can comment on or make changes to this bug.