Closed Bug 421602 Opened 12 years ago Closed 12 years ago

[FIX]Image onload fails to fire intermittently

Categories

(Core :: DOM: UI Events & Focus Handling, defect, P2)

x86
Windows XP
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: alex, Assigned: bzbarsky)

Details

(Keywords: regression)

Attachments

(5 files)

While loading several images by creating an Image object and assigning the src property to it, along with an onload event handler, the onload event handler sometimes doesn't fire (in Firefox 3 Beta 3).

var preloaderImg = new Image();
preloaderImg.onload = function() { ... }
preloaderImg.src = '...';

To see the bug in action, clear the browser cache and open the attached html file with Firefox 3 Beta.
Attached file test case
Alex is this a regression from Firefox 2.0.12?
(In reply to comment #2)
> Alex is this a regression from Firefox 2.0.12?
> 

Yep, I believe it is. The attached test case passes on Firefox 2.x, but fails on Firefox 3.x 

(btw - thanks to Dan Malca from FoxyTunes for troubleshooting this and creating the test case)
Requesting blocking as this is a regression from Firefox 2.0.x
Flags: blocking1.9?
Keywords: regression
Nothing in the original testcase keeps the
Image objects alive while waiting data to be loaded from server.
Modified testcase works fine.
Opera behaves similarly as FF3 (well, Opera actually starts to load 39 images
 here, of which 37 seven loads succeeds). The modified testcase works fine.
Ok. I actually had a similar test case, but in mine the preloaderImgArray was local to the main() function, so it probably was garbage collected as well, so the test failed. making it global makes sure it stays alive along with all the contained images. Does this make sense?

FF3 seems more aggressive than FF2 with garbage collection (is this GC or some other mechanism taking place?). While strictly speaking this is 100% fine, I wonder how many (slightly sloppy) sites or libraries this could potentially break.


You could try to find the regression range and see which bug caused this.
That might give some hint why ff3 works the way it does.
(In reply to comment #10)
> Regression window is 
> http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&date=explicit&mindate=1182974220&maxdate=1182981179

Only change in that range that looks relevant is the cycle collector checkin. If this really is a case of us breaking cycles where we otherwise would have leaked, I'm not sure there's much to be done here...
(In reply to comment #11)
> Only change in that range that looks relevant is the cycle collector checkin.
> If this really is a case of us breaking cycles where we otherwise would have
> leaked, I'm not sure there's much to be done here...

Yep, I think this is it - looks like GC aging was changed to 0 in bug 381199, which causes the garbage collector to be very aggressive.  

I think the solution might be to increase the GC intervals, which would give legacy code such as image loading above a chance to run without being GCed (as in FF2) - this shouldn't result in any leaks, just less frequent GCs. 

As I mentioned above, I'd imagine that many JS libraries (probably wrongfully) assume that their vars won't go away so quickly, and this can potentially break quite a few of them.

(In reply to comment #12)
> Yep, I think this is it - looks like GC aging was changed to 0 in bug 381199,
> which causes the garbage collector to be very aggressive.  

No more aggressive than the JS GC was before the cycle collector landed.

That said, I don't see why the cycle collector would collect loading images; I'd think it ought to consider them reachable, since I'd think something else would have a reference to them.

That said, this was an issue in the nsIDOMGCParticipant world (pre-cycle-collector), and I fixed it in bug 321054.
(In reply to comment #12)
> I think the solution might be to increase the GC intervals, which would give
> legacy code such as image loading above a chance to run without being GCed (as
> in FF2) - this shouldn't result in any leaks, just less frequent GCs. 

Let me be very clear:  the solution here is definitely not to change the GC aging constant.  That might somehow fix one case of this bug, but it would definitely not fix others.

The GC aging constant affects only what objects are examined during a cycle collection.  However, any JavaScript object and anything reachable from a JavaScript object is always considered in cycle collection.  If we have any GC hazards, we need to fix those, not twiddle the aging constant so that we use more memory.
(In reply to comment #14)
> Let me be very clear:  the solution here is definitely not to change the GC
> aging constant.  That might somehow fix one case of this bug, but it would
> definitely not fix others.
David - sure, I'm not very familiar with Mozilla GC mechanisms, so whatever works best... :)



From Jonas:
The cycle collector is most likely the cause here. It used to be that setting .onload on an image would stick it in the 'wrapper has expando' has table on the document which would keep it alive for the lifetime of the document. So GC wouldn't be able to collect it. However the cycle collector traverses these references in such a way that an orphaned node and the wrapper would appear as a cycle and would get collected.

I would have expected the network or imglib code to keep strong references here while the image is loading, but maybe it uses weak references somehow.

What we could do is to explicitly keep alive all currently loading images until after their onload or onerror has fired.

Ben, you want something other than leak bugs to work on? :) Here's one for you.
Assignee: nobody → bent.mozilla
Flags: blocking1.9? → blocking1.9+
Priority: -- → P2
This is great fun!  Network holds strong refs to its listeners.  But imagelib holds _weak_ refs to its listeners, because those listeners always hold a strong ref to the imgIRequest (it's the only way they can get the image data).  So we used to leak until the imagelib referece was made weak.

I've been thinking about making it strong again now that we have cycle collection.  That would definitely address this bug.  The one issue it would cause is that an image whose network connection never closes (e.g. jpeg push) would effectively leak for the lifetime of that network connection, even if removed from the DOM and forgotten about.  It would be collected when the page was left, of course (because the necko load would be terminated at that point).  Right now this works, because when the last imagelib listener goes away the underlying necko channel is stopped, iirc.

I suppose one option would be to change the jpeg push stuff somehow so that necko no longer had a strong ref to the imagelib listener after the first OnStopRequest (using nsWeakPtr somehow).  But even if we don't do that, the jpeg push behavior we'd get with cycle collection seems to be a better tradeoff than what we have no without.
That certainly seems to be a good way to fix this, but I really don't think we want that for 1.9. We've been talking here about fixing this for 1.9 by simply making the image element hold on to itself while it's loading the image. Much lower impact even if not necessarily an elegant fix, IMO. I believe Ben has at least started hacking that up.
For what it's worth, "while it's loading the image" turned out to be harder than I thought when I did bug 321054, but there's probably a good bit of code from that (that was backed out when cycle collector landed -- that's probably the version you'd want to look at) that you could use.
OK, so holding a strong ref to the image decoder observer and cycle-collecting all this stuff is not good enough, because while necko has a strong ref to the imgRequest, the imgRequest holds weak refs to the imgRequestProxies.  So if we did what I suggest in comment 17, we'd just end up collecting the proxy/observer cycle, and it wouldn't do us much good.
What this does is hold a strong ref from the imgRequestProxy to the observer until OnStopRequest fires.  Then it goes back to holding a weak ref, the way the API promises.  For a multipart channel, that means after the first part is done it'll hold a weak ref.

If we really want to change only the content side, I guess we could.  As dbaron said, it's a lot more complicated, but it could be done.

Either way, the mochitest catches this bug and should go in.
Assignee: bent.mozilla → bzbarsky
Status: NEW → ASSIGNED
Attachment #309583 - Flags: superreview?(jonas)
Attachment #309583 - Flags: review?(pavlov)
Summary: Image onload fails to fire intermittently → [FIX]Image onload fails to fire intermittently
Do we really want to hold strong references to all image-load observers? Things like background image loads etc doesn't IMHO be kept alive if we no longer care about the image. <img> is special because people use that to pre-populate caches, and you can have onload-handlers that need to fire even if there are no references to the <img>, but that doesn't seem to be true for other types of loads.
I think this is just the simplest and safest way to fix this bug.  We could try to only do images, per comment 19.  Like dbaron said, that approach, while doable, would require a good bit more care.  If someone wants to do that, I'm not going to stop them, of course!
Attachment #309583 - Flags: review?(pavlov) → review+
Comment on attachment 309583 [details] [diff] [review]
Possible approach

>+void imgRequestProxy::NullOutListener()
>+{
>+  if (mListenerIsStrongRef) {
>+    // Releasing could do weird reentery stuff, so just play it super-safe
>+    imgIDecoderObserver* obs = mListener;
>+    mListener = nsnull;
>+    mListenerIsStrongRef = PR_FALSE;
>+    NS_IF_RELEASE(obs);

Or you could simply do:

nsCOMPtr<imgIDecoderObserver> obs;
obs.swap(mListener);

sr=me with that fixed.
Attachment #309583 - Flags: superreview?(jonas) → superreview+
Checked in, with sicking's nice simplification.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Attached file valgrind log
After this landed, nye's mochitest cycle been crashing on test_compute_data_with_start_struct.html.  It seems jemalloc debug builds are perhaps the only ones that will crash since jemalloc is filling the deleted object (imgRequestProxy) memory with 0x5a5a5a5a.  With jemalloc disabled, no crash but valgrind still sees the problem.
Attached patch Fix thatSplinter Review
The RemoveFromLoadGroup kills us, then we go on to touch members.  Bad!

This patch just keeps us alive till we get out of this method.
Attachment #310939 - Flags: review?(pavlov)
I checked that patch in for now to fix the orange.  Will make changes as needed based on review comments.
Attachment #310939 - Flags: review?(pavlov) → review+
Is it possible this is causing bug 425987? (see comment 6)
verified fixed using Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9pre) Gecko/2008050614 Minefield/3.0pre

--> Verified fixed
Status: RESOLVED → VERIFIED
Component: Event Handling → User events and focus handling
You need to log in before you can comment on or make changes to this bug.