Closed Bug 1404422 Opened 2 years ago Closed 2 years ago

On session restore, some tabs never finish loading because imagelib unblocks load events before blocking them

Categories

(Core :: ImageLib, defect, P1)

56 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- unaffected
firefox56 + wontfix
firefox57 + wontfix
firefox58 + fixed

People

(Reporter: bzbarsky, Assigned: aosmond)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [gfx-noted])

Attachments

(7 files, 10 obsolete files)

1.43 KB, patch
tnikkel
: review+
Details | Diff | Splinter Review
2.62 KB, patch
tnikkel
: review+
Details | Diff | Splinter Review
40.39 KB, patch
tnikkel
: review+
Details | Diff | Splinter Review
4.07 KB, patch
tnikkel
: review+
Details | Diff | Splinter Review
1.10 KB, patch
tnikkel
: review+
Details | Diff | Splinter Review
13.12 KB, patch
aosmond
: review+
Details | Diff | Splinter Review
5.05 KB, patch
tnikkel
: review+
Details | Diff | Splinter Review
When I restore my session, some of my tabs (only the foreground ones count here, of course) never finish loading.  Specifically, the little bouncy blue ball animation on the tab keeps going for them.

They're always all bugzilla tabs so far, though this may just be due to most of the foreground tabs I have open being bugzilla.  But not _all_ bugzilla foreground tabs show the problem; maybe half of them.  Some of them are the only tab in their window, some are in a window with many other tabs.  Some are search results, some are individual bugs...

This didn't use to happen until a few weeks ago, I think.  But I restart rarely enough, until these last few weeks, that it could have broken any time this summer, really.  :(

The main problem, apart from the visual distraction, is the CPU usage and lag and fan noise.  right now my nightly chrome process is taking about 70% of one core, just to run all those bouncy ball animations.  I just went through and hit "esc" on all of those tabs; there were 11 of them.  Now nightly is down to 12% of a core...
Inevitably, questions. :-)

(In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from comment #0)
> When I restore my session, some of my tabs (only the foreground ones count
> here, of course) never finish loading.  Specifically, the little bouncy blue
> ball animation on the tab keeps going for them.

But the bugzilla content shows up? Or stays blank? Or shows the tab spinner of death in the content (white content, big ugly grey spinner) ?

> They're always all bugzilla tabs so far, though this may just be due to most
> of the foreground tabs I have open being bugzilla.  But not _all_ bugzilla
> foreground tabs show the problem; maybe half of them.  Some of them are the
> only tab in their window, some are in a window with many other tabs.  Some
> are search results, some are individual bugs...

As an experiment, if you, say, use 'bookmark all tabs' to bookmark a set of about 10-15 tabs in one window, and then create a new window and load all the bookmarks ("open all in tabs"), does that cause the same issue?

> This didn't use to happen until a few weeks ago, I think.  But I restart
> rarely enough, until these last few weeks, that it could have broken any
> time this summer, really.  :(

Hmm. If it *is* recent, I wonder if it's to do with request filtering we do to turn off the spinners, if the content does in fact load. Jared, is that possible?

> The main problem, apart from the visual distraction, is the CPU usage and
> lag and fan noise.  right now my nightly chrome process is taking about 70%
> of one core, just to run all those bouncy ball animations.  I just went
> through and hit "esc" on all of those tabs; there were 11 of them.  Now
> nightly is down to 12% of a core...

Yeah, the perf impact has bug 1397092 filed for it. It's been difficult to get that narrowed down, though. :-(
Flags: needinfo?(jaws)
Flags: needinfo?(bzbarsky)
> But the bugzilla content shows up?

Yes.  The pages look fully loaded to me.  They just have the blue bouncy ball going, and the status bar saying "Read bugzilla.mozilla.org".

> As an experiment, if you, say, use 'bookmark all tabs' to bookmark a set of about 10-15 tabs in one window, and then create a new window and load all the bookmarks ("open all in tabs"), does that cause the same issue?

Oh, good idea.  I haven't managed to reproduce this way so far.  That said, on the restart I just did, only two tabs ended up with the problem, not 11.

Also, the "open all in tabs" setup is a bit different from session restore, because it loads all the background tabs too, whereas session restore lazy-loads those.  I also tried with "undo open window", which is pretty similar to session restore and haven't managed to reproduce that way either.  :(
Flags: needinfo?(bzbarsky)
(In reply to :Gijs from comment #1)
> > This didn't use to happen until a few weeks ago, I think.  But I restart
> > rarely enough, until these last few weeks, that it could have broken any
> > time this summer, really.  :(
> 
> Hmm. If it *is* recent, I wonder if it's to do with request filtering we do
> to turn off the spinners, if the content does in fact load. Jared, is that
> possible?

I'm not sure what "request filtering" you have in mind. Maybe bug 1399111? That bug just changes the colors of the animations. By "the spinners" maybe you're talking about the e10s tab spinner?
Flags: needinfo?(jaws)
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #3)
> (In reply to :Gijs from comment #1)
> > > This didn't use to happen until a few weeks ago, I think.  But I restart
> > > rarely enough, until these last few weeks, that it could have broken any
> > > time this summer, really.  :(
> > 
> > Hmm. If it *is* recent, I wonder if it's to do with request filtering we do
> > to turn off the spinners, if the content does in fact load. Jared, is that
> > possible?
> 
> I'm not sure what "request filtering" you have in mind. Maybe bug 1399111?

No, I meant bug 1380150 and bug 1400337. I have been seeing issues on 57 in the last day or 2 where pages never seem to load, either, though admittedly I also keep seeing a "Read <foo.com>" or "Transferring data from <foo.com>" status tooltip in the bottom corner. I don't know if the code in those 2 bugs affects both of those, or if there's an underlying network change that could be the cause of all this, or if bz and I are seeing different things. I haven't managed to find STR either. :-(

> That bug just changes the colors of the animations. By "the spinners" maybe
> you're talking about the e10s tab spinner?

No, sorry, I just meant "load indicators", it will probably take a few months before I stop calling them that - they were spinners for so long! :-\

bz: could you do a post-facto review of the patches from those 2 bugs? ( https://hg.mozilla.org/mozilla-central/rev/c9fe4ec27ca3, https://hg.mozilla.org/mozilla-central/rev/8eb9099d21dd )

Those are pretty small changes. The blue ball thing is triggered by the "progress" attribute on the tab. It should be getting removed here: https://dxr.mozilla.org/mozilla-central/rev/97efdde466f18cf580fda9673cf4c38ee21fc7b7/browser/base/content/tabbrowser.xml#795

which basically means, it gets removed once we get an onStateChange where the following things hold

- aRequest is not null
- **NOT** (aStateFlags & nsIWebProgressListener.STATE_STOP && --this.mRequestCount > 0 && aStatus == NS_ERROR_UNKNOWN_HOST)
- **NOT** (aStateFlags & STATE_START && aStateFlags & STATE_IS_NETWORK)
- (aStateFlags & STATE_STOP && aStateFlags & STATE_IS_NETWORK)



Are the pages you're seeing this with loading from cache? Because that's what I normally see with session restore. I would assume this would match the STATE_RESTORING case, and would never set the busy (or, therefore, progress) attributes, but somehow that doesn't seem to be true here... but then, if that isn't true, I don't understand why we wouldn't get an onStateChange with the right flags (per the above). :-(

Perhaps some logging in the tabbrowser code could help elucidate what's going on here?
Flags: needinfo?(bzbarsky)
Are you part of an experiment cohort that has Necko's RCWN enabled? This is controlled by network.http.rcwn.enabled.

I ask, because I recall observing something similar in bug 1385348, but it went away for me eventually for mysterious reasons and I closed the bug out.
> Are you part of an experiment cohort that has Necko's RCWN enabled?

I am not.

> bz: could you do a post-facto review of the patches from those 2 bugs?

Will look.

> Are the pages you're seeing this with loading from cache?

I'm not sure how to tell.

> I would assume this would match the STATE_RESTORING case

That's bfcache.  Totally different.  Necko loads that happen to come from cache will have busy indicators; indeed it may take a while to read stuff from cache.

> Perhaps some logging in the tabbrowser code could help elucidate what's going on here?

Could be.  If someone puts up a try build with the logging they want, I can try making a copy of the profile and reproducing with that build.
> bz: could you do a post-facto review of the patches from those 2 bugs? 

Those look reasonable at first glance.

I'm going to see whether I can create a sessionstore file that does not leak info but reproduces the problem and then bisect...
I _think_ this regressed in this range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=1b065ffd8a535a0ad4c39a912af18e948e6a42c1&tochange=0985725c848ec0cfc6f2f3c3a5aa3d71321e7620

but it's hard to be sure because this was not 100% reliably reproducible on builds from back then.

Bisecting on local builds from back then is not really working, because they end up with half the tabs crashed at startup and other weirdness; I expect gfx is not happy about opening that many windows on remote X.  :(

I'll give the logging patch Gijs sent me a shot and see how that goes.
So Gijs and I poked around at this some with various logging bits.

When this fails, we have a docshell with busyFlags == 5 in the relevant tab.  That means we think it's still loading.

The loadgroup for that tab's docshell has a single request in it: a document-onload-blocker.

So the Firefox UI is blameless here, at the very least: the problem is presumably in DOM code somewhere.
Component: Session Restore → DOM
Product: Firefox → Core
Oh, and the relevant document's readyState is "interactive".
OK, more progress.  I logged all the places we call BlockOnload and UnblockOnload on documents, and there are two callsites that have very weird behavior.  Those callsites are ImageLoader::BlockOnload/UnblockOnload and nsImageLoadingContent::BlockOnload/UnblockOnload.  And by "weird behavior" I mean "they sometimes unblock before blocking".

For example, here's the log I get for ImageLoader for a document where we end up with a lingering load blocker:

UNBLOCK StyleImage: 0x147c0f800, 0x1479d1fb0, 0x145da2100
UNBLOCK StyleImage: 0x147c0f800, 0x1479d1fb0, 0x15599ea00
DOBLOCK StyleImage: 0x147c0f800, 0x1479d1fb0, 0x145da2100
DOBLOCK StyleImage: 0x147c0f800, 0x1479d1fb0, 0x145da2100
UNBLOCK StyleImage: 0x147c0f800, 0x1479d1fb0, 0x145da2100
DOBLOCK StyleImage: 0x147c0f800, 0x1479d1fb0, 0x15599ea00
DOBLOCK StyleImage: 0x147c0f800, 0x1479d1fb0, 0x15599ea00
UNBLOCK StyleImage: 0x147c0f800, 0x1479d1fb0, 0x15599ea00

where on each line "UNBLOCK" means we're in ImageLoader::UnblockOnload, "DOBLOCK" means we're in ImageLoader::BlockOnload, and the three pointers are the nsIDocument*, the ImageLoader*, and the imgIRequest*.  Note how for each of the two imgIRequests involved the sequence of calls is: unblock, block, block, unblock.

I get the same sort of pattern for most things going through nsImageLoadingContent too, though some are just a single "unblock, block".

This is broken, because nsDocument::UnblockOnload has this:

  if (mOnloadBlockCount == 0 && mAsyncOnloadBlockCount == 0) {
    NS_NOTREACHED("More UnblockOnload() calls than BlockOnload() calls; dropping call");
    return;
  }

so if you mis-nest your blocking/unblocking bits you might get an unblock dropped on the floor and the "matching" block will then cause the document to never finish loading.

And now we have our smoking gun.  The regression range I found earlier has the patches for bug 1359833, which rejiggered how the image load/unload blocking works.  In particular, it made it sometimes-kinda-async.  Specifically, https://hg.mozilla.org/mozilla-central/rev/ca9a48e8a4fbfa5b236c7da73d20d626638f91ba changed imgRequestProxy::BlockOnload/UnblockOnload to look like this:

   if (!IsOnEventTarget()) {
     Dispatch(NS_NewRunnableFunction("imgRequestProxy::BlockOnload",
                                     [self]() -> void {
       self->BlockOnload();
     }));
     return;
   }
   blocker->BlockOnload(this);

and similar for unblocking, instead of just sync-calling the blocker methods.

Now consider this sequence of events:

1)  The imgRequestProxy is !IsOnEventTarget().
2)  imgRequestProxy::BlockOnload is called.
3)  Before the runnable from step 2 runs, IsOnEventTarget() becomes true.
4)  Before the runnable from step 2 runs, imgRequestProxy::UnblockOnload is called.

If this can happen, then "blocker" will see an UnblockOnload() call before it sees the corresponding BlockOnload().

Oh, and whether IsOnEventTarget() happens to be true is moderately random: it depends on whether the runnable we were reached from is labeled or not.

Andy, Timothy, could you please take a look at this?
Blocks: 1359833
Component: DOM → Image Blocking
Flags: needinfo?(tnikkel)
Flags: needinfo?(bzbarsky)
Flags: needinfo?(aosmond)
Keywords: regression
Summary: On session restore, some tabs never finish loading → On session restore, some tabs never finish loading because imagelib unblocks load events before blocking them
Component: Image Blocking → ImageLib
> Andy, Timothy, could you please take a look at this?

Sorry, Andrew.  :(  I should not type after midnight...
Assignee: nobody → aosmond
Priority: -- → P2
Whiteboard: [gfx-noted]
Note that this bug also means onload can fire _early_ for pages with images on them.  Specifically, if the onload blocker count is 1 and there is no network activity and then an image load starts, that will _unblock_ onload first and fire the event, even though the page is not done loading...
Ugh, we sort of expected regressions from that labelling but this was a bit later than I had hoped :). Given the requirement that the listener be called in the correct labelled context, perhaps the best thing to do is if we have ever needed to dispatch, then we *always* dispatch BlockOnload/UnblockOnload due to their particular ordering requirements. Only ~0.3% of requests ever require dispatching, so 1 in ~333 requests will be impacted as an upper bound. Presumably not all of them trigger blocking, so the impact should be minimal. Thank you for hunting this down bz :).
Status: NEW → ASSIGNED
Flags: needinfo?(tnikkel)
Flags: needinfo?(aosmond)
That would at least fix the ordering issue, yes.

That said, doing the BlockOnload async is somewhat questionable in the first place, unfortunately.  While that runnable waits to be run, onload could fire...

In my log above, it seems like every single request ended up dispatching BlockOnload via a runnable.  It sounds like that's very much not expected, right?  Is it worth looking into why I'm seeing that?
(In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from comment #15)
> That would at least fix the ordering issue, yes.
> 
> That said, doing the BlockOnload async is somewhat questionable in the first
> place, unfortunately.  While that runnable waits to be run, onload could
> fire...
> 

For raster images, it can make the decision to post BlockOnload off main thread. The conditions will easily be satisfied (got all the encoded data but no metadata decoded yet, wasn't a sync decode, not a transient images [testing mostly] and there are multiple processor cores). I would be surprised if we haven't been living with this race for some time but it didn't cause a problem because we were consistent in how we handled it.

> In my log above, it seems like every single request ended up dispatching
> BlockOnload via a runnable.  It sounds like that's very much not expected,
> right?  Is it worth looking into why I'm seeing that?

Yes I want to dig into this more. 1/333 is just the population spread -- it is possible they will tend to cluster (e.g. multiple documents in different tab groups sharing the same set of images...?).
(In reply to Andrew Osmond [:aosmond] from comment #16)
> (In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from
> comment #15)
> > That would at least fix the ordering issue, yes.
> > 
> > That said, doing the BlockOnload async is somewhat questionable in the first
> > place, unfortunately.  While that runnable waits to be run, onload could
> > fire...
> > 
> 
> For raster images, it can make the decision to post BlockOnload off main
> thread. The conditions will easily be satisfied (got all the encoded data
> but no metadata decoded yet, wasn't a sync decode, not a transient images
> [testing mostly] and there are multiple processor cores). I would be
> surprised if we haven't been living with this race for some time but it
> didn't cause a problem because we were consistent in how we handled it.
> 

My bad, this is a lie. OnImageDataComplete is on the main thread, I was thinking of OnImageDataAvailable. Another point to investigate (what labelling context does this happen on).
> e.g. multiple documents in different tab groups sharing the same set of images...

I _definitely_ have that.

> Another point to investigate (what labelling context does this happen on).

Is there a way I can log the current labelling context?  It would be pretty easy for me to add that to my logging here.
is this happening in beta as well?
Priority: P2 → P1
> is this happening in beta as well?

This happens in Firefox 56 release.  And yes, in 57 beta.
Fixing this will be tricky if we want it to be synchronous. I wonder if it would be better to rethink how we manage this.

All the BlockOnload and UnblockOnload callbacks from the imgIOnloadBlocker interface are used for is to add a dummy request to a given document's load group. This may or may not be the same document as the loading (owning) document which was supplied in imgLoader::LoadImage.

imgRequestProxy can already add itself to the given document's load group. Thus in the case where the composed document matches the original loading document, the imgIOnloadBlocker callbacks effectively do nothing, as we won't remove the imgRequestProxy from the load group until the appropriate time anyways.

When an imgRequestProxy is cloned, it may not add itself to the new loading document's load group. Adding to the load group is only triggered by calling imgCacheValidator::AddProxy (which *may* be called from clone but not necessarily), imgLoader::LoadImage, or an edge case for multipart images.

Naively, it seems to me that the correct thing to do is remove imgIOnloadBlocker entirely, make imgRequestProxy::Clone/SyncClone consistently add the clone to the given document's load group, and create a clone for the composed document if it differs from the one used for the loading document. This would ensure we are added to the load group in the correct scheduler group context, which will avoid any races caused by dispatching. But I don't know why it was designed the way it is -- bz, tnikkel, thoughts?

As for eliminating the dispatch entirely, I don't think that that is a workable idea. It is inherently unpredictable what documents will want the image when we start loading, so we always run the risk of our guess being wrong, unless we always use the unlabelled context. And *that* will undermine the whole premise of scheduler groups. The best I could come up with is if we can guarantee the work done in BlockOnload/UnblockOnload is safe to call in any scheduler group context, and thus avoid the dispatch in the first place?
Whoops, I forgot some calls to nsImageLoadingContent::CloneRequestForTree that I had mentally noted, but ended up not committing.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=55bff57411eb5fcae113a9aa160a0afc3686267a

This set of changes implements my line of thinking in comment 21. imgIOnloadBlocker was removed entirely, there is a new variant of imgRequestClone::Clone which takes a given load group and consistently binds to it, and each of the former imgIOnloadBlocker users clone a request for the additional document (if any) at the appropriate times. Since the Clone adds the request to the load group, it will prevent onload for occurring while running in the right scheduler group context.
Attachment #8918315 - Attachment is obsolete: true
[Tracking Requested - why for this release]: This should probably be blocking 57 and get a 56 uplift.  Not having tabs load is about as bad as it gets - worse than a crash.
Version: unspecified → 56 Branch
Linux run seems surprisingly clean. Makes me doubt any tests trigger my code.

try for Windows/Mac/Android: https://treeherder.mozilla.org/#/jobs?repo=try&revision=675e69a8ae8b8d981aeccf8078c9ac386593a288
Adding tracking flags here for 56 and onwards. We can consider this for possible dot release once we have a fix.
(In reply to Andrew Osmond [:aosmond] [PTO Oct 14-22] from comment #29)
> Linux run seems surprisingly clean. Makes me doubt any tests trigger my code.
> 
> try for Windows/Mac/Android:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=675e69a8ae8b8d981aeccf8078c9ac386593a288

Now this is closer to what I expected. Probably causing test failures for:
- image/test/mochitest/test_animation_operators.html
- html/semantics/embedded-content/image-maps/image-map-processing-model/hash-name-reference.html

Need to narrow these down. In any event, I don't think this set of patches will be workable for release and beta, they are too big and raise too many behavioral questions. I'm working on an alternative.

The alternative will hopefully just ignore block/unblock if the document/load group used in nsDocument::BlockOnload matches that was used in the original LoadImage(WithChannel) request. Currently we always call it, even if there was no need to. This increases the room for failure more than we need to.
Ritu, given that we are considering dot release for this issue you may want to mark it as a 57 blocking issue.
Flags: needinfo?(rkothari)
This patch would be targeting uplift to release and/or beta. This should reduce the probability of the issue without actually fixing it. The actual fix as per parts 1-4 already attached is very complicated. This is simpler and easier to audit.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=449a2032c002138887af0a37476a60b5dcdb5e8a
Hi Milan, this still looks like a serious issue. Is there anyone who can follow up while Andrew is on PTO?
Flags: needinfo?(milan)
The WIP is probably "good enough". Looks like it didn't break Linux, so let's try on all platforms:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b948c7e90db927e63745563e4554bfa9550e6bbc
(Good enough to buy us time I mean. I wouldn't want to land something bigger on beta, let alone release.)
Comment on attachment 8918473 [details] [diff] [review]
[WIP][release] 0001-Bug-1404422-Avoid-nsDocument-BlockOnload-UnblockOnlo.patch, v1

The try looks clean -- all of the failures are marked as intermittent and none of them stood out to me as related to the change.

Most of the time, the image's loading/owning document and the composed composed will be the same. If that's the case, in nsContentUtils::LoadImage, we should have used the load group from the loading document and imgLoader will have added the imgRequestProxy to the load group at this time. This would mean that the callbacks generated via imgIOnloadBlocker frequently do nothing.

This patch checks if the document we are going to call nsDocument::BlockOnload on matches the one we have already blocked on. If they are, it won't matter if the events come out of order because they will now do nothing *most* of the time. Right now they could affect pretty much any set of documents which share images. This will prevent most of the failures and will hopefully bring the problem down to an acceptable frequency, so that it can be properly solved in another release.

This leaves the original problem of the events coming out of order. Solving this is tricky -- an inability to dispatch due to the synchronous requirement to ensure onload is properly blocked for a document, and a requirement to dispatch due to an incompatible labelled context. I certainly won't want to sweep this under the rug! In a followup bug, I would suggest landing the other parts 1-4 targeting the 58 release. *That* set solves the event ordering by removing the events entirely -- we should be blocking the required documents when we start the request, not at some later date. There are still some outstanding issues with those patches (which are not present in *this* patch) where I failed to take into the behaviour of LoadImageWithChannel, which blocks the load group of the channel instead of the document -- we would need to clone the request again to block for the loading document to maintain the preconditions. All that said, they are quite large and risky, and I would be uncomfortable recommending for uplift to release/beta.
Attachment #8918473 - Flags: review?(tnikkel)
Comment on attachment 8918473 [details] [diff] [review]
[WIP][release] 0001-Bug-1404422-Avoid-nsDocument-BlockOnload-UnblockOnlo.patch, v1

Okay, so Andrew's patch is based on the idea that all LoadImage calls add the imgRequest to the passed in load group (this happens on the main thread during the LoadImage call) and don't get removed until we get all the data, and we have decoded the size of the image. And the onload blocking that we do on top of this that go through progress tracker and calls BlockUnload should be redundant. These are the ones that are the problem. This patch makes them basically no-ops by ignoring them if the document matches.

Does this seem like a reasonable thing to do in the short term until a proper fix with more code can be tested and landed?

Also, Andrew said he had problems reproducing the problem, so we'll probably need your help to test that this fixes the problem.
Attachment #8918473 - Flags: feedback?(bzbarsky)
Yes, this one should be a 57 blocker. Tracked as such.
Flags: needinfo?(rkothari)
Comment on attachment 8918473 [details] [diff] [review]
[WIP][release] 0001-Bug-1404422-Avoid-nsDocument-BlockOnload-UnblockOnlo.patch, v1

Sorry for the lag here.  I was trying to understand this patch, and I'm a bit confused by it, honestly.

Let's just pick one callsite, in nsImageLoadingContent::UnblockOnload:

   nsIDocument* doc = GetOurCurrentDoc();
   if (doc) {
-    doc->UnblockOnload(false);
+    doc->UnblockOnload(false, !usesChannel ? GetOurOwnerDoc() : nullptr);
   }

In practice the case we care about is the !usesChannel case (i.e. the "we are an <img> element in a document" case).  In that case, we do:

   doc->UnblockOnload(GetOurOwnerDoc());

and GetOurOwnerDoc() == doc, because "doc" is the current doc.  So this is doing:

  doc->UnblockOnload(doc);

So far so good.  This call will be ignored, with the new code, unless doc is an SVG resource document.

Is that the basic intent of the change?  To restrict the blocking/unblocking to the resource document case?  If so, that could be done much more simply at callsites, without changing the blocking/unblocking API...

If this is not the intent, then I'm not quite sure what the intent is.
Flags: needinfo?(aosmond)
Attachment #8918473 - Flags: feedback?(bzbarsky)
Also, is the imgRequest added to multiple loadgroups?  Or are the imgRequestProxies the things that get added to loadgroups?  Or something else?
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #40)
> Comment on attachment 8918473 [details] [diff] [review]
> [WIP][release]
> 0001-Bug-1404422-Avoid-nsDocument-BlockOnload-UnblockOnlo.patch, v1
> 
> Sorry for the lag here.  I was trying to understand this patch, and I'm a
> bit confused by it, honestly.
> 
> Let's just pick one callsite, in nsImageLoadingContent::UnblockOnload:
> 
>    nsIDocument* doc = GetOurCurrentDoc();
>    if (doc) {
> -    doc->UnblockOnload(false);
> +    doc->UnblockOnload(false, !usesChannel ? GetOurOwnerDoc() : nullptr);
>    }
> 
> In practice the case we care about is the !usesChannel case (i.e. the "we
> are an <img> element in a document" case).  In that case, we do:
> 
>    doc->UnblockOnload(GetOurOwnerDoc());
> 
> and GetOurOwnerDoc() == doc, because "doc" is the current doc.  So this is
> doing:
> 
>   doc->UnblockOnload(doc);
> 
> So far so good.  This call will be ignored, with the new code, unless doc is
> an SVG resource document.
> 
> Is that the basic intent of the change?  To restrict the blocking/unblocking
> to the resource document case?  If so, that could be done much more simply
> at callsites, without changing the blocking/unblocking API...
> 
> If this is not the intent, then I'm not quite sure what the intent is.

nsDocument::BlockOnload/UnblockOnload are recursive functions, such that the document that ends up handling the event is the top level "mDisplayDocument". That's what I couldn't simply check at the callsite. Now we could add another method (and one iteration of my patches did this) which gets you the root display document and then we check at the callsite. Easy enough to do.(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #41)

> Also, is the imgRequest added to multiple loadgroups?  Or are the
> imgRequestProxies the things that get added to loadgroups?  Or something
> else?

It is imgRequestProxy gets added to the load group.
Flags: needinfo?(milan)
Flags: needinfo?(aosmond)
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #41)
> Also, is the imgRequest added to multiple loadgroups?  Or are the
> imgRequestProxies the things that get added to loadgroups?  Or something
> else?

To expand on my comment, my intention is to add more imgRequestProxies to account for any load groups we want added to the document. But I think that patch is a bit bigger since it really supplants the imgIOnloadBlocker interface entirely.
> nsDocument::BlockOnload/UnblockOnload are recursive functions, such that the document
> that ends up handling the event is the top level "mDisplayDocument"

While true, the "depth" here is always either 1 or 2.  Every document either has a null mDisplayDocument, or has an mDisplayDocument whose mDisplayDocument is null.  

The only case that has a non-null mDisplayDocument is an SVG external resource (i.e. paint server and the like).

In the non-null mDisplayDocument case, the loadgroup of the document we're looking at is a request in the loadgroup of mDisplayDocument; see <http://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/dom/base/nsDocument.cpp#1176-1180>.  So as long as we add ourselves to either of the two loadgroups we're blocking onload on mDisplayDocument.
After discussing this on IRC with bz, we believe we can just remove imgIOnloadBlocker wholesale. The document and channel load groups should already be set up such that they block each other in the appropriate order.

Bug 697230 was filed to block the load group until the decode was complete. Today that maps to the metadata decode. This added the imgIOnloadBlocker interface.

Bug 1163856 changed how imagelib notified for the load complete event, such that is mirrors the behaviour of imgIOnloadBlocker, in addition to removing its own request from the load group. This made imgIOnloadBlocker redundant (*** most of the time).

I have found a bug (that's what I'm calling it in any event) where LoadImageWithChannel won't necessarily add us to the load group. Once that is fixed, we should always be adding any imgRequestProxies to the document/channel's load group in the LoadImage* calls, and imgIOnloadBlocker can be removed.
> Today that maps to the metadata decode.

That part is still a bit confusing to me.  The whole point of bug 697230 and similar was that for images that we _want_ to decode "immediately" we shouldn't consider the page loaded until the decode is complete.  Otherwise there are issues with flicker, drawWindow/drawImage needing to sync-decode, etc...  Did we change our mind about that behavior at some point?
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #46)
> > Today that maps to the metadata decode.
> 
> That part is still a bit confusing to me.  The whole point of bug 697230 and
> similar was that for images that we _want_ to decode "immediately" we
> shouldn't consider the page loaded until the decode is complete.  Otherwise
> there are issues with flicker, drawWindow/drawImage needing to sync-decode,
> etc...  Did we change our mind about that behavior at some point?

Pretty sure I remember Seth changing that. I don't have bug numbers or changeset links handy though.
Liz, should this still track 56 at this point?
Flags: needinfo?(lhenry)
No, I don't think we'll do another dot release (56.0.2 will be out today)
Flags: needinfo?(lhenry)
This patch does the meat of the work needed in imagelib:
1) Blocks the load group in LoadImageWithChannel, we never used to.
2) Blocks the load group in imgRequestProxy::Clone variants.
3) Now fails to change the load group in imgRequestProxy::SetLoadGroup (we didn't implemented this properly to begin with).
4) Ensures we never remove ourselves from the load group during Clone/CancelAndForgetObserver unless it is safe to do so (will dispatch).

This should make imgIOnloadBlocker unnecessary. We are still failing one test case in conjunction with the other test cases. Specifically a print preview test case on Windows non-e10s only:

https://treeherder.mozilla.org/logviewer.html#?job_id=140217268&repo=try&lineNumber=4703

As far as I can tell, the timeout that should be deferred *is* deferred to run when it should, but for some reason, the second canvas capture of the print preview contains "1 timers" instead of "0 timers". Possibly fall out of a load group removal? The static requests don't block on the load group (maybe they should; it didn't have an impact on this test case).
Attachment #8918314 - Attachment is obsolete: true
Attachment #8918316 - Attachment is obsolete: true
Attachment #8918317 - Attachment is obsolete: true
Attachment #8918338 - Attachment is obsolete: true
Attachment #8918473 - Attachment is obsolete: true
Attachment #8918473 - Flags: review?(tnikkel)
This test case doesn't care about chrome:// URLs but was getting events for them. I added to the ignore list which already had resource:// and about://.
This change just disables the onload block/unblock events without ripping out the plumbing everywhere. This was really a just-in-case we didn't want to uplift the next part which is much larger.
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa223ffe913d25cf6cab4092f9d6f27368750bcd

I believe this will green the try at long last. It turns out I was too aggressive in adding to the load group on cloning, as this would somehow mess up a print preview test case on Windows non-e10s, letting a timer event get processed. I think that the failure is actually not a problem, because it gets fired in between printPreview calls, rather than during them (which is what it wanted to avoid) in the test case [1]. But it did make sense not to reblock the load group unless something actually changed.

One concern I have is with respect to bug 1120149. It is quite some time ago, but there is the possibility of a similar regression since I am dramatically changing how the load group notifications work. I think we may have to live with that and see what happens.

[1] http://searchfox.org/mozilla-central/source/layout/base/tests/chrome/printpreview_helper.xul
Attachment #8923348 - Flags: review?(tnikkel)
Attachment #8923349 - Flags: review?(tnikkel)
Attachment #8923350 - Flags: review?(tnikkel)
Attachment #8923352 - Flags: review?(tnikkel)
Attachment #8922921 - Flags: review?(tnikkel)
Attachment #8922922 - Flags: review?(tnikkel)
Attachment #8922923 - Flags: review?(tnikkel)
Attachment #8923348 - Flags: review?(tnikkel) → review+
Attachment #8923349 - Flags: review?(tnikkel) → review+
Comment on attachment 8923350 [details] [diff] [review]
0003-Bug-1404422-Part-1c.-Refactor-how-an-imgRequestProxy.patch

>+  /* Remove from the load group and readd as a background request. */
>+  void BackgroundInLoadGroup();

MoveToBackgroundInLoadGroup? ChangeToBackgroundInLoadGroup?
Attachment #8923350 - Flags: review?(tnikkel) → review+
Comment on attachment 8923350 [details] [diff] [review]
0003-Bug-1404422-Part-1c.-Refactor-how-an-imgRequestProxy.patch

This also means that NS_IMAGELIB_CHANGING_OWNER is now not needed correct? Can we remove this from the tree?
Comment on attachment 8923352 [details] [diff] [review]
0004-Bug-1404422-Part-1d.-Ensure-imgRequestProxy-PerformC.patch

>   if (GetOwner() && GetOwner()->GetValidator()) {
>     // Note that if we have a validator, we don't want to issue notifications at
>     // here because we want to defer until that completes.
>     clone->SetNotificationsDeferred(true);
>     GetOwner()->GetValidator()->AddProxy(clone);

Can you comment here that the AddProxy call does the add to load group?

Do we not want to avoid that AddToLoadGroup call if addToLoadGroup is false?
Attachment #8922923 - Flags: review?(tnikkel) → review+
Attachment #8922922 - Flags: review?(tnikkel) → review+
Attachment #8922921 - Flags: review?(tnikkel) → review+
(In reply to Timothy Nikkel (:tnikkel) from comment #58)
> Comment on attachment 8923350 [details] [diff] [review]
> 0003-Bug-1404422-Part-1c.-Refactor-how-an-imgRequestProxy.patch
> 
> >+  /* Remove from the load group and readd as a background request. */
> >+  void BackgroundInLoadGroup();
> 
> MoveToBackgroundInLoadGroup? ChangeToBackgroundInLoadGroup?

Accepted MoveToBackgroundInLoadGroup.

(In reply to Timothy Nikkel (:tnikkel) from comment #59)
> Comment on attachment 8923350 [details] [diff] [review]
> 0003-Bug-1404422-Part-1c.-Refactor-how-an-imgRequestProxy.patch
> 
> This also means that NS_IMAGELIB_CHANGING_OWNER is now not needed correct?
> Can we remove this from the tree?

Correct. Removed.
Attachment #8923350 - Attachment is obsolete: true
Attachment #8923823 - Flags: review+
Comment on attachment 8923824 [details] [diff] [review]
0004-Bug-1404422-Part-1d.-Ensure-imgRequestProxy-PerformC.patch

(In reply to Timothy Nikkel (:tnikkel) from comment #60)
> Comment on attachment 8923352 [details] [diff] [review]
> 0004-Bug-1404422-Part-1d.-Ensure-imgRequestProxy-PerformC.patch
> 
> >   if (GetOwner() && GetOwner()->GetValidator()) {
> >     // Note that if we have a validator, we don't want to issue notifications at
> >     // here because we want to defer until that completes.
> >     clone->SetNotificationsDeferred(true);
> >     GetOwner()->GetValidator()->AddProxy(clone);
> 
> Can you comment here that the AddProxy call does the add to load group?

Done.

> 
> Do we not want to avoid that AddToLoadGroup call if addToLoadGroup is false?

We don't know when the validator will complete, and it may discard the cache which may say we are complete. So I don't think we have a choice in this case. However the vast majority of LoadImage calls (and all LoadImageWithChannel calls) will have a load group associated with them, and so the document load group was probably already blocked... a nasty corner case if it turns out to be a problem though.
Attachment #8923824 - Attachment filename: 0004-Bug-1404422-Part-1d.-Ensure-imgRequestProxy-PerformC.patch → 0004-Bug-1404422-Part-1d.-Ensure-imgRequestProxy-PerformC.patch, v2
Attachment #8923824 - Flags: review?(tnikkel)
Attachment #8923352 - Attachment is obsolete: true
Attachment #8923352 - Flags: review?(tnikkel)
Attachment #8923824 - Flags: review?(tnikkel) → review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/442e399f7cb6
Part 1a. Ensure imgLoader::LoadImage(WithChannel) adds the request to the expected load group. r=tnikkel
https://hg.mozilla.org/integration/mozilla-inbound/rev/a943db5eaa64
Part 1b. Make imgRequestProxy::SetLoadGroup return an error if changing the load group. r=tnikkel
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b420c5a7b11
Part 1c. Refactor how an imgRequestProxy is added/removed from its load group. r=tnikkel
https://hg.mozilla.org/integration/mozilla-inbound/rev/fe1b72af4740
Part 1d. Ensure imgRequestProxy::PerformClone consistently adds the clone to the expected load group. r=tnikkel
https://hg.mozilla.org/integration/mozilla-inbound/rev/1a1bcb63456d
Part 2. Make chrome test ignore generic, unrelated chrome URLs from events. r=tnikkel
https://hg.mozilla.org/integration/mozilla-inbound/rev/5c766b0509d5
Part 3. Disable triggering imgIOnloadBlocker block/unblock events. r=tnikkel
https://hg.mozilla.org/integration/mozilla-inbound/rev/f24a18bffbc7
Part 4. Remove imgIOnloadBlocker and related from tree as redundant. r=tnikkel
Depends on: 1413981
Hi Milan, This is a bug I have tracked as 57 blocking. It's nice to see we landed a fix in Nightly58. Looking at the sheer number of patches, I am not sure if this is worth the risk to uplift in 57 now. Do you agree? I will be planning a dot release for 57 that will incrementally improve quality. Should we consider including this fix in the dot release 2-3 weeks from now? Thanks for your input.
Flags: needinfo?(milan)
Given the size and the fact that this shipped in 56, I would not immediately uplift to 57.  Perhaps in a later dot release if it surfaces as a big problem.
Flags: needinfo?(milan)
Depends on: 1414762
Depends on: 1414427
Depends on: 1414433
Taking this off the blocker list per comment 67.  We'll keep tracking for a couple of weeks and re-evaluate.
(In reply to Julien Cristau [:jcristau] from comment #68)
> Taking this off the blocker list per comment 67.  We'll keep tracking for a
> couple of weeks and re-evaluate.

In my experience the severity here is pretty serious. I can reproduce this easily by loading multiple pages in parallel (tp5o pages) and the typical scenario that the document stays in interactive state so the window load event is not fired, which can result a broken page for some sites. Anyway, it might not be realistic to uplift the patches, but if there is a chance it's worth the effort imo.
Flags: needinfo?(jcristau)
Strong +1, because I'm expecting plenty of Session Restore bugs to be filed that I need to triage where I'll have to say 'Can you try 58 beta?'
Also, a Shield study is being set up to investigate whether it's time now to restore your previous session on startup by default, and I'd like that to go a smooth as possible.
Given all the issues found with these patches already since landing, I don't think crash-landing them into the RC3 build is a wise strategy at all.
Unless there's a smaller / lowest-possible-risk fix we could take in a dot release this will have to wait for 58.
Flags: needinfo?(jcristau)
You need to log in before you can comment on or make changes to this bug.