Long-running NotifyIconObservers::Run on main-thread in parent process when closing a tab

NEW
Unassigned

Status

()

defect
P2
normal
Last year
8 months ago

People

(Reporter: mconley, Unassigned)

Tracking

(Blocks 1 bug, {perf})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [fxperf:p3][qf:p3][fxsearch])

I just had a sluggish tab close, and was able to capture a profile:

https://perfht.ml/2ByeAcr

The profile is dominated with a "NotifyIconObservers::Run" function that seems to be doing a lot on the main thread. About 301ms worth.

That seems bad.
(In reply to Mike Conley (:mconley) (:⚙️) (Totally backlogged on reviews and needinfos) from comment #0)

> The profile is dominated with a "NotifyIconObservers::Run" function that
> seems to be doing a lot on the main thread. About 301ms worth.

In this profile I see permitUnload taking 361ms, and spinning the event loop during that time, causing PageIconProtocolHandler.js to run some random code (in small chunks that probably wouldn't cause jank).
(In reply to Florian Quèze [:florian] from comment #1)
> (In reply to Mike Conley (:mconley) (:⚙️) (Totally backlogged on reviews and
> needinfos) from comment #0)
> 
> > The profile is dominated with a "NotifyIconObservers::Run" function that
> > seems to be doing a lot on the main thread. About 301ms worth.
> 
> In this profile I see permitUnload taking 361ms, and spinning the event loop
> during that time, causing PageIconProtocolHandler.js to run some random code
> (in small chunks that probably wouldn't cause jank).

I take back "(in small chunks that probably wouldn't cause jank)", this is just a supposition, I don't see clearly in the profile if the time spent in mozilla::places::NotifyIconObservers::Run is happening in one chunk or in several small chunks.
well, this is spinning the events loop indeed, there's an onPageChanged notification, that actually was enqueued but due to spinning happens immediately, and that causes some UI (likely) to request a new icon through page-icon.
Since we are still spinning, instead of delaying the icon load it happens immediately and we call serveIcon.
Most of these operations wouldn't normally happen in the same tick, but spinning makes them so.

I'm not sure what to do in this case.
It's possible new notifications from bug 1340498 may help here, but I can't promise.
Keywords: perf
Priority: -- → P2
Whiteboard: [fxperf][qf] → [fxperf][qf][fxsearch]
(In reply to Marco Bonardo [::mak] from comment #3)
> Since we are still spinning, instead of delaying the icon load it happens
> immediately and we call serveIcon.
> Most of these operations wouldn't normally happen in the same tick, but
> spinning makes them so.

Maybe I'm misunderstanding how our event loops work, but why would spinning the event loop change the priority of events? Why would the delay of the icon load not occur in this case with the nested event loop, but would occur with the normal event loop?
Flags: needinfo?(mak77)
(In reply to Mike Conley (:mconley) (:⚙️) (Totally backlogged on reviews and needinfos) from comment #4)
> Maybe I'm misunderstanding how our event loops work, but why would spinning
> the event loop change the priority of events?

It doesn't.
But it looks like here we have a main thread stack of calls that are not supposed to be part of the same loop.
The spinning seems to cause enqueued runnables to be adjacent on the main thead, and then just executes them in the same tick. Since the profiler is a polling one, it may just poll before and after the spin, and think this is a unique stack?
Indeed PageIconProtocolHandler.js::newChannel2 calls getFaviconDataForPage, that sends an event to the async thread, that when done dispatches back an event to the main thread, that calls serveIcon. The profiler think newChannel2 directly invoked serveIcon.
The only way I could explain that is if the async thread gets scheduled for immediate execution, and the main thread event it dispatches is enqueued just after the original newChannel2 event (so nothing was enqueued to run after the newChannel2 event). Due to spinning the 2 runnables run consecutively.

I'm not sure how we'd avoid this, apart from adding further artificial delays of events, or marking the runnable as low priority? It sounds like a problem of events priority here.
Flags: needinfo?(mak77)
(In reply to Marco Bonardo [::mak] from comment #5)

> It doesn't.
> But it looks like here we have a main thread stack of calls that are not
> supposed to be part of the same loop.
> The spinning seems to cause enqueued runnables to be adjacent on the main
> thead, and then just executes them in the same tick. Since the profiler is a
> polling one, it may just poll before and after the spin, and think this is a
> unique stack?

Plausible, I guess. Hey mstange, could that be what we're seeing here?
Flags: needinfo?(mstange)
Or njn, would you happen to know what's happening here?
Flags: needinfo?(n.nethercote)
I don't really understand most of mak's comment, but here's what I think happens:

The profiler measures jank by dispatching a timer with interval 16ms to the main thread's SystemGroup. If there's a jank marker in the profile, that means that the timer runnable hasn't executed until the end of the jank marker. So any runnables that run during the duration of the jank marker were either dispatched before the 16ms timer added its runnable, or they were dispatched to a higher priority event queue.
The order that the event queues get serviced in does not depend on whether we run a nested event loop or the root event loop.

The same applies for the onPageChanged notification: If it was dispatched before the NotifyIconObservers runnable, with the same priority, then it should run first. However, it does not, so my guess is that the NotifyIconObservers runnables were dispatched *before* the onPageChanged notification was dispatched.
Flags: needinfo?(mstange)
I don't have anything to add beyond what mstange said.
Flags: needinfo?(n.nethercote)
I talked to mstange IRL about this, and I think I can help clear this up.

The stack in perf.html says:

...
  ...
    newChannel2/<
      serveIcon

mak was concerned because newChannel2 does not call serveIcon directly - it is supposed to queue it to run asynchronously via PlacesUtils.favicons.getFaviconDataForPage.

However, what I learned today is that the /< at the end of newChannel2 isn't just noise and shouldn't be ignored; the /< means that the next frame beneath it was defined as a lambda within the /< function. So this is saying that serveIcon is a lambda inside newChannel2 - _not_ that newChannel2 calls serveIcon.
Almost: newChannel2/< means "this is an anonymous lambda that was created inside a function called newChannel2". So in your case, the anonymous lambda is the caller, and serveIcon is called by that lambda.
Got it, thanks!
That makes more sense.

(In reply to Markus Stange [:mstange] from comment #8)
> NotifyIconObservers runnables were dispatched *before* the onPageChanged
> notification was dispatched.

NotifyIconObservers is a runnable dispatched by the async thread to the main thread, and when it runs it call onPageChanged.
the order looks correct regarding that.
Excluding the high priority queue, since I don't think we even have one for these runnables, it sounds like:
NotifyIconObserver is already enqueued, we close the tab and start spinning, notifyIconObserver runs immediately, it sends onPageChanged, someone collects that and requires an updated icon (newChannel2).
=> At this point we dispatch a runnable to the async thread, that sends back a runnable on the main thread (serveIcon).
It sounds like only at this point the jank timer event gets enqueued, otherwise it would be served before serveIcon (that is what mstange says as well, IIUC).
I'm still not sure why the timer event is not between the 2 main thread events... maybe the scheduler decided to run our async thread before the thread that dispatches the timer event.

I'm still not sure what to do here, apart from adding artificial delays.
FWIW Looking at the profile it doesn't look to me like bug 1340498 will help, unfortunately. It looks to me like the time spent in XPConvert is due to converting for the callback, and not for the relatively simple data in the notification. (Also I suspect we didn't inline everything all the way down to nsNavHistory::SendPageChangedNotification, so if the notifications were taking time, I'd expect samples with those frames to be present.)
Giving this an [fxperf:p2] for now. I think we could at least experiment with making the callback less costly. It looks like changing the aData parameter of the callback to a jsval, and manually constructing a Uint8Array from the buffer might be worth exploring?
Whiteboard: [fxperf][qf][fxsearch] → [fxperf:p2][qf][fxsearch]

Updated

Last year
Whiteboard: [fxperf:p2][qf][fxsearch] → [fxperf:p2][qf:p1][qf:f61][fxsearch]

Updated

Last year
Whiteboard: [fxperf:p2][qf:p1][qf:f61][fxsearch] → [fxperf:p2][qf:p1][qf:f64][fxsearch]

Updated

Last year
Whiteboard: [fxperf:p2][qf:p1][qf:f64][fxsearch] → [fxperf:p2][qf:p1:f64][fxsearch]
I've come back around to this, and I think there's a piece of the puzzle missing.

Mainly, I think before I closed my tab, I must have opened up the Bookmarks menu:

https://perfht.ml/2K8sT94

Which, I would imagine, queued up a bunch of favicon requests inside a PlacesView.

So right off the bat, we have a large chunk of favicon requests in flight, and Places is about to get busy retrieving those.

Secondly, I must have been on a page with a beforeunload event handler set. That's why when I decided to close the tab, it started to spin the event loop. Once it started to spin the event loop, I guess that's when it started to service all of the icon events that were in the queue. Once once those events were serviced did the main thread service the message from content saying, "I don't have any beforeunload events".

A few ideas:

1) Since the bookmark menu was likely closed when I closed the tab, I wonder if it makes sense to cancel any in-flight favicon requests from a PlacesView for which the menu has closed. I don't even know if that's possible.

2) I don't know how possible it is, but it might make sense to decrease the priority of favicon events for the Bookmarks menu in the parent process. That menu (and maybe some other menus) are all likely to request a huge slew of favicons all at once... it's probably more important to service content process messages than it is to service favicon events. That might have helped us here, since the "go ahead and close" message from the content process would have taken priority over the favicon loads.

Hey mak, is my mental model here correct? And do you know if it's possible to cancel in-flight favicon loads?

Also cc'ing Mossop, since I think he's been working with favicons lately, and might have some insight here.
Flags: needinfo?(mak77)
(In reply to Mike Conley (:mconley) (:⚙️) (Catching up on needinfos / reviews) from comment #16)
> 1) Since the bookmark menu was likely closed when I closed the tab, I wonder
> if it makes sense to cancel any in-flight favicon requests from a PlacesView
> for which the menu has closed. I don't even know if that's possible.

Technically it's possible. Currently Places views for menus are kept alive, so the menus can be reopened faster, but I imagine image requests are redone every time. There's not much linking the 2 though, we may need to associate those icon loads to a loadgroup and cancel that in browserPlacesViews.js::_onPopupHidden handler (the trees may want something similar). This would probably not cancel the database fetching though, we'd need to forward cancellation from the ui to the channel created by PageIconProtocolHandler.js and from there to nsIFaviconService::getFaviconDataForPage through a canceler function.
So, while it's technically feasible, it sounds like quite some work.

> 2) I don't know how possible it is, but it might make sense to decrease the
> priority of favicon events for the Bookmarks menu in the parent process.

Or more in general for Places views. I'd have nothing against that, but I don't know if technically we already support re-prioritization of the main-thread events queue. Off-hand it sounds like it would be easier to mark certain events as critical/important, than to mark many events as not-critical. Thus, maybe we should give the onbeforeunload event an high priority, rather than reduce priority of everything else?
Flags: needinfo?(mak77)
Okay, thanks mak.

I'm downgrading this to a fxperf:p3 and a qf:p3 because it looks like the favicon loading was being kicked off by the bookmarks menu. I think it's still worth lowering the priority somehow of these favicon requests, but I don't think this is something that needs to be addressed asap.
Whiteboard: [fxperf:p2][qf:p1:f64][fxsearch] → [fxperf:p3][qf:p3:f64][fxsearch]

Updated

8 months ago
Whiteboard: [fxperf:p3][qf:p3:f64][fxsearch] → [fxperf:p3][qf:p3][fxsearch]
You need to log in before you can comment on or make changes to this bug.