Closed Bug 1522316 Opened 11 months ago Closed 9 months ago

Network loads seem to be started very late in this profile

Categories

(Core :: DOM: Service Workers, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla67
Tracking Status
firefox66 --- wontfix
firefox67 --- fixed

People

(Reporter: mstange, Assigned: smaug)

References

(Depends on 3 open bugs, Blocks 1 open bug)

Details

(Whiteboard: [qf:p1:pageload][necko-triaged])

Attachments

(4 files, 2 obsolete files)

Please take a look at the network charts in this profile: http://bit.ly/2WgurnM, or in this profile: http://bit.ly/2WaMOL0

These are profiles from bug 1522311.

It seems to me that what's happening here is that the content process knows about network loads it wants to do many seconds before the parent process knows about them. And then the actual network loads start extremely late, long after they could have started. In fact, the network is idle during most of the time when the user is looking at a blank screen.

What causes these network loads to start in the end? It looks like it's not happening until the content process main thread becomes available. Do these loads require a jump through the content process event loop before they get forwarded to the parent process?

http://bit.ly/2DsPtIw smells like service workers, but I could be very wrong.

Is this possibly a regression from https://bug1300659.bmoattachments.org/attachment.cgi?id=8808282 or did that make the situation worse?

Flags: needinfo?(amarchesini)

I may sound a little bit pushy, perhaps, but I think BT would help here, clearly. If you are using a custom build, I would love to help to make the BT patch work and try to get something out of the instrumented build.

To provide some general SW context:

  • Service Worker interception currently happens in the child, although the e10s refactor will change that and is nearly done.
    • The parent only gets told about network requests if the ServiceWorker decides not to intercept.
    • The interception gets reset on the main thread (in the content process) because that's where the network request was happening in the first place.
  • fetch() calls made from workers do get bounced off the main thread off the content process

And to provide some context from the imgur SW:

  • It appears to implement a "fetch" event handler, so the SW has to be spun up for all requests if it's not already spun up. (It actually appears to implement 2 "fetch" handlers? The minified/mangled source doesn't make it super clear what's going on when trivially beautified.)
  • It appears to use pass-through fetch, which means that it effectively ends up doing event.respondWith(fetch(event.request)) which means that the the Response (headers) has to come down to the content process main thread, then bounce to the SW to resolve the promise, then get bounced back to the main thread to actually get the data flowing. The good news is the input stream stuff should end up efficiently streaming, although things could be further improved.

So in general, if the main thread in the content process is experiencing a lot of contention, yeah, it's a problem.

And to answer :smaug's question, it's certainly possible that the interactions of a highly contended main thread and the throttled event queue could make things worse.

(In reply to Andrew Sutherland [:asuth] from comment #5)

So in general, if the main thread in the content process is experiencing a lot of contention, yeah, it's a problem.

As a rule of thumb, the main thread of the content process is always contended. If we are round-tripping through it, we add arbitrary delays to everything. I think it's extremely important to address this problem.

(In reply to Markus Stange [:mstange] from comment #7)

As a rule of thumb, the main thread of the content process is always contended. If we are round-tripping through it, we add arbitrary delays to everything. I think it's extremely important to address this problem.

Yes, this is a known issue. If you'd like to help, please take a look at bug 1330826 and bug 1351231. Thanks!

Thanks for the pointers! And thanks for figuring out what causes the problem on this page.

Depends on: 1522593
Depends on: 1330826, 1351231

Talking with :smaug on IRC right now, I should also mention that the SW also seemed to potentially be doing Cache API lookups. These happen via PBackground and can 100% be impacted by PBackground contention (WebRTC can block this for dumb reasons) or disk I/O contention and would also stall the pass-through fetches. In the short term I'm hoping to add MOZ_LOG stuff for ServiceWorkers and the Cache API which can help get us more profiler markers with :jesup's show-MOZ_LOG-as-markers fix thing to better understand rather than guess. (Has that landed?)

Whiteboard: [qf:p1:pageload]

ytausky, can be interesting for you to work on this? The idea is probably to have 2 ways to dispatch worker runnables: with and without throttable queue. For networking runnables we should not use the throttable queue.

Flags: needinfo?(amarchesini) → needinfo?(ytausky)

The priority flag is not set for this bug.
:selenamarie, could you have a look please?

Flags: needinfo?(sdeckelmann)
Flags: needinfo?(sdeckelmann)
Priority: -- → P2
Whiteboard: [qf:p1:pageload] → [qf:p1:pageload][necko-triaged]
Blocks: 1335900
Component: Networking → DOM: Service Workers
Attached patch wip (obsolete) — Splinter Review

https://treeherder.mozilla.org/#/jobs?repo=try&revision=de6e79b3125ed00fa523c798d2e94973899c5b0d

Because of the debugger stuff, the queue handling in workers is a tad weird.
normal postMessage uses always mMainThreadDebuggeeEventTarget.
So this messaging queue is in fact used only by console API for now.

Assignee: nobody → bugs
Attached patch wip (v3) (obsolete) — Splinter Review

remote: View your change here:
remote: https://hg.mozilla.org/try/rev/74d184574e47797bc0a8414b6acf6c3692c2a018
remote:
remote: Follow the progress of your build on Treeherder:
remote: https://treeherder.mozilla.org/#/jobs?repo=try&revision=74d184574e47797bc0a8414b6acf6c3692c2a018
remote: recorded changegroup in replication log in 0.041s

Depends on: 1524006

Sorry for not being responsive, I just came back from vacation. I'm assuming :smaug is working on this now? Please ping me again if I should still take over.

Flags: needinfo?(ytausky)

remote: Follow the progress of your build on Treeherder:
remote: https://treeherder.mozilla.org/#/jobs?repo=try&revision=da1222000d145e9979b01d9d25739129b711259f
remote: recorded changegroup in replication log in 0.048s
push complete

Attachment #9043964 - Attachment is obsolete: true
Attachment #9044333 - Attachment is obsolete: true

"ForMessaging" is a bit weird name currently, but that is mostly because postMessage ends up using mMainThreadDebuggeeEventTarget queue (which super weird).

There are two perhaps unexpected cases,
(1) LifeCycleEventWatcher uses the lower priority queue so that .installing on main thread gets set correctly (and the case is sort about messaging)
(2) TopLevelWorkerFinishedRunnable uses lower priority queue so that all the other runnables dispatched during shutdown (some of them are dispatched without worker queues) are handled before.

Workers event loop handling is a tad weird even without the patch. Some of the runnables go to the throttable queue, meaning basically that they have lower priority, but some runnables bypass that and use just the normal NS_DispatchToMainThread. That feels error prone.

The patch should let us create certain objects which require main thread sooner, or open network connections etc. But still, there is main thread usage, and main thread is mostly blocked during the whole page load, so eventually we should remove all the unneeded main thread usage.

Attachment #9048025 - Flags: feedback?(amarchesini)
Comment on attachment 9048025 [details] [diff] [review]
unthrottled_worker_queue_7.diff

Review of attachment 9048025 [details] [diff] [review]:
-----------------------------------------------------------------

It seems all reasonable to me.
Attachment #9048025 - Flags: feedback?(amarchesini) → feedback+

try push, pref off
remote:
remote: Follow the progress of your build on Treeherder:
remote: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c6061f4bae4a336c7ed99e19e0be6c51af148613

try push, pref on
remote:
remote: Follow the progress of your build on Treeherder:
remote: https://treeherder.mozilla.org/#/jobs?repo=try&revision=218bc59a2959982880545c1a9da430fc597251f6
remote: recorded changegroup in replication log in 0.046s

If main thread is busy handling runnables in the normal priority queue, control-type of messages from
workers are possibly postponed to run after those. That can lead to bad performance, if the page
expects workers to be able to proceed simultanously with the main thread.
This patch makes the control messages to use medium high priority queue in order to try to
ensure they are handled in timely manner.
Pref dom.worker.use_medium_high_event_queue can be set to false to disable this new behavior.

Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/73824b8dd855
use medium high priority queue for worker->main thread control messages, r=baku
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67
See Also: → 1531511
Regressions: 1545577
You need to log in before you can comment on or make changes to this bug.