1335900 - Enabling service worker slows down time-to-interactive on dropbox.com by 600ms

Reporter

Description

•

8 years ago

Two test accounts: dzbarsky+test_sw@gmail.com (service worker enabled) and dzbarsky+test_sw_off@gmail.com (service worker disabled) Both have password 123456 Log in and go to https://www.dropbox.com/home Once it finishes loading, you can do "require.s.contexts['embedded-app'].require('modules/clean/web_timing_logger').time_to_interactive - performance.timing.navigationStart" in the console to figure out when we mark the app as interactive. The timing is pretty noisy, but aggregate telemetry numbers from users in the wild show a 600ms difference on the browse page. You may be able to reproduce on a simpler page, load https://www.dropbox.com/profiling/maestro_blank_embedded_app and run the same command in console.

Ben Kelly [:bkelly, not reviewing]

Updated

•

8 years ago

Blocks: ServiceWorkers-perf

Andrew Overholt [:overholt]

Comment 1

•

8 years ago

Catalin is going to help here (he's busy this week, though).

Assignee: nobody → catalin.badea392

Cătălin Badea (:catalinb)

Comment 2

•

8 years ago

From what I can see, the registered service worker doesn't serve any cached resources and will just fall back to the network, at least for https://www.dropbox.com/profiling/maestro_blank_embedded_app. I've measured the time it takes from when we decide to intercept the request until we reset the channel, which should be the overhead introduced by going through the service worker. We spend most of the time waiting for the ResumeRequest runnable to be executed on the main thread. Not using the throttled event queue reduces this time considerably. The first dispatch to the service worker takes between 14ms and 30ms, this is probably due to the SW waking up. Subsequent dispatches usually take less than 1ms. The averages and max values are taken while loading https://www.dropbox.com/profiling/maestro_blank_embedded_app with a registered service worker and then refreshing 6 times. ThrottledQueue (refreshing the page 6 times): Time between interception and channel reset: average=76.7275, max=574.190719 Time between ResumeRequest dispatch and execution on MT: average=76.454, max=574.041316 The max value is a bit unusual, measuring just the first load doesn't yield such high values. ThrottledQueue (start firefox, load https://www.dropbox.com/profiling/maestro_blank_embedded_app then quit) x 6: Time between interception and channel reset: average=74.6755, max=243.249065 Time between ResumeRequest dispatch and execution on MT: avg=74.2079, max=243.175957 NS_DispatchToMainThread: Time between interception and channel reset: average=17.0206, max=68.695675 Time between ResumeRequest dispatch and execution on MT: average=16.8428, max=68.552611

Ben Kelly [:bkelly, not reviewing]

Comment 3

•

8 years ago

They claim that serving resources out of Cache API also slowed it down, but its hard to measure when they have that disabled. I tried to explain they should expect a perf hit if they interpose a javascript fetch event handler and it doesn't do anything. Running the javascript is always going to be slower than not.

Cătălin Badea (:catalinb)

Comment 4

•

8 years ago

Attached patch Measure fetch interception time. — Details — Splinter Review

This is the patch I used to do the measurements. It's a bit of a hack and I need to make sure these are correct.

Cătălin Badea (:catalinb)

Comment 5

•

8 years ago

(In reply to Ben Kelly [not reviewing due to deadline][:bkelly] from comment #3) > They claim that serving resources out of Cache API also slowed it down, but > its hard to measure when they have that disabled. I tried to explain they > should expect a perf hit if they interpose a javascript fetch event handler > and it doesn't do anything. Running the javascript is always going to be > slower than not. Right, but we may be able to improve on this by not throttling fetch response runnables or channel reset runnables. Also, this might be a good opportunity to write talos tests. Could we skip the throttled queue in this case? Or give these runnables a higher priority, not sure how it works.

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Comment 6

•

8 years ago

(In reply to Cătălin Badea (:catalinb) from comment #5) > Right, but we may be able to improve on this by not throttling fetch > response runnables or channel reset runnables. Also, this might be a good > opportunity to write talos tests. > > Could we skip the throttled queue in this case? Or give these runnables a > higher priority, not sure how it works. You mean bypassing ThrottledEventQueue? I don't think we should do that. That only affects timing if the main thread is busy. If the main thread is busy, then your not going to be getting good performance here anyway. ThrottledEventQueue does not actually "throttle". It just yields to other main thread work between each runnable.

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Comment 7

•

8 years ago

You could see if disabling their loading animation thing helps. Perhaps they are hitting the main thread hard to do that animation.

Cătălin Badea (:catalinb)

Comment 8

•

8 years ago

(In reply to Ben Kelly [not reviewing due to deadline][:bkelly] from comment #6) > (In reply to Cătălin Badea (:catalinb) from comment #5) > > Right, but we may be able to improve on this by not throttling fetch > > response runnables or channel reset runnables. Also, this might be a good > > opportunity to write talos tests. > > > > Could we skip the throttled queue in this case? Or give these runnables a > > higher priority, not sure how it works. > > You mean bypassing ThrottledEventQueue? I don't think we should do that. > That only affects timing if the main thread is busy. If the main thread is > busy, then your not going to be getting good performance here anyway. > > ThrottledEventQueue does not actually "throttle". It just yields to other > main thread work between each runnable. Yes, but we're delaying the network request, which is not handled on the main thread. Maybe this can cause an overall longer loading time.

Cătălin Badea (:catalinb)

Comment 9

•

8 years ago

(In reply to Ben Kelly [not reviewing due to deadline][:bkelly] from comment #7) > You could see if disabling their loading animation thing helps. Perhaps > they are hitting the main thread hard to do that animation. I'll be sure to check that.

Cătălin Badea (:catalinb)

Comment 10

•

8 years ago

David, does your telemetry data include the client version? I wonder if there's any difference between firefox releases.

Flags: needinfo?(dzbarsky)

Ben Kelly [:bkelly, not reviewing]

Comment 11

•

8 years ago

(In reply to Cătălin Badea (:catalinb) from comment #8) > > ThrottledEventQueue does not actually "throttle". It just yields to other > > main thread work between each runnable. > > Yes, but we're delaying the network request, which is not handled on the > main thread. Maybe this can cause an overall longer loading time. I don't think letting the worker jank the main thread is the solution. Also, any benefit would be very racy and timing dependent anyway because you might end up getting scheduled behind the current work causing problems. Lets see what other work is happening on the main thread that is conflicting with the SW runnables.

Ben Kelly [:bkelly, not reviewing]

Comment 12

•

8 years ago

Btw, I'm glad to see that linkedin turned on cache-control:immutable.

Cătălin Badea (:catalinb)

Comment 13

•

8 years ago

After investigating the tests at https://github.com/samertm/firefox-sw-perf, it turns out using URL objects in the fetch handler is one of the main causes for the increased loading time. For the "/with_dbxsw" path, dropping the use of URL objects will speed the loading time from ~500ms to ~30ms. It's interesting that the service worker will actually cause gecko to perform more font face style flushes, which will keep the main thread busy and cause URL syncloop requests to be even slower. I *think* the increase in style flushes is due to the increased time between when css resources are fetched and style operations not being coalesced.

Ben Kelly [:bkelly, not reviewing]

Updated

•

8 years ago

Depends on: 1344751

Cătălin Badea (:catalinb)

Comment 14

•

8 years ago

Another path we probably want to optimze is event.respondWith(fetch(event.request)). While this will always add some overhead, I think a lot of people will end up doing: if (request.url matches list_of_special_paths) { .. } else { event.respondWith(fetch(event.request)); } On https://github.com/samertm/firefox-sw-perf , a service worker that just responds with a fetch on the same request, will bump the loading times from 30-40ms to 600ms. Chrome doesn't have this issue.

Desigan Chinniah [:cyberdees] [:dees] [London - GMT]

Updated

•

8 years ago

platform-rel: --- → ?

Whiteboard: [platform-rel-Dropbox]

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

8 years ago

Whiteboard: [platform-rel-Dropbox] → [platform-rel-Dropbox][qf]

Andrew Overholt [:overholt]

Updated

•

8 years ago

Priority: -- → P1

Naveed Ihsanullah [:naveed]

Updated

•

8 years ago

Whiteboard: [platform-rel-Dropbox][qf] → [platform-rel-Dropbox][qf:p1]

Thomas Elin [:relaas]

Updated

•

8 years ago

platform-rel: ? → +

Cătălin Badea (:catalinb)

Comment 15

•

8 years ago

Dropbox dropped the use of URL object in their service workers, but still experience poor performance with Firefox (compared to chrome). I couldn't reproduce these new issues using the perf benchmark because the performance api is broken for synthesized responses (bug 1351521). From what I investigated some time ago, I couldn't find individual issues that we can fix at this time. I'd like to take another stab at this after we land parent interception, which would allow me to fix the benchmark.

Naveed Ihsanullah [:naveed]

Updated

•

8 years ago

Whiteboard: [platform-rel-Dropbox][qf:p1] → [platform-rel-Dropbox][qf:p2]

Andrew Overholt [:overholt]

Comment 16

•

7 years ago

(I'm moving this to our P3 bucket because this bug in itself is not actionable at this point and depends on other work here)

Priority: P1 → P3

Firefox Bug Husbandry Bot

Updated

•

7 years ago

Keywords: perf

Naveed Ihsanullah [:naveed]

Comment 17

•

7 years ago

Re-evaluating in 2 months when Service Worker rewrite lands.

Whiteboard: [platform-rel-Dropbox][qf:p2] → [platform-rel-Dropbox][qf:p3]

Cătălin Badea (:catalinb)

Updated

•

7 years ago

Assignee: catalin.badea392 → nobody

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 18

•

6 years ago

Part of this is most likely that worker code uses throttled queue for any URL creation (used inside fetch() and Cache API) and also for initiating network connections.
Throttled queue should be used only for things like postMessage.

Whiteboard: [platform-rel-Dropbox][qf:p3] → [platform-rel-Dropbox][qf?]

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 19

•

6 years ago

Removing qf, since this is just a variant of bug 1522316.

Depends on: 1522316

Whiteboard: [platform-rel-Dropbox][qf?] → [platform-rel-Dropbox]

Andrew Sutherland [:asuth] (he/him)

Comment 20

•

6 years ago

(In reply to Olli Pettay [:smaug] (PTO-ish Feb 16-23) from comment #18)

Part of this is most likely that worker code uses throttled queue for any URL creation (used inside fetch() and Cache API) and also for initiating network connections.
Throttled queue should be used only for things like postMessage.

So, I just looked into this because I was very confused because we made http/https creation bypass the main thread in bug 1344751. But it seems like Bug 1454656 regressed that as part of an attempt to unify/clean up URL code, causing us to have to consult the main thread in every case. Specifically, the hunk removal at https://hg.mozilla.org/mozilla-central/rev/b5051b2393f2#l5.272 was our fast-path.

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: DOM → DOM: Core & HTML

Patricia Lawless

Comment 21

•

5 years ago

Given new information, is this still a P3?

Flags: needinfo?(htsai)

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

5 years ago

Component: DOM: Core & HTML → DOM: Service Workers

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 22

•

5 years ago

Bug 1558923 could have helped here.

Hsin-Yi Tsai (she/her) [:hsinyi]

Comment 23

•

5 years ago

I'll defer to Jens as he is managing the team. :)

Flags: needinfo?(htsai)

Andrew Sutherland [:asuth] (he/him)

Comment 24

•

5 years ago

So my understanding of the situation is:

The DropBox ServiceWorker uses (used?) URL a lot. (Which is reasonable.)
- Bug 1344751 fixed slowdown from this.
- Bug 1454656 regressed this.
- Bug 1558923 should have fixed this again.
Main thread contention was delaying ServiceWorker-initiated fetches.
- Bug 1522316 addressed this.

So we think the situation is that performance should have improved from the above. However, on nightly we also think that that the enabling of parent-intercept in bug 1456995 may potentially introduce additional latency which we are tracking in bug 1587759.

In general, it's definitely known that we will need to be focusing on addressing latency problems with ServiceWorkers in the near and medium term. So the big question is whether we have active contacts at dropbox who would like to work with us on optimizing our performance as it relates to dropbox specifically.

So, I am going to toggle the needinfo on dzbarsky. We'll resolve WFM if we don't hear back in a while.

Flags: needinfo?(dzbarsky)

Andrew Sutherland [:asuth] (he/him)

Comment 25

•

5 years ago

Do you have any numbers on DropBox ServiceWorker performance under Firefox 69 release or the current 70 release? Thanks!

Flags: needinfo?(dzbarsky)

David Zbarsky (:dzbarsky)

Reporter

Comment 26

•

5 years ago

Thanks for the update Andrew.

We experimented with services workers when I first reported this, but ultimately did not end up using them in production (both due to Firefox perf issues and because Chrome didn't let us reliably update the worker script).

Happy to hear you've got improvements coming though!

Flags: needinfo?(dzbarsky)

Andrew Sutherland [:asuth] (he/him)

Comment 27

•

5 years ago

Okay, I'm going to resolve this WORKSFORME because it seems there's no specific further actions to be taken here, but please feel free to reopen or file a new bug if you perform further ServiceWorker investigations involving Firefox. In the meantime we do know we have a lot of performance enhancements to work on and will be looking to get representative tests added to talos (automated performance regression tests).

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WORKSFORME