Warn and gather data if ServiceWorker hits max workers per domain limit

RESOLVED FIXED in Firefox 39

Status

()

Core
DOM: Workers
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: nsm, Assigned: jaoo, Mentored)

Tracking

unspecified
mozilla39
x86_64
Linux
Points:
---

Firefox Tracking Flags

(firefox39 fixed)

Details

(Whiteboard: [good first bug][lang=C++])

Attachments

(1 attachment, 6 obsolete attachments)

This should be a fairly easy bug requiring adding one telemetry probe and some warnings to RuntimeService::RegisterWorker().

1) Warn to window console or browser console (when no window) if a ServiceWorker spawn gets queued due to hitting max workers per domain limit.
2) Add a Telemetry probe to count when a ServiceWorker hits the limit. We may also need another probe to count that a SW was spawned, so we get a percentage of failures, unless metrics has some other way to determine this.

Andrea, would you be willing to mentor since step 1 is the big piece and you know about it?
Blocks: 1059784
Flags: needinfo?(amarchesini)
Whiteboard: [good first bug][lang=C++]
(Assignee)

Comment 1

3 years ago
Go for it!
Assignee: nobody → josea.olivera

Updated

3 years ago
Status: NEW → ASSIGNED
Mentor: amarchesini@mozilla.com
Flags: needinfo?(amarchesini)
(Assignee)

Comment 2

3 years ago
Created attachment 8576971 [details] [diff] [review]
WIP

Just show a warning in the web console when a ServiceWorker spawn gets queued.
(Assignee)

Comment 3

3 years ago
Created attachment 8577501 [details] [diff] [review]
WIP

Added telemetry probes for service worker spawns. I'm planing to add a couple of more probes for the number of domains and the number of domains hitting the limit.
Attachment #8576971 - Attachment is obsolete: true
(Assignee)

Comment 4

3 years ago
Created attachment 8578731 [details] [diff] [review]
WIP

(In reply to José Antonio Olivera Ortega [:jaoo] from comment #3)
> I'm planing to add a
> couple of more probes for the number of domains and the number of domains
> hitting the limit.

Probes for the hit-the-limit thing I commented in the last version of the patch.
(Assignee)

Updated

3 years ago
Attachment #8577501 - Attachment is obsolete: true
(Assignee)

Comment 5

3 years ago
Comment on attachment 8578731 [details] [diff] [review]
WIP

Andrea, I would be nice to have some feedback at this point. Would you mind to have a look a it? Thanks!

PS. When you are back from your holidays! ;)
Attachment #8578731 - Flags: feedback?(amarchesini)
Comment on attachment 8578731 [details] [diff] [review]
WIP

Review of attachment 8578731 [details] [diff] [review]:
-----------------------------------------------------------------

Patch wide fixes:
1) You need to call telemetry only when it is a service worker, and not for both service and shared workers.
2) Always AssertIsOnMainThread() before calling Telemetry, unless the containing block already does so.

::: dom/workers/RuntimeService.cpp
@@ +1420,5 @@
>                                   aWorkerPrivate->IsServiceWorker();
>    if (isSharedOrServiceWorker) {
>      AssertIsOnMainThread();
>  
> +    Telemetry::Accumulate(Telemetry::SERVICE_WORKER_SPAWN_ATTEMPS, 1);

Only for serviceworkers (use aWorkerPrivate->IsServiceWorker()), don't count shared workers.

@@ +1455,5 @@
>  
>        domainInfo = new WorkerDomainInfo();
>        domainInfo->mDomain = domain;
> +      if (isSharedOrServiceWorker && !domainInfo->mDomainTracked) {
> +        Telemetry::Accumulate(Telemetry::DOMAIN_WITH_SERVICE_WORKERS, 1);        

Nit: Whitespace.

@@ +1473,5 @@
> +        // ServiceWorker spawn gets queued due to hitting max workers per domain
> +        // limit so let's log a warning.
> +        // Note: aWorkerPrivate->GetDocument() call might result nullptr due to
> +        // no window so the message warning will show up in the browser console.
> +        nsContentUtils::ReportToConsole(nsIScriptError::warningFlag,

Before you call this, AssertIsOnMainThread().

@@ +1489,5 @@
>      else if (parent) {
>        domainInfo->mChildWorkerCount++;
> +      if (isSharedOrServiceWorker) {
> +        Telemetry::Accumulate(Telemetry::SERVICE_WORKER_WAS_SPAWNED, 1);
> +      }

SWs can never have a parent (yet), so this block is not needed

@@ +1495,5 @@
>      else {
>        domainInfo->mActiveWorkers.AppendElement(aWorkerPrivate);
> +      if (isSharedOrServiceWorker) {
> +        Telemetry::Accumulate(Telemetry::SERVICE_WORKER_WAS_SPAWNED, 1);
> +      }

If you put this at the end of RegisterWorker, right before |return true;| then you will only count workers that really did get a thread created for them, which is what we want.

@@ +1515,5 @@
>    }
>  
>    // From here on out we must call UnregisterWorker if something fails!
> +  // TODO: Should we decrease the counters of service workers spawns if the
> +  // worker ends up unregistered?

You don't need this if you move the probe as suggested above.

::: dom/workers/RuntimeService.h
@@ +47,5 @@
>      nsTArray<WorkerPrivate*> mQueuedWorkers;
>      nsClassHashtable<nsCStringHashKey, SharedWorkerInfo> mSharedWorkerInfos;
>      uint32_t mChildWorkerCount;
> +    bool mDomainTracked;
> +    bool mDomainHitMaxWorkersLimitTracked;

You set these to true but never use them. Is there a follow up? I don't think you need to preserve this state.

::: toolkit/components/telemetry/Histograms.json
@@ +7678,5 @@
>      "expires_in_version": "never",
>      "kind": "count",
>      "description": "Count plugin hang notices in e10s"
> +  },
> +  "SERVICE_WORKER_SPAWN_ATTEMPS": {

Nit: ATTEMPTS

@@ +7681,5 @@
> +  },
> +  "SERVICE_WORKER_SPAWN_ATTEMPS": {
> +    "expires_in_version": "never",
> +    "kind": "count",
> +    "description": "Tracking ServiceWorker spawn attemps"

description should be "Count attempts to spawn a ServiceWorker for a domain"
Attachment #8578731 - Flags: feedback?(amarchesini)
(Assignee)

Comment 7

3 years ago
Created attachment 8579318 [details] [diff] [review]
WIP

(In reply to Nikhil Marathe [:nsm] (needinfo? please) from comment #6)
> Comment on attachment 8578731 [details] [diff] [review]
> WIP
> 
> Review of attachment 8578731 [details] [diff] [review]:
> -----------------------------------------------------------------

Thanks for the review!

> Patch wide fixes:
> 1) You need to call telemetry only when it is a service worker, and not for
> both service and shared workers.
> 2) Always AssertIsOnMainThread() before calling Telemetry, unless the
> containing block already does so.

Done!

> ::: dom/workers/RuntimeService.h
> @@ +47,5 @@
> >      nsTArray<WorkerPrivate*> mQueuedWorkers;
> >      nsClassHashtable<nsCStringHashKey, SharedWorkerInfo> mSharedWorkerInfos;
> >      uint32_t mChildWorkerCount;
> > +    bool mDomainTracked;
> > +    bool mDomainHitMaxWorkersLimitTracked;
> 
> You set these to true but never use them. Is there a follow up? I don't
> think you need to preserve this state.

I use them. There is no such follow up work. They are useful to have a way to avoid reporting twice the same domain hit the limit.

I'll ping you today so we could go over the patch and see whether this domain probes are useful or no.
Attachment #8578731 - Attachment is obsolete: true
Comment on attachment 8579318 [details] [diff] [review]
WIP

Review of attachment 8579318 [details] [diff] [review]:
-----------------------------------------------------------------

The reason I'm not in favour of tracking it across domains is that it doesn't give us much more useful information.
We measure "X SWs hit the limit out of Y attempts". "A domains hit the limit out of Z domains" does not add much to it. Also, using booleans like this is wrong because:
1) user visits example.com, spawns various workers flips counts and booleans etc. So telemetry is bumped up once.
2) user closes all tabs for example.com, at this point runtime service removes the domain info from domain map
3) step 1 happens again.

Which means we measure how many times the domaininfo is created and then sw is spawned and hits the limit. the domain infos don't persist, another reason why I think having this isn't useful.

::: dom/workers/RuntimeService.cpp
@@ +1458,5 @@
>        NS_ASSERTION(!parent, "Shouldn't have a parent here!");
>  
>        domainInfo = new WorkerDomainInfo();
>        domainInfo->mDomain = domain;
> +      if (isServiceWorker && !domainInfo->mDomainTracked) {

mDomainTracked will always be false here since it was just created.

::: dom/workers/RuntimeService.h
@@ +46,5 @@
>      nsTArray<WorkerPrivate*> mActiveWorkers;
>      nsTArray<WorkerPrivate*> mQueuedWorkers;
>      nsClassHashtable<nsCStringHashKey, SharedWorkerInfo> mSharedWorkerInfos;
>      uint32_t mChildWorkerCount;
> +    bool mDomainTracked;

if you decide to keep it, rename to mTelemetryUsesServiceWorker.

@@ +47,5 @@
>      nsTArray<WorkerPrivate*> mQueuedWorkers;
>      nsClassHashtable<nsCStringHashKey, SharedWorkerInfo> mSharedWorkerInfos;
>      uint32_t mChildWorkerCount;
> +    bool mDomainTracked;
> +    bool mDomainHitMaxWorkersLimitTracked;

similarly, mTelemetryHitMaxWorkersLimit.
(Assignee)

Comment 9

3 years ago
Created attachment 8581514 [details] [diff] [review]
v1

(In reply to Nikhil Marathe [:nsm] (needinfo? please) from comment #8)
> Comment on attachment 8579318 [details] [diff] [review]
> WIP
> 
> Review of attachment 8579318 [details] [diff] [review]:
> -----------------------------------------------------------------

Thanks for explain your reason in detail.

> The reason I'm not in favour of tracking it across domains is that it
> doesn't give us much more useful information.

Let's remove the domain related probes then.

I guess this's ready for review. Would you mind please? Thanks!
Attachment #8579318 - Attachment is obsolete: true
Attachment #8581514 - Flags: review?(nsm.nikhil)
Comment on attachment 8581514 [details] [diff] [review]
v1

Review of attachment 8581514 [details] [diff] [review]:
-----------------------------------------------------------------

looks good. Needs worker peer review.

::: dom/locales/en-US/chrome/dom/dom.properties
@@ +160,5 @@
>  # LOCALIZATION NOTE: Do not translate "IndexedDB".
>  IndexedDBTransactionAbortNavigation=An IndexedDB transaction that was not yet complete has been aborted due to page navigation.
>  # LOCALIZATION NOTE (WillChangeBudgetWarning): Do not translate Will-change, %1$S,%2$S,%3$S are numbers.
>  WillChangeBudgetWarning=Will-change memory consumption is too high. Surface area covers %1$S pixels, budget is the document surface area multiplied by %2$S (%3$S pixels). All occurences of will-change in the document are ignored when over budget.
> +HittingMaxWorkersPerDomain=ServiceWorker spawn gets queued due to hitting max workers per domain limit.

Nit: s/gets/got
Attachment #8581514 - Flags: review?(nsm.nikhil) → review?(amarchesini)
Comment on attachment 8581514 [details] [diff] [review]
v1

Review of attachment 8581514 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm but before landing I want to know if we really need these 2 additional telemetry logs.

::: dom/workers/RuntimeService.cpp
@@ +27,5 @@
>  #include "jsfriendapi.h"
>  #include "mozilla/ArrayUtils.h"
>  #include "mozilla/CycleCollectedJSRuntime.h"
>  #include "mozilla/TimeStamp.h"
> +#include "mozilla/Telemetry.h"

alphabetic order.

@@ +1418,5 @@
>  
> +  bool isServiceWorker = aWorkerPrivate->IsServiceWorker();
> +  if (isServiceWorker) {
> +    AssertIsOnMainThread();
> +    Telemetry::Accumulate(Telemetry::SERVICE_WORKER_SPAWN_ATTEMPTS, 1);

I don't think we want this. Have you spoken with nsm about this?

@@ +1551,5 @@
>    }
>  
> +  if (isServiceWorker) {
> +    AssertIsOnMainThread();
> +    Telemetry::Accumulate(Telemetry::SERVICE_WORKER_WAS_SPAWNED, 1);

Same for this.

::: toolkit/components/telemetry/Histograms.json
@@ +7688,5 @@
> +    "expires_in_version": "never",
> +    "kind": "count",
> +    "description": "Count ServiceWorkers that really did get a thread created for them"
> +  },
> +  "SERVICE_WORKER_SPAWN_GETS_QUEUED": {

Actually, this is the only telemetry is required for this bug, right?
The other 2 are not related to this bug: I don't think we want to know how many SWs are created.
Attachment #8581514 - Flags: review?(amarchesini) → review+
> looks good. Needs worker peer review.

nsm, what do you think about these 2 additional telemetry logs? attempt/spawned?
Comment on attachment 8581514 [details] [diff] [review]
v1

Review of attachment 8581514 [details] [diff] [review]:
-----------------------------------------------------------------

bent, we need a review from a worker peer. The patch looks good to me.
Attachment #8581514 - Flags: feedback?(bent.mozilla)
(In reply to Andrea Marchesini (:baku) from comment #12)
> > looks good. Needs worker peer review.
> 
> nsm, what do you think about these 2 additional telemetry logs?
> attempt/spawned?
Flags: needinfo?(nsm.nikhil)
(In reply to Andrea Marchesini (:baku) from comment #12)
> > looks good. Needs worker peer review.
> 
> nsm, what do you think about these 2 additional telemetry logs?
> attempt/spawned?

I'd like to have it in to get a good idea of the relative number of SWs that get queued. That way, if we see a large percentage of queues, we will have to look at alternate strategies for SWs like a thread pool and aggressive termination. If it is fairly small, we can just tweak the limits a little.

:jaoo feel free to land once :bent approves. Thanks for the patch!
Flags: needinfo?(nsm.nikhil)
Comment on attachment 8581514 [details] [diff] [review]
v1

Review of attachment 8581514 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great, thanks! I just have two suggestions:

::: dom/locales/en-US/chrome/dom/dom.properties
@@ +160,5 @@
>  # LOCALIZATION NOTE: Do not translate "IndexedDB".
>  IndexedDBTransactionAbortNavigation=An IndexedDB transaction that was not yet complete has been aborted due to page navigation.
>  # LOCALIZATION NOTE (WillChangeBudgetWarning): Do not translate Will-change, %1$S,%2$S,%3$S are numbers.
>  WillChangeBudgetWarning=Will-change memory consumption is too high. Surface area covers %1$S pixels, budget is the document surface area multiplied by %2$S (%3$S pixels). All occurences of will-change in the document are ignored when over budget.
> +HittingMaxWorkersPerDomain=ServiceWorker spawn gets queued due to hitting max workers per domain limit.

Nit: I think this would be better:

"A ServiceWorker could not be started immediately because other documents in the same origin are already using the maximum number of workers. The ServiceWorker is now queued and will be started after some of the other workers have completed."

And please mark with

  # LOCALIZATION NOTE: Do not translate "ServiceWorker".

so that our localizers know that "ServiceWorker" is something special.

::: dom/workers/RuntimeService.cpp
@@ +1415,5 @@
>    }
>  
>    nsCString sharedWorkerScriptSpec;
>  
> +  bool isServiceWorker = aWorkerPrivate->IsServiceWorker();

Nit: Make this const
Attachment #8581514 - Flags: feedback?(bent.mozilla) → feedback+
Created attachment 8583046 [details] [diff] [review]
v2

Addresses review comments from comment 11 and comment 16.

Ben, would mind to have a look a it again please? Thank!

PS. Should we request review at someone from the telemetry team?
Attachment #8581514 - Attachment is obsolete: true
Attachment #8583046 - Flags: review?(bent.mozilla)
Attachment #8583046 - Flags: review?(bent.mozilla) → review+
(In reply to José Antonio Olivera Ortega [:jaoo] from comment #17)
> PS. Should we request review at someone from the telemetry team?

If you'd like, but I think this is fine. Thanks!
Comment on attachment 8583046 [details] [diff] [review]
v2

I'd like :rvitillo to have a look at the patch so he could let's know if this is ready to go.
Attachment #8583046 - Flags: review?(rvitillo)
Ben, could you confirm that this probe should never expire? Just making sure "never" is not misused as a default value.
Flags: needinfo?(bent.mozilla)
I went ahead and pushed this to try. Find the results at https://treeherder.mozilla.org/#/jobs?repo=try&revision=9899f1f39e47 please.
(In reply to rvitillo@mozilla.com from comment #20)
> Ben, could you confirm that this probe should never expire? Just making sure
> "never" is not misused as a default value.

Good point, something like 50 should be reasonable.
Flags: needinfo?(bent.mozilla)
Created attachment 8583305 [details] [diff] [review]
v3

Setting expire to gecko 50.


Carrying out r=bent.

Thanks!
Attachment #8583046 - Attachment is obsolete: true
Attachment #8583046 - Flags: review?(rvitillo)
Attachment #8583305 - Flags: review?(rvitillo)
Attachment #8583305 - Flags: review?(rvitillo) → review+
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b3c166fa508a
Keywords: checkin-needed
https://hg.mozilla.org/integration/mozilla-inbound/rev/3604d6055fe5
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/3604d6055fe5
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox39: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
You need to log in before you can comment on or make changes to this bug.