Open Bug 1663512 Opened 4 years ago Updated 1 year ago

Crash in [@ mozilla::dom::RemoteWorkerManager::GetRemoteType]

Categories

(Core :: DOM: Service Workers, defect, P3)

defect

Tracking

()

ASSIGNED

People

(Reporter: gsvelto, Assigned: rpl)

References

Details

(Keywords: crash, leave-open, regression)

Crash Data

Attachments

(4 files, 1 obsolete file)

Crash report: https://crash-stats.mozilla.org/report/index/d96a07ff-1576-4fb6-b73d-cc9900200906

Top 10 frames of crashing thread:

0 xul.dll static mozilla::dom::RemoteWorkerManager::GetRemoteType dom/workers/remoteworkers/RemoteWorkerManager.cpp:157
1 xul.dll static mozilla::dom::RemoteWorkerManager::IsRemoteTypeAllowed dom/workers/remoteworkers/RemoteWorkerManager.cpp:223
2 xul.dll mozilla::dom::RemoteWorkerChild::ExecWorkerOnMainThread dom/workers/remoteworkers/RemoteWorkerChild.cpp:321
3 xul.dll mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/dom/workers/remoteworkers/RemoteWorkerChild.cpp:297:17'>::Run xpcom/threads/nsThreadUtils.h:577
4 xul.dll mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:146
5 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:512
6 xul.dll mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:86:7'>::Run xpcom/threads/nsThreadUtils.h:577
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1234
8 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:109
9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:327

This crash appears to be a regression introduced most likely with bug 1568597. Luca can you have a look?

Flags: needinfo?(lgreco)

We are clearly triggering this diagnostic assert: https://searchfox.org/mozilla-central/rev/76a83d0a218837ba6937d6a0fac51cb0008c2334/dom/workers/remoteworkers/RemoteWorkerManager.cpp#157-158
and so E10SUtils.getRemoteTypeForWorkerPrincipal did throw an exception.

On non-nightly channel the diagnostic assert wouldn't be triggered and so it wouldn't crash, but the remote worker being launched would be aborted and so it would still be an issue on non nightly channels.

It would be interesting to know which error condition we are triggering in those calls to E10SUtils.getRemoteTypeForWorkerPrincipal, e.g. to double-check if any of our assumptions in that method isn't actually met, but unfortunately we are not currently including in the diagnostic assert message enough details to determine that.

The following are some of the scenarios that would make E10SUtils.getRemoteTypeForWorkerPrincipal to throw for sure:

I think that some reasonable next steps to handle this could be:

  • adding some additional details to the diagnostic assert to help us to better assess what scenario is making us to trigger this diagnostic assertion (e.g. a subset of the details related to the call parameters, like the principal type and the URI scheme, the worker type and the preferred remote type may be enough).

  • bring back the old RemoteWorkerManager::GetRemoteType and switch between the old and new logic based on a mirror: once static pref, and keep the new logic enabled on Nightly only for now, until we have this figured out and we are confident that we can let the new logic to ride the train (especially given that, based on the telemetry environment related to the crash reports I did look into, the diagnostic assert is being triggered with fission disabled).

I'm needinfo-ing Andrew to let him know about this issue, and to get his opinion about the issue and the proposed next steps described above.

Flags: needinfo?(lgreco) → needinfo?(bugmail)

Another details that is also worth to mention is that the diagnostic assertion is being triggered in the child process when we are double-checking that the remote worker should be allowed to run in that child process, which means that RemoteWorkerManager::GetRemoteType didn't fail when we have previously called it to get the remoteType where the remote worker should be launched.

Assignee: nobody → lgreco
Status: NEW → ASSIGNED

Clearing pending needinfo (:asuth and I already discussed about this over slack and zoom and we agreed to proceed with adding some additional details to the diagnostic assert).

Flags: needinfo?(bugmail)
Attachment #9174465 - Flags: data-review?(chutten)
Keywords: leave-open

Updated the answer to the question 6 of the data review request after discussing about it with :chutten over slack.

Attachment #9174465 - Attachment is obsolete: true
Attachment #9174465 - Flags: data-review?(chutten)
Attachment #9174479 - Flags: data-review?(chutten)

Comment on attachment 9174479 [details]
request_bug1663512_remoteworkermanager_diagnostic_assert.md

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Luca Greco is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for pre-release channels only.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.


Result: datareview+

Attachment #9174479 - Flags: data-review?(chutten) → data-review+
Pushed by luca.greco@alcacoop.it:
https://hg.mozilla.org/integration/autoland/rev/6b7d0d5a2c25
Add details to RemoteWorkerManager::GetRemoteType diagnostic assert to aid investigating issues. r=asuth
Regressions: 1664433

Aggregating on crashes, the bulk of crashes seem to have an assertion of type E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=web. These all appear to be calls to RemoteWorkerChild::ExecWorkerOnMainThread made from the main thread calling into RemoteWorkerManager::IsRemoteTypeAllowed.

One crash had the more initially concerning failure E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=parent. But when looking at it, it's not a security issues, it's that when loading the registration at startup we're calling GetRemoteType from within the ServiceWorkerPrivateImpl::Initialize invoked by the ServiceWorkerPrivate constructor and it's throwing.

Flags: needinfo?(lgreco)

This patch include some additional changes to the diagnostic assert to be able to include:

  • the error name for the E10SUtils call failure
  • the file name and line number that originated the error

The change itself does not include a new automated test, but I did verify locally the new expected crash message:

  • by temporarily a js error and a Components.Exception to be thrown while running one of the existing tests (dom/workers/test/xpcshell/test_remoteworker_launch_new_process.js).
  • and also temporarily forcing the failure while the method is being called in both the parent and child processes.

TODO:

  • double-check with asuth that this does cover the addition to the diagnostic assert that we agreed on
  • double-check with asuth that the approach used to retrieve the file and line number from the exception is the right one
    (and/or if there is any existing helper that would cover part of what being done in this patch)
  • new data review request to get an explicit sign-off about the new details that this change would include in the crash details.

Andrew and I did meet over zoom and discussed about possible next steps a couple of weeks ago.

We agreed that the information currently available in the diagnostic assert are not yet enough to figure out what is triggering the failures and that the next step was to look into adding to the diagnostic assert at least the line number that did trigger the issue.

The patch attached in comment 11 includes the changes needed to add those additional details.

As described in the commit message, a new review request is going to be needed before landing that change, I'll create it after having double-checked with asuth the attached patch from his perspective.

Flags: needinfo?(lgreco)

Hi Chris,
this new attachment contains a new data review request for the additional details that the patch attached as https://bugzilla.mozilla.org/attachment.cgi?id=9186948 to this bug would collect as part of the diagnostic assert.

Attachment #9187216 - Flags: data-review?(chutten)
Severity: -- → S3
Priority: -- → P3

Comment on attachment 9187216 [details]
request_bug1663512_remoteworkermanager_diagnostic_assert_part2.md

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, :rpl and :asuth are responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for pre-release channels only.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.


Result: datareview+

Attachment #9187216 - Flags: data-review?(chutten) → data-review+
Pushed by luca.greco@alcacoop.it:
https://hg.mozilla.org/integration/autoland/rev/182e7b68fb7e
Add error name and source file location to RemoteWorkerManager::GetRemoteType diagnostic assert to aid investigating issues. r=asuth

I just took a look into some more crash reports that do include the new additional details (error code name and error location if available)

The more useful ones seems the two crash reports with error code NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS, which does seem to be throwing when E10SUtils is creating a DOM console instance here:

That should be happening when E10SUtils is going to emit a log for the first time and so it may be triggered by one of the class to log from this searchfox query:

Follows a summary of the crash report that I did look into.


Two content process crashes from a recent 85 nightly are triggered by an error with code
NS_ERROR_FAILURE:

Three content process crashed on beta (84.0b2):

The leave-open keyword is there and there is no activity for 6 months.
:rpl, maybe it's time to close this bug?

Flags: needinfo?(lgreco)

The patches landed from this bug did add some more details (whenever possible) to the crash, but the crash is still being tracked and so leave-open seems to be still appropriate.

Flags: needinfo?(lgreco)

On 91.0a1 I see the following infos now:

E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=web, errorName=NS_ERROR_FAILURE, errorLocation=(unknown):0
E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=web, errorName=NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS, errorLocation=resource://gre/modules/E10SUtils.jsm:716
E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=web, errorName=NS_ERROR_XPC_JS_THREW_STRING, errorLocation=:0
E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=webIsolated, errorName=NS_ERROR_FAILURE, errorLocation=(unknown):0
E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=webIsolated, errorName=NS_ERROR_XPC_JS_THREW_STRING, errorLocation=:0

Does this help, Luca?

Flags: needinfo?(lgreco)

(In reply to Jens Stutte [:jstutte] from comment #20)

On 91.0a1 I see the following infos now:

E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=web, errorName=NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS, errorLocation=resource://gre/modules/E10SUtils.jsm:716

Does this help, Luca?

Hi Jens, do you recall from which crash report this one was coming from?
I'd like to take a look to what line 716 in E10SUtils.jsm was doing in the same mercurial revision the build that send the report was build from (e.g. to double-check if that could give us any useful information, the one I did mention in comment 17 was a call to console.createInstance).

E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=webIsolated, errorName=NS_ERROR_XPC_JS_THREW_STRING, errorLocation=:0

I was taking another look into what may be triggering this one, and I'm wondering if we may be hitting the call at the end of at the end of XPCConvert::JSValToXPCException which converts the given jsval into a string.

Flags: needinfo?(lgreco) → needinfo?(jstutte)

(In reply to Jens Stutte [:jstutte] from comment #22)

https://crash-stats.mozilla.org/report/index/d4c44abf-6ca5-46eb-b939-bbe0a0210606

Thanks Jens,
based on the crash report, the line that is originating the expection should be:
https://hg.mozilla.org/mozilla-central/file/b5eb71cf51d58da1616bf50fd04fcaa860494df4/toolkit/modules/E10SUtils.jsm#l716

I'm not sure yet in which conditions that line may be the one throwing the exception, I was expecting to be line 720.

The only things I can think of at the moment if there are cases that may make accessing Ci or Ci.nsIE10SUtils to be throwing (given that line 716 is doing a aWorkerType === Ci.nsIE10SUtils.REMOTE_WORKER_TYPE_SERVICE check).

Flags: needinfo?(lgreco)

The leave-open keyword is there and there is no activity for 6 months.
:rpl, maybe it's time to close this bug?

Flags: needinfo?(lgreco)

The leave-open keyword is there and there is no activity for 6 months.
:rpl, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(lgreco)

There is one last instance in beta 103, saying:

E10SUtils.getRemoteTypeForWorkerPrincipal did throw: workerType=service, principal=https, preferredRemoteType=web, processRemoteType=parent, errorName=NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS, errorLocation=resource://gre/modules/XPCOMUtils.sys.mjs:375 -

I am having a hard time to find the right revision for XPCOMUtils.sys.mjs, I can see only XPCOMUtils.jsm in that tree (I am running a build to see if it is generated). Edit: It seems bug 1777486 migrated this, so this is puzzling actually.

Flags: needinfo?(lgreco)
See Also: → 1818569
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: