Closed Bug 1487227 Opened 2 years ago Closed 2 years ago

Crash in mozilla::dom::ClientSource::WindowExecutionReady hitting MOZ_DIAGNOSTIC_ASSERT(spec.LowerCaseEqualsLiteral("about:blank") || StringBeginsWith(spec, static_cast<const nsLiteralCString&>(nsLiteralCString("" "blob:"))) || nsContentUtils::Storag...

Categories

(Core :: DOM: Service Workers, defect, P1)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED DUPLICATE of bug 1493211
Tracking Status
firefox63 --- fixed

People

(Reporter: dbaron, Assigned: asuth)

References

Details

(Keywords: crash, topcrash, Whiteboard: DWS_NEXT)

Crash Data

This bug was filed from the Socorro interface and is
report bp-70192428-acdc-471c-adf6-b4d1f0180829.
=============================================================

There are a bunch of crashes showing up in nightly that are hitting:
MOZ_DIAGNOSTIC_ASSERT(spec.LowerCaseEqualsLiteral("about:blank") || StringBeginsWith(spec, static_cast<const nsLiteralCString&>(nsLiteralCString("" "blob:"))) || nsContentUtils::StorageAllowedForWindow(aInnerWindow) == nsContentUtils::StorageAccess::eAllow)

(seems like one user filing 6 of them, and then somebody else hitting it once).


Top 10 frames of crashing thread:

0 xul.dll mozilla::dom::ClientSource::WindowExecutionReady dom/clients/manager/ClientSource.cpp:286
1 xul.dll nsGlobalWindowOuter::SetNewDocument dom/base/nsGlobalWindowOuter.cpp:1948
2 xul.dll nsresult nsDocumentViewer::InitInternal layout/base/nsDocumentViewer.cpp:1033
3 xul.dll nsDocumentViewer::Init layout/base/nsDocumentViewer.cpp:772
4 xul.dll nsresult nsDocShell::Embed docshell/base/nsDocShell.cpp:6674
5 xul.dll nsDSURIContentListener::DoContent docshell/base/nsDSURIContentListener.cpp:196
6 xul.dll bool nsDocumentOpenInfo::TryContentListener uriloader/base/nsURILoader.cpp:759
7 xul.dll nsDocumentOpenInfo::OnStartRequest uriloader/base/nsURILoader.cpp:306
8 xul.dll void mozilla::net::HttpChannelChild::DoOnStartRequest netwerk/protocol/http/HttpChannelChild.cpp:765
9 xul.dll void mozilla::net::HttpChannelChild::OnStartRequest netwerk/protocol/http/HttpChannelChild.cpp:639

=============================================================
Flags: needinfo?(bugmail)
Priority: -- → P1
Immediate crash stack observations:
- The 61 release failures seem like legit crashes that are not related to this assertion.
- In all of the nightly crashes we're seeing that I've looked at:
  - The controller is set but we're actually receiving an OnStartRequest via IPC from the parent.
  - There rarely appears to be a ServiceWorker thread active in the child.  This suggests that some SW scope matched but that we reset interception prior to actually spinning up the ServiceWorker.  The most likely explanation for this is our no-fetch optimization.
    - In the no-fetch optimization we'll set the controller at https://searchfox.org/mozilla-central/rev/05d91d3e02a0780f44599371005591d7988e2809/dom/serviceworkers/ServiceWorkerManager.cpp#2200 and reset interception at https://searchfox.org/mozilla-central/rev/2fe43133dbc774dda668d4597174d73e3969181a/dom/serviceworkers/ServiceWorkerPrivate.cpp#1694
    - This is notable because it's a very common occurrence when ServiceWorkers are registered for push notifications, and it doesn't involve complicated edge-cases like the registration having disappeared.


A notable point about the assertion is that it's about us having a controller and storage access being forbidden which means we shouldn't have a controller.  This means that the assertion doesn't care about other logic errors like us having a controller marked when the document isn't actually being controlled.  (Although we may have an assertion about that somewhere?)  In practice, it appears that we could screw this up when the no-fetch optimization is involved and just continually bounce things off the no-fetch ServiceWorker.  Ugh, which could explain a performance complaint that reached me via the grape-vine that seemed like it shouldn't happen because of our no-fetch optimization...

Anyways, the point is also that we do a StorageAllowedForChannel check when we get the navigation request.  And that should avoid this problem... BUT...  There is an asymmetry in the decision-making process between nsContentUtils::StorageAllowedForWindow used by the assertion and nsContentUtils::StorageAllowedForChannel(aChannel) used by ServiceWorkerInterceptController::ShouldPrepareForIntercept.

StorageAllowedForChannel passes aWindow=null where StorageAllowedForWindow does not and this results in the following deviations:
- InternalStorageAllowedForPrincipal does an aWindow-dependent SandboxFlags SANDBOXED_ORIGIN check.  (It also does a privatebrowsing document check, but that's the legacy check that would only deviate for system principal windows marked as private browsing.  The OriginAttribute checks should already be good enough for our purposes.)
- StorageDisabledByAntiTracking gets invoked with the window and it...
  - farms out a decision to StorageDisabledByAntiTrackingInternal that will return disabled=true if !AntiTrackingCommon::IsFirstPartyStorageAccessGrantedFor(aWindow,...)
    - AntiTrackingCommon::IsFirstPartyStorageAccessGrantedFor does a ton of stuff.

So there does seem to be some appreciable deviation here.

So the general theory here would be:
- In at least some situations we fail to clear the controller when we reset interception.  This might be specific to interception resetting prior to actually dispatching the fetch event.
- This means no-fetch optimized sites are impacted.
- Tracking protection is involved, narrowing the number of impacted sites further since that depends on tracking protection being enabled (which is only enabled by default on private browsing?  at least prior to our recent announcement... and SW's would be categorically denied in private browsing mode, so we'd never experience those scenarios), and also because it's quite possible that many users who would enable tracking protection outside of private browsing would set other settings that would disable ServiceWorkers even before tracking protection would get involved.


The obvious next-steps for my investigation are:
- Ensure that our no-fetch optimization is properly clearing the controller.
- Consider landing a patch that gives us an additional bit of entropy, which is whether the channel-check and window-check for storage agree or not.  This would explain a lot, but it still wouldn't explain why we have a controller marked when we no longer should.

Leaving the needinfo up on myself for the next steps.
QA Whiteboard: DWS_NEXT
QA Whiteboard: DWS_NEXT
Whiteboard: DWS_NEXT
P1: What's the status on this bug?
The debugger confirms that we're not clearing the controller when we reset interception which is indeed problematic for navigations in general.  (This is from our dom/serviceworkers/test/test_nofetch_handler.html test confirming that at WindowExecutionReady time we still have a controller marked subsequent to invoking ResetInterception and we end up with the same IPC stuff on the stack.)

I need to do a little more investigation / reproduction as it relates to the asymmetry in storage-allowed-checking.  It looks likely that the existing tests for third-party-cookie settings at https://searchfox.org/mozilla-central/source/dom/serviceworkers/test/test_third_party_iframes.html could be extended to cover whatever's going on.  I should have a test and fix on Wednesday.
Flags: needinfo?(bugmail)
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Adding top crash keyword since is in the top 3 in nightly.
Keywords: topcrash
Duplicate of this bug: 1491389
If it helps, I encounter this bug consistently when trying to play embedded Twitter videos on this Web page with third-party cookies and 'slow loading' trackers disabled:
http://www.bbc.co.uk/programmes/articles/57f0LhFBpYpbzvTzXTjvklC/7-ways-strictly-week-one-made-a-splash-on-social
Yes, it does, thank you.  I also got a good STR in bug 1491389 but the stumbling block has been in adding automated test coverage.  It appears there may be an answer in bug 1493211 as to why I was not able to reproduce the crash.  Specifically, that tracking protection breaks Service Worker interception when nested iframes are involved, which is how the test itself is structured.  We should be able to resolve this shortly.
FWIW I suspect the root cause here may be the imbalance between the window and channel paths that I mentioned in bug 1493211 comment 13...
(In reply to :Ehsan Akhgari from comment #8)
> FWIW I suspect the root cause here may be the imbalance between the window
> and channel paths that I mentioned in bug 1493211 comment 13...

If my guess here is correct, this should fix the crash: https://hg.mozilla.org/integration/autoland/rev/b69245da928a
(In reply to :Ehsan Akhgari from comment #9)
> If my guess here is correct, this should fix the crash:
> https://hg.mozilla.org/integration/autoland/rev/b69245da928a

This crash did go away starting with the build that that commit landed in.
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Depends on: 1493211
Resolution: --- → FIXED
Resolution: FIXED → DUPLICATE
Duplicate of bug: 1493211
For posterity, my mention in comment 1 and comment 3 about us improperly resetting interception are incorrect.  ServiceWorkerContainer.controller should still be set to the matching ServiceWorker per step 12.6 of "Handle Fetch" even if we don't dispatch a fetch event or FetchEvent.respondWith is not invoked.  The higher-level section 2.4 https://w3c.github.io/ServiceWorker/#selection covers this well; selection is independent of the fetch event.
The crash stats on this seems to suggest this wasn't actually fixed? (or regressed?)

My friend seems to be running into this on linux: https://crash-stats.mozilla.com/report/index/be0e63ae-56bc-4ccc-ae24-e6f2a0181028#tab-details
(In reply to Alexis Beingessner [:Gankro] from comment #14)
> The crash stats on this seems to suggest this wasn't actually fixed? (or
> regressed?)
> 
> My friend seems to be running into this on linux:
> https://crash-stats.mozilla.com/report/index/be0e63ae-56bc-4ccc-ae24-
> e6f2a0181028#tab-details

This was just fixed on nightly on the 25th, and it appears from the crash report he is running a build from the 23rd. Please have him download a build after the 25th. Thanks.
(Note that the final fix for this crash landed in bug 1493211.)
(In reply to :Ehsan Akhgari from comment #16)
> (Note that the final fix for this crash landed in bug 1493211.)

Err, sorry bug 1499995.

Also note https://bugzilla.mozilla.org/show_bug.cgi?id=1499995#c44; this may actually not be completely fixed yet...
You need to log in before you can comment on or make changes to this bug.