Closed Bug 1662925 Opened 4 years ago Closed 4 years ago

Clicking on a twitter link inside of a google doc opens an about:blank page without loading the twitter url

Categories

(Core :: DOM: Service Workers, defect)

defect

Tracking

()

RESOLVED FIXED
83 Branch
Fission Milestone M6b
Tracking Status
firefox-esr78 --- unaffected
firefox82 --- fixed
firefox83 --- fixed

People

(Reporter: annyG, Assigned: asuth)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(3 files, 2 obsolete files)

This is with Fission enabled and I came across this issue quite a few times.

Steps: (ensure fission is enabled)

  1. Open https://docs.google.com/document/d/1WDpmoJnK_pPrWposjWfkZWSA1gzB4uC2Ft7_ehyWN18/edit
  2. Wait until the document has loaded, try clicking on the twitter link a few times (close the twitter tabs after you open them)
  3. If you haven't succeeded yet, try opening a few other sites, e.g. I've been opening a few https://web.telegram.org/ pages VERY QUICKLY (copy pasting the url and hitting enter fast) and closing them after they load, before doing the next step.
  4. Wait a few minutes, make sure there are no twitter processes in about:process and try clicking on the twitter link again.

Expected:

  • A new tab opens with twitter url

Current:

  • about:blank page opens and does not redirect us to twitter.com

Though this is quite annoying, it is the only way I have been able to reproduce.

I have used mozregression (for the first time ever) to try to determine if it's due to my patches in bug 1589102, but 2 out of the 4 times, mozregression showed me that it was this range https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=3ab251f67b71bfefc6f3aa29bea44f1d9b1de1a4&tochange=4ee63a7a01c558ab33cbe80885cf966630e27dc9 and not my patches.

Perhaps QA can retest this to verify if it is indeed the patches above that caused this? Or maybe it will turn out to be my patches after all.

Flags: needinfo?(nkochar)
Flags: needinfo?(nika)

Another way to check this, is for me to download the firefox binary that includes my changes, but excludes the commit range above, use it for a day or two and see if it happens with my patches as well. I'll do that in the meantime.

I'm seeing very similar symptoms. For me, this is also happening frequently with Reddit as well as Twitter. I also see a similar symptom where the page just doesn't load when typing the URL in the bar.

(For me, I was following links to Twitter from a message in Discord.)

:rpl added a bunch of "RemoteWorkerManager" MOZ_LOGging in those patches, it would be great if you could enable that for any runs where you haven't backed out those changes.

:mccr8 regarding pages not loading when typing the URL in the bar - I CC'ed you on bug 1662927, which I also filed.

Today I have been using firefox binary that only has my changes and none of the service worker changes and I haven't run into either of these issues.. Will report back tomorrow.

I've discussed this a bit on Matrix, so clearing my ni?

There's a chance this is being caused by a race when intercepting the twitter.com load due to a serviceworker, potentially related to the use of GetNewOrUsedBrowserProcessAsync here: https://searchfox.org/mozilla-central/rev/b2716c233e9b4398fc5923cbe150e7f83c7c6c5b/dom/workers/remoteworkers/RemoteWorkerManager.cpp#621-627

Ideally, this sort of thing would be switched to use GetNewOrUsedLaunchingBrowserProcess, AddKeepAlive, and WaitForLaunchAsync like is done by CanonicalBrowsingContext here: https://searchfox.org/mozilla-central/rev/b2716c233e9b4398fc5923cbe150e7f83c7c6c5b/docshell/base/CanonicalBrowsingContext.cpp#1127-1144

Flags: needinfo?(nika)
Fission Milestone: ? → M6b
Component: DOM: Navigation → DOM: Service Workers
Flags: needinfo?(nkochar) → needinfo?(bugmail)
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)

Andrew, please provide an estimate date for the patch in the Whiteboard field as M6b is nearing completion.

Flags: needinfo?(bugmail)
Whiteboard: ETA: ?
Assignee: bugmail → nika
Whiteboard: ETA: ?

Our current understanding of this bug thanks to Nika's excellent test and investigation work is that the ServiceWorker's mechanism for enqueueing FetchEvent operations is deficient and has always been. Specifically, FetchEventOpParent::Initialize is assuming that the RemoteWorkerParent already exists when the request to send a fetch event reaches PBackground from the main-thread. However, the act of creating a RemoteWorkerParent is asynchronous when process launching is involved, which it frequently is in the fission case. (In the non-fission "web" case, there's usually already a "web" process.)

I'm going to provide a fix for this newly uncovered problem and then we will ideally land the whole stack. I'm clearing the needinfo so I can be re-needinfo'ed but this bug remains my top priority.

Flags: needinfo?(bugmail)

re-assigning back to asuth, as they're handling the changes to FetchEventOpParent

Assignee: nika → bugmail

Andrew, any updates on the patch and an ETA? The experiment launch date is Oct 6 and we need fixes by Oct 1.

Flags: needinfo?(bugmail)

I have a fix locally that passes Nika's test (with a minor shutdown cleanup change; no need to try and manually shutdown the active worker) and the bug 1622451 verification test. My plan is to augment Nika's test to have a variant that creates an appropriate request body, clean up the fix patch a little more, then do some try pushes and ask for review.

Flags: needinfo?(bugmail)

moz-phab really didn't like my arc checkout of nika's test patch and my attempt to placate it and fell over in a "Validation Error" leaving my checkout without my test enhancement and fix patches. Unfortunately I'm using hg and have been historically burned by hg a lot so I'm not quite comfortable assuming that the hg bookmark mechanism works as reliably as git reflog (hg bookmarks shows I'm not on my bookmark, so presumably the stack is still there?) at this time of day, so I'll deal with this tomorrow, but I think https://github.com/mystor/phlay may have found a new user!

Try pushes which do have the test enhancements and fix safely preserved in the cloud:

Depends on D91173

Excerpting some documentation I wrote up for an OVERVIEW.md for Service
Workers for this exact use situation below. The basic situation of this patch
is that we were trying to create the FetchEventOpProxyParent immediately
before the managing PRemoteWorker instance existed, which isn't a thing we can
do. Because FetchEvent ops are more complicated they require somewhat more
unique handling, but it should have been unified with the PendingOp
infrastructure and was not. The one notable thing going on actor-wise is that
the request body (if present) requires special RemoteLazyInputStream
serialization and this is something that can only be done once we have the
RemoteWorkerParent. See https://phabricator.services.mozilla.com/D73173 and
its commit message and my "Restating" blocks for more context.

Threads and Proxies

Main Thread

ServiceWorkerManager's authoritative ServiceWorker state lives on the main
thread in the parent process. This is due to a combination of legacy factors
and that currently the nsHttpChannel AsyncOpen logic where navigation
interception occurs must be on the main thread.

IPDL Background Thread

The IPDL Background Thread is the thread where PBackground parent actors are
created. Because IPDL actors are explicitly tied to the thread they are created
on and PBackground is the only top-level protocol created for use by DOM
Workers, this thread is the natural home for book-keeping and authoritative
state for APIs that are accessed via PBackground-managed protocols. For
example, IndexedDB and other QuotaManager-managed storage APIs.

The Remote Worker APIs are all PBackground managed and so all messages sent via
the Remote Worker API need to be sent from the IPDL Background thread.

Main Thread to IPDL Background Proxies

There are 2 ways to get data from the main thread to the IPDL Background thread:
either via direct runnable dispatch or via IPDL IPC. We use IPDL IPC (which
has optimizations for same-process communication).

The following interfaces exist exclusively to proxy requests from the
ServiceWorkerManager on the main thread to the IPDL Background thread.

  • PRemoteWorkerController is a proxy wrapper around the
    RemoteWorkerController API exposed on the IPDL Background thread.
  • PFetchEventOp is paired with PFetchEventOpProxy managed by the above
    PRemoteWorkerController. PFetchEventOp gets the data to the IPDL
    Background thread from the main thread, then PFetchEventOpProxy gets the
    data down to the content process.

Non-Fetch ServiceWorker events AKA ExtendableEvents

How non-fetch events are dispatched to the serviceworker (including the IPC).

Because ServiceWorkers are intended to be shutdown and restarted on demand and
most event processing is asynchronous, there needs to be a way to track
outstanding requests and for content logic to indicate when it is done
processing requests. ExtendableEvents are the mechanism for this. A method
waitUntil(promise) adds promises to be track as long as the event is still
"active".

This straightforward lifecycle means that these events map well to our IPC
async return value mechanism. This is used by
PRemoteWorker::ExecServiceWorkerOp.

Fetch events and FetchEvent.respondWith()

FetchEvents have a different lifecycle and dataflow than regular
ExtendableEvents. They expose a respondWith() method that will eventually
resolve with a fetch Response object that potentially needs to be propagated
before the ExtendableEvent is no longer active. (The response will never be
propagated after waitUntil() settles because every call to respondWith()
implicitly calls waitUntil().)

From an IPC perspective, this means that FetchEvent instances need their own
IPC actor rather than being able to depend on the async return value mechanism
of IPDL. That's why PFetchEventOpProxy exists and is managed by
PRemoteWorker.

Depends on D92020

It turns out moz-phab has its own mechanism of downloading patches from phabricator, nicely documented at https://github.com/mozilla-conduit/review#downloading-a-patch-from-phabricator so I should have done moz-phab patch D91173. (My previous use of arc had come from previous uses of arc as directed by since-updated Firefox version control or conduit documentation but I'd never tried to submit any of the resulting applied patches, just use them.)

Unfortunately when submitting my stack which built on Nika's test this did permute the canonical test revision and its stack to no longer depend on Nika's provision fix. I think this might actually be desired, but it's weird. (And to its credit moz-phab did warn me about this, but after I panicked the first time after agreeing and hit control-c it left my hg bookmark pointing at the wrong thing, so I restored my state and just went ahead with the stack clobbering.)

Whiteboard: patches waiting for review

This has been reported even without Fission so removing Fission milestone flag.

Fission Milestone: M6b → ---

This is much worse with Fission so we do need to block M6b on this.

Fission Milestone: --- → M6b
Whiteboard: patches waiting for review
See Also: → 1665942
Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/5572b2a892c9
Part 1: Test fetch intercepting spawning a serviceworker process, r=asuth
https://hg.mozilla.org/integration/autoland/rev/c4920ccb9fbd
Test enhancements. r=nika
https://hg.mozilla.org/integration/autoland/rev/49008be99669
Make SW fetch events properly queue. r=dom-workers-and-storage-reviewers,janv
Regressions: 1669876
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 83 Branch

Comment on attachment 9178901 [details]
Bug 1662925 - Make SW fetch events properly queue. r=#dom-workers-and-storage

Beta/Release Uplift Approval Request

  • User impact if declined: Websites that involve ServiceWorker interception may not properly launch on the first try if they are launching a content process to host the ServiceWorker. This is more of a major problem for fission, but the circumstances also seem to arise when normal e10s Firefox is first started and possibly also GeckoView.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): After all is said and done, this is a relatively straightforward life-cycle management bug that definitely was broken before. The underlying pending operation tracking has been exercising related paths for some time and has been refined and the MozPromises involved in all of this already existed before and had the same lifetime relationships.

We don't actually have to take the new test and its refinements, but it seems appropriate to help verify the fix.

  • String changes made/needed: none
Attachment #9178901 - Flags: approval-mozilla-beta?
Attachment #9177434 - Flags: approval-mozilla-beta?
Attachment #9178900 - Flags: approval-mozilla-beta?

As we're about to build the last beta for 82, uplifting makes me nervous. is this a new issue, or somehow getting worse for users? if there are regressions from it, how likely are we to notice and link them back to that change within a couple of days?

Flags: needinfo?(bugmail)

This is not a new issue. It also is recoverable if the user goes to the URL bar and hits enter again, etc. That said, bug 1663125 which impacted Firefox 80 would have made the problem disappear because ServiceWorkers were not having their registrations loading from disk on startup. That regression was fixed late in the 81 cycle (81.0rc2) as well, so it might appear like a new regression for some set of users.

See the blockers and see also's for potentially related bugs. Note that there's a separate twitter issue which is more concerning and getting some awareness and that's the one that involves the corrupted content UI (bug 1665368). That is likely unrelated to this bug.

In terms of regressions, I think we'd notice it pretty quickly because the failure modes for a screw-up would be:

  1. Crash. We should see a ton of crashes on twitter.com for those that report the URL if so.
  2. Hangs forever / displays about:blank. This is actually the problem that the bug fixes, but if that didn't get fixed right or increased the incidence, I'd still expect us to hear about it because of twitter.
  3. Increased corrupted content error rate. We actually may see this in the success case because whatever's happening with the twitter SW throwing a rejection might happen more because the set of successfully launched ServiceWorkers with FetchEvents dispatched at them would increase.

Note again that the Twitter corrupted content error (bug 1665368) is a separate bug I'm working on a fix for and that one is definitely not suitable for immediate uplift. So maybe it makes sense to wait to for that fix to land and both to ride the beta train. I just wanted to make sure this potential fix was known to release management for consideration.

Flags: needinfo?(bugmail)

Comment on attachment 9178901 [details]
Bug 1662925 - Make SW fetch events properly queue. r=#dom-workers-and-storage

Thanks for the details Andrew. Let's get this one in 82.0b9.

Attachment #9178901 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9177434 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9178900 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

This was caused by the changes in bug 1626362. I didn't see it in the regressed by field.

Flags: in-testsuite+
Regressed by: 1626362
Has Regression Range: --- → yes
Regressions: 1673534
See Also: → 1683595
See Also: → 1701726

Comment on attachment 9178652 [details]
Bug 1662925 - Doc improvements for Remote and Shared workers.

Revision D91887 was moved to bug 1701726. Setting attachment 9178652 [details] to obsolete.

Attachment #9178652 - Attachment is obsolete: true
Attachment #9177007 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: