Closed Bug 1857882 Opened 2 years ago Closed 2 years ago

Either a promise or an async function, created in the window scope when adding a bookmark, stale forever

Categories

(Firefox :: Bookmarks & History, defect, P3)

defect

Tracking

()

RESOLVED FIXED
122 Branch
Tracking Status
firefox122 --- fixed

People

(Reporter: mak, Assigned: mak)

References

(Depends on 1 open bug)

Details

(Whiteboard: [sng])

Attachments

(1 file)

Bug 1832222 was opening a new window, adding a bookmark and closing it. This creates a BookmarkState and registers various transactions with the PlacesTransactions module.

One of these transactions never resolves after closing the window, likely because of bug 1674505, though with various experiment on Try I couldn't identify which promise is hanging.

The bug actually happens at the clearTransactionsHistory call in head.js and it's surely due to the window scope where the work happens going away.

This may cause the TransactionsManager to hang indefinitely, as it manages a queue of promises. We have a timeout race on it, but it's a 5 minutes timeout so it won't help tests (also reducing that didn't seem to help here)

Depends on: 1674505

does it help if we have debug-only and chrome-only assertion for checking if there's a promise reaction job in a dying global where the promise reaction job is for different global?
or will it hit frequently outside of the problematic case?

(In reply to Tooru Fujisawa [:arai] from comment #1)

does it help if we have debug-only and chrome-only assertion for checking if there's a promise reaction job in a dying global where the promise reaction job is for different global?
or will it hit frequently outside of the problematic case?

Yes, it would help to have some debugging tool to get a stack or something that allows us to recognize the stalled code.
I can't tell how frequently it would hit, but it can surely hit both users and tests.
So, it would be nice to have something we can enable on demand and run tests locally or on Try with it.

See Also: → 1858480

I've posted a prototype in bug 1858480, and I'd like to see if it's useful here.
Is the issue still happening on central tip? if so, is it intermittent? what's the steps to reproduce?

then, about the frequency, the situation happens very frequently, and we cannot assert. so the prototype just prints it.

forgot to set ni?
:mak, can you take a look into the above comment?

Flags: needinfo?(mak)

(In reply to Tooru Fujisawa [:arai] from comment #3)

I've posted a prototype in bug 1858480, and I'd like to see if it's useful here.
Is the issue still happening on central tip? if so, is it intermittent? what's the steps to reproduce?

It was workarounded in bug 1855218, so reverting to a m-c revision before 489e7952cd67, and running browser/components/places/tests/browser/browser_bookmarkProperties_folderSelection.js should intermittently cause this and hang the test.
I'll keep the ni open to try it.

I tried pushing with the patch and the modified prefs though I don't see any interesting print out in the logs. I must note it only failed in opt builds, if that matters.
I wonder if I misintepreted the happenings though there's clearly a never resolving promise (as per comment 0), a chain of promises that doesn't proceed, and the problem goes away if everything happens in the same window.

My steps to reproduce:

hg update 0097ac44fd89
moz-phab patch D190886 --apply-to .
<commit patch changing the 2 prefs for debugging>
./mach try fuzzy browser/components/places/tests/browser/ --rebuild 10
Flags: needinfo?(mak)

Thank you for checking! and I'm sorry that I haven't expected Windows case.
The code prints details to stderr, but I'm not sure if stderr is supposed to be visible in the opt-build Windows automation log.
I'll check how the debug print interacts with Windows build and also with automation.

so far, I observe the stderr/stdout are visible in local ./mach test ... run with Windows opt build, but the issue doesn't reproduce with local run.
Now checking how it goes on automation.

Then, the debug print covers the following scenario:

  1. promise P1 is created in global G1
  2. promise P2 is created in global G2, directly depending on P1
  3. G1 becomes dying
  4. P1 gets resolved, and reaction job is enqueued
  5. the reaction job for P1 is ignored because G1 is already dying
  6. P2 remains unresolved forever

and the patch detects the situation by the following algorithm:

  • when a reaction job is getting ignored because of the global is dying:
    • if a graph of promises and their reactions contains promise or reaction from other global:
      • print the graph of promises and reactions

If the issue here doesn't match this scenario, it won't print anything.
Possible cases are:

  • (a) P1 isn't get resolved
  • (b) P2 doesn't directly depend on P1, but there's some function or abstraction between them, and all of them belongs to G1

"P2 directly depends on P1" means the following case:

  • let P2 = P1.then(...)
  • let P2 = async function(...) { await P1; }

and "not-directly" case is, for example, something like the following:

// in G2
let P2_resolve = null;
let P2 = new Promise(r => P2_resolve = r);

// in G1
P1.then(function onFulfilled(v) { G2.P2_resolve(v) });

in this case, even when P1 gets resolved, the onFulfilled isn't executed because G1 is dying.
then, given onFullfilled belongs to G1, and the reference to G2 exists only inside the function's code, the graph of promises and reactions doesn't contain anything from G2.

Just realized that the above try run is using artifact build (possibly because of the try syntax?).
also, there was another problem in my patch that null-check wasn't sufficient (already submitted the fixed patch).

anyway, I'm seeing the debug print in the try run with the fix, and I'm now retriggering until it hits the failure case.
https://treeherder.mozilla.org/jobs?repo=try&revision=5e44b7474fdc3755bc885421b9e018022850fa72

it happened 2 times in the try run, but the situation covered by the patch doesn't seem to happen for the failure case, given there's no debug print around that, while there's the debug print for other cases.
So, possibility is the (a) or (b) above, or more complex case.

I'll look into other options around printing stalled promises and reactions.

I've added more logging to the actual code
https://treeherder.mozilla.org/jobs?repo=try&revision=9922eacf8e730dd91183cbda55cfd1a42743bfb8&selectedTaskRun=GNrDaJZ3R1W1nEyp_iYNHw.0

And apparently this is a dead-lock due to cyclic dependency around promises and tasks between transactEnqueuer and _mainEnqueuer.

Here's the details:

Basic flow of BookmarkState._createBookmark

Let's use a graph, each node being a promise or task, and directed edge being a dependency between promises, such as Promise.prototype.then.

Suppose [M.0] being _mainEnqueuer's promise and [T.0] being _transactEnqueuer's promise, the initial state is the following:

[M.0]

[T.0]

When BookmarkState._createBookmark is called when bookmark panel is hidden, it calls PlacesTransactions.batch.
There, a promise for the batch result is created by this._promise.then below, depending on the task func's promise, which depends on the _mainEnqueuer's current promise (ignoring minor details around Promise.race, Promise.catch, and some intermediate promises):

https://searchfox.org/mozilla-central/rev/40d51bef58b8e901d6ab4b60dd280f372a0e417d/browser/components/places/PlacesUIUtils.sys.mjs#378-379

async _createBookmark() {
  await lazy.PlacesTransactions.batch(async () => {

https://searchfox.org/mozilla-central/rev/40d51bef58b8e901d6ab4b60dd280f372a0e417d/toolkit/components/places/PlacesTransactions.sys.mjs#570-571

batch(task) {
  return this._mainEnqueuer.enqueue(async () => {

https://searchfox.org/mozilla-central/rev/40d51bef58b8e901d6ab4b60dd280f372a0e417d/toolkit/components/places/PlacesTransactions.sys.mjs#453,469-470,474

enqueue(func) {
...
  let promise = this._promise.then(() =>
    Promise.race([func(), timeoutPromise])
...
  this._promise = promise.catch(console.error);

The graph becomes to the following, where [M.1] being the promise for this._promise.then:

[M.0]<--[task()]<--[M.1]

[T.0]

Then, given [M.0] is already resolved, the task is executed:

https://searchfox.org/mozilla-central/rev/40d51bef58b8e901d6ab4b60dd280f372a0e417d/browser/components/places/PlacesUIUtils.sys.mjs#380-386

    this._guid = await lazy.PlacesTransactions.NewBookmark({
      parentGuid: this.parentGuid,
      tags: this._newState.tags,
      title: this._newState.title ?? this._originalState.title,
      url: this._newState.uri ?? this._originalState.uri,
      index: this._originalState.index,
    }).transact();

It creates a transaction and awaits on it.

https://searchfox.org/mozilla-central/rev/40d51bef58b8e901d6ab4b60dd280f372a0e417d/toolkit/components/places/PlacesTransactions.sys.mjs#538,552,555,565-566

transact(txnProxy) {
...
  let promise = this._transactEnqueuer.enqueue(async () => {
...
    let retval = await rawTxn.execute();
...
  });
  this._mainEnqueuer.alsoWaitFor(promise);

it enqueues a new function to _transactEnqueuer, which creates a new promise [T.1], and also calls this._mainEnqueuer.alsoWaitFor to create another promise [M.2] for _mainEnqueuer to depend on [T.1]:

[M.0]<--[task()]<--[M.1]<----[M.2]
           |                    |
           +------------------+ |
                              v v
[T.0]<--[rawTxn.execute()]<--[T.1]

For this case the graph is DAG, and things work as expected.
Given [T.0] is resolved, rawTxn.execute() runs, and it resolves [T.1], and then task() finishes, and [M.1] resolves, and then [M.2] resolves.

Basic flow of TransactionsManager.clearTransactionsHistory

Then, going back to initial state:

[M.0]

[T.0]

TransactionsManager.clearTransactionsHistory enqueues a function to _mainEnqueuer which creates [M.1] depending on the function's result, and also calls _transactEnqueuer.alsoWaitFor with the [M.1]:

https://searchfox.org/mozilla-central/rev/40d51bef58b8e901d6ab4b60dd280f372a0e417d/toolkit/components/places/PlacesTransactions.sys.mjs#651-652,662,664

clearTransactionsHistory(undoEntries, redoEntries) {
  let promise = this._mainEnqueuer.enqueue(function () {
...
  });
...
  this._transactEnqueuer.alsoWaitFor(promise);

If we represent the anonymous function in clearTransactionsHistory as [cleanup()], the graph becomes the following, so, in this case, it's still DAG, and thing s work as expected:

[M.0]<--[cleanup()]<--[M.1]
                        ^
                        |
[T.0]<----------------[T.1]

Conflict between BookmarkState._createBookmark and TransactionsManager.clearTransactionsHistory

Then, the problem happens if TransactionsManager.clearTransactionsHistory happens in the middle of BookmarkState._createBookmark,
which is happening in the failure case, according to the log.

Suppose that BookmarkState._createBookmark is first called, which enqueus a task function and creates [M.1] depends on it:

[M.0]<--[task()]<--[M.1]

[T.0]

Then, before the task() is executed, TransactionsManager.clearTransactionsHistory is called, which makes the graph into the following:

[M.0]<--[task()]<--[M.1]<--[cleanup()]<--[M.2]
                                           ^
                                           |
[T.0]<-----------------------------------[T.1]

After that, given [M.0] is already resolved, task() is executed, and
it enqueues a new function to _transactEnqueuer, which creates a new promise [T.2], and also calls this._mainEnqueuer.alsoWaitFor to create another promise [M.3] for _mainEnqueuer to depend on [T.2]:

[M.0]<--[task()]<--[M.1]<--[cleanup()]<--[M.2]<-----------------------[M.3]
           |                               ^                             |
           |                               |                             |
           +-------------------------------|---------------------------+ |
                                           |                           v v
[T.0]<-----------------------------------[T.1]<--[rawTxn.execute()]<--[T.2]

Here, there's a cycle, and task() never finished because it depends on rawTxn.execute() which depends on task() itself because of the route added by TransactionsManager.clearTransactionsHistory.

Wow, nice debug! Then we can assume the separate window was pretty much just making the case more likely to be hit due to timing, and had nothing to do with dying globals. That also explains why my many experiments to cope with dying globals didn't bring improvements.

Let's remove the dependencies on dying global for now, and sorry for the red herring.

I'll think how to improve the code here. I had already some in-flight patches to change batch() so it only handles arrays, reducing its complexity to a single case, and adding logging to the whole thing. That's not solving the problem, but it may be worth doing anyway.

Probably the batch should block transactEnqueue, but transactions that are part of the batch should not go into it. That'd be easy to do.
Afaict undo() and redo() have similar problems with batch() as clearTransactionsHistory.
It would be nice to try to go back a single enqueuer mode, I don't see clearly why it wouldn't work once batch transactions are not added to a queue, but this code is quite complex so it's possible I'm missing something.

Let's go one thing at a time.

Assignee: nobody → mak
Status: NEW → ASSIGNED
No longer depends on: 1674505
See Also: 1858480
Whiteboard: [sng]
Depends on: 1865087

Using 2 queues increases the risk of deadlocks, and complicates our lives.
Let's test this in Nightly for now to ensure there's no unexpected effects.

Depends on D193815

The patch here depends on the refactorings in bug 1865087.

Pushed by mak77@bonardo.net: https://hg.mozilla.org/integration/autoland/rev/a90c00b4d9d8 Test a single queue PlacesTransactions in Nightly. r=places-reviewers,daisuke
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 122 Branch
Blocks: 1865776
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: