Either a promise or an async function, created in the window scope when adding a bookmark, stale forever
Categories
(Firefox :: Bookmarks & History, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox122 | --- | fixed |
People
(Reporter: mak, Assigned: mak)
References
(Depends on 1 open bug)
Details
(Whiteboard: [sng])
Attachments
(1 file)
Bug 1832222 was opening a new window, adding a bookmark and closing it. This creates a BookmarkState and registers various transactions with the PlacesTransactions module.
One of these transactions never resolves after closing the window, likely because of bug 1674505, though with various experiment on Try I couldn't identify which promise is hanging.
The bug actually happens at the clearTransactionsHistory call in head.js and it's surely due to the window scope where the work happens going away.
This may cause the TransactionsManager to hang indefinitely, as it manages a queue of promises. We have a timeout race on it, but it's a 5 minutes timeout so it won't help tests (also reducing that didn't seem to help here)
Comment 1•2 years ago
|
||
does it help if we have debug-only and chrome-only assertion for checking if there's a promise reaction job in a dying global where the promise reaction job is for different global?
or will it hit frequently outside of the problematic case?
Assignee | ||
Comment 2•2 years ago
•
|
||
(In reply to Tooru Fujisawa [:arai] from comment #1)
does it help if we have debug-only and chrome-only assertion for checking if there's a promise reaction job in a dying global where the promise reaction job is for different global?
or will it hit frequently outside of the problematic case?
Yes, it would help to have some debugging tool to get a stack or something that allows us to recognize the stalled code.
I can't tell how frequently it would hit, but it can surely hit both users and tests.
So, it would be nice to have something we can enable on demand and run tests locally or on Try with it.
Comment 3•2 years ago
|
||
I've posted a prototype in bug 1858480, and I'd like to see if it's useful here.
Is the issue still happening on central tip? if so, is it intermittent? what's the steps to reproduce?
then, about the frequency, the situation happens very frequently, and we cannot assert. so the prototype just prints it.
Comment hidden (Intermittent Failures Robot) |
Comment 5•2 years ago
|
||
forgot to set ni?
:mak, can you take a look into the above comment?
Assignee | ||
Comment 6•2 years ago
|
||
(In reply to Tooru Fujisawa [:arai] from comment #3)
I've posted a prototype in bug 1858480, and I'd like to see if it's useful here.
Is the issue still happening on central tip? if so, is it intermittent? what's the steps to reproduce?
It was workarounded in bug 1855218, so reverting to a m-c revision before 489e7952cd67, and running browser/components/places/tests/browser/browser_bookmarkProperties_folderSelection.js should intermittently cause this and hang the test.
I'll keep the ni open to try it.
Assignee | ||
Comment 7•2 years ago
•
|
||
I tried pushing with the patch and the modified prefs though I don't see any interesting print out in the logs. I must note it only failed in opt builds, if that matters.
I wonder if I misintepreted the happenings though there's clearly a never resolving promise (as per comment 0), a chain of promises that doesn't proceed, and the problem goes away if everything happens in the same window.
My steps to reproduce:
hg update 0097ac44fd89
moz-phab patch D190886 --apply-to .
<commit patch changing the 2 prefs for debugging>
./mach try fuzzy browser/components/places/tests/browser/ --rebuild 10
Comment 8•2 years ago
|
||
Thank you for checking! and I'm sorry that I haven't expected Windows case.
The code prints details to stderr
, but I'm not sure if stderr
is supposed to be visible in the opt-build Windows automation log.
I'll check how the debug print interacts with Windows build and also with automation.
Comment 9•2 years ago
•
|
||
so far, I observe the stderr/stdout are visible in local ./mach test ...
run with Windows opt build, but the issue doesn't reproduce with local run.
Now checking how it goes on automation.
Then, the debug print covers the following scenario:
- promise P1 is created in global G1
- promise P2 is created in global G2, directly depending on P1
- G1 becomes dying
- P1 gets resolved, and reaction job is enqueued
- the reaction job for P1 is ignored because G1 is already dying
- P2 remains unresolved forever
and the patch detects the situation by the following algorithm:
- when a reaction job is getting ignored because of the global is dying:
- if a graph of promises and their reactions contains promise or reaction from other global:
- print the graph of promises and reactions
- if a graph of promises and their reactions contains promise or reaction from other global:
If the issue here doesn't match this scenario, it won't print anything.
Possible cases are:
- (a) P1 isn't get resolved
- (b) P2 doesn't directly depend on P1, but there's some function or abstraction between them, and all of them belongs to G1
"P2 directly depends on P1" means the following case:
let P2 = P1.then(...)
let P2 = async function(...) { await P1; }
and "not-directly" case is, for example, something like the following:
// in G2
let P2_resolve = null;
let P2 = new Promise(r => P2_resolve = r);
// in G1
P1.then(function onFulfilled(v) { G2.P2_resolve(v) });
in this case, even when P1 gets resolved, the onFulfilled
isn't executed because G1 is dying.
then, given onFullfilled
belongs to G1, and the reference to G2 exists only inside the function's code, the graph of promises and reactions doesn't contain anything from G2.
Comment 10•2 years ago
|
||
Just realized that the above try run is using artifact build (possibly because of the try syntax?).
also, there was another problem in my patch that null-check wasn't sufficient (already submitted the fixed patch).
anyway, I'm seeing the debug print in the try run with the fix, and I'm now retriggering until it hits the failure case.
https://treeherder.mozilla.org/jobs?repo=try&revision=5e44b7474fdc3755bc885421b9e018022850fa72
Comment 11•2 years ago
|
||
it happened 2 times in the try run, but the situation covered by the patch doesn't seem to happen for the failure case, given there's no debug print around that, while there's the debug print for other cases.
So, possibility is the (a) or (b) above, or more complex case.
I'll look into other options around printing stalled promises and reactions.
Comment 12•2 years ago
|
||
I've added more logging to the actual code
https://treeherder.mozilla.org/jobs?repo=try&revision=9922eacf8e730dd91183cbda55cfd1a42743bfb8&selectedTaskRun=GNrDaJZ3R1W1nEyp_iYNHw.0
And apparently this is a dead-lock due to cyclic dependency around promises and tasks between transactEnqueuer
and _mainEnqueuer
.
Here's the details:
Basic flow of BookmarkState._createBookmark
Let's use a graph, each node being a promise or task, and directed edge being a dependency between promises, such as Promise.prototype.then
.
Suppose [M.0]
being _mainEnqueuer
's promise and [T.0]
being _transactEnqueuer
's promise, the initial state is the following:
[M.0]
[T.0]
When BookmarkState._createBookmark
is called when bookmark panel is hidden, it calls PlacesTransactions.batch
.
There, a promise for the batch result is created by this._promise.then
below, depending on the task func's promise, which depends on the _mainEnqueuer
's current promise (ignoring minor details around Promise.race
, Promise.catch
, and some intermediate promises):
async _createBookmark() {
await lazy.PlacesTransactions.batch(async () => {
batch(task) {
return this._mainEnqueuer.enqueue(async () => {
enqueue(func) {
...
let promise = this._promise.then(() =>
Promise.race([func(), timeoutPromise])
...
this._promise = promise.catch(console.error);
The graph becomes to the following, where [M.1]
being the promise for this._promise.then
:
[M.0]<--[task()]<--[M.1]
[T.0]
Then, given [M.0]
is already resolved, the task is executed:
this._guid = await lazy.PlacesTransactions.NewBookmark({
parentGuid: this.parentGuid,
tags: this._newState.tags,
title: this._newState.title ?? this._originalState.title,
url: this._newState.uri ?? this._originalState.uri,
index: this._originalState.index,
}).transact();
It creates a transaction and awaits on it.
transact(txnProxy) {
...
let promise = this._transactEnqueuer.enqueue(async () => {
...
let retval = await rawTxn.execute();
...
});
this._mainEnqueuer.alsoWaitFor(promise);
it enqueues a new function to _transactEnqueuer
, which creates a new promise [T.1]
, and also calls this._mainEnqueuer.alsoWaitFor
to create another promise [M.2]
for _mainEnqueuer
to depend on [T.1]
:
[M.0]<--[task()]<--[M.1]<----[M.2]
| |
+------------------+ |
v v
[T.0]<--[rawTxn.execute()]<--[T.1]
For this case the graph is DAG, and things work as expected.
Given [T.0]
is resolved, rawTxn.execute()
runs, and it resolves [T.1]
, and then task()
finishes, and [M.1]
resolves, and then [M.2]
resolves.
Basic flow of TransactionsManager.clearTransactionsHistory
Then, going back to initial state:
[M.0]
[T.0]
TransactionsManager.clearTransactionsHistory
enqueues a function to _mainEnqueuer
which creates [M.1]
depending on the function's result, and also calls _transactEnqueuer.alsoWaitFor
with the [M.1]
:
clearTransactionsHistory(undoEntries, redoEntries) {
let promise = this._mainEnqueuer.enqueue(function () {
...
});
...
this._transactEnqueuer.alsoWaitFor(promise);
If we represent the anonymous function in clearTransactionsHistory
as [cleanup()]
, the graph becomes the following, so, in this case, it's still DAG, and thing s work as expected:
[M.0]<--[cleanup()]<--[M.1]
^
|
[T.0]<----------------[T.1]
Conflict between BookmarkState._createBookmark
and TransactionsManager.clearTransactionsHistory
Then, the problem happens if TransactionsManager.clearTransactionsHistory
happens in the middle of BookmarkState._createBookmark
,
which is happening in the failure case, according to the log.
Suppose that BookmarkState._createBookmark
is first called, which enqueus a task function and creates [M.1]
depends on it:
[M.0]<--[task()]<--[M.1]
[T.0]
Then, before the task()
is executed, TransactionsManager.clearTransactionsHistory
is called, which makes the graph into the following:
[M.0]<--[task()]<--[M.1]<--[cleanup()]<--[M.2]
^
|
[T.0]<-----------------------------------[T.1]
After that, given [M.0]
is already resolved, task()
is executed, and
it enqueues a new function to _transactEnqueuer
, which creates a new promise [T.2]
, and also calls this._mainEnqueuer.alsoWaitFor
to create another promise [M.3]
for _mainEnqueuer
to depend on [T.2]
:
[M.0]<--[task()]<--[M.1]<--[cleanup()]<--[M.2]<-----------------------[M.3]
| ^ |
| | |
+-------------------------------|---------------------------+ |
| v v
[T.0]<-----------------------------------[T.1]<--[rawTxn.execute()]<--[T.2]
Here, there's a cycle, and task()
never finished because it depends on rawTxn.execute()
which depends on task()
itself because of the route added by TransactionsManager.clearTransactionsHistory
.
Assignee | ||
Comment 13•2 years ago
|
||
Wow, nice debug! Then we can assume the separate window was pretty much just making the case more likely to be hit due to timing, and had nothing to do with dying globals. That also explains why my many experiments to cope with dying globals didn't bring improvements.
Let's remove the dependencies on dying global for now, and sorry for the red herring.
I'll think how to improve the code here. I had already some in-flight patches to change batch() so it only handles arrays, reducing its complexity to a single case, and adding logging to the whole thing. That's not solving the problem, but it may be worth doing anyway.
Probably the batch should block transactEnqueue, but transactions that are part of the batch should not go into it. That'd be easy to do.
Afaict undo() and redo() have similar problems with batch() as clearTransactionsHistory.
It would be nice to try to go back a single enqueuer mode, I don't see clearly why it wouldn't work once batch transactions are not added to a queue, but this code is quite complex so it's possible I'm missing something.
Let's go one thing at a time.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 14•2 years ago
|
||
Using 2 queues increases the risk of deadlocks, and complicates our lives.
Let's test this in Nightly for now to ensure there's no unexpected effects.
Depends on D193815
Assignee | ||
Comment 15•2 years ago
|
||
The patch here depends on the refactorings in bug 1865087.
Comment 16•2 years ago
|
||
Comment 17•2 years ago
|
||
bugherder |
Description
•