Slack tabs hung up for minutes in LSObject::StoreDatabase
Categories
(Core :: Storage: localStorage & sessionStorage, defect, P2)
Tracking
()
People
(Reporter: emilio, Unassigned)
References
(Depends on 1 open bug)
Details
Comment 1•3 months ago
•
|
||
Seems some deadlock on RequestHelper::mMonitor? (main thread of slack.com (2/3) dispatches a runnable to RemoteLzyStream thread and waits for mMonitor but that thread also waits for mMonitor)
Comment 2•3 months ago
|
||
Interesting; brief notes:
- Slack process 2 shows a single tick in the Wait call under State::Canceling, with the rest of the samples in the outer wait
- Slack process 1 is also blocking, but only has samples in the outer wait.
- The "QuotaManager IO" and "LS Thread" both seem fine/idle.
Comment 3•3 months ago
|
||
I'm currently busy with other tasks, and this requires further investigation.
Updated•3 months ago
|
Comment 4•3 months ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #2)
- The "QuotaManager IO" and "LS Thread" both seem fine/idle.
Actually, the LS Thread seems to be flushing data to the database on disk. If there's too much data, the operation can take long time.
There's only one IO thread for loading/flushing LS data (for all origins), so if the flushing for an origin is taking long time, it can block loading of data for some other origin.
Comment 5•3 months ago
|
||
(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #1)
Seems some deadlock on RequestHelper::mMonitor? (main thread of slack.com (2/3) dispatches a runnable to RemoteLzyStream thread and waits for mMonitor but that thread also waits for mMonitor)
Actually, RemoteLzyStream never calls Wait, it only calls Notify.
Comment 6•3 months ago
•
|
||
There are only a few short spikes on the LS thread.
What strikes me most is that on two of the three slack processes the RemoteLazyThread is never executing any runnable, and on the third the RequestHelper
runs only very late wrt to when it is dispatched. I cannot really see what could hold its execution back, the thread seems to exist and to be idle before. The dispatch does not use any special flag. There should be nothing between that Dispatch
and running RequestHelper::Run
except normal thread handling. All possible dispatch failures seem to be handled, too. From the profile info this happened on Windows 11 on a AMD Ryzen Threadripper 3990X 64-Core Processor
which sounds like "pretty fast and every thread that wants to run has its core".
Edit: Probably the first execution happened before the profile started. And then we are waiting for something to come back from the parent process.
Comment 7•3 months ago
|
||
It's interesting that multiple content processes for slack.com hang at the same time, that points more to a problem in the parent which probably doesn't respond in reasonable time for the request to prepare a database. Requests for preparing databases in the parent are always shared for given origin, so if another content process for the same origin needs a database, it will just wait for the existing request in parent to be finished.
Comment 8•3 months ago
|
||
I just verified that the special logging which we added in bug 1535221 still works.
Emilio, you can enable it by setting:
MOZ_LOG=QuotaManager:3,LocalStorage:3,sync
LSNG_CRASH_ON_CANCEL=1
The logging can look like this (printed just before the crash):
0:51.30 pid:56210 [Parent 56210: IPDL Background]: I/LocalStorage LSRequestBase [105a5ec00]
0:51.30 pid:56210 [Parent 56210: IPDL Background]: I/LocalStorage mState: Nesting
0:51.30 pid:56210 [Parent 56210: IPDL Background]: I/LocalStorage mNestedState: CheckExistingOperations
0:51.30 pid:56210 [56210] Hit MOZ_CRASH(LSNG: Crash on cancel.) at /Users/jvarga/Projects/Mozilla15/src/dom/localstorage/ActorsParent.cpp:6473
Have you experienced the hang only once, or multiple times ?
Reporter | ||
Comment 9•3 months ago
|
||
I only hit this once (maybe after the 129 -> 130 update?). Is there something like that that rings a bell?
I assume the logging is not worth it if not reproing the issue, but lmk if you want me to run with it anyways.
Comment 10•3 months ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #9)
I only hit this once (maybe after the 129 -> 130 update?). Is there something like that that rings a bell?
I assume the logging is not worth it if not reproing the issue, but lmk if you want me to run with it anyways.
Yeah, since you only encountered it once, there's no point in enabling the logging.
So, until we see this happens much more often and other users are affected, I think we can downgrade this back to S3.
We can also add some new telemetry, I'll file a bug for it.
Thanks.
Description
•