Open Bug 1906921 Opened 3 months ago Updated 3 months ago

Slack tabs hung up for minutes in LSObject::StoreDatabase

Categories

(Core :: Storage: localStorage & sessionStorage, defect, P2)

defect

Tracking

()

People

(Reporter: emilio, Unassigned)

References

(Depends on 1 open bug)

Details

Seems some deadlock on RequestHelper::mMonitor? (main thread of slack.com (2/3) dispatches a runnable to RemoteLzyStream thread and waits for mMonitor but that thread also waits for mMonitor)

Flags: needinfo?(jvarga)

Interesting; brief notes:

I'm currently busy with other tasks, and this requires further investigation.

Flags: needinfo?(jvarga)
Severity: -- → S2
Priority: -- → P2

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #2)

  • The "QuotaManager IO" and "LS Thread" both seem fine/idle.

Actually, the LS Thread seems to be flushing data to the database on disk. If there's too much data, the operation can take long time.
There's only one IO thread for loading/flushing LS data (for all origins), so if the flushing for an origin is taking long time, it can block loading of data for some other origin.

(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #1)

Seems some deadlock on RequestHelper::mMonitor? (main thread of slack.com (2/3) dispatches a runnable to RemoteLzyStream thread and waits for mMonitor but that thread also waits for mMonitor)

Actually, RemoteLzyStream never calls Wait, it only calls Notify.

There are only a few short spikes on the LS thread.

What strikes me most is that on two of the three slack processes the RemoteLazyThread is never executing any runnable, and on the third the RequestHelper runs only very late wrt to when it is dispatched. I cannot really see what could hold its execution back, the thread seems to exist and to be idle before. The dispatch does not use any special flag. There should be nothing between that Dispatch and running RequestHelper::Run except normal thread handling. All possible dispatch failures seem to be handled, too. From the profile info this happened on Windows 11 on a AMD Ryzen Threadripper 3990X 64-Core Processor which sounds like "pretty fast and every thread that wants to run has its core".

Edit: Probably the first execution happened before the profile started. And then we are waiting for something to come back from the parent process.

It's interesting that multiple content processes for slack.com hang at the same time, that points more to a problem in the parent which probably doesn't respond in reasonable time for the request to prepare a database. Requests for preparing databases in the parent are always shared for given origin, so if another content process for the same origin needs a database, it will just wait for the existing request in parent to be finished.

I just verified that the special logging which we added in bug 1535221 still works.

Emilio, you can enable it by setting:

MOZ_LOG=QuotaManager:3,LocalStorage:3,sync
LSNG_CRASH_ON_CANCEL=1

The logging can look like this (printed just before the crash):

 0:51.30 pid:56210 [Parent 56210: IPDL Background]: I/LocalStorage LSRequestBase [105a5ec00]
 0:51.30 pid:56210 [Parent 56210: IPDL Background]: I/LocalStorage   mState: Nesting
 0:51.30 pid:56210 [Parent 56210: IPDL Background]: I/LocalStorage   mNestedState: CheckExistingOperations
 0:51.30 pid:56210 [56210] Hit MOZ_CRASH(LSNG: Crash on cancel.) at /Users/jvarga/Projects/Mozilla15/src/dom/localstorage/ActorsParent.cpp:6473

Have you experienced the hang only once, or multiple times ?

Flags: needinfo?(emilio)

I only hit this once (maybe after the 129 -> 130 update?). Is there something like that that rings a bell?

I assume the logging is not worth it if not reproing the issue, but lmk if you want me to run with it anyways.

Flags: needinfo?(emilio) → needinfo?(jvarga)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #9)

I only hit this once (maybe after the 129 -> 130 update?). Is there something like that that rings a bell?

I assume the logging is not worth it if not reproing the issue, but lmk if you want me to run with it anyways.

Yeah, since you only encountered it once, there's no point in enabling the logging.
So, until we see this happens much more often and other users are affected, I think we can downgrade this back to S3.
We can also add some new telemetry, I'll file a bug for it.
Thanks.

Severity: S2 → S3
Flags: needinfo?(jvarga)
Depends on: 1590635
You need to log in before you can comment on or make changes to this bug.