1906921 - Slack tabs hung up for minutes in LSObject::StoreDatabase

Emilio Cobos Álvarez (:emilio)

Reporter

Description

•

3 months ago

https://share.firefox.dev/3xRN49y

Kagami Rosylight [:saschanaz] (they/them)

Comment 1

•

3 months ago

•

Edited

Seems some deadlock on RequestHelper::mMonitor? (main thread of slack.com (2/3) dispatches a runnable to RemoteLzyStream thread and waits for mMonitor but that thread also waits for mMonitor)

Flags: needinfo?(jvarga)

Andrew Sutherland [:asuth] (he/him)

Comment 2

•

3 months ago

Interesting; brief notes:

Slack process 2 shows a single tick in the Wait call under State::Canceling, with the rest of the samples in the outer wait
Slack process 1 is also blocking, but only has samples in the outer wait.
The "QuotaManager IO" and "LS Thread" both seem fine/idle.

Jan Varga [:janv]

Comment 3

•

3 months ago

I'm currently busy with other tasks, and this requires further investigation.

Flags: needinfo?(jvarga)

Eden Chuang[:edenchuang]

Updated

•

3 months ago

Severity: -- → S2

Priority: -- → P2

Jan Varga [:janv]

Comment 4

•

3 months ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #2)

The "QuotaManager IO" and "LS Thread" both seem fine/idle.

Actually, the LS Thread seems to be flushing data to the database on disk. If there's too much data, the operation can take long time.
There's only one IO thread for loading/flushing LS data (for all origins), so if the flushing for an origin is taking long time, it can block loading of data for some other origin.

Jan Varga [:janv]

Comment 5

•

3 months ago

(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #1)

Seems some deadlock on RequestHelper::mMonitor? (main thread of slack.com (2/3) dispatches a runnable to RemoteLzyStream thread and waits for mMonitor but that thread also waits for mMonitor)

Actually, RemoteLzyStream never calls Wait, it only calls Notify.

Jens Stutte [:jstutte]

Comment 6

•

3 months ago

•

Edited

There are only a few short spikes on the LS thread.

What strikes me most is that on two of the three slack processes the RemoteLazyThread is never executing any runnable, and on the third the RequestHelper runs only very late wrt to when it is dispatched. I cannot really see what could hold its execution back, the thread seems to exist and to be idle before. The dispatch does not use any special flag. There should be nothing between that Dispatch and running RequestHelper::Run except normal thread handling. All possible dispatch failures seem to be handled, too. From the profile info this happened on Windows 11 on a AMD Ryzen Threadripper 3990X 64-Core Processor which sounds like "pretty fast and every thread that wants to run has its core".

Edit: Probably the first execution happened before the profile started. And then we are waiting for something to come back from the parent process.

Jan Varga [:janv]

Comment 7

•

3 months ago

It's interesting that multiple content processes for slack.com hang at the same time, that points more to a problem in the parent which probably doesn't respond in reasonable time for the request to prepare a database. Requests for preparing databases in the parent are always shared for given origin, so if another content process for the same origin needs a database, it will just wait for the existing request in parent to be finished.

Jan Varga [:janv]

Comment 8

•

3 months ago

I just verified that the special logging which we added in bug 1535221 still works.

Emilio, you can enable it by setting:

MOZ_LOG=QuotaManager:3,LocalStorage:3,sync
LSNG_CRASH_ON_CANCEL=1

The logging can look like this (printed just before the crash):

 0:51.30 pid:56210 [Parent 56210: IPDL Background]: I/LocalStorage LSRequestBase [105a5ec00]
 0:51.30 pid:56210 [Parent 56210: IPDL Background]: I/LocalStorage   mState: Nesting
 0:51.30 pid:56210 [Parent 56210: IPDL Background]: I/LocalStorage   mNestedState: CheckExistingOperations
 0:51.30 pid:56210 [56210] Hit MOZ_CRASH(LSNG: Crash on cancel.) at /Users/jvarga/Projects/Mozilla15/src/dom/localstorage/ActorsParent.cpp:6473

Have you experienced the hang only once, or multiple times ?

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 9

•

3 months ago

I only hit this once (maybe after the 129 -> 130 update?). Is there something like that that rings a bell?

I assume the logging is not worth it if not reproing the issue, but lmk if you want me to run with it anyways.

Flags: needinfo?(emilio) → needinfo?(jvarga)

Jan Varga [:janv]

Comment 10

•

3 months ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #9)

I only hit this once (maybe after the 129 -> 130 update?). Is there something like that that rings a bell?

I assume the logging is not worth it if not reproing the issue, but lmk if you want me to run with it anyways.

Yeah, since you only encountered it once, there's no point in enabling the logging.
So, until we see this happens much more often and other users are affected, I think we can downgrade this back to S3.
We can also add some new telemetry, I'll file a bug for it.
Thanks.

Severity: S2 → S3

Flags: needinfo?(jvarga)

Jan Varga [:janv]

Updated

•

3 months ago

Depends on: 1590635

Bugzilla

Quick Search

Slack tabs hung up for minutes in LSObject::StoreDatabase

Categories

(Core :: Storage: localStorage & sessionStorage, defect, P2)

Tracking

()

People

(Reporter: emilio, Unassigned)

References

(Depends on 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated