Open Bug 1680860 Opened 5 years ago Updated 8 months ago

[meta] Performance investigations surrounding large IndexedDB responses

Categories

(Core :: Storage: IndexedDB, task, P3)

task

Tracking

()

Performance Impact low

People

(Reporter: mstange, Unassigned)

References

(Depends on 4 open bugs)

Details

(Keywords: meta, perf:responsiveness, Whiteboard: dom-lws-bugdash-triage)

Attachments

(1 file, 4 obsolete files)

Profile: https://share.firefox.dev/2JTARaI

I just experienced severe jank when interacting with the profiler during symbolication. Then I took a profile of that (linked above). I was moving my mouse consistently all the time, and you can see the mousemove events for that in the parent process, but the mouse move feedback in the content process got interrupted multiple times, for 2.3 seconds at a time.
You can see a consistent increase in memory usage during those gaps.
After the gap, the first thing that runs is a success event from IndexedDB.

This indicates to me that "somehow IndexedDB is blocking the pipes".

Hi,

Thanks for reporting!

To be honest, I am not so sure if the jank was introduced by IDB since the content process was idle for most of the time. I don't understand why the main thread on the content process was idle around 2.3 seconds for every 3 seconds when there are many mouse move events which are needed to be propagated to the content process.

I guess it might be helpful if we can check the Background thread on the parent process (for indexedDB actors) and the call stack/thread for mouse event on the parent process (I thought it's the main thread, but I couldn't find the call stack for it on the profiler) to ensure there is no anything blocks them to be propagated to the content process.

cc :sg and :janv in case they know something.

Severity: -- → S4
Priority: -- → P3

I guess, a big structure clone is being processed (de-serialized) during a "pause".

(In reply to Jan Varga [:janv] from comment #2)

I guess, a big structure clone is being processed (de-serialized) during a "pause".

https://profiler.firefox.com/public/6v77h8r62tx4mgzgevq2v509phwp2hcaaj0q6bg/stack-chart/?globalTrackOrder=9-0-1-2-3-4-5-6-7-8&hiddenGlobalTracks=1-2-3-4-5-6-7&localTrackOrderByPid=97061-2-0-1~97066-0~97107-0~97102-0~97103-0~97106-0~97084-0~97083-0~97089-0~&thread=10&timelineType=stack&v=5

Right, ~SturcturedCloneReadInfo took 57 ms among the IDBCursor.delete() which was executed in the success event handler, but I thought that is acceptable. Or maybe I was wrong.

The mousemove is coalesced in the content process and be processed along with the vsync. From the profiler, the content process didn't receive/process vsync during the "pause", maybe the background thread of parent process was busy on something and blocked vsync sending the content process?

Would it be possible to get a profile with a bit more threads? IPC, PBackground etc.

Yeah I'll try to reproduce this again.

(In reply to Edgar Chen [:edgar] from comment #4)

From the profiler, the content process didn't receive/process vsync during the "pause", maybe the background thread of parent process was busy on something and blocked vsync sending the content process?

That sounds likely! I see PBackgroundIDBCursor::Msg_Response in the profile, and PVsync also goes through PBackground from the parent process background thread to the content process main thread. So my guess is that vsync was blocked on serializing / deserializing a large Response message.

I managed to capture a profile of this with more detail: https://share.firefox.dev/39TjDW2

The IPC data in the profile shows that, indeed, all IPC messages are delayed while a large message is sent, not just messages that go through PBackground.

Oh, that profile wasn't actually an IndexedDB message. Here's a better profile: https://share.firefox.dev/3qEePcY

Well, that profile does show a problem, but I'm not sure that it's the same problem as in the original profile. It's only showing one pause (when we get the libxul symbol table from the database), and the pause is not very long. I'm having a hard time reproducing the original bug with multiple long pauses.

Here's a better profile showing the original bug: https://share.firefox.dev/36BchEe

  • 450ms in GetStructuredCloneReadInfoFromExternalBlob
  • 108ms on the "IPDL Background" thread, putting the structured clone into an IPC message
  • 2 seconds (!) on the "IPC I/O Parent" thread, sending chunks to the child process. During this time, all other PBackground messages (such as vsync) are blocked.
  • One the content process, once all the chunks have arrived, the delivery to the main thread is actually pretty quick.

This bug is about the fact that the PBackground queue is clogged up by a big IndexedDB message, and in particular, the fact that this causes vsync notifications to be delayed during that time.

Summary: All IPC is paused while IndexedDB results are sent to the content process? → Large IndexedDB responses clog up the PBackground queue and delay vsync notifications
Whiteboard: [qf]

Jed, can you take a look at https://share.firefox.dev/36BchEe and advise? Should large IDB responses use shmems? Should they use bigger chunks?

Flags: needinfo?(jld)

This is… weird. The maximum message size is 256MiB, so if one message takes 2s to send, then that's less than 128MiB/s to send data to a local process. The way we're doing it isn't very optimized for large messages, but that's still surprising. In fact it's probably less than that: the memory track in the profiler shows an increase of ~118MB in the child process while receiving.

I remember that we'd previously found that local-domain sockets on macOS have relatively small buffers and don't perform well on messages that would exceed the buffer size, but this is still surprising.

I wonder if the profiler being active is part of this — we've had other bugs where frequently interrupting an operation with signals causes performance problems (e.g., we have to block SIGPROF while forking on Linux, because of how signals interact with the page table manipulations for copy-on-write VM).

It would help if we had a microbenchmark for sending a large message, so we could check that on different OSes, with the Gecko profiler vs. an OS-level profiler vs. no profiler, etc.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #12)

I wonder if the profiler being active is part of this — we've had other bugs where frequently interrupting an operation with signals causes performance problems (e.g., we have to block SIGPROF while forking on Linux, because of how signals interact with the page table manipulations for copy-on-write VM).

Except that we use Mach APIs on Mac, not signals. But there might be a similar issue.

I noticed the bad performance without the profiler running. I only started the profiler because things were slow.

Large structured clones are stored as separate files on disk (outside of the database). However, the data is still read and decompressed in the parent process. We can instead send a file descriptor to the content process and let it read from the file and decompress data.
There's already some support for this which was originally added for WebAssembly.Modules. See bug 1561876 which removed the support for WebAssembly.Modules and left some infrastructure for using the "preprocessing" in future. The preprocessing is not supported by indexes and cursors, that would have to be added. There's also another bug which needs to be addressed, bug 1313185.

To summarize bug 1690921, there are serious but easily fixable problems with large message performance on Unix and especially on macOS. That should improve this situation, but there may still be noticeable latency (I've only tested on debug builds so far, which isn't an accurate measure of how much improvement we'll get on release builds).

Note that if the message had been over 256MiB, it would crash.

Flags: needinfo?(jld)
Whiteboard: [qf] → [qf:p1:responsiveness]

Here is a new profile I took from this testcase: https://share.firefox.dev/3nrWXRF

I ran it from Chrome too, we seem to have better numbers in Firefox than in Chrome in general, on my machine.

This testcase is maybe different than Markus' initial bug: here I'm storing a big value in IDB, but never retrieving it.
Also I'm using requestAnimationFrame callbacks, not mousemove events, I'm not sure if this is different regarding this bug.

Attached file simpler testcase (obsolete) —

This testcase can retrieve. Here we're significantly worse than Chrome on my machine.

Attachment #9218973 - Attachment is obsolete: true
Attached file testcase-idb.html (obsolete) —

(fix typo)

https://share.firefox.dev/3ntT5Q0 <= this one is when retrieving from IDB.

Hey Iain, Bas told me you could be interested in this :-) We spend a lot of time in js::NativeObject::addPropertyInternal.

Flags: needinfo?(iireland)
Attachment #9218988 - Attachment is obsolete: true
Attached file testcase-idb.html (obsolete) —

Actually access the result, so that Chrome actually transfers the result, this makes the comparison fair.

Attachment #9218989 - Attachment is obsolete: true
Attached file testcase-idb.html

with better timestamps

Attachment #9219001 - Attachment is obsolete: true

Jan points out in chat that "JSStructuredCloneReader::read looks pretty suboptimal". He's currently working on improving our representation of shapes (bug 1704430); once that's done we can bypass a lot of the unnecessary work here.

Opened bug 1708272.

Depends on: 1708272
Flags: needinfo?(iireland)

These testcases look like a rather different problem - they're spending time on the main thread. The original bug was about delays that are not due to work on the main thread, but due to work in the IPC system.

(In reply to Markus Stange [:mstange] from comment #27)

These testcases look like a rather different problem - they're spending time on the main thread. The original bug was about delays that are not due to work on the main thread, but due to work in the IPC system.

yeah, I agree.

In the profiles I took I couldn't find anything interesting in the IPDL background thread, that's why I didn't keep it when sharing. So could this specific bug be fixed by 1690921 then?

Here is another profile with all the same threads as your previous profile: https://share.firefox.dev/3u4SdE0 (store then retrieve a big object).
I think the time spent in IPC looks reasonable now, what do you think?

Some more things from looking at this profile:

  • moveToTenured happens repeatidly when ReadStructuredClone, can we check the size of the data and unserialize directly to the Tenured generation?
  • in the thread "IPDL BackGround" smaug thinks that free happens in the same task, we should move that to a separate task so that the thread isn't blocked.
  • Bas thinks that ComputeCrc32c could be optimized widely
  • IPC is ~100ms for ~90MB (that's 800MB/s), on my profiling machine from 5y ago, this looks OK.

IPC IO Parent looks pretty good now. We think that it's not p1 responsiveness anymore, please change if you disagree.

I'll file bugs for these few things.

Whiteboard: [qf:p1:responsiveness] → [qf:p3:responsiveness]

Florian noticed that in the "store" part of the profile (the first operation), in the indexedDb thread, there are 2 long fsync operations that blocked the profiler sampling. It would be nice to double check on more profiles.

(In reply to Julien Wajsberg [:julienw] from comment #30)

Florian noticed that in the "store" part of the profile (the first operation), in the indexedDb thread, there are 2 long fsync operations that blocked the profiler sampling. It would be nice to double check on more profiles.

What does "blocked" mean in this context?

I mean there's a big hole in samples on this thread during fsync.
Florian filed bug 1709111 about this issue.

Does anyone know why we're moving this much memory around? Maybe there's a way to make these queries return less data. I saw from comment 16 that we will just crash when the result size hits 256MB.

IndexedDB keeps an approximation of its results size and will generate a failure (and/or stop prefetching records) if the results get too large. It won't actually hit the crash size.

We can't actually stop users from putting giant things in IndexedDB without cross-browser standardization, and the lack of a computed size for structured serialization means that it can't yet be standardized, although there have been some partial efforts to that end.

There is a preprocess mechanism that could be used to avoid sending the IDB responses in a single big response, but this will inherently introduce additional latency and is not a switch that can just be flipped because it also inherently messes up response ordering required per spec and compensations are not yet introduced.

I've re-titled the bug since it sounds like the quadratic performance hit of bug 1690921 was the cause of the very serious-sounding vsync jank.

Summary: Large IndexedDB responses clog up the PBackground queue and delay vsync notifications → [meta] Performance investigations surrounding large IndexedDB responses
Performance Impact: --- → P3
Whiteboard: [qf:p3:responsiveness]
Whiteboard: dom-lws-bugdash-triage
Type: defect → task
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: