[meta] Performance investigations surrounding large IndexedDB responses
Categories
(Core :: Storage: IndexedDB, task, P3)
Tracking
()
Performance Impact | low |
People
(Reporter: mstange, Unassigned)
References
(Depends on 4 open bugs)
Details
(Keywords: meta, perf:responsiveness, Whiteboard: dom-lws-bugdash-triage)
Attachments
(1 file, 4 obsolete files)
5.41 KB,
text/html
|
Details |
Profile: https://share.firefox.dev/2JTARaI
I just experienced severe jank when interacting with the profiler during symbolication. Then I took a profile of that (linked above). I was moving my mouse consistently all the time, and you can see the mousemove events for that in the parent process, but the mouse move feedback in the content process got interrupted multiple times, for 2.3 seconds at a time.
You can see a consistent increase in memory usage during those gaps.
After the gap, the first thing that runs is a success
event from IndexedDB.
This indicates to me that "somehow IndexedDB is blocking the pipes".
Comment 1•5 years ago
•
|
||
Hi,
Thanks for reporting!
To be honest, I am not so sure if the jank was introduced by IDB since the content process was idle for most of the time. I don't understand why the main thread on the content process was idle around 2.3 seconds for every 3 seconds when there are many mouse move events which are needed to be propagated to the content process.
I guess it might be helpful if we can check the Background thread on the parent process (for indexedDB actors) and the call stack/thread for mouse event on the parent process (I thought it's the main thread, but I couldn't find the call stack for it on the profiler) to ensure there is no anything blocks them to be propagated to the content process.
cc :sg and :janv in case they know something.
Comment 2•5 years ago
|
||
I guess, a big structure clone is being processed (de-serialized) during a "pause".
Comment 3•5 years ago
|
||
(In reply to Jan Varga [:janv] from comment #2)
I guess, a big structure clone is being processed (de-serialized) during a "pause".
Right, ~SturcturedCloneReadInfo
took 57 ms among the IDBCursor.delete()
which was executed in the success event handler, but I thought that is acceptable. Or maybe I was wrong.
The mousemove
is coalesced in the content process and be processed along with the vsync. From the profiler, the content process didn't receive/process vsync during the "pause", maybe the background thread of parent process was busy on something and blocked vsync sending the content process?
Comment 5•5 years ago
|
||
Would it be possible to get a profile with a bit more threads? IPC, PBackground etc.
Reporter | ||
Comment 6•5 years ago
|
||
Yeah I'll try to reproduce this again.
(In reply to Edgar Chen [:edgar] from comment #4)
From the profiler, the content process didn't receive/process vsync during the "pause", maybe the background thread of parent process was busy on something and blocked vsync sending the content process?
That sounds likely! I see PBackgroundIDBCursor::Msg_Response
in the profile, and PVsync
also goes through PBackground
from the parent process background thread to the content process main thread. So my guess is that vsync was blocked on serializing / deserializing a large Response message.
Reporter | ||
Comment 7•5 years ago
|
||
I managed to capture a profile of this with more detail: https://share.firefox.dev/39TjDW2
The IPC data in the profile shows that, indeed, all IPC messages are delayed while a large message is sent, not just messages that go through PBackground.
Reporter | ||
Comment 8•5 years ago
|
||
Oh, that profile wasn't actually an IndexedDB message. Here's a better profile: https://share.firefox.dev/3qEePcY
Reporter | ||
Comment 9•5 years ago
|
||
Well, that profile does show a problem, but I'm not sure that it's the same problem as in the original profile. It's only showing one pause (when we get the libxul symbol table from the database), and the pause is not very long. I'm having a hard time reproducing the original bug with multiple long pauses.
Reporter | ||
Comment 10•5 years ago
|
||
Here's a better profile showing the original bug: https://share.firefox.dev/36BchEe
- 450ms in
GetStructuredCloneReadInfoFromExternalBlob
- 108ms on the "IPDL Background" thread, putting the structured clone into an IPC message
- 2 seconds (!) on the "IPC I/O Parent" thread, sending chunks to the child process. During this time, all other PBackground messages (such as vsync) are blocked.
- One the content process, once all the chunks have arrived, the delivery to the main thread is actually pretty quick.
This bug is about the fact that the PBackground queue is clogged up by a big IndexedDB message, and in particular, the fact that this causes vsync notifications to be delayed during that time.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 11•5 years ago
|
||
Jed, can you take a look at https://share.firefox.dev/36BchEe and advise? Should large IDB responses use shmems? Should they use bigger chunks?
Comment 12•5 years ago
|
||
This is… weird. The maximum message size is 256MiB, so if one message takes 2s to send, then that's less than 128MiB/s to send data to a local process. The way we're doing it isn't very optimized for large messages, but that's still surprising. In fact it's probably less than that: the memory track in the profiler shows an increase of ~118MB in the child process while receiving.
I remember that we'd previously found that local-domain sockets on macOS have relatively small buffers and don't perform well on messages that would exceed the buffer size, but this is still surprising.
I wonder if the profiler being active is part of this — we've had other bugs where frequently interrupting an operation with signals causes performance problems (e.g., we have to block SIGPROF
while forking on Linux, because of how signals interact with the page table manipulations for copy-on-write VM).
It would help if we had a microbenchmark for sending a large message, so we could check that on different OSes, with the Gecko profiler vs. an OS-level profiler vs. no profiler, etc.
Comment 13•5 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #12)
I wonder if the profiler being active is part of this — we've had other bugs where frequently interrupting an operation with signals causes performance problems (e.g., we have to block
SIGPROF
while forking on Linux, because of how signals interact with the page table manipulations for copy-on-write VM).
Except that we use Mach APIs on Mac, not signals. But there might be a similar issue.
Reporter | ||
Comment 14•5 years ago
|
||
I noticed the bad performance without the profiler running. I only started the profiler because things were slow.
Comment 15•5 years ago
|
||
Large structured clones are stored as separate files on disk (outside of the database). However, the data is still read and decompressed in the parent process. We can instead send a file descriptor to the content process and let it read from the file and decompress data.
There's already some support for this which was originally added for WebAssembly.Modules. See bug 1561876 which removed the support for WebAssembly.Modules and left some infrastructure for using the "preprocessing" in future. The preprocessing is not supported by indexes and cursors, that would have to be added. There's also another bug which needs to be addressed, bug 1313185.
Comment 16•5 years ago
|
||
To summarize bug 1690921, there are serious but easily fixable problems with large message performance on Unix and especially on macOS. That should improve this situation, but there may still be noticeable latency (I've only tested on debug builds so far, which isn't an accurate measure of how much improvement we'll get on release builds).
Note that if the message had been over 256MiB, it would crash.
Updated•5 years ago
|
Comment 17•4 years ago
|
||
Comment 18•4 years ago
|
||
Here is a new profile I took from this testcase: https://share.firefox.dev/3nrWXRF
I ran it from Chrome too, we seem to have better numbers in Firefox than in Chrome in general, on my machine.
Comment 19•4 years ago
|
||
This testcase is maybe different than Markus' initial bug: here I'm storing a big value in IDB, but never retrieving it.
Also I'm using requestAnimationFrame callbacks, not mousemove events, I'm not sure if this is different regarding this bug.
Comment 20•4 years ago
|
||
This testcase can retrieve. Here we're significantly worse than Chrome on my machine.
Comment 21•4 years ago
|
||
(fix typo)
Comment 22•4 years ago
|
||
https://share.firefox.dev/3ntT5Q0 <= this one is when retrieving from IDB.
Comment 23•4 years ago
|
||
Hey Iain, Bas told me you could be interested in this :-) We spend a lot of time in js::NativeObject::addPropertyInternal.
Updated•4 years ago
|
Comment 24•4 years ago
|
||
Actually access the result, so that Chrome actually transfers the result, this makes the comparison fair.
Comment 26•4 years ago
|
||
Jan points out in chat that "JSStructuredCloneReader::read looks pretty suboptimal". He's currently working on improving our representation of shapes (bug 1704430); once that's done we can bypass a lot of the unnecessary work here.
Opened bug 1708272.
Reporter | ||
Comment 27•4 years ago
|
||
These testcases look like a rather different problem - they're spending time on the main thread. The original bug was about delays that are not due to work on the main thread, but due to work in the IPC system.
Comment 28•4 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #27)
These testcases look like a rather different problem - they're spending time on the main thread. The original bug was about delays that are not due to work on the main thread, but due to work in the IPC system.
yeah, I agree.
In the profiles I took I couldn't find anything interesting in the IPDL background thread, that's why I didn't keep it when sharing. So could this specific bug be fixed by 1690921 then?
Here is another profile with all the same threads as your previous profile: https://share.firefox.dev/3u4SdE0 (store then retrieve a big object).
I think the time spent in IPC looks reasonable now, what do you think?
Comment 29•4 years ago
|
||
Some more things from looking at this profile:
moveToTenured
happens repeatidly whenReadStructuredClone
, can we check the size of the data and unserialize directly to the Tenured generation?- in the thread "IPDL BackGround" smaug thinks that
free
happens in the same task, we should move that to a separate task so that the thread isn't blocked. - Bas thinks that
ComputeCrc32c
could be optimized widely - IPC is ~100ms for ~90MB (that's 800MB/s), on my profiling machine from 5y ago, this looks OK.
IPC IO Parent looks pretty good now. We think that it's not p1 responsiveness anymore, please change if you disagree.
I'll file bugs for these few things.
Comment 30•4 years ago
|
||
Florian noticed that in the "store" part of the profile (the first operation), in the indexedDb thread, there are 2 long fsync operations that blocked the profiler sampling. It would be nice to double check on more profiles.
Comment 31•4 years ago
|
||
(In reply to Julien Wajsberg [:julienw] from comment #30)
Florian noticed that in the "store" part of the profile (the first operation), in the indexedDb thread, there are 2 long fsync operations that blocked the profiler sampling. It would be nice to double check on more profiles.
What does "blocked" mean in this context?
Comment 32•4 years ago
|
||
I mean there's a big hole in samples on this thread during fsync
.
Florian filed bug 1709111 about this issue.
Comment 33•4 years ago
|
||
Does anyone know why we're moving this much memory around? Maybe there's a way to make these queries return less data. I saw from comment 16 that we will just crash when the result size hits 256MB.
Comment 34•4 years ago
|
||
IndexedDB keeps an approximation of its results size and will generate a failure (and/or stop prefetching records) if the results get too large. It won't actually hit the crash size.
We can't actually stop users from putting giant things in IndexedDB without cross-browser standardization, and the lack of a computed size for structured serialization means that it can't yet be standardized, although there have been some partial efforts to that end.
There is a preprocess mechanism that could be used to avoid sending the IDB responses in a single big response, but this will inherently introduce additional latency and is not a switch that can just be flipped because it also inherently messes up response ordering required per spec and compensations are not yet introduced.
I've re-titled the bug since it sounds like the quadratic performance hit of bug 1690921 was the cause of the very serious-sounding vsync jank.
Updated•3 years ago
|
Updated•8 months ago
|
Updated•8 months ago
|
Description
•