Experiment ScriptPreloader with JS::FrontendContext-based APIs
Categories
(Core :: JavaScript Engine, task, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox117 | --- | fixed |
People
(Reporter: arai, Assigned: arai)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
As an experiment to see how JS::FrontendContext
-based APIs work in wild,
rewrite ScriptPreloader
's JS::DecodeMultiStencilsOffThread
consumer with JS::FrontendContext
-based API and manual thread management.
Assignee | ||
Comment 1•2 years ago
|
||
Assignee | ||
Comment 2•2 years ago
|
||
The attached patch is almost direct translation, using the following:
JS::FrontendContext
andJS::DecodeStencil
for decodingRunnable
andNS_DispatchBackgroundTask
for off-thread task dispatchVector<RefPtr<JS::Stencil>>*
instead ofJS::OffThreadToken*
for passing the result to main thread
I haven't touched batching or prioritization yet, but there could be some space to optimize based on this way.
Assignee | ||
Comment 3•2 years ago
|
||
There are some comments about the cost of the off-thread decode operation and the balance.
With manual thread handling, the situation and the requirement there can be changed.
For example, we could use single background task and gradually send the decoded data to the main thread,
instead of re-dispatching the task for each chunk.
// There's a significant setup cost for each off-thread decode operation,
// so scripts are decoded in chunks to minimize the overhead. There's a
// careful balancing act in choosing the size of chunks, to minimize the
// number of decode operations, while also minimizing the number of buffer
// underruns that require the main thread to wait for a script to finish
// decoding.
//
// For the first chunk, we don't have much time between the start of the
// decode operation and the time the first script is needed, so that chunk
// needs to be fairly small. After the first chunk is finished, we have
// some buffered scripts to fall back on, and a lot more breathing room,
// so the chunks can be a bit bigger, but still not too big.
static constexpr int OFF_THREAD_FIRST_CHUNK_SIZE = 128 * 1024;
static constexpr int OFF_THREAD_CHUNK_SIZE = 512 * 1024;
// Ideally, we want every chunk to be smaller than the chunk sizes
// specified above. However, if we have some number of small scripts
// followed by a huge script that would put us over the normal chunk size,
// we're better off processing them as a single chunk.
//
// In order to guarantee that the JS engine will process a chunk
// off-thread, it needs to be at least 100K (which is an implementation
// detail that can change at any time), so make sure that we always hit at
// least that size, with a bit of breathing room to be safe.
static constexpr int SMALL_SCRIPT_CHUNK_THRESHOLD = 128 * 1024;
Assignee | ||
Comment 4•2 years ago
|
||
Experimenting single task with SPSCQueue
:
https://treeherder.mozilla.org/jobs?repo=try&revision=b2f875c5f6aee1741557c35b689f664f1e2ddb12
Updated•2 years ago
|
Assignee | ||
Comment 5•2 years ago
|
||
Depends on D179765
Assignee | ||
Comment 6•2 years ago
|
||
Here's the experimentation result:
Talos ts_paint
ts_paint | |
---|---|
unpatched | 542.7 |
part 1 | 545.7 |
part 2 (queue size = 10) | 547 |
part 2 (queue size = 32) | 544 |
part 2 (queue size = 1) | 547 |
There's almost no difference between unpatched and patched, and also with different configurations. reasons I can think of are the following:
- Stencil XDR decode is light-weight operation (compared to previous JSScript XDR decode), because:
- most data is used as mmapped data
- only pointer vector is allocated
- no JS runtime interaction happens
- the test is executed multiple times sequentially, which means the cache file is kept in memory, instead of written to and read from the disk
How long the main thread blocked without in-memory cache
Setup
Test the following with unpatched, and patched (= part 1 + part 2, queue size = 16).
Add logging to the following place to see how long the main thread waited for off-thread-decoding in mal.Wait()
call:
MonitorAutoLock mal(mMonitor);
// Process script batches until our target is found.
while (!script->mReadyToExecute) {
if (JS::OffThreadToken* token = mToken.exchange(nullptr)) {
MonitorAutoUnlock mau(mMonitor);
FinishOffThreadDecode(token);
} else {
MOZ_ASSERT(!mParsingScripts.empty());
mWaitingForDecode = true;
mal.Wait();
mWaitingForDecode = false;
}
}
Launch Nightly with temporary profile placed in the following place, 8 times for each:
- internal SSD
- 2.5inch HDD connected over USB
In order to reduce the effect of on-memory cache of the cache files, SSD and HDD profiles are launched alternatively.
If same profile is launched multiple times, there's no wait happens.
Result
SSD total
type | average [ms] | min [ms] | max [ms] |
---|---|---|---|
unpatched | 2.1 | 1.0 | 4.9 |
patched | 1.1 | 0.5 | 2.6 |
HDD total
type | average [ms] | min [ms] | max [ms] |
---|---|---|---|
unpatched | 87 | 64 | 107 |
patched | 56 | 43 | 65 |
Both SSD and HDD show improvements.
Reason I can think of is that, the off-thread decode task can continue accessing the disk without waiting for the main thread to dispatch the next chunk, and that makes it possible to absorb the disk latency.
Assignee | ||
Comment 7•2 years ago
|
||
Here's the graph that represents when each (i-th) script is decoded and accessed, on SSD and HDD, on parent process (which encodes ~250 files).
on SSD case, only the first ~5 files can block the main thread execution.
on HDD case, first ~60 files can block the main thread execution.
So, the queuing strategy should reflect them, for example:
- first n% files need to be quickly decoded and passed to the main thread, so the chunk size should be small
- other files don't need to be queued frequently, and the chunk can be large
the current patch part 2 uses fixed chunk size for the queue element, for simplicity, but we could use variable size depending on the index, or simply the total decoded size:
// The minimum number of stencils in single chunk.
static constexpr size_t OFF_THREAD_DECODE_MININUM_CHUNK_SIZE = 8;
Assignee | ||
Comment 8•2 years ago
|
||
maybe SPSCQueue
element doesn't have to be Vector
, but simply RefPtr<JS::Stencil>
.
it's only the matter of how often it performs MonitorAutoLock mal(mMonitor)
and mal.Notify()
.
Assignee | ||
Comment 9•2 years ago
|
||
it looks like calling MonitorAutoLock mal(mMonitor)
and mal.Notify()
for each decode doesn't regress the performance.
so I'll go with that way.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 10•2 years ago
|
||
Comment 11•2 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/10a341592712
https://hg.mozilla.org/mozilla-central/rev/6237f81208fc
Description
•