Closed Bug 1836422 Opened 2 years ago Closed 2 years ago

Experiment ScriptPreloader with JS::FrontendContext-based APIs

Categories

(Core :: JavaScript Engine, task, P3)

task

Tracking

()

RESOLVED FIXED
117 Branch
Tracking Status
firefox117 --- fixed

People

(Reporter: arai, Assigned: arai)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

As an experiment to see how JS::FrontendContext-based APIs work in wild,
rewrite ScriptPreloader's JS::DecodeMultiStencilsOffThread consumer with JS::FrontendContext-based API and manual thread management.

The attached patch is almost direct translation, using the following:

  • JS::FrontendContext and JS::DecodeStencil for decoding
  • Runnable and NS_DispatchBackgroundTask for off-thread task dispatch
  • Vector<RefPtr<JS::Stencil>>* instead of JS::OffThreadToken* for passing the result to main thread

I haven't touched batching or prioritization yet, but there could be some space to optimize based on this way.

There are some comments about the cost of the off-thread decode operation and the balance.
With manual thread handling, the situation and the requirement there can be changed.
For example, we could use single background task and gradually send the decoded data to the main thread,
instead of re-dispatching the task for each chunk.

https://searchfox.org/mozilla-central/rev/cde3d4a8d228491e8b7f1bd94c63bbe039850696/js/xpconnect/loader/ScriptPreloader.h#383-407

// There's a significant setup cost for each off-thread decode operation,
// so scripts are decoded in chunks to minimize the overhead. There's a
// careful balancing act in choosing the size of chunks, to minimize the
// number of decode operations, while also minimizing the number of buffer
// underruns that require the main thread to wait for a script to finish
// decoding.
//
// For the first chunk, we don't have much time between the start of the
// decode operation and the time the first script is needed, so that chunk
// needs to be fairly small. After the first chunk is finished, we have
// some buffered scripts to fall back on, and a lot more breathing room,
// so the chunks can be a bit bigger, but still not too big.
static constexpr int OFF_THREAD_FIRST_CHUNK_SIZE = 128 * 1024;
static constexpr int OFF_THREAD_CHUNK_SIZE = 512 * 1024;

// Ideally, we want every chunk to be smaller than the chunk sizes
// specified above. However, if we have some number of small scripts
// followed by a huge script that would put us over the normal chunk size,
// we're better off processing them as a single chunk.
//
// In order to guarantee that the JS engine will process a chunk
// off-thread, it needs to be at least 100K (which is an implementation
// detail that can change at any time), so make sure that we always hit at
// least that size, with a bit of breathing room to be safe.
static constexpr int SMALL_SCRIPT_CHUNK_THRESHOLD = 128 * 1024;
Attachment #9337133 - Attachment description: WIP: Bug 1836422 - Rewrite ScriptPreloader with JS::FrontendContext-based APIs. → WIP: Bug 1836422 - Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs.

Here's the experimentation result:

Talos ts_paint

ts_paint
unpatched 542.7
part 1 545.7
part 2 (queue size = 10) 547
part 2 (queue size = 32) 544
part 2 (queue size = 1) 547

There's almost no difference between unpatched and patched, and also with different configurations. reasons I can think of are the following:

  • Stencil XDR decode is light-weight operation (compared to previous JSScript XDR decode), because:
    • most data is used as mmapped data
    • only pointer vector is allocated
    • no JS runtime interaction happens
  • the test is executed multiple times sequentially, which means the cache file is kept in memory, instead of written to and read from the disk

How long the main thread blocked without in-memory cache

Setup

Test the following with unpatched, and patched (= part 1 + part 2, queue size = 16).

Add logging to the following place to see how long the main thread waited for off-thread-decoding in mal.Wait() call:

https://searchfox.org/mozilla-central/rev/887d4b5da89a11920ed0fd96b7b7f066927a67db/js/xpconnect/loader/ScriptPreloader.cpp#997-1010

MonitorAutoLock mal(mMonitor);

// Process script batches until our target is found.
while (!script->mReadyToExecute) {
  if (JS::OffThreadToken* token = mToken.exchange(nullptr)) {
    MonitorAutoUnlock mau(mMonitor);
    FinishOffThreadDecode(token);
  } else {
    MOZ_ASSERT(!mParsingScripts.empty());
    mWaitingForDecode = true;
    mal.Wait();
    mWaitingForDecode = false;
  }
}

Launch Nightly with temporary profile placed in the following place, 8 times for each:

  • internal SSD
  • 2.5inch HDD connected over USB

In order to reduce the effect of on-memory cache of the cache files, SSD and HDD profiles are launched alternatively.

If same profile is launched multiple times, there's no wait happens.

Result

SSD total

type average [ms] min [ms] max [ms]
unpatched 2.1 1.0 4.9
patched 1.1 0.5 2.6

HDD total

type average [ms] min [ms] max [ms]
unpatched 87 64 107
patched 56 43 65

Both SSD and HDD show improvements.
Reason I can think of is that, the off-thread decode task can continue accessing the disk without waiting for the main thread to dispatch the next chunk, and that makes it possible to absorb the disk latency.

Here's the graph that represents when each (i-th) script is decoded and accessed, on SSD and HDD, on parent process (which encodes ~250 files).

on SSD case, only the first ~5 files can block the main thread execution.
on HDD case, first ~60 files can block the main thread execution.

So, the queuing strategy should reflect them, for example:

  • first n% files need to be quickly decoded and passed to the main thread, so the chunk size should be small
  • other files don't need to be queued frequently, and the chunk can be large

the current patch part 2 uses fixed chunk size for the queue element, for simplicity, but we could use variable size depending on the index, or simply the total decoded size:

  // The minimum number of stencils in single chunk.
  static constexpr size_t OFF_THREAD_DECODE_MININUM_CHUNK_SIZE = 8;

maybe SPSCQueue element doesn't have to be Vector, but simply RefPtr<JS::Stencil>.
it's only the matter of how often it performs MonitorAutoLock mal(mMonitor) and mal.Notify().

it looks like calling MonitorAutoLock mal(mMonitor) and mal.Notify() for each decode doesn't regress the performance.
so I'll go with that way.

Blocks: 1837574
Attachment #9337133 - Attachment description: WIP: Bug 1836422 - Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs. → Bug 1836422 - Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs. r?bthrall!,kmag!
Attachment #9337649 - Attachment description: WIP: Bug 1836422 - Part 2: Use single task and SPSCQueue in ScriptPreloader::DecodeTask. → Bug 1836422 - Part 2: Use single task and SPSCQueue in ScriptPreloader::DecodeTask. r?bthrall!,kmag!
Pushed by arai_a@mac.com: https://hg.mozilla.org/integration/autoland/rev/10a341592712 Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs. r=bthrall,mccr8 https://hg.mozilla.org/integration/autoland/rev/6237f81208fc Part 2: Use single task and SPSCQueue in ScriptPreloader::DecodeTask. r=bthrall,mccr8
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 117 Branch
Regressions: 1880104
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: