Closed Bug 1836422 Opened 2 years ago Closed 2 years ago

Experiment ScriptPreloader with JS::FrontendContext-based APIs

Tracking

()

Status:

RESOLVED FIXED

Milestone:

117 Branch

Tracking Flags:

Tracking

Status

firefox117

---

fixed

People

(Reporter: arai, Assigned: arai)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Bug 1836422 - Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs. r?bthrall!,kmag! 2 years ago Tooru Fujisawa [:arai] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1836422 - Part 2: Use single task and SPSCQueue in ScriptPreloader::DecodeTask. r?bthrall!,kmag! 2 years ago Tooru Fujisawa [:arai] 48 bytes, text/x-phabricator-request		Details \| Review
When each script is decoded and accessed first time 2 years ago Tooru Fujisawa [:arai] 321.19 KB, image/png		Details

Tooru Fujisawa [:arai]

Assignee

Description

•

2 years ago

As an experiment to see how JS::FrontendContext-based APIs work in wild,
rewrite ScriptPreloader's JS::DecodeMultiStencilsOffThread consumer with JS::FrontendContext-based API and manual thread management.

Tooru Fujisawa [:arai]

Assignee

Comment 1

•

2 years ago

Attached file Bug 1836422 - Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs. r?bthrall!,kmag! — Details

Tooru Fujisawa [:arai]

Assignee

Comment 2

•

2 years ago

The attached patch is almost direct translation, using the following:

JS::FrontendContext and JS::DecodeStencil for decoding
Runnable and NS_DispatchBackgroundTask for off-thread task dispatch
Vector<RefPtr<JS::Stencil>>* instead of JS::OffThreadToken* for passing the result to main thread

I haven't touched batching or prioritization yet, but there could be some space to optimize based on this way.

Tooru Fujisawa [:arai]

Assignee

Comment 3

•

2 years ago

There are some comments about the cost of the off-thread decode operation and the balance.
With manual thread handling, the situation and the requirement there can be changed.
For example, we could use single background task and gradually send the decoded data to the main thread,
instead of re-dispatching the task for each chunk.

https://searchfox.org/mozilla-central/rev/cde3d4a8d228491e8b7f1bd94c63bbe039850696/js/xpconnect/loader/ScriptPreloader.h#383-407

// There's a significant setup cost for each off-thread decode operation,
// so scripts are decoded in chunks to minimize the overhead. There's a
// careful balancing act in choosing the size of chunks, to minimize the
// number of decode operations, while also minimizing the number of buffer
// underruns that require the main thread to wait for a script to finish
// decoding.
//
// For the first chunk, we don't have much time between the start of the
// decode operation and the time the first script is needed, so that chunk
// needs to be fairly small. After the first chunk is finished, we have
// some buffered scripts to fall back on, and a lot more breathing room,
// so the chunks can be a bit bigger, but still not too big.
static constexpr int OFF_THREAD_FIRST_CHUNK_SIZE = 128 * 1024;
static constexpr int OFF_THREAD_CHUNK_SIZE = 512 * 1024;

// Ideally, we want every chunk to be smaller than the chunk sizes
// specified above. However, if we have some number of small scripts
// followed by a huge script that would put us over the normal chunk size,
// we're better off processing them as a single chunk.
//
// In order to guarantee that the JS engine will process a chunk
// off-thread, it needs to be at least 100K (which is an implementation
// detail that can change at any time), so make sure that we always hit at
// least that size, with a bit of breathing room to be safe.
static constexpr int SMALL_SCRIPT_CHUNK_THRESHOLD = 128 * 1024;

Tooru Fujisawa [:arai]

Assignee

Comment 4

•

2 years ago

Experimenting single task with SPSCQueue:
https://treeherder.mozilla.org/jobs?repo=try&revision=b2f875c5f6aee1741557c35b689f664f1e2ddb12

Phabricator Automation

Updated

•

2 years ago

Attachment #9337133 - Attachment description: WIP: Bug 1836422 - Rewrite ScriptPreloader with JS::FrontendContext-based APIs. → WIP: Bug 1836422 - Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs.

Tooru Fujisawa [:arai]

Assignee

Comment 5

•

2 years ago

Attached file Bug 1836422 - Part 2: Use single task and SPSCQueue in ScriptPreloader::DecodeTask. r?bthrall!,kmag! — Details

Depends on D179765

Tooru Fujisawa [:arai]

Assignee

Comment 6

•

2 years ago

Here's the experimentation result:

Talos ts_paint

	ts_paint
unpatched	542.7
part 1	545.7
part 2 (queue size = 10)	547
part 2 (queue size = 32)	544
part 2 (queue size = 1)	547

There's almost no difference between unpatched and patched, and also with different configurations. reasons I can think of are the following:

Stencil XDR decode is light-weight operation (compared to previous JSScript XDR decode), because:
- most data is used as mmapped data
- only pointer vector is allocated
- no JS runtime interaction happens
the test is executed multiple times sequentially, which means the cache file is kept in memory, instead of written to and read from the disk

How long the main thread blocked without in-memory cache

Setup

Test the following with unpatched, and patched (= part 1 + part 2, queue size = 16).

Add logging to the following place to see how long the main thread waited for off-thread-decoding in mal.Wait() call:

https://searchfox.org/mozilla-central/rev/887d4b5da89a11920ed0fd96b7b7f066927a67db/js/xpconnect/loader/ScriptPreloader.cpp#997-1010

MonitorAutoLock mal(mMonitor);

// Process script batches until our target is found.
while (!script->mReadyToExecute) {
  if (JS::OffThreadToken* token = mToken.exchange(nullptr)) {
    MonitorAutoUnlock mau(mMonitor);
    FinishOffThreadDecode(token);
  } else {
    MOZ_ASSERT(!mParsingScripts.empty());
    mWaitingForDecode = true;
    mal.Wait();
    mWaitingForDecode = false;
  }
}

Launch Nightly with temporary profile placed in the following place, 8 times for each:

internal SSD
2.5inch HDD connected over USB

In order to reduce the effect of on-memory cache of the cache files, SSD and HDD profiles are launched alternatively.

If same profile is launched multiple times, there's no wait happens.

Result

SSD total

type	average [ms]	min [ms]	max [ms]
unpatched	2.1	1.0	4.9
patched	1.1	0.5	2.6

HDD total

type	average [ms]	min [ms]	max [ms]
unpatched	87	64	107
patched	56	43	65

Both SSD and HDD show improvements.
Reason I can think of is that, the off-thread decode task can continue accessing the disk without waiting for the main thread to dispatch the next chunk, and that makes it possible to absorb the disk latency.

Tooru Fujisawa [:arai]

Assignee

Comment 7

•

2 years ago

Attached image When each script is decoded and accessed first time — Details

Here's the graph that represents when each (i-th) script is decoded and accessed, on SSD and HDD, on parent process (which encodes ~250 files).

on SSD case, only the first ~5 files can block the main thread execution.
on HDD case, first ~60 files can block the main thread execution.

So, the queuing strategy should reflect them, for example:

first n% files need to be quickly decoded and passed to the main thread, so the chunk size should be small
other files don't need to be queued frequently, and the chunk can be large

the current patch part 2 uses fixed chunk size for the queue element, for simplicity, but we could use variable size depending on the index, or simply the total decoded size:

  // The minimum number of stencils in single chunk.
  static constexpr size_t OFF_THREAD_DECODE_MININUM_CHUNK_SIZE = 8;

Tooru Fujisawa [:arai]

Assignee

Comment 8

•

2 years ago

maybe SPSCQueue element doesn't have to be Vector, but simply RefPtr<JS::Stencil>.
it's only the matter of how often it performs MonitorAutoLock mal(mMonitor) and mal.Notify().

Tooru Fujisawa [:arai]

Assignee

Comment 9

•

2 years ago

it looks like calling MonitorAutoLock mal(mMonitor) and mal.Notify() for each decode doesn't regress the performance.
so I'll go with that way.

Tooru Fujisawa [:arai]

Assignee

Updated

•

2 years ago

Blocks: 1837574

Phabricator Automation

Updated

•

2 years ago

Attachment #9337133 - Attachment description: WIP: Bug 1836422 - Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs. → Bug 1836422 - Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs. r?bthrall!,kmag!

Phabricator Automation

Updated

•

2 years ago

Attachment #9337649 - Attachment description: WIP: Bug 1836422 - Part 2: Use single task and SPSCQueue in ScriptPreloader::DecodeTask. → Bug 1836422 - Part 2: Use single task and SPSCQueue in ScriptPreloader::DecodeTask. r?bthrall!,kmag!

Pulsebot

Comment 10

•

2 years ago

Pushed by arai_a@mac.com: https://hg.mozilla.org/integration/autoland/rev/10a341592712 Part 1: Rewrite ScriptPreloader with JS::FrontendContext-based APIs. r=bthrall,mccr8 https://hg.mozilla.org/integration/autoland/rev/6237f81208fc Part 2: Use single task and SPSCQueue in ScriptPreloader::DecodeTask. r=bthrall,mccr8

Cristian Tuns

Comment 11

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/10a341592712
https://hg.mozilla.org/mozilla-central/rev/6237f81208fc

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox117: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 117 Branch

Tooru Fujisawa [:arai]

Assignee

Updated

•

1 years ago

Regressions: 1880104

You need to log in before you can comment on or make changes to this bug.