Closed Bug 1752870 Opened 2 years ago Closed 2 years ago

Lock contention and significant lookup time in indirect stubs table

Categories

(Core :: JavaScript: WebAssembly, defect, P1)

defect

Tracking

()

RESOLVED FIXED
98 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox96 --- wontfix
firefox97 --- wontfix
firefox98 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(2 files, 1 obsolete file)

The TeamViewer wasm app is heavily threaded and is experiencing many problems after the landing of the indirect stubs code. Some were fixed in bug 1750930 but a number remain (bug 1750930 comment 16), quoting the reporter:

  1. Too slow (compare to previous than FF 96 and Chromium based browsers)
  2. Application times out in a lot of Network operations
  3. Some functionalities not working at all
  4. Cannot perform VoIP calls like FF 95 and older
  5. No code errors, ONLY time outs

Profiles show that lock contention and long lookup times in indirect stubs lookup when stubs are looked up from table.get on funcref tables is a major culprit (bug 1750930 comments 19-20, I will copy into comments here) and needs to be fixed first.

Speeding up the lookup will reduce contention so I will look into that first, but I'm guessing the lock needs reengineering in practice (or we back out the indirect stubs code).

See Also: → 1750930

Bug 1750930 comment 19:

The original report's problem with the startup time remains fixed (mostly -- there's a slightly longer lag with the fix than there was in FF95 but this is acceptable to me) but once started the application seems significantly laggier in recent builds than in FF95, I've received a video through other chanels with a side-by-side comparison.

According to the profiler there appears to be tremendous lock contention in js::wasm::Code::lookupIndirectStubRange which is called from Table::getFuncRef which is called from WasmTableObject::getImpl which is called from WasmTableObject::get. That is, the code very heavily reads functions from Table<funcref>. This is bound to be slow regardless, but the lock is a new thing that came in with the indirect stubs.

Profile for Nightly build 2022-01-30: https://share.firefox.dev/3ALszbn

(Unfortunately, FF95 does not appear to have worker profiling, but I'm going to try to obtain a good profile from FF96 before the indirect stubs landed, in the hope that worker profiles are available there.)

The reporter confirms that the application is compiled with -O2, so there should be no artifacts of debug code.

Bug 1750930 comment 20:

Profile for FF96 just before the indirect stubs landed: https://share.firefox.dev/3IPgi8K.

The application has 29 worker threads in the "WebCOOP+COEP Content" process. Certainly this creates room for some contention, should there be a lock.

The profiles are not completely comparable, but one might start the FF98 profile (previous comment) around 25s and focus only on the WebCOOP+COEP process. The execution time for most threads is dominated by futexwait - they're mostly idle - but for those threads that have long periods of activity initially in the segment starting at 25s, 20%-55% of the time is spent either in lock contention (dominates) or in LookupInSorted. Clearly these threads will tend to serialize each other since the lookups cannot happen in parallel. And the other threads are performing the same operations, which will tend to further exacerbate overhead and cause contention.

It was already known that we would want to avoid the lock in lookupIndirectStubRange, but it was assumed that applications would not hammer on this code in the manner of TeamViewer and that the lock would be affordable.

Steps to reproduce at least some symptoms:

Go to <REDACTED>. Log in with <REDACTED>/<REDACTED>. In the view that is shown, click on "All" under "Groups" at the lower left, then "Connect" in the view that is shown, then "WebClient" and "Connect". A new tab opens up, wait for it to load, then click "Connect". A few things are shown on a black background. Click on "Communicate" and a little tab pops down. Mouse over the icons and notice these are sluggish; they should not be.

Profiling notes:

A local release build may have a more informative profile because it has not been subject to PGO/LTO, but of course it may be misleading for that reason. For example, in an actual release build I see LookupInSorted being hot, while in a local Release build I see BinarySearch (called from LookupInSorted) being hot.

By and large, since the sluggishness of the UI is a matter of computer performance and other things, a profile may be a better guide of what's going on here. I tend to start the profile before I click on "All" in the above STR, and stop it after I have moused over the tabs in the toolbar a few times, total profile time about 1 minute.

As an experiment, I reversed the order of tiers searched for the code range (so, tier2 before tier1). This made things much worse, which is sensible since there will be a lot of tier1 code in the table, as it is initialized early, before tier2 code is available. Searching tier2 before tier1, once tier2 code is available, every search among tier2 stubs will fail to produce a hit before we move on to tier1; while searching tier1 before tier2, we do half the work.

All of that variability is another reason to get rid of the lock, of course, or to radically speed up the search. Or to get rid of the lookup altogether - increasingly attractive, but it probably depends on storing yet more information in the table. Basically, we only want the code range because we want its function index. The CodeRange* is not stable in the face of insertions in the table, but the range's function index is. We could store that in the table.

There are three problems:

Lock contention can be handled with a reader/writer lock to allow readers to be concurrent, I have a hack that seems to work, profile of a longer run here: https://share.firefox.dev/340nkJf. Everything feels pretty snappy with this fix, I can open the folders and run the browser and so on. More testing needed but this is promising, and the profile does look good. Contention is completely absent now. It would probably be worth it to land a fix for this alone.

After that, some threads become very, very busy searching the stubs table for the appropriate code range for a function pointer. One thread appears to spend almost 70% of its time on this activity alone, triggered by table.get, which appears to be called from JS. This suggests an architectural problem with emscripten and/or the application, but we should try to mitigate it even so. The stubs table is searched per-tier by LazyStubTier::lookupRange, which calls LazyStubSegment::lookupRange on each segment in order, and the latter binary-searches the segment's codeRanges_ array for a match. At steady state there are almost 350 lazy stubs segments, almost all of them small segments devoted to entry stubs, which will yield no hits. For indirect stubs there are about 30, and they are very large - on the order of 45000 code ranges per segment. This strikes me as an architectural problem on our part. Stub segments can have stubs of mixed types but almost never will, and it would be easy to partition and search only the ones that are relevant. That might be a start. Some type of hashed LRU cache on the front of the stubs might be appealing too.

The third problem is memory consumption, we allocate almost 150MB of code more with the indirect stubs code present than without, almost doubling code size from what I can see. I need to investigate this more, but it does not look great.

(Also, about:memory appears unable to properly handle code memory shared between workers, but that's a separate bug.)

Has Regression Range: --- → yes

Optimizing the search in wasm::LookupInSorted by inlining the binary search and optimizing comparisons only shows how hot this function is, with close to 80% of the running time on a couple of the threads (the profile was lost, sorry), and the ultimate caller, WasmTableObject::getImpl, higher still. This suggests that work is always waiting to be processed and that we're basically blocking on this function. So the search must be improved somehow, with smarter filtering of where we search, caches, or avoiding the search altogether.

Adding a filter before the search to check if the value we're seeking is even going to be in the table (after all, it's sorted), the total time for WasmTableObject::getImpl on the busiest threads drops to 30%, with the time distributed among locking, searching, and iterating. This is still not great, but clearly the application is not really blocked on this operation any more, as it spends 60% of the time even on the busiest threads in atomic.wait. (There could be secondary effects as the thread is still busy, with other threads not being serviced as quickly as they are, though.)

Profile after these optimizations: https://share.firefox.dev/3GjMXSn

Depends on D137516

Depends on D137517

Attachment #9261670 - Attachment description: WIP: Bug 1752870 - reader/writer ExclusiveData variant (HACK) (WIP) → WIP: Bug 1752870 - Multiple-reader/single-writer ExclusiveData variant.
Attachment #9261672 - Attachment description: WIP: Bug 1752870 - filter before searching (WIP) → WIP: Bug 1752870 - Filter code ranges before searching.

The bummed binary search is faster, but the other two patches provide almost all of the gains, so we'll land only those two initially.

Attachment #9261670 - Attachment description: WIP: Bug 1752870 - Multiple-reader/single-writer ExclusiveData variant. → Bug 1752870 - Multiple-reader/single-writer ExclusiveData variant. r?jonco
Attachment #9261672 - Attachment description: WIP: Bug 1752870 - Filter code ranges before searching. → Bug 1752870 - Filter code ranges before searching. r?rhunt
Regressed by: 1742053
No longer regressed by: 1743586
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cf9543eb9137
Multiple-reader/single-writer ExclusiveData variant. r=jonco
https://hg.mozilla.org/integration/autoland/rev/519a25bb4431
Filter code ranges before searching. r=rhunt
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 98 Branch

Konstantinos, the fix for this should be in this morning's Nightly 98 or, in the worst case, the one coming this evening; perhaps when you have a moment you can try to verify that things are working OK. The fix will be in FF98 which goes to beta next week, but will not be backported to FF97 (released next week) or FF96. There will be further work on this; in FF99 I hope to return us to the situation we had in FF95, which I believe is even better than what we'll have in FF98 (see bug 1753061). Thanks again for the bug report and don't hesitate to file more; your application is apparently a stress test for Firefox.

Lars we tested on 98.0a1. It seems now our app is working better. Waiting to double check again on FF98 beta release.

Attachment #9261671 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: