Closed Bug 1750930 Opened 3 years ago Closed 3 years ago

Indirect stubs code has quadratic table insertion time (was: Thread pool issue in emscripten web application)

Categories

(Core :: JavaScript: WebAssembly, defect, P1)

Firefox 96
defect

Tracking

()

RESOLVED FIXED
98 Branch
Tracking Status
thunderbird_esr91 --- unaffected
firefox-esr91 --- unaffected
firefox96 --- wontfix
firefox97 + fixed
firefox98 + fixed

People

(Reporter: konstantinos.zagklis, Assigned: lth)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(2 files)

Attached file MozillaBugReport.txt

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0

Steps to reproduce:

We ran an emscripten application that was working with FF 95 on the latest release (96) and it has become unresponsive.

The application utilizes emscriptens emsdk to build a wasm file, which has a thread pool in its C++ code.

More details in the attached file.

Actual results:

The application has been experiencing severely degraded performance and is unusable.
It hangs completely after a while.

Expected results:

The application should have worked as in FF 95.

Maybe this emscripten issue is related?
https://github.com/emscripten-core/emscripten/issues/15978

The Bugbug bot thinks this bug should belong to the 'Core::Javascript: WebAssembly' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Javascript: WebAssembly
Product: Firefox → Core

Do you happen to have either an online presence for this application or an archive of it you can share (either here or privately with me)? This is going to be hard to debug otherwise.

And/or, can you test with the latest Firefox Nightly? (See https://www.mozilla.org/en-US/firefox/channel/desktop/ for downloads.) I fixed an oom-during-streaming-compilation issue earlier this week and it would be useful to know if that was the same problem as what you're seeing.

Flags: needinfo?(konstantinos.zagklis)

From an initial profile in a clean local release build, it looks like the insert() method of the indirect stubs table is the problem, and more generally, that initElems takes 100% of the CPU for long stretches, due to insert and to a very, very heavily contended lock (or a very buggy lock). For a large thread pool it could take a long time to initialize everything. Not conclusive, but very suggestive. See https://share.firefox.dev/3Imxe6a.

27:36.37 INFO: Last good revision: d4bd94bc7b58345d02f59b22f35ba6269d8fd2b0
27:36.37 INFO: First bad revision: 2200d1f5c9563332ec8162220ffa392b98a850da
27:36.37 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=d4bd94bc7b58345d02f59b22f35ba6269d8fd2b0&tochange=2200d1f5c9563332ec8162220ffa392b98a850da

The pushlog is exactly the landing of the indirect stubs code.

Assignee: nobody → lhansen
Severity: -- → S2
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Keywords: regression
OS: Unspecified → All
Priority: -- → P1
Regressed by: 1743586, 1639153
Hardware: Unspecified → All
Summary: Thread pool issue in emscripten web application → Indirect stubs code has serious lock contention (was: Thread pool issue in emscripten web application)

(Lock contention is just a symptom; renaming again.)

The problem here is the following. The application has a table containing 47951 functions. Initially we create indirect stubs for these after baseline-compiling the code; this takes 53ms. But then there are additional worker threads. Because the table is in the Module and the Module is shared, these threads also create stubs and insert them in the table (this is logically the right thing to do). But then we tier-2 compile the code, and because code pointers are different for tier-2 code we create new stubs for all the threads and insert those in the table too, for a total of ... stubs. In the end, the table has 815167 entries and creating one set of 48K stubs takes 55 seconds.

This is not the only thing going on; there appears to be more than one module involved too, but the second module is not a problem.

The main problem is the insertion code. It uses a binary search to find the insertion point (yeah!) but then it moves all elements that are in the way one spot to the right (well, ok) and then it does this repeatedly for every element it is trying to insert, and that basically means the code is O(n^2) in the size of the table (boo hiss).

None of this can happen in parallel, so all the threads are completely serialized when they do this, and so getting the page up and running can take a very long time indeed.

The stub creation happens because the table is not private to the module, and maybe it could be avoided, but that's the architecture we have right now and we're actively discussing what to do about that and that discussion will take some time to complete. So in the mean time the most appropriate fix here is to make the insertion code much faster, either by using a better algorithm or by using a better data structure than one linear table per module.

Summary: Indirect stubs code has serious lock contention (was: Thread pool issue in emscripten web application) → Indirect stubs code has quadratic table insertion time (was: Thread pool issue in emscripten web application)
Flags: needinfo?(konstantinos.zagklis)

Break the indirect stubs table it into per-tls tables to avoid the
very long insertion times resulting from having one shared table per
module.

The patch fixes the problem and is fairly contained. It could be smarter (but then it would be less contained). It needs to be cleaned up in various ways before landing. We'll want to uplift this to beta for sure.

Please post here when the beta is live so we can do a smoke test and verify our application is working as expected

Has Regression Range: --- → yes
Has STR: --- → yes
Attachment #9259921 - Attachment description: WIP: Bug 1750930 - Make the indirect stub table two-level → Bug 1750930 - Make the indirect stub table two-level. r?yury
Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/edd95f545caf Make the indirect stub table two-level. r=yury
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 98 Branch

Comment on attachment 9259921 [details]
Bug 1750930 - Make the indirect stub table two-level. r?yury

Beta/Release Uplift Approval Request

  • User impact if declined: Some wasm applications will have really nasty hang-like behavior on startup.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Fairly contained reengineering of a data structure to avoid quadratic behavior when table grows large.
  • String changes made/needed:
Attachment #9259921 - Flags: approval-mozilla-beta?

Comment on attachment 9259921 [details]
Bug 1750930 - Make the indirect stub table two-level. r?yury

Approved for 97.0b7.

Attachment #9259921 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Konstantinos from comment #9)

Please post here when the beta is live so we can do a smoke test and verify our application is working as expected

The fix is in the FF98 Nightly builds now. I don't know if it made it into FF97 Beta 7 but it will be in FF97 Beta 8, which is built tomorrow, and should be available to beta users by Wednesday of this week. We're not fixing FF96. FF97 releases on February 8.

Reply to Lars T Hansen
Tested on Firefox developer edition 97.0b8 and 98.0a1

  1. Too slow (compare to previous than FF 96 and hromium based browsers)
  2. Application times out in a lot of Network operations
  3. Some functionalities not working at all
  4. Cannot perform VoIP calls like FF 95 and older
  5. No code errors, ONLY time outs

Our application it's not really usable anymore in FF 96 and newer versions
Is this solution final ?

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Tested on Firefox developer edition 97.0b8 and 98.0a1

1.Too slow (compare to previous than FF 96 and chromium based browsers)
2.Application times out in a lot of Network operations
3.Some functionalities not working at all
4.Cannot perform VoIP calls like FF 95 and older
5.No code errors, ONLY time outs

Our application it's not really usable anymore in FF 96 and newer versions
Is this solution final ?

Flags: needinfo?(lhansen)

Hi Konstantinos,

I will re-triage today and we'll see about what to do here. The original report was however about the slowness in establishing the initial connection (quoted below), and generally we try not to handle multiple different problems with the same bug report, so we'll likely want to close this one and file additional bugs for additional problems. I'm leaving it open for the time being.


Recording additional information from submitter, communicated through email:

You can access our application from this url:  login.teamviewer.com/Connect

Use the following credentials
E-Mail: <REDACTED>
Password: <REDACTED>

In the application's main screen, fill in the text box this id: <REDACTED> and press connect when the button appears.

Using FF 95.0.2 the button will appear in ~20sec
Using FF 96.0.1 up to Firefox Nightly 98.0a1 will take more than 5min and the application won't work
Flags: needinfo?(lhansen)

The original report's problem with the startup time remains fixed (mostly -- there's a slightly longer lag with the fix than there was in FF95 but this is acceptable to me) but once started the application seems significantly laggier in recent builds than in FF95, I've received a video through other chanels with a side-by-side comparison.

According to the profiler there appears to be tremendous lock contention in js::wasm::Code::lookupIndirectStubRange which is called from Table::getFuncRef which is called from WasmTableObject::getImpl which is called from WasmTableObject::get. That is, the code very heavily reads functions from Table<funcref>. This is bound to be slow regardless, but the lock is a new thing that came in with the indirect stubs.

Profile for Nightly build 2022-01-30: https://share.firefox.dev/3ALszbn

(Unfortunately, FF95 does not appear to have worker profiling, but I'm going to try to obtain a good profile from FF96 before the indirect stubs landed, in the hope that worker profiles are available there.)

The reporter confirms that the application is compiled with -O2, so there should be no artifacts of debug code.

Profile for FF96 just before the indirect stubs landed: https://share.firefox.dev/3IPgi8K.

The application has 29 worker threads in the "WebCOOP+COEP Content" process. Certainly this creates room for some contention, should there be a lock.

The profiles are not completely comparable, but one might start the FF98 profile (previous comment) around 25s and focus only on the WebCOOP+COEP process. The execution time for most threads is dominated by futexwait - they're mostly idle - but for those threads that have long periods of activity initially in the segment starting at 25s, 20%-55% of the time is spent either in lock contention (dominates) or in LookupInSorted. Clearly these threads will tend to serialize each other since the lookups cannot happen in parallel. And the other threads are performing the same operations, which will tend to further exacerbate overhead and cause contention.

It was already known that we would want to avoid the lock in lookupIndirectStubRange, but it was assumed that applications would not hammer on this code in the manner of TeamViewer and that the lock would be affordable.

See Also: → 1752870

I'm going to close this and spin off bug 1752870 for the new issue, which is related but different. Will copy comment 19 and comment 20 into that and populate it with STR, and will cc key people.

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Regressed by: 1742053
No longer regressed by: 1743586
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: