1750930 - Indirect stubs code has quadratic table insertion time (was: Thread pool issue in emscripten web application)

Konstantinos

Reporter

Description

•

3 years ago

Attached file MozillaBugReport.txt — Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0

Steps to reproduce:

We ran an emscripten application that was working with FF 95 on the latest release (96) and it has become unresponsive.

The application utilizes emscriptens emsdk to build a wasm file, which has a thread pool in its C++ code.

More details in the attached file.

Actual results:

The application has been experiencing severely degraded performance and is unusable.
It hangs completely after a while.

Expected results:

The application should have worked as in FF 95.

Konstantinos

Reporter

Comment 1

•

3 years ago

Maybe this emscripten issue is related?
https://github.com/emscripten-core/emscripten/issues/15978

BugBot [:suhaib / :marco/ :calixte]

Comment 2

•

3 years ago

The Bugbug bot thinks this bug should belong to the 'Core::Javascript: WebAssembly' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Javascript: WebAssembly

Product: Firefox → Core

Lars T Hansen [:lth]

Assignee

Comment 3

•

3 years ago

Do you happen to have either an online presence for this application or an archive of it you can share (either here or privately with me)? This is going to be hard to debug otherwise.

And/or, can you test with the latest Firefox Nightly? (See https://www.mozilla.org/en-US/firefox/channel/desktop/ for downloads.) I fixed an oom-during-streaming-compilation issue earlier this week and it would be useful to know if that was the same problem as what you're seeing.

Flags: needinfo?(konstantinos.zagklis)

Lars T Hansen [:lth]

Assignee

Comment 4

•

3 years ago

•

Edited

From an initial profile in a clean local release build, it looks like the insert() method of the indirect stubs table is the problem, and more generally, that initElems takes 100% of the CPU for long stretches, due to insert and to a very, very heavily contended lock (or a very buggy lock). For a large thread pool it could take a long time to initialize everything. Not conclusive, but very suggestive. See https://share.firefox.dev/3Imxe6a.

Lars T Hansen [:lth]

Assignee

Comment 5

•

3 years ago

27:36.37 INFO: Last good revision: d4bd94bc7b58345d02f59b22f35ba6269d8fd2b0
27:36.37 INFO: First bad revision: 2200d1f5c9563332ec8162220ffa392b98a850da
27:36.37 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=d4bd94bc7b58345d02f59b22f35ba6269d8fd2b0&tochange=2200d1f5c9563332ec8162220ffa392b98a850da

The pushlog is exactly the landing of the indirect stubs code.

Assignee: nobody → lhansen

Severity: -- → S2

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Keywords: regression

OS: Unspecified → All

Priority: -- → P1

Regressed by: 1743586, 1639153

Hardware: Unspecified → All

Summary: Thread pool issue in emscripten web application → Indirect stubs code has serious lock contention (was: Thread pool issue in emscripten web application)

Lars T Hansen [:lth]

Assignee

Comment 6

•

3 years ago

(Lock contention is just a symptom; renaming again.)

The problem here is the following. The application has a table containing 47951 functions. Initially we create indirect stubs for these after baseline-compiling the code; this takes 53ms. But then there are additional worker threads. Because the table is in the Module and the Module is shared, these threads also create stubs and insert them in the table (this is logically the right thing to do). But then we tier-2 compile the code, and because code pointers are different for tier-2 code we create new stubs for all the threads and insert those in the table too, for a total of ... stubs. In the end, the table has 815167 entries and creating one set of 48K stubs takes 55 seconds.

This is not the only thing going on; there appears to be more than one module involved too, but the second module is not a problem.

The main problem is the insertion code. It uses a binary search to find the insertion point (yeah!) but then it moves all elements that are in the way one spot to the right (well, ok) and then it does this repeatedly for every element it is trying to insert, and that basically means the code is O(n^2) in the size of the table (boo hiss).

None of this can happen in parallel, so all the threads are completely serialized when they do this, and so getting the page up and running can take a very long time indeed.

The stub creation happens because the table is not private to the module, and maybe it could be avoided, but that's the architecture we have right now and we're actively discussing what to do about that and that discussion will take some time to complete. So in the mean time the most appropriate fix here is to make the insertion code much faster, either by using a better algorithm or by using a better data structure than one linear table per module.

Summary: Indirect stubs code has serious lock contention (was: Thread pool issue in emscripten web application) → Indirect stubs code has quadratic table insertion time (was: Thread pool issue in emscripten web application)

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

Flags: needinfo?(konstantinos.zagklis)

Lars T Hansen [:lth]

Assignee

Comment 7

•

3 years ago

Attached file Bug 1750930 - Make the indirect stub table two-level. r?yury — Details

Break the indirect stubs table it into per-tls tables to avoid the
very long insertion times resulting from having one shared table per
module.

Lars T Hansen [:lth]

Assignee

Comment 8

•

3 years ago

The patch fixes the problem and is fairly contained. It could be smarter (but then it would be less contained). It needs to be cleaned up in various ways before landing. We'll want to uplift this to beta for sure.

Konstantinos

Reporter

Comment 9

•

3 years ago

Please post here when the beta is live so we can do a smoke test and verify our application is working as expected

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

Has Regression Range: --- → yes

Has STR: --- → yes

status-firefox96: --- → affected

status-firefox97: --- → affected

status-firefox98: --- → affected

status-firefox-esr91: --- → unaffected

status-thunderbird_esr91: --- → unaffected

Phabricator Automation

Updated

•

3 years ago

Attachment #9259921 - Attachment description: WIP: Bug 1750930 - Make the indirect stub table two-level → Bug 1750930 - Make the indirect stub table two-level. r?yury

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

Blocks: wasm-large-web-properties

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox96: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

tracking-firefox97: --- → +

tracking-firefox98: --- → +

Pulsebot

Comment 10

•

3 years ago

Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/edd95f545caf Make the indirect stub table two-level. r=yury

Andreea Pavel [:apavel]

Comment 11

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/edd95f545caf

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox98: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 98 Branch

Lars T Hansen [:lth]

Assignee

Comment 12

•

3 years ago

Comment on attachment 9259921 [details]
Bug 1750930 - Make the indirect stub table two-level. r?yury

Beta/Release Uplift Approval Request

User impact if declined: Some wasm applications will have really nasty hang-like behavior on startup.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Fairly contained reengineering of a data structure to avoid quadratic behavior when table grows large.
String changes made/needed:

Attachment #9259921 - Flags: approval-mozilla-beta?

Ryan VanderMeulen [:RyanVM]

Comment 13

•

3 years ago

Comment on attachment 9259921 [details]
Bug 1750930 - Make the indirect stub table two-level. r?yury

Approved for 97.0b7.

Attachment #9259921 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 14

•

3 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/d435109964ec

status-firefox97: affected → fixed

Lars T Hansen [:lth]

Assignee

Comment 15

•

3 years ago

(In reply to Konstantinos from comment #9)

Please post here when the beta is live so we can do a smoke test and verify our application is working as expected

The fix is in the FF98 Nightly builds now. I don't know if it made it into FF97 Beta 7 but it will be in FF97 Beta 8, which is built tomorrow, and should be available to beta users by Wednesday of this week. We're not fixing FF96. FF97 releases on February 8.

Konstantinos

Reporter

Comment 16

•

3 years ago

Reply to Lars T Hansen
Tested on Firefox developer edition 97.0b8 and 98.0a1

Too slow (compare to previous than FF 96 and hromium based browsers)
Application times out in a lot of Network operations
Some functionalities not working at all
Cannot perform VoIP calls like FF 95 and older
No code errors, ONLY time outs

Our application it's not really usable anymore in FF 96 and newer versions
Is this solution final ?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Konstantinos

Reporter

Comment 17

•

3 years ago

Tested on Firefox developer edition 97.0b8 and 98.0a1

1.Too slow (compare to previous than FF 96 and chromium based browsers)
2.Application times out in a lot of Network operations
3.Some functionalities not working at all
4.Cannot perform VoIP calls like FF 95 and older
5.No code errors, ONLY time outs

Our application it's not really usable anymore in FF 96 and newer versions
Is this solution final ?

Flags: needinfo?(lhansen)

Lars T Hansen [:lth]

Assignee

Comment 18

•

3 years ago

Hi Konstantinos,

I will re-triage today and we'll see about what to do here. The original report was however about the slowness in establishing the initial connection (quoted below), and generally we try not to handle multiple different problems with the same bug report, so we'll likely want to close this one and file additional bugs for additional problems. I'm leaving it open for the time being.

Recording additional information from submitter, communicated through email:

You can access our application from this url:  login.teamviewer.com/Connect

Use the following credentials
E-Mail: <REDACTED>
Password: <REDACTED>

In the application's main screen, fill in the text box this id: <REDACTED> and press connect when the button appears.

Using FF 95.0.2 the button will appear in ~20sec
Using FF 96.0.1 up to Firefox Nightly 98.0a1 will take more than 5min and the application won't work

Flags: needinfo?(lhansen)

Lars T Hansen [:lth]

Assignee

Comment 19

•

3 years ago

The original report's problem with the startup time remains fixed (mostly -- there's a slightly longer lag with the fix than there was in FF95 but this is acceptable to me) but once started the application seems significantly laggier in recent builds than in FF95, I've received a video through other chanels with a side-by-side comparison.

According to the profiler there appears to be tremendous lock contention in js::wasm::Code::lookupIndirectStubRange which is called from Table::getFuncRef which is called from WasmTableObject::getImpl which is called from WasmTableObject::get. That is, the code very heavily reads functions from Table<funcref>. This is bound to be slow regardless, but the lock is a new thing that came in with the indirect stubs.

Profile for Nightly build 2022-01-30: https://share.firefox.dev/3ALszbn

(Unfortunately, FF95 does not appear to have worker profiling, but I'm going to try to obtain a good profile from FF96 before the indirect stubs landed, in the hope that worker profiles are available there.)

The reporter confirms that the application is compiled with -O2, so there should be no artifacts of debug code.

Lars T Hansen [:lth]

Assignee

Comment 20

•

3 years ago

Profile for FF96 just before the indirect stubs landed: https://share.firefox.dev/3IPgi8K.

The application has 29 worker threads in the "WebCOOP+COEP Content" process. Certainly this creates room for some contention, should there be a lock.

The profiles are not completely comparable, but one might start the FF98 profile (previous comment) around 25s and focus only on the WebCOOP+COEP process. The execution time for most threads is dominated by futexwait - they're mostly idle - but for those threads that have long periods of activity initially in the segment starting at 25s, 20%-55% of the time is spent either in lock contention (dominates) or in LookupInSorted. Clearly these threads will tend to serialize each other since the lookups cannot happen in parallel. And the other threads are performing the same operations, which will tend to further exacerbate overhead and cause contention.

It was already known that we would want to avoid the lock in lookupIndirectStubRange, but it was assumed that applications would not hammer on this code in the manner of TeamViewer and that the lock would be affordable.

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

Comment 21

•

3 years ago

I'm going to close this and spin off bug 1752870 for the new issue, which is related but different. Will copy comment 19 and comment 20 into that and populate it with STR, and will cc key people.

Status: REOPENED → RESOLVED

Closed: 3 years ago → 3 years ago

Resolution: --- → FIXED

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

Regressed by: 1742053
No longer regressed by: 1743586

MozillaBugReport.txt 3 years ago Konstantinos 1.14 KB, text/plain		Details
Bug 1750930 - Make the indirect stub table two-level. r?yury 3 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request	RyanVM : approval-mozilla-beta+	Details \| Review