Closed Bug 1891227 Opened 10 months ago Closed 8 months ago

40.93 - 18.9% addMarN time_duration / addMarN time_duration (Linux) regression on Wed April 3 2024

Categories

(Core :: Storage: IndexedDB, defect, P3)

defect

Tracking

()

RESOLVED FIXED
128 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox126 --- fixed
firefox127 --- fixed
firefox128 --- fixed

People

(Reporter: aesanu, Assigned: jstutte)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Perfherder has detected a browsertime performance regression from push 58f4bf812f636518abbad47cfd558fb59315d4b5. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Test Platform Options Absolute values (old vs new) Performance Profiles
41% addMarN time_duration linux1804-64-shippable-qr fission webrender 18,524.29 -> 26,105.71 Before/After
19% addMarN time_duration linux1804-64-shippable-qr fission webrender 21,685.95 -> 25,783.67 Before/After

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the patch(es) may be backed out in accordance with our regression policy.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 42229

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(jjalkanen)

IIUC this does not happen on Windows, that is the test is most likely emphasizing an underlying problem with nsThreadPool on linux.

The high noise on linux test values both before and after the patch landed is also notable compared to the flat Windows line, but the upwards tick after landing is real, I'd say.

Looking at one of the profiles, it seems that the test is piling up some work that is then thrown on the IndexedDB thread pool at once - but serialized, that is all the events are coming most likely through the same connection. The difference before and after the patch is, that before all events were hitting the same single thread, now we bounce between the two threads we kept alive as idle threads. Though this might sound bad, it is not immediately clear why this would increase the overhead in any measurable way. And in fact on Windows it does not, though the profile shows the same bouncing between two threads also there.

Most of the extra time seems to be spent inside pthread_cond_wait_timed, so largely out of our direct control if not for signaling new events in time (which happens always in the same moment on Windows and linux, AFAICS).

IIUC all the linux perf tests are running on dedicated machines with the following CPU spec:

   1 Intel E3-1585Lv5 3.0GHz CPU
   8GB DDR4 2400MHz RAM

which gives us 4 cores / 8 threads. The test is creating load on three threads at most, so this should hopefully not lead to any problems with overhead from context switching between threads on the same CPU core. And interestingly repeating the same test locally on a 12 core /24 threads linux machine confirms roughly the same difference between having just 1 idle thread or 2 (the current value) for the thread pool. Performance is even worse if we keep 4 idle threads. The worst case however is not keeping any idle thread as this will make us spin up and down threads all the time.

AMD Ryzen 9 3900XT 12-Core Processor

mach raptor --browsertime -t addMarN --post-startup-delay=1000

---
Old dedicated thread

08:18:10     INFO -  raptor-browsertime Info: CPU / Power usage: 11240
08:18:10     INFO -  raptor-browsertime Info: Time duration was  8718.099999999999

08:19:08     INFO -  raptor-browsertime Info: CPU / Power usage: 10617
08:19:09     INFO -  raptor-browsertime Info: Time duration was  8369.46

---
Idle: 2 (current default)

20:08:54     INFO -  raptor-browsertime Info: CPU / Power usage: 11867
20:08:54     INFO -  raptor-browsertime Info: Time duration was  9241.9

20:09:36     INFO -  raptor-browsertime Info: CPU / Power usage: 12431
20:09:36     INFO -  raptor-browsertime Info: Time duration was  9557.88

---
Idle: 1

19:56:29     INFO -  raptor-browsertime Info: CPU / Power usage: 10282
19:56:29     INFO -  raptor-browsertime Info: Time duration was  7969.180000000001

20:11:51     INFO -  raptor-browsertime Info: CPU / Power usage: 10891
20:11:51     INFO -  raptor-browsertime Info: Time duration was  8427.64

20:12:33     INFO -  raptor-browsertime Info: CPU / Power usage: 10690
20:12:33     INFO -  raptor-browsertime Info: Time duration was  8219.84

---
Idle: 4

20:14:36     INFO -  raptor-browsertime Info: CPU / Power usage: 12811
20:14:36     INFO -  raptor-browsertime Info: Time duration was  9909.62

20:15:28     INFO -  raptor-browsertime Info: CPU / Power usage: 12873
20:15:28     INFO -  raptor-browsertime Info: Time duration was  9985.52

---
Idle: 0

20:19:49     INFO -  raptor-browsertime Info: CPU / Power usage: 20948
20:19:49     INFO -  raptor-browsertime Info: Time duration was  12701.56

So there seems something here we might want to investigate about the behavior of nsThreadPool on linux. Maybe the findings could also shed some light on the noise we see in general in linux execution times. This would probably deserve its own XPCOM bug.

But I do not think we should refrain from using thread pools here. The underlying problem can hit any thread pool under a similar load and was there before. And this test is also not very representative for the load we expect normally to happen on this thread pool and in fact most other tests do not show any noticeable problems. We might want to experiment with different idle settings for this pool, at most, but that might be counter productive for other scenarios (and/or on other OSs).

"addMab1 time_duration opt fission webrender" is about 10% better with own thread management (15 test runs).
"addMarN time_duration opt fission webrender" is almost 25% better with own thread management (20 test runs).

(In reply to Jan Varga [:janv] from comment #3)

"addMab1 time_duration opt fission webrender" is about 10% better with own thread management (15 test runs).
"addMarN time_duration opt fission webrender" is almost 25% better with own thread management (20 test runs).

Weird that this is not visible in the graph ? And also not for addMab1 graphs

As discussed offline, the old custom thread pool was handling all runnables for a connection in one single thread and keeping the threads idle for some time before joining them. These features might contribute to the performance in this particular test.

Flags: needinfo?(jjalkanen)
Depends on: 1891664
See Also: → 1891696
Severity: -- → S3
Priority: -- → P3

Setting Fx126 to Fixed, the regressing patch was backed out of beta
https://bugzilla.mozilla.org/show_bug.cgi?id=1791767#c58

It seems bug 1791767 fixed bug 1878149 (see bug 1893807).

https://hg.mozilla.org/releases/mozilla-beta/rev/1d48046b04e7
Fx127 fixed: the regressing patch was backed out of beta

See Also: → 1897072
See Also: 1891696

After bug 1891664 landed, I can see a slight down-trend, though the noise in these test results is very high. Probably we need some more datapoint to really see it. But from local testing I do not think we should wait for bug 1897072 or other improvements and just let ride bug 1791767 now.

(In reply to Jens Stutte [:jstutte] from comment #9)

After bug 1891664 landed, I can see a slight down-trend, though the noise in these test results is very high. Probably we need some more datapoint to really see it. But from local testing I do not think we should wait for bug 1897072 or other improvements and just let ride bug 1791767 now.

If I read the data correctly, before bug 1891664 results were happening mostly in the corridor between 20-30k, after the landing mostly between 15-25k. And before bug 1791767 landed, we were mostly in the same corridor (15-20k) even if maybe some small percentage better than now.

I thus think we recovered most if not all of the regression. I assume we did not get a perf alert because the first data points after landing were close to the upper limit of that corridor (probably for random reasons).

Can we close this out?

Flags: needinfo?(jstutte)

I think so, Andra?

Flags: needinfo?(jstutte) → needinfo?(aesanu)

I see this untriaged alert which refers to a slightly later push but contains exactly the expected result. I do not know exactly how those alerts work, but I am sufficiently sure to close this out.

Fixed by bug 1891664.

Status: NEW → RESOLVED
Closed: 8 months ago
Flags: needinfo?(aesanu)
Resolution: --- → FIXED
Assignee: nobody → jstutte
Target Milestone: --- → 128 Branch
You need to log in before you can comment on or make changes to this bug.