Closed Bug 1658847 Opened 4 years ago Closed 4 years ago

hypesquad

Tracking

()

Status:

RESOLVED FIXED

Milestone:

81 Branch

Tracking Flags:

Tracking

Status

firefox-esr68

---

unaffected

firefox-esr78

---

unaffected

firefox79

---

unaffected

firefox80

---

wontfix

firefox81

---

fixed

People

(Reporter: dholbert, Assigned: aosmond)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Attachments

(2 files)

mozregression 4 years ago Gerald Squelart (he/him) (not at Mozilla since 2022-09-15) 22.17 KB, text/plain		Details
Bug 1658847 - Put profiler thread to sleep when allocating memory for shmems. 4 years ago Andrew Osmond [:aosmond] (he/him) 47 bytes, text/x-phabricator-request		Details \| Review

Daniel Holbert [:dholbert]

Reporter

Description

•

4 years ago

STR:

Starting with a fresh profile, visit https://profiler.firefox.com/ and click the button to enable the profiler menu button.
Visit https://discord.com/hypesquad
Ctrl+Shift+1 to start profiling

ACTUAL RESULTS:
The animated guy with a backpack freezes after you start profiling. It's kinda janky before that, but it just freezes entirely when the profiler starts.

EXPECTED RESULTS:
No such freezing.

This is blocking investigation of why the rendering is janky. In bug 1658282 comment 6, mstange suggests it's because the profiler causes a deadlock during shmem allocation.

This bug here covers that profiler issue.

Daniel Holbert [:dholbert]

Reporter

Comment 1

•

4 years ago

Here are two profiles that I tried to capture & ran into this issue:
https://share.firefox.dev/2DZyFeS
https://share.firefox.dev/3aiSdXQ

Daniel Holbert [:dholbert]

Reporter

Comment 2

•

4 years ago

Transferring ni=gerald from bug 1658282.

Flags: needinfo?(gsquelart)

Daniel Holbert [:dholbert]

Reporter

Comment 3

•

4 years ago

This is a regression -- I can profile the site without interrupting its animation (i.e. no deadlock) in Firefox 79 release.

Keywords: regression

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 4

•

4 years ago

Thank you Daniel.

Some early unprocessed thoughts:
The "nativeallocations" feature is not used, that would have been the main culprit.
"screenshots" is also not used, that could have caused hiccups.
The profiler sampler overhead is at 20%, that's fairly high!

Anyway I can reproduce it on Linux (the animation doesn't freeze for me, but gets jankier), and I can see a lot of time spend in posix_fallocate, I'll investigate further...

Assignee: nobody → gsquelart

Severity: -- → S3

Flags: needinfo?(gsquelart)

Priority: -- → P1

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 5

•

4 years ago

Attached file mozregression — Details

mozregression points at bug 1582954, which started using posix_fallocate, and that's the function that gets stuck when profiling.

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 6

•

4 years ago

Andrew, you introduced posix_fallocate in bug 1582954, and it's causing significant pauses when profiling. I'm not sure if it's only affecting the Gecko Profiler, or if it's really causing pauses by itself (in which case, it would also be the "regressed-by" for bug 1658282).

Could you please have a look?

Flags: needinfo?(aosmond)

Regressed by: 1582954

BMO Automation

Updated

•

4 years ago

Has Regression Range: --- → yes

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 7

•

4 years ago

(In reply to Gerald Squelart [:gerald] (he/him) from comment #6)

(it would also be the "regressed-by" for bug 1658282)

Forget about that part, bug 1658282 is a graphics issue pre-dating your patch.

BugBot [:suhaib / :marco/ :calixte]

Comment 8

•

4 years ago

Set release status flags based on info from the regressing bug 1582954

status-firefox79: --- → unaffected

status-firefox80: --- → affected

status-firefox81: --- → affected

status-firefox-esr68: --- → unaffected

status-firefox-esr78: --- → unaffected

Julien Cristau [:jcristau]

Updated

•

4 years ago

status-firefox80: affected → fix-optional

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 9

•

4 years ago

I can reproduce with the STR. Investigating.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 10

•

4 years ago

It doesn't happen if I profile, but disable profiling the main thread. While the profiles suggest it is blocked, it actually appears to be getting EINTR many times in succession, and we loop retrying.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 11

•

4 years ago

Reducing the sample time also avoids the problem.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 12

•

4 years ago

Reviewing the posix_fallocate implementation:

https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/posix/posix_fallocate.c;hb=d614a7539657941a9201c236b2f15afac18e1213

Either the read or write hits EINTR, and then we need to start all over again. This is why larger allocations are more vulnerable. If we were handling EINTR directly inside this loop, I imagine we would avoid the problem, as we would be making incremental progress.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 13

•

4 years ago

Just for the record, given the profiler registers a frequent signal, SIGPROF,

https://searchfox.org/mozilla-central/rev/fa7f47027917a186fb2052dee104cd06c21dd76f/mozglue/baseprofiler/core/platform-linux-android.cpp#299

it is unlikely this impacts users outside of the profiler.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 14

•

4 years ago

kvark linked me to:

https://github.com/pmem/issues/issues/168
https://github.com/pmem/pmdk/pull/3580/files

which tries incrementally smaller chunks to work around this.

The only alternative would be to implement posix_fallocate ourselves.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 15

•

4 years ago

We already have mozilla::fallocate which implements its own fallback if we don't have posix_fallocate:

https://searchfox.org/mozilla-central/rev/fa7f47027917a186fb2052dee104cd06c21dd76f/xpcom/glue/FileUtils.cpp#80

Perhaps the best course of action is to improve this implementation to support EINTR and use our fallback path if we get EINTR-ed...

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 16

•

4 years ago

Thank you for the analysis.

So if I understand correctly, the profiler sampler's SIGPROF is interrupting posix_fallocate? And that's why it happens more often when allocating a lot and/or sampling more frequently, right?

So one alternate solution (apart from modifying posix_fallocate itself) would be to stop the profiler from interrupting the thread in this case.
I think we could use (abuse!) AUTO_PROFILER_THREAD_SLEEP and AUTO_PROFILER_THREAD_WAKE around the call to posix_fallocate:

If MOZ_GECKO_PROFILER is not #defined, these macros are empty.
Otherwise, If the profiler is not running, they will only do one TLS read.
Otherwise (i.e., the profiler is running), they will do one TLS read and one atomic write; after AUTO_PROFILER_THREAD_SLEEP the next sampling loop will still take one sample (thereby interrupting posix_fallocate), but the following loops will not sample that thread anymore, allowing posix_fallocate to do its job to completion; AUTO_PROFILER_THREAD_WAKE will then allow the sampler to resume sampling.

What do you think?

Markus Stange [:mstange]

Comment 17

•

4 years ago

I like this suggestion. You could even argue that it would be reasonable to put AUTO_PROFILER_THREAD_SLEEP there even if it didn't have these beneficial effects, because no interesting CPU work happens inside posix_fallocate.

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 18

•

4 years ago

•

Edited

Great!

Andrew, since you're able to reproduce the issue could you please try that? You just need to add AUTO_PROFILER_THREAD_SLEEP before the posix_fallocate call, and AUTO_PROFILER_THREAD_WAKE after. [edit: See comment 19, no need for _WAKE, but you'll need to enclose _SLEEP and fallocate in a scope]
If that works, ship it! 😉

And I see that we have more posix_fallocate calls in our code: https://searchfox.org/mozilla-central/search?q=posix_fallocate
So as a follow-up (in this bug or another one) we should consider doing the same in these other places, or combine them all into a profiler-friendly wrapper...

Markus Stange [:mstange]

Comment 19

•

4 years ago

(In reply to Gerald Squelart [:gerald] (he/him) from comment #18)

You just need to add AUTO_PROFILER_THREAD_SLEEP before the posix_fallocate call, and AUTO_PROFILER_THREAD_WAKE after.

It's an RAII class (it starts with "AUTO_"), so just AUTO_PROFILER_THREAD_SLEEP should be fine, no wake after. You only need the wake thing if you want to temporarily "unsleep" from the inside.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 20

•

4 years ago

Attached file Bug 1658847 - Put profiler thread to sleep when allocating memory for shmems. — Details

posix_fallocate iterates over each page/block in a shmem to ensure the
OS allocates memory to back it. Large shmems will cause many read/write
calls to be made, and when profiling, it is very likely a SIGPROF signal
will interrupt us at sufficiently high sampling rates. Most attempts at
retrying will fail for the same reason, and this can cause the threads
to block for an indeterminate period of time.

To work around this we put the profiler thread to sleep during the
posix_fallocate call. This will avoid us getting repeatably interrupted
until it can be completed.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 21

•

4 years ago

I verified this works. Thanks for the suggestion :).

Flags: needinfo?(aosmond)

Pulsebot

Comment 22

•

4 years ago

Pushed by aosmond@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4a008bd49ea4 Put profiler thread to sleep when allocating memory for shmems. r=mstange

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Updated

•

4 years ago

Blocks: 1659880

Narcis Beleuzu [:NarcisB]

Comment 23

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4a008bd49ea4

Status: NEW → RESOLVED

Closed: 4 years ago

status-firefox81: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 81 Branch

Cosmin Sabou [:CosminS]

Updated

•

4 years ago

Regressions: 1659965

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

4 years ago

Assignee: gsquelart → aosmond

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox80: fix-optional → wontfix

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Updated

•

4 years ago

Blocks: 1664038

You need to log in before you can comment on or make changes to this bug.