Closed Bug 1750452 Opened 2 years ago Closed 2 years ago

Thunderbird intermittently deadlocks on Linux in paf_prepare after clicking a link.

Tracking

()

Status:

RESOLVED FIXED

Milestone:

98 Branch

Tracking Flags:

Tracking

Status

firefox98

---

fixed

People

(Reporter: emilio, Assigned: mozbugz)

References

Details

Attachments

(3 files)

Stack of the hanging process. 2 years ago Emilio Cobos Álvarez (:emilio) 15.41 KB, text/plain		Details
Stacks on the parent process 2 years ago Emilio Cobos Álvarez (:emilio) 865.88 KB, text/plain		Details
Bug 1750452 - Replace mutex-locked sampling pause during exec() with atomic variable - r?mstange!,emilio 2 years ago Gerald Squelart (he/him) (not at Mozilla since 2022-09-15) 48 bytes, text/x-phabricator-request		Details \| Review

Emilio Cobos Álvarez (:emilio)

Reporter

Description

•

2 years ago

Attached file Stack of the hanging process. — Details

I've seen this a couple times and today I finally did some digging.

We're forking a process, and we deadlock on fork (see stack). The parent process waitpid()s the child and effectively waits infinitely.

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 1

•

2 years ago

Attached file Stacks on the parent process — Details

Probably the parent process was holding the lock when the fork happens? I don't see how this isn't a problem with Firefox. Probably we hit it less because we fork less?

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 2

•

2 years ago

Gerald, is comment 1 plausible?

Flags: needinfo?(gsquelart)

Emilio Cobos Álvarez (:emilio)

Reporter

Updated

•

2 years ago

Comment 3

•

2 years ago

According to the comment above paf_prepare():

in the child ... we always immediately exec() after fork(), and exec() clobbers all process state.

To verify:
fork keeps the state of mutexes.
But then execve does not keep them.

But then again, reading about pthread_atfork, that talks about mutex issues in multi-threaded libraries, which continue to run between fork and exec. The advice there is to lock all mutexes in paf_prepare(), and release them in paf_parent() and paf_child()!

Interestingly, our paf_prepare() comment also says:

At one point we did have a paf_child() function, but it caused problems related to locking gPSMutex. See bug 1348374.

There is a lot of discussion in that bug 1348374!

Back to the trace you got:
What looks strange to me, is that the child is in paf_prepare(), which presumably should only run in the parent before fork?
Also as you said, why don't we notice this in Firefox? Fission would have made it worse.

It'd be great if you could capture an rr or Pernosco trace, to see if we can find the source of locking/deadlocking.
And I'll see if I can give it a try on the side...

In any case, it looks dangerous to use mutexes in these pthread_atfork() handlers, if only because these handlers could be called from anywhere, where the mutex is indeed already locked.
So unless we can really understand what's happening and how this could be done safely, a potential alternative would be to replace the mutex lock with an atomic counter of forks, something like this:

paf_prepare() would increment it atomically.
paf_parent() would decrement it atomically.
The sampler thread would check it, and skip its sampling (with the interrupting signals we want to suppress) if the counter is not zero. (*)
paf_child() (to be re-added) would set it to zero. This assumes there could not be forks happening at the same time.

(*) There is a potential pseudo-race, where a sampling could have just started and gone past the atomic check, and a fork could then happen, which will be interrupted by sampling signals. But IIUC (from original bug 837390), the main issue was the frequent signalling by repeated samplings every 1ms, which could delay fork by seconds or even minutes(!). So after this initial unwanted sampling, the next samplings will be blocked by our new atomic flag, and fork will then be able to complete undisturbed.

What do you think?

Depends on: 1348374

Flags: needinfo?(gsquelart)

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Assignee

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P2

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Assignee

Comment 4

•

2 years ago

Here's a patch with the atomic variable: https://treeherder.mozilla.org/jobs?repo=try&group_state=expanded&collapsedPushes=962166&revision=0aa7c540e36279dc01d7ad385604ee03a8fbfa93

I tried it in Firefox (not Thunderbird), and verified locally that the sampler loop was skipped during forks. But I didn't notice significant fork delays/deadlocks beforehand, so I don't know if it's actually better.
Emilio, when possible could you please try it?

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 5

•

2 years ago

Yeah, using an atomic seems nicer indeed. I can try to give it a spin in a local TB build for a bit, but I hit this very infrequently so...

The good part is that we can probably remove the ifdefs around the atomic variable now? If nobody increments it it'll do nothing, and a regular relaxed load is probably not worth ifdef'ing for performance.

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Assignee

Comment 6

•

2 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)

Yeah, using an atomic seems nicer indeed. I can try to give it a spin in a local TB build for a bit, but I hit this very infrequently so...

Oh I see.
I managed to build Thunderbird, and tried to open one of the Home tab links (support, etc.) around 100 times while profiling, I could see forks happening at every click, but got no deadlocks before & after.
Please let me know if there was some sequence of steps that seemed to trigger it.

The good part is that we can probably remove the ifdefs around the atomic variable now? If nobody increments it it'll do nothing, and a regular relaxed load is probably not worth ifdef'ing for performance.

Are you talking of the new gaForkCount in my patch?
I think the #ifdefs are cheap, and remove a bit of unused code and data on Android and non-Linux platforms.
But I suppose it's not that critical, the test in the sampler loop is only called once per millisecond at wost, a tiny operation compared to everything else the sampler does. I'll consider it if we go ahead with this patch, thanks.

Relatedly, pthread_atfork() is supposedly not available on Android, but the comment saying this was written in 2013, I should investigate if it's still true. I see a few other calls in Firefox and at a glance they don't all seem to be excluded on Android.

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Assignee

Comment 7

•

2 years ago

Uploading patch soon.
I've added markers, explaining that sampling is suspended.
Try: https://treeherder.mozilla.org/jobs?repo=try&group_state=expanded&collapsedPushes=962166&revision=215ff131d104181b71729f2bbfdb69d1cbc354fb
Example profile from Thunderbird, zoomed on a fork marker -- Notice that sampling (blue rectangles in Parent Process rows) doesn't happen: https://share.firefox.dev/3Ki9a6l

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Assignee

Comment 8

•

2 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #1)

Probably the parent process was holding the lock when the fork happens? I don't see how this isn't a problem with Firefox. Probably we hit it less because we fork less?

One thing I discovered while testing this patch in Firefox, with the new marker: Early on, Firefox starts a "forkserver" process, which is responsible for starting other processes, and which itself doesn't get profiled. So the (profiled) parent process effectively only ever calls fork() once. That's probably why this issue is not noticeable there.

I guess Thunderbird doesn't do that, so all fork()s are done directly from the parent process, which makes it much more likely to hit a deadlock during profiling.

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Assignee

Comment 9

•

2 years ago

Attached file Bug 1750452 - Replace mutex-locked sampling pause during exec() with atomic variable - r?mstange!,emilio — Details

Note: The atomic variable is named gaSkipSampling, not mentioning forks because it could later be used in other situations, so it's best to describe it by the effect it has.

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → gsquelart

Status: NEW → ASSIGNED

Pulsebot

Comment 10

•

2 years ago

Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/470e249abdc6
Replace mutex-locked sampling pause during exec() with atomic variable - r=emilio,mstange

Cristina Cozmuta (:CrissCozmuta)

Comment 11

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/470e249abdc6

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox98: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 98 Branch

Mathew Hodson

Updated

•

2 years ago

Bugzilla

Quick Search

Thunderbird intermittently deadlocks on Linux in paf_prepare after clicking a link.

Categories

(Core :: Gecko Profiler, defect, P2)

Tracking

()

People

(Reporter: emilio, Assigned: mozbugz)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Updated

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Updated

Attachment

General

Description

File Name

Content Type