Closed Bug 1750452 Opened 2 years ago Closed 2 years ago

Thunderbird intermittently deadlocks on Linux in paf_prepare after clicking a link.

Categories

(Core :: Gecko Profiler, defect, P2)

defect

Tracking

()

RESOLVED FIXED
98 Branch
Tracking Status
firefox98 --- fixed

People

(Reporter: emilio, Assigned: mozbugz)

References

Details

Attachments

(3 files)

I've seen this a couple times and today I finally did some digging.

We're forking a process, and we deadlock on fork (see stack). The parent process waitpid()s the child and effectively waits infinitely.

Probably the parent process was holding the lock when the fork happens? I don't see how this isn't a problem with Firefox. Probably we hit it less because we fork less?

Gerald, is comment 1 plausible?

Flags: needinfo?(gsquelart)
See Also: → 1348374

According to the comment above paf_prepare():

in the child ... we always immediately exec() after fork(), and exec() clobbers all process state.

To verify:
fork keeps the state of mutexes.
But then execve does not keep them.

But then again, reading about pthread_atfork, that talks about mutex issues in multi-threaded libraries, which continue to run between fork and exec. The advice there is to lock all mutexes in paf_prepare(), and release them in paf_parent() and paf_child()!

Interestingly, our paf_prepare() comment also says:

At one point we did have a paf_child() function, but it caused problems related to locking gPSMutex. See bug 1348374.

There is a lot of discussion in that bug 1348374!

Back to the trace you got:
What looks strange to me, is that the child is in paf_prepare(), which presumably should only run in the parent before fork?
Also as you said, why don't we notice this in Firefox? Fission would have made it worse.

It'd be great if you could capture an rr or Pernosco trace, to see if we can find the source of locking/deadlocking.
And I'll see if I can give it a try on the side...

In any case, it looks dangerous to use mutexes in these pthread_atfork() handlers, if only because these handlers could be called from anywhere, where the mutex is indeed already locked.
So unless we can really understand what's happening and how this could be done safely, a potential alternative would be to replace the mutex lock with an atomic counter of forks, something like this:

  • paf_prepare() would increment it atomically.
  • paf_parent() would decrement it atomically.
  • The sampler thread would check it, and skip its sampling (with the interrupting signals we want to suppress) if the counter is not zero. (*)
  • paf_child() (to be re-added) would set it to zero. This assumes there could not be forks happening at the same time.

(*) There is a potential pseudo-race, where a sampling could have just started and gone past the atomic check, and a fork could then happen, which will be interrupted by sampling signals. But IIUC (from original bug 837390), the main issue was the frequent signalling by repeated samplings every 1ms, which could delay fork by seconds or even minutes(!). So after this initial unwanted sampling, the next samplings will be blocked by our new atomic flag, and fork will then be able to complete undisturbed.

What do you think?

Depends on: 1348374
Flags: needinfo?(gsquelart)
Severity: -- → S3
Priority: -- → P2

Here's a patch with the atomic variable: https://treeherder.mozilla.org/jobs?repo=try&group_state=expanded&collapsedPushes=962166&revision=0aa7c540e36279dc01d7ad385604ee03a8fbfa93

I tried it in Firefox (not Thunderbird), and verified locally that the sampler loop was skipped during forks. But I didn't notice significant fork delays/deadlocks beforehand, so I don't know if it's actually better.
Emilio, when possible could you please try it?

Yeah, using an atomic seems nicer indeed. I can try to give it a spin in a local TB build for a bit, but I hit this very infrequently so...

The good part is that we can probably remove the ifdefs around the atomic variable now? If nobody increments it it'll do nothing, and a regular relaxed load is probably not worth ifdef'ing for performance.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)

Yeah, using an atomic seems nicer indeed. I can try to give it a spin in a local TB build for a bit, but I hit this very infrequently so...

Oh I see.
I managed to build Thunderbird, and tried to open one of the Home tab links (support, etc.) around 100 times while profiling, I could see forks happening at every click, but got no deadlocks before & after.
Please let me know if there was some sequence of steps that seemed to trigger it.

The good part is that we can probably remove the ifdefs around the atomic variable now? If nobody increments it it'll do nothing, and a regular relaxed load is probably not worth ifdef'ing for performance.

Are you talking of the new gaForkCount in my patch?
I think the #ifdefs are cheap, and remove a bit of unused code and data on Android and non-Linux platforms.
But I suppose it's not that critical, the test in the sampler loop is only called once per millisecond at wost, a tiny operation compared to everything else the sampler does. I'll consider it if we go ahead with this patch, thanks.

Relatedly, pthread_atfork() is supposedly not available on Android, but the comment saying this was written in 2013, I should investigate if it's still true. I see a few other calls in Firefox and at a glance they don't all seem to be excluded on Android.

Uploading patch soon.
I've added markers, explaining that sampling is suspended.
Try: https://treeherder.mozilla.org/jobs?repo=try&group_state=expanded&collapsedPushes=962166&revision=215ff131d104181b71729f2bbfdb69d1cbc354fb
Example profile from Thunderbird, zoomed on a fork marker -- Notice that sampling (blue rectangles in Parent Process rows) doesn't happen: https://share.firefox.dev/3Ki9a6l

(In reply to Emilio Cobos Álvarez (:emilio) from comment #1)

Probably the parent process was holding the lock when the fork happens? I don't see how this isn't a problem with Firefox. Probably we hit it less because we fork less?

One thing I discovered while testing this patch in Firefox, with the new marker: Early on, Firefox starts a "forkserver" process, which is responsible for starting other processes, and which itself doesn't get profiled. So the (profiled) parent process effectively only ever calls fork() once. That's probably why this issue is not noticeable there.

I guess Thunderbird doesn't do that, so all fork()s are done directly from the parent process, which makes it much more likely to hit a deadlock during profiling.

Note: The atomic variable is named gaSkipSampling, not mentioning forks because it could later be used in other situations, so it's best to describe it by the effect it has.

Assignee: nobody → gsquelart
Status: NEW → ASSIGNED
Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/470e249abdc6
Replace mutex-locked sampling pause during exec() with atomic variable - r=emilio,mstange
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 98 Branch
See Also: 1348374
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: