Closed Bug 1617078 Opened 4 years ago Closed 4 years ago

race conditions between ContentParent::NotifyTabDestroying, NotifyTabDestroyed and GetUsedBrowserProcess

Categories

(Core :: DOM: Content Processes, defect, P3)

defect

Tracking

()

RESOLVED WORKSFORME
Fission Milestone M6c

People

(Reporter: tnikkel, Assigned: tnikkel)

References

Details

Attachments

(1 obsolete file)

No description provided.
Blocks: R-fis

The following sequence was observed running the reftest layout/reftests/scrolling/frame-scrolling-attr-1.html with fission enabled:

  1. ContentParent::NotifyTabDestroying Provide's to the PreallocatedProcessManager (via TryToRecycle)
  2. a new document Take's the content parent from the PreallocatedProcessManager
  3. someone else calls TryToRecycle and Provide's to the PreallocatedProcessManager
  4. ContentParent::NotifyTabDestroyed is called, and now TryToRecycle returns false because PreallocatedProcessManager's slot for a process is filled by someone else

and so we shutdown the content parent while someone from step 2 is just getting set up.

To prevent this we don't give the content parent to the PreallocatedProcessManager in NotifyTabDestroying, we only check if we could. We do the actual provide in NotifyTabDestroyed.

One thing that is different with this patch is if NotifyTabDestroying is called and CanRecycle is true but the NotifyTabDestroyed call never comes for some reason. Before this patch the content parent would be in PreallocatedProcessManager, and someone else could try to use it, but it may not be in a good state. After this patch the content parent won't be in any list of available processes to use so it will just sit there I guess? The ForceKillTimer stuff doesn't come into play if we think we can recycle at NotifyTabDestroying time, so maybe this could be improved.

I'm not sure if this is the best approach, please let me know if there is a better way.

Assignee: nobody → tnikkel
Status: NEW → ASSIGNED
Flags: needinfo?(dteller)

What's the expected behavior? We should wait for 2. to be complete before shutting down, is that it?

Flags: needinfo?(dteller) → needinfo?(tnikkel)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #2)

What's the expected behavior? We should wait for 2. to be complete before shutting down, is that it?

I don't think that can work, because then we are shutting down the process of a document that is still being displayed.

I explored a couple of other solutions.

  1. Don't allow Take to return a content parent that has had NotifyTabDestroying and is waiting for an expected NotifyTabDestroyed call on it. Still have it store the content parent but just have a bool that makes it return null during this time. I didn't like this because NotifyTabDestroyed might not get called.
  2. Store some extra state in content parent or PreallocatedProcessManager so that if a content parent got Take()n between NotifyTabDestroying and NotifyTabDestroyed we make sure the TryToRecycle call suceeds in NotifyTabDestroyed. Keeping this new state up to date seemed tricky.

Ultimately I decided on what is in the patch: don't give the content parent to the PreallocatedProcessManager until NotifyTabDestroyed (if NotifyTabDestroyed is not called then I'm guessing the process is probably not in good shape and we don't want to re-use it anyway). This adds a little bit of complication in that we have to handle the case where the value of CanRecycle/TryToRecycle is different between the NotifyTabDestroying and NotifyTabDestroyed calls but handling that seems better than the complications of the above mentioned solutions.

Flags: needinfo?(tnikkel) → needinfo?(dteller)

Tracking for Fission Nightly (M6) unless this bug breaks a real website (then we can track for M5 Dogfooding).

Fission Milestone: --- → M6
Priority: -- → P1

Sorry, I'm still a bit confused. I see that you're offering a solution, but I still don't quite understand what the problem/expected outcome is. I mean, race conditions are a problem, but I assume that there's something more :)

Flags: needinfo?(dteller) → needinfo?(tnikkel)

The problem is that we shut down a process that is being used by a document that is very much alive and visible. And nothing replaces it, it just stays dead. We don't want to do that.

Flags: needinfo?(tnikkel)
Flags: needinfo?(dteller)

Note that we hit this problem while running reftests, it's not some theoretical concern.

Flags: needinfo?(dteller)

Tentatively moving P1 Fission M6 bugs to M6a.

Fission Milestone: M6 → M6a

ni for the question in the review.

Flags: needinfo?(nika)

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:tnikkel, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(tnikkel)

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #10)

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:tnikkel, could you have a look please?
For more information, please visit auto_nag documentation.

The r+ was conditional on making a change that turns out to make the patch not fix the original problem.

Flags: needinfo?(tnikkel)

Timothy, is this still an issue for the reftest layout/reftests/scrolling/frame-scrolling-attr-1.html? I see that this is not marked as skip-if or fail-if for fission and my quick check does not show this test as failing in the most recent runs on central.

Flags: needinfo?(tnikkel)

It was an intermittent failure that I was able to reproduce reliably by making 100 copies of the test. So there is no annotation and probably no failures m-c. I will do another try run with the 100 copies of this test to see if it still happens.

Flags: needinfo?(tnikkel)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4c85bf6dde110328fd28c004000343f19ee7e882

Now it seems that it's possible to crash the content process when running 100 copies of that test in a row. Before the problem was that we didn't draw the content of an OOP iframe because of a mixup in choosing which process to assign to the iframe. Not sure if the crash is related or something else entirely (there is no stack in the log), so I can't tell you if the original problem still exists or not because there is a worse problem happening.

I see these in the log:
Maximum number of clients reachedMaximum number of clients reached
Maximum number of clients reachedUnable to init server: Could not connect: Connection refused

It sounds like some kind of X error? Does each run of the test create a window and we're hitting some kind of limit?

Each copy of the reftest containes 13 oop iframes (and 13 more in the reference). There is bug 1635451 which seems like it's related. That said, something must have changed since I wasn't hitting any crash in March.

My guess is that some fission process count limit has been raised or lifted since March and now we hit the X connections limit instead of this issue (which deals with re-using processes so is less likely if the process limit is very large).

See Also: → semi-headless

Comment 14 indicates that the original issue the bug was filed for no longer reproduces, maybe because of recent preallocated process changes landed by jesup in the recent past. This is no longer a M6a issue.

Fission Milestone: M6a → M6c
Priority: P1 → P3
Flags: needinfo?(nika)

Since the original issue is no longer reproducible.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
Attachment #9128019 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: