Closed Bug 1617078 Opened 4 years ago Closed 4 years ago

race conditions between ContentParent::NotifyTabDestroying, NotifyTabDestroyed and GetUsedBrowserProcess

Tracking

()

Status:

RESOLVED WORKSFORME

Project Flags:

Fission Milestone

M6c

People

(Reporter: tnikkel, Assigned: tnikkel)

References

Details

Attachments

(1 obsolete file)

Bug 1617078. Fix race conditions between ContentParent::NotifyTabDestroying/ed and picking a content parent for a new tab. 4 years ago Timothy Nikkel (:tnikkel) 47 bytes, text/x-phabricator-request		Details \| Review

Timothy Nikkel (:tnikkel)

Assignee

Description

•

4 years ago

No description provided.

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

4 years ago

Blocks: R-fis

Timothy Nikkel (:tnikkel)

Assignee

Comment 1

•

4 years ago

Attached file Bug 1617078. Fix race conditions between ContentParent::NotifyTabDestroying/ed and picking a content parent for a new tab. (obsolete) — Details

The following sequence was observed running the reftest layout/reftests/scrolling/frame-scrolling-attr-1.html with fission enabled:

ContentParent::NotifyTabDestroying Provide's to the PreallocatedProcessManager (via TryToRecycle)
a new document Take's the content parent from the PreallocatedProcessManager
someone else calls TryToRecycle and Provide's to the PreallocatedProcessManager
ContentParent::NotifyTabDestroyed is called, and now TryToRecycle returns false because PreallocatedProcessManager's slot for a process is filled by someone else

and so we shutdown the content parent while someone from step 2 is just getting set up.

To prevent this we don't give the content parent to the PreallocatedProcessManager in NotifyTabDestroying, we only check if we could. We do the actual provide in NotifyTabDestroyed.

One thing that is different with this patch is if NotifyTabDestroying is called and CanRecycle is true but the NotifyTabDestroyed call never comes for some reason. Before this patch the content parent would be in PreallocatedProcessManager, and someone else could try to use it, but it may not be in a good state. After this patch the content parent won't be in any list of available processes to use so it will just sit there I guess? The ForceKillTimer stuff doesn't come into play if we think we can recycle at NotifyTabDestroying time, so maybe this could be improved.

I'm not sure if this is the best approach, please let me know if there is a better way.

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → tnikkel

Status: NEW → ASSIGNED

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

4 years ago

Flags: needinfo?(dteller)

David Teller [:Yoric] - still alive but not very active

Comment 2

•

4 years ago

What's the expected behavior? We should wait for 2. to be complete before shutting down, is that it?

Flags: needinfo?(dteller) → needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Comment 3

•

4 years ago

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #2)

What's the expected behavior? We should wait for 2. to be complete before shutting down, is that it?

I don't think that can work, because then we are shutting down the process of a document that is still being displayed.

I explored a couple of other solutions.

Don't allow Take to return a content parent that has had NotifyTabDestroying and is waiting for an expected NotifyTabDestroyed call on it. Still have it store the content parent but just have a bool that makes it return null during this time. I didn't like this because NotifyTabDestroyed might not get called.
Store some extra state in content parent or PreallocatedProcessManager so that if a content parent got Take()n between NotifyTabDestroying and NotifyTabDestroyed we make sure the TryToRecycle call suceeds in NotifyTabDestroyed. Keeping this new state up to date seemed tricky.

Ultimately I decided on what is in the patch: don't give the content parent to the PreallocatedProcessManager until NotifyTabDestroyed (if NotifyTabDestroyed is not called then I'm guessing the process is probably not in good shape and we don't want to re-use it anyway). This adds a little bit of complication in that we have to handle the case where the value of CanRecycle/TryToRecycle is different between the NotifyTabDestroying and NotifyTabDestroyed calls but handling that seems better than the complications of the above mentioned solutions.

Flags: needinfo?(tnikkel) → needinfo?(dteller)

Chris Peterson [:cpeterson]

Comment 4

•

4 years ago

Tracking for Fission Nightly (M6) unless this bug breaks a real website (then we can track for M5 Dogfooding).

Fission Milestone: --- → M6

Priority: -- → P1

David Teller [:Yoric] - still alive but not very active

Comment 5

•

4 years ago

Sorry, I'm still a bit confused. I see that you're offering a solution, but I still don't quite understand what the problem/expected outcome is. I mean, race conditions are a problem, but I assume that there's something more :)

Flags: needinfo?(dteller) → needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Comment 6

•

4 years ago

•

Edited

The problem is that we shut down a process that is being used by a document that is very much alive and visible. And nothing replaces it, it just stays dead. We don't want to do that.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

4 years ago

Flags: needinfo?(dteller)

Timothy Nikkel (:tnikkel)

Assignee

Comment 7

•

4 years ago

Note that we hit this problem while running reftests, it's not some theoretical concern.

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

4 years ago

Flags: needinfo?(dteller)

Chris Peterson [:cpeterson]

Comment 8

•

4 years ago

Tentatively moving P1 Fission M6 bugs to M6a.

Fission Milestone: M6 → M6a

Timothy Nikkel (:tnikkel)

Assignee

Comment 9

•

4 years ago

ni for the question in the review.

Flags: needinfo?(nika)

BugBot [:suhaib / :marco/ :calixte]

Comment 10

•

4 years ago

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:tnikkel, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Comment 11

•

4 years ago

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #10)

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:tnikkel, could you have a look please?
For more information, please visit auto_nag documentation.

The r+ was conditional on making a change that turns out to make the patch not fix the original problem.

Flags: needinfo?(tnikkel)

Neha Kochar [:neha]

Comment 12

•

4 years ago

Timothy, is this still an issue for the reftest layout/reftests/scrolling/frame-scrolling-attr-1.html? I see that this is not marked as skip-if or fail-if for fission and my quick check does not show this test as failing in the most recent runs on central.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Comment 13

•

4 years ago

It was an intermittent failure that I was able to reproduce reliably by making 100 copies of the test. So there is no annotation and probably no failures m-c. I will do another try run with the 100 copies of this test to see if it still happens.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Comment 14

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4c85bf6dde110328fd28c004000343f19ee7e882

Now it seems that it's possible to crash the content process when running 100 copies of that test in a row. Before the problem was that we didn't draw the content of an OOP iframe because of a mixup in choosing which process to assign to the iframe. Not sure if the crash is related or something else entirely (there is no stack in the log), so I can't tell you if the original problem still exists or not because there is a worse problem happening.

Andrew McCreight [:mccr8]

Comment 15

•

4 years ago

I see these in the log:
Maximum number of clients reachedMaximum number of clients reached
Maximum number of clients reachedUnable to init server: Could not connect: Connection refused

It sounds like some kind of X error? Does each run of the test create a window and we're hitting some kind of limit?

Timothy Nikkel (:tnikkel)

Assignee

Comment 16

•

4 years ago

Each copy of the reftest containes 13 oop iframes (and 13 more in the reference). There is bug 1635451 which seems like it's related. That said, something must have changed since I wasn't hitting any crash in March.

Timothy Nikkel (:tnikkel)

Assignee

Comment 17

•

4 years ago

•

Edited

My guess is that some fission process count limit has been raised or lifted since March and now we hit the X connections limit instead of this issue (which deals with re-using processes so is less likely if the process limit is very large).

Neha Kochar [:neha]

Updated

•

4 years ago

Comment 18

•

4 years ago

Comment 14 indicates that the original issue the bug was filed for no longer reproduces, maybe because of recent preallocated process changes landed by jesup in the recent past. This is no longer a M6a issue.

Fission Milestone: M6a → M6c

Priority: P1 → P3

Neha Kochar [:neha]

Updated

•

4 years ago

Flags: needinfo?(nika)

Neha Kochar [:neha]

Comment 19

•

4 years ago

Since the original issue is no longer reproducible.

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

Resolution: --- → WORKSFORME

Phabricator Automation

Updated

•

3 years ago

Attachment #9128019 - Attachment is obsolete: true

You need to log in before you can comment on or make changes to this bug.