Closed Bug 171220 Opened 20 years ago Closed 20 years ago

Profile switching network teardown race condition with NSS shutdown


(Core Graveyard :: Profile: BackEnd, defect, P1)



(Not tracked)



(Reporter: KaiE, Assigned: KaiE)



(Keywords: topembed+, Whiteboard: [adt2] [ETA 10/15])


(1 file)

cc'ing everybody I know is involved in profile switching / turbo mode

We currently on switching profiles in embedding apps after doing some crypto
operations, becasue of a race condition between network shutdown and NSS shutdown.

In order to switch profiles, it is necessary to prepare a clean environment, to
release all usage of NSS code, to allow NSS to shut itself down and re-init cleanly.
In the past we prepared this by switching Necko of offline state, and profile
manager sends out events to prepare that clean environment.
Three events are used:
- profile-change-net-teardown
- profile-change-teardown
- profile-before-change
I believe the plan was to bring down all network activity during event

In file nsHttpHandler.cpp there is code to drop all http connections when events
are observed.
However, the code listens for "profile-before-change" (string at two places in
the file).
But the PSM/NSS code that assumes network is already down is also being called
when the same event is received!
In my test case, the NSS shutdown code was actually being reached before http
dropped the connections.

In order to fix the race condition, I propose to change nsHttpHandler to drop
connections on even "profile-change-net-teardown".

(When I talked to Darin yesterday, he reminded that a clean solution would be if
NSS itself took care, that it ignored any open handles after it has shut down
itself. But I'm unable to provide this approach at the current time. The NSS
team is still looking into a "hammer" shutdown, but still, I think we should
make sure that we avoid anything that could eventually mean stale resources
being still accessed/referenced.)
Attached patch Patch v1Splinter Review
The hammer solution is fraugh with problems:

1) It would probably result in a later crash. This crash, the result of an app
keeping an NSS ref around that has become invalid because of the NSS shutdown.
This crash would still happen in NSS (because the data would be passed to an NSS
function), and the support calls on NSS would be even higher.
 NSS cannot "ignore any open handle". The typical scenario would be: app keeps a
certificate around. NSS hammer shuts down. NSS starts up. app call
NSS_func(leaked_cert); How does NSS_func() know that leaked_cert is not valid.
 There are methods in which applications keep NSS data as Aliases (i.e., the
member can be set to null by NSS, because the data is an AliasTarget) and apps
always test for null before using the data. However, Alias and AliasTargets have
to be used carefully in multithreaded environment (i.e., alias usage and
aliastarget cleanup of it's alias must occur on the same thread or else it must
be protected), and the application changes would be very large.

2) It's just telling application developers: don't worry about making sure that
you release all NSS data when you need to.

If the requirement is that NSS be shutdown and restarted, and no crash can
happen in the future because of leaked reference to NSS data, then the hammer
solution must be ruled out.

i was only talking about the layered PRFileDesc... it seems like NSS could
protect itself at shutdown by going through and marking all of its open file
descriptors as invalid/broken (think: EPIPE).  having this in place would make
NSS less susceptible to coding errors in the client.  it doesn't mean that
clients shouldn't shutdown in the proper order; it just makes the system more
robust :-/
you assume NSS has access to all the open file descriptors. NSS is just a layer
on the stack, and has no knowledge about what file descriptors happenned to be
openned. All the state is stored in the file descriptor itself. There is no list
of open file descriptors in NSS -- that's up to the application to manage.

ic, well, wishful thinking i guess :-/
Keywords: topembed
Can you please review the patch?
Comment on attachment 100892 [details] [diff] [review]
Patch v1

Attachment #100892 - Flags: superreview+
bryner: you might be interested in this patch as well.
I'm not sure why we need to observe "profile-change-net-teardown" in the http
handler as well.

It gets used here
After the IOService has gone offline due to this notification (which I thought
happened synchronously), what's still hanging around?
yeah, socket transport service shutdown does happen synchronously (i.e., the UI
thread joins with the socket thread).
I can't confirm to your assumption.

The event observation code in HttpHandler is not a no-op.

When the event is observed in HttpHandler, it finds alive objects and cleans
them up.

If I remember correctly, it was happening in the call to:

The cleanup code results in some http objects getting freed, and in case of
https connections, SSL cleanup functions are reached.
Note: The problem was seen on recent MOZILLA_1_0_BRANCH code.
Blocks: 154896
Comment on attachment 100892 [details] [diff] [review]
Patch v1

Attachment #100892 - Flags: review+
Comment on attachment 100892 [details] [diff] [review]
Patch v1

a=asa for checkin to 1.2beta (on behalf of drivers)
Attachment #100892 - Flags: approval+
Fixed on trunk.

Darin cc'ed me on bug 174131, and if I understand correctly, that other bug
might be the real cause of the problem we are trying to fix here. While in
theory, this patch might no longer be required, I checked it in anyway, because
we all agreed it is safe - and it fixes the race condition.
Closed: 20 years ago
Resolution: --- → FIXED
yeah, i agree that this patch is still the right thing.  thanks kai!
QA Contact: ktrina → gbush for 1.0 branch.  Change mozilla1.0.2+ to fixed1.0.2 when
checked in.
Marking as topembed+, as this is needed by a major embedding customer.
Keywords: topembedtopembed+
Priority: -- → P1
Whiteboard: [adt2] [ETA 10/15]
Target Milestone: --- → mozilla1.0.2
Plussing for adt.  Need checkin by cob 10/16. Please make sure to get all
required  approvals (drivers) before checkin if this is not already done.
Keywords: adt1.0.2adt1.0.2+
Blocks: blackbird
Checked in to 1.0 branch.
verifying code fix for branch
marking verified1.0.2
verified code fix on trunk
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.