Closed Bug 171220 Opened 20 years ago Closed 20 years ago
Profile switching network teardown race condition with NSS shutdown
cc'ing everybody I know is involved in profile switching / turbo mode We currently on switching profiles in embedding apps after doing some crypto operations, becasue of a race condition between network shutdown and NSS shutdown. Intro: In order to switch profiles, it is necessary to prepare a clean environment, to release all usage of NSS code, to allow NSS to shut itself down and re-init cleanly. In the past we prepared this by switching Necko of offline state, and profile manager sends out events to prepare that clean environment. Three events are used: - profile-change-net-teardown - profile-change-teardown - profile-before-change I believe the plan was to bring down all network activity during event profile-change-net-teardown. Bug: In file nsHttpHandler.cpp there is code to drop all http connections when events are observed. However, the code listens for "profile-before-change" (string at two places in the file). But the PSM/NSS code that assumes network is already down is also being called when the same event is received! In my test case, the NSS shutdown code was actually being reached before http dropped the connections. Proposal: In order to fix the race condition, I propose to change nsHttpHandler to drop connections on even "profile-change-net-teardown". (When I talked to Darin yesterday, he reminded that a clean solution would be if NSS itself took care, that it ignored any open handles after it has shut down itself. But I'm unable to provide this approach at the current time. The NSS team is still looking into a "hammer" shutdown, but still, I think we should make sure that we avoid anything that could eventually mean stale resources being still accessed/referenced.)
The hammer solution is fraugh with problems: 1) It would probably result in a later crash. This crash, the result of an app keeping an NSS ref around that has become invalid because of the NSS shutdown. This crash would still happen in NSS (because the data would be passed to an NSS function), and the support calls on NSS would be even higher. NSS cannot "ignore any open handle". The typical scenario would be: app keeps a certificate around. NSS hammer shuts down. NSS starts up. app call NSS_func(leaked_cert); How does NSS_func() know that leaked_cert is not valid. There are methods in which applications keep NSS data as Aliases (i.e., the member can be set to null by NSS, because the data is an AliasTarget) and apps always test for null before using the data. However, Alias and AliasTargets have to be used carefully in multithreaded environment (i.e., alias usage and aliastarget cleanup of it's alias must occur on the same thread or else it must be protected), and the application changes would be very large. 2) It's just telling application developers: don't worry about making sure that you release all NSS data when you need to. If the requirement is that NSS be shutdown and restarted, and no crash can happen in the future because of leaked reference to NSS data, then the hammer solution must be ruled out.
stephane: i was only talking about the layered PRFileDesc... it seems like NSS could protect itself at shutdown by going through and marking all of its open file descriptors as invalid/broken (think: EPIPE). having this in place would make NSS less susceptible to coding errors in the client. it doesn't mean that clients shouldn't shutdown in the proper order; it just makes the system more robust :-/
you assume NSS has access to all the open file descriptors. NSS is just a layer on the stack, and has no knowledge about what file descriptors happenned to be openned. All the state is stored in the file descriptor itself. There is no list of open file descriptors in NSS -- that's up to the application to manage. bob
ic, well, wishful thinking i guess :-/
Can you please review the patch?
Comment on attachment 100892 [details] [diff] [review] Patch v1 sr=darin
Attachment #100892 - Flags: superreview+
bryner: you might be interested in this patch as well.
I'm not sure why we need to observe "profile-change-net-teardown" in the http handler as well. It gets used here http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsIOService.cpp#707 After the IOService has gone offline due to this notification (which I thought happened synchronously), what's still hanging around?
yeah, socket transport service shutdown does happen synchronously (i.e., the UI thread joins with the socket thread).
I can't confirm to your assumption. The event observation code in HttpHandler is not a no-op. When the event is observed in HttpHandler, it finds alive objects and cleans them up. If I remember correctly, it was happening in the call to: DropConnections(mIdleConnections); The cleanup code results in some http objects getting freed, and in case of https connections, SSL cleanup functions are reached.
Note: The problem was seen on recent MOZILLA_1_0_BRANCH code.
Comment on attachment 100892 [details] [diff] [review] Patch v1 r=ccarlen
Attachment #100892 - Flags: review+
Comment on attachment 100892 [details] [diff] [review] Patch v1 a=asa for checkin to 1.2beta (on behalf of drivers)
Attachment #100892 - Flags: approval+
Fixed on trunk. Darin cc'ed me on bug 174131, and if I understand correctly, that other bug might be the real cause of the problem we are trying to fix here. While in theory, this patch might no longer be required, I checked it in anyway, because we all agreed it is safe - and it fixes the race condition.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
yeah, i agree that this patch is still the right thing. thanks kai!
firstname.lastname@example.org for 1.0 branch. Change mozilla1.0.2+ to fixed1.0.2 when checked in.
Marking as topembed+, as this is needed by a major embedding customer.
Plussing for adt. Need checkin by cob 10/16. Please make sure to get all required approvals (drivers) before checkin if this is not already done.
verified code fix on trunk
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.