Closed Bug 171220 Opened 22 years ago Closed 22 years ago

Profile switching network teardown race condition with NSS shutdown

Tracking

(Not tracked)

Status:

VERIFIED FIXED

Milestone:

mozilla1.0.2

People

(Reporter: KaiE, Assigned: KaiE)

References

Details

(Keywords: topembed+, Whiteboard: [adt2] [ETA 10/15])

Attachments

(1 file)

Patch v1 22 years ago Kai Engert (:KaiE:) 1.27 KB, patch	ccarlen : review+ darin.moz : superreview+ asa : approval+	Details \| Diff \| Splinter Review

Kai Engert (:KaiE:)

Assignee

Description

•

22 years ago

cc'ing everybody I know is involved in profile switching / turbo mode

We currently on switching profiles in embedding apps after doing some crypto
operations, becasue of a race condition between network shutdown and NSS shutdown.


Intro:
In order to switch profiles, it is necessary to prepare a clean environment, to
release all usage of NSS code, to allow NSS to shut itself down and re-init cleanly.
In the past we prepared this by switching Necko of offline state, and profile
manager sends out events to prepare that clean environment.
Three events are used:
- profile-change-net-teardown
- profile-change-teardown
- profile-before-change
I believe the plan was to bring down all network activity during event
profile-change-net-teardown.


Bug:
In file nsHttpHandler.cpp there is code to drop all http connections when events
are observed.
However, the code listens for "profile-before-change" (string at two places in
the file).
But the PSM/NSS code that assumes network is already down is also being called
when the same event is received!
In my test case, the NSS shutdown code was actually being reached before http
dropped the connections.

Proposal:
In order to fix the race condition, I propose to change nsHttpHandler to drop
connections on even "profile-change-net-teardown".


(When I talked to Darin yesterday, he reminded that a clean solution would be if
NSS itself took care, that it ignored any open handles after it has shut down
itself. But I'm unable to provide this approach at the current time. The NSS
team is still looking into a "hammer" shutdown, but still, I think we should
make sure that we avoid anything that could eventually mean stale resources
being still accessed/referenced.)

Kai Engert (:KaiE:)

Assignee

Comment 1

•

22 years ago

Attached patch Patch v1 — Details — Splinter Review

Stephane Saux

Comment 2

•

22 years ago

The hammer solution is fraugh with problems:

1) It would probably result in a later crash. This crash, the result of an app
keeping an NSS ref around that has become invalid because of the NSS shutdown.
This crash would still happen in NSS (because the data would be passed to an NSS
function), and the support calls on NSS would be even higher.
 NSS cannot "ignore any open handle". The typical scenario would be: app keeps a
certificate around. NSS hammer shuts down. NSS starts up. app call
NSS_func(leaked_cert); How does NSS_func() know that leaked_cert is not valid.
 There are methods in which applications keep NSS data as Aliases (i.e., the
member can be set to null by NSS, because the data is an AliasTarget) and apps
always test for null before using the data. However, Alias and AliasTargets have
to be used carefully in multithreaded environment (i.e., alias usage and
aliastarget cleanup of it's alias must occur on the same thread or else it must
be protected), and the application changes would be very large.


2) It's just telling application developers: don't worry about making sure that
you release all NSS data when you need to.

If the requirement is that NSS be shutdown and restarted, and no crash can
happen in the future because of leaked reference to NSS data, then the hammer
solution must be ruled out.

Darin Fisher

Comment 3

•

22 years ago

stephane:

i was only talking about the layered PRFileDesc... it seems like NSS could
protect itself at shutdown by going through and marking all of its open file
descriptors as invalid/broken (think: EPIPE).  having this in place would make
NSS less susceptible to coding errors in the client.  it doesn't mean that
clients shouldn't shutdown in the proper order; it just makes the system more
robust :-/

Robert Relyea

Comment 4

•

22 years ago

you assume NSS has access to all the open file descriptors. NSS is just a layer
on the stack, and has no knowledge about what file descriptors happenned to be
openned. All the state is stored in the file descriptor itself. There is no list
of open file descriptors in NSS -- that's up to the application to manage.

bob

Darin Fisher

Comment 5

•

22 years ago

ic, well, wishful thinking i guess :-/

Kai Engert (:KaiE:)

Assignee

Updated

•

22 years ago

Keywords: topembed

Kai Engert (:KaiE:)

Assignee

Comment 6

•

22 years ago

Can you please review the patch?

Darin Fisher

Comment 7

•

22 years ago

Comment on attachment 100892 [details] [diff] [review]
Patch v1

sr=darin

Attachment #100892 - Flags: superreview+

Darin Fisher

Comment 8

•

22 years ago

bryner: you might be interested in this patch as well.

Conrad Carlen (not reading bugmail)

Comment 9

•

22 years ago

I'm not sure why we need to observe "profile-change-net-teardown" in the http
handler as well.

It gets used here
http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsIOService.cpp#707
After the IOService has gone offline due to this notification (which I thought
happened synchronously), what's still hanging around?

Darin Fisher

Comment 10

•

22 years ago

yeah, socket transport service shutdown does happen synchronously (i.e., the UI
thread joins with the socket thread).

Kai Engert (:KaiE:)

Assignee

Comment 11

•

22 years ago

I can't confirm to your assumption.

The event observation code in HttpHandler is not a no-op.

When the event is observed in HttpHandler, it finds alive objects and cleans
them up.

If I remember correctly, it was happening in the call to:
  DropConnections(mIdleConnections);

The cleanup code results in some http objects getting freed, and in case of
https connections, SSL cleanup functions are reached.

Kai Engert (:KaiE:)

Assignee

Comment 12

•

22 years ago

Note: The problem was seen on recent MOZILLA_1_0_BRANCH code.

Jaime Rodriguez, Jr.

Updated

•

22 years ago

Blocks: 154896

Conrad Carlen (not reading bugmail)

Comment 13

•

22 years ago

Comment on attachment 100892 [details] [diff] [review]
Patch v1

r=ccarlen

Attachment #100892 - Flags: review+

Kai Engert (:KaiE:)

Assignee

Updated

•

22 years ago

Keywords: adt1.0.2, mozilla1.0.2

Asa Dotzler [:asa]

Comment 14

•

22 years ago

Comment on attachment 100892 [details] [diff] [review]
Patch v1

a=asa for checkin to 1.2beta (on behalf of drivers)

Attachment #100892 - Flags: approval+

Kai Engert (:KaiE:)

Assignee

Comment 15

•

22 years ago

Fixed on trunk.

Darin cc'ed me on bug 174131, and if I understand correctly, that other bug
might be the real cause of the problem we are trying to fix here. While in
theory, this patch might no longer be required, I checked it in anyway, because
we all agreed it is safe - and it fixes the race condition.

Status: NEW → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Darin Fisher

Comment 16

•

22 years ago

yeah, i agree that this patch is still the right thing.  thanks kai!

Gayatri Rimola

Updated

•

22 years ago

QA Contact: ktrina → gbush

Randell Jesup [:jesup] (needinfo me)

Comment 17

•

22 years ago

a=rjesup@wgate.com for 1.0 branch.  Change mozilla1.0.2+ to fixed1.0.2 when
checked in.

Keywords: mozilla1.0.2 → mozilla1.0.2+

Jaime Rodriguez, Jr.

Comment 18

•

22 years ago

Marking as topembed+, as this is needed by a major embedding customer.

Keywords: topembed → topembed+

Priority: -- → P1

Whiteboard: [adt2] [ETA 10/15]

Target Milestone: --- → mozilla1.0.2

Michael Buckland

Comment 19

•

22 years ago

Plussing for adt.  Need checkin by cob 10/16. Please make sure to get all
required  approvals (drivers) before checkin if this is not already done.

Keywords: adt1.0.2 → adt1.0.2+

Michael Buckland

Updated

•

22 years ago

Blocks: blackbird

Kai Engert (:KaiE:)

Assignee

Comment 20

•

22 years ago

Checked in to 1.0 branch.

Keywords: adt1.0.2+, mozilla1.0.2+ → fixed1.0.2

Grace Bush

Comment 21

•

22 years ago

verifying code fix for branch
marking verified1.0.2

Keywords: fixed1.0.2 → verified1.0.2

Grace Bush

Comment 22

•

22 years ago

verified code fix on trunk

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Core → Core Graveyard

You need to log in before you can comment on or make changes to this bug.