Closed Bug 508633 Opened 14 years ago Closed 7 years ago

Unresponsive OCSP server should not kill page load

Categories

(Core :: Security: PSM, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: johnath, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [psm-fatal])

[Filing as PSM because I think this is a product issue, not an NSS one, but copying some NSS folks as well, because they'll know for sure]

We had a situation in bug 508408 where an unresponsive OCSP responder on https://addons.mozilla.org was causing page load to hang for a few minutes before timing out. Rerouting the OCSP requests to localhost (which immediately refused the connection) let the page load instantly, albeit downgraded to DV, as expected from bug 405139.

It is my opinion that an unresponsive OCSP responder should be treated like one that refuses the connection, that is, EV status should be revoked, but the connection should proceed. Indeed, in a bug I am having no luck finding, I seem to recall Kai implementing precisely that behaviour, with a timeout of something like 15s. But it would seem that fix did not do what was intended, or that we've discovered some similar situation not covered by the original patch.
Mike, although the NSS library uses locks and is quite capable of operating 
simultaneously on multiple threads, I am told that PSM single threads all
SSL connections.  Perhaps that is the problem.  If multiple requests are 
being made to the same server, and for each one there is a long time,
and those timeouts are serialized rather than running concurrently ...

If NSS is not obeying the configured socket timeouts, that's an NSS bug.
Otherwise ...
Mozilla itself "single threads" all of its networking, too :-)
It uses a single dispatcher thread for all networking, which dispatches non-blocking I/O calls serially...

PSM uses the same strategy on its own separate thread.

Sometimes NSS decides to block a non-blocking read/write call... when it wants to perform OCSP.

Yes, at this given moment, PSM will not be able to perform any other read/write calls until the situation has been resolved and the non-SSL OCSP request has been completed.

During this duration all attempts to do read/write on an SSL socket, initiated by the Mozilla networking layer, will be temporary rejected with error code WOULDBLOCK.
Johnathan, yes we had reduced the timeout, and then it had been declared as acceptable/fixed.

We need to reproduce the AMO situation and debug it to understand what's happening.
Kai, the difference is that Mozilla waits for all pending IO
operations simultaneously, whereas PSM issues and waits for
pending SSL IO operations one at a time.
(In reply to comment #4)
> Kai, the difference is that Mozilla waits for all pending IO
> operations simultaneously, whereas PSM issues and waits for
> pending SSL IO operations one at a time.

Agreed, that's true.
I found my old patch and have attached it to bug 511393 for review.
(In reply to comment #3)
> We need to reproduce the AMO situation and debug it to understand what's
> happening.

Appears trivial to duplicate - setup an OCSP responder that either has :80 firewalled off or one that accepts a connection and never responds.  

In fact, I duplicated this just last night in production!
Depends on: 511393
Whiteboard: [psm-fatal]
I think there are several issues:

* When an OCSP request times out, we should continue to let the page load, even if it is EV, if the option to require an OCSP response hasn't been set; in the case of EV, this should downgrade to DV indications. However, this isn't working; the timeout causes the page to fail to load.

* When a page has N HTTPS subresources that use the same OCSP responder, and that OCSP responder doesn't respond, we may make up to N requests to that OCSP responder, each of which will time out. This is different than the normal case where we get a response, because then we only make 1 OCSP request and then cache it. To resolve this, we would need to cache an indication that the OCSP responder is down and avoid making OCSP requests to it for the current page and/or for a certain time period.

* It may be the case that the 15s timeout isn't functioning correctly.

* 15s may be too long of a timeout.

* Bug 511393 causes all SSL traffic to be serialized behind the current OCSP request, even when that traffic doesn't depend on that OCSP request.

Is there anything I missed? Let's make this bug about the first issue and then I'll file separate bugs about the others.
reassign bug owner.
mass-update-kaie-20120918
Assignee: kaie → nobody
Blocks: 803582
We currently time out after 2 seconds for DV and 10 seconds for EV. I don't think there's anything else to do here.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.