Closed Bug 705755 Opened 9 years ago Closed 3 years ago

Reintroduce handling of SSL short write after SSL thread removal

Categories

(Core :: Security: PSM, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- fixed

People

(Reporter: briansmith, Assigned: mayhemer)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [psm-assigned])

Attachments

(1 file, 3 obsolete files)

+++ This bug was initially created as a clone of Bug #674147 +++

(In reply to Honza Bambas (:mayhemer) from Bug 674147 Comment 37)
> You are regressing bug 378629:
> https://bugzilla.mozilla.org/show_bug.cgi?id=356470#c54
> https://bugzilla.mozilla.org/show_bug.cgi?id=378629#c11
> 
> PSMSend must watch for libssl sending just request - 1 amount of data (= a
> "short write").  In that case, the simplest solution is IMO that we have to
> limit next data send to just 1 byte, simply explained in a pseudo code:
> 
> PSMSend(buffer, count)
> {
>   static flag = false;
>   if (flag)
>     count = 1;
>     flag = false;
> 
>   written = lower->send(buffer, count)
>   if (written == count - 1)
>     flag = true;
> }
> 
> We can do this in a separate bug, but ASAP.  The final solution is up to
> you, but I'd like to review and test it.

This is marked blocking bug 674147 because that is how we show "this bug is caused by that bug" but I still intend to land the patch in bug 674147 first.
I renamed the bug to be more descriptive and also added refs to other bugs that gets affected by bug 674147.

I agree to land that bug first, this one is not a critical issue.
Blocks: 356470, 378629
Summary: remove the SSL thread (nsSSLThread) → Reintroduce handling of SSL short write after SSL thread removal
No longer blocks: 674147
Depends on: 674147
-> me

Stealing this one.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Attached patch v1 (obsolete) — Splinter Review
- implements what the SSL thread had been implementing before its removal
- adds also check for 16383 bytes of written data ; this often happens even for larger buffers like 32768, w/o it we jitter again with 2 bytes being later written separately
- the 16383 const is disputable

Added Nelson for feedback to get more info on how all ssl lib indicates the short write.

To have what we had before SSL thread removal we may just remove the 16383 const checking and we are done.
Attachment #582708 - Flags: review?(kaie)
Attachment #582708 - Flags: feedback?(nelson)
Aren't we short on time for reviews on these for ff11 ?
I agree with the patch in general, just some small comments:


+/* Secure write sometimes returns this amount of bytes written even for larger
+   sizes of the output buffer, consider this as indication of a short write */
+static const PRInt32 kShortWrite16k = 16383;

(a)

Please explain (in the comment) where this number comes from.
I had to research myself.

I believe this is based on libssl's constant 
   #define MAX_FRAGMENT_LENGTH             16384

Maybe add this comment:
  // Value is based on libssl's maximum buffer size (in a private header):
  //   MAX_FRAGMENT_LENGTH - 1
  // TODO: Query this value once at runtime (changes in NSS required)


(b)

Nit, optional: Given you only use this constant once,
why not move it closer to the place where you use it?
Makes the explanation easier to find when reading the code.



(c)

+  // Value of the last byte pending from the SSL short write that needs
+  // to be passed to subsequent calls to send to perform the flush.
+  char      mShortWritePendingByte;

Please change that to "unsigned char", because that's what you use everywhere else.



(d)


+  if (socketInfo->IsShortWritePending()) {
+    // We got "SSL short write" last time, try to flush the pending byte.
+    buf = socketInfo->GetShortWritePendingByteRef();
+    amount = 1;
+ ###
+    PR_LOG(gPIPNSSLog, PR_LOG_DEBUG, ("[%p] pushing 1 byte after SSL short write",
+           fd));
+  }
+


If I understand correctly, if we reach position ###,
the following must be true:

- the function was called with: amount >= 1 

- this expression is true:
  *(const unsigned char*) pending-ref == *buf-given-to-function[0]

I propose to add an assertion,
that way you don't need to change the "buf" pointer.

  if (socketInfo->IsShortWritePending()) {
    // We got "SSL short write" last time, try to flush the pending byte.
    NS_ASSERTION(amount >= 1 &&
                 buf[0] == *(const unsigned char*)
                           socketInfo->GetShortWritePendingByteRef(),
                 "unexpected buffer after short write");
    amount = 1;
    PR_LOG(gPIPNSSLog, PR_LOG_DEBUG, ("[%p] pushing 1 byte after SSL short write",
           fd));
  }


(e)

+  // SSL Short Write handling.
+  //
+  // True when SSL layer has indicated an "SSL short write", i.e. need
+  // to call on send one or more times to push all pending data to write.
+  bool      mIsShortWritePending;
+  // These are only valid iff mIsShortWritePending is true.
+  //
+  // Value of the last byte pending from the SSL short write that needs
+  // to be passed to subsequent calls to send to perform the flush.
+  char      mShortWritePendingByte;
+  // Original amount of data the upper layer has requested to write to
+  // return after the successful flush.
+  PRInt32   mShortWriteOriginalAmount;

In this block it's difficult to see which variables you refer to when saying
  "These are only valid iff mIsShortWritePending is true."

I'd align the block clearer, maybe like this:

  // True when SSL layer has indicated an "SSL short write", i.e. need
  // to call on send one or more times to push all pending data to write.
  bool      mIsShortWritePending;

  // Value of the last byte pending from the SSL short write that needs
  // to be passed to subsequent calls to send to perform the flush.
  // Only valid if mIsShortWritePending
  unsigned char mShortWritePendingByte;

  // Original amount of data the upper layer has requested to write to
  // return after the successful flush.
  // Only valid if mIsShortWritePending
  PRInt32   mShortWriteOriginalAmount;
Please run some quick local testing with a debug build, after you do (d), and confirm you don't see the assertion.

If you address my proposals, r=kaie

No second round of review necessary, but please attach the new patch for reference.
Comment on attachment 582708 [details] [diff] [review]
v1

r-

If you address my small comments, you may mark r=kaie on your new patch.
Attachment #582708 - Flags: review?(kaie) → review-
> r-

because of the char/unsigned char inconsistence.
Blocks: 674147
No longer depends on: 674147
We may want to check this is actually still needed.
Status: ASSIGNED → NEW
David, can you or someone from your team please verify this is still an issue and should be fixed?
Assignee: honzab.moz → nobody
Flags: needinfo?(dkeeler)
Sure, I can investigate this.
Flags: needinfo?(dkeeler)
See Also: → 1382886
"perf" key word?
Assignee: nobody → honzab.moz
Blocks: 1382886
Status: NEW → ASSIGNED
Attached patch v1 (merged) (obsolete) — Splinter Review
I addressed all the Kai's 5 years old comments where I believed it was necessary.  I can't trigger the code path tho, so hard to say if this breaks something or actually fixes something.
Attachment #582708 - Attachment is obsolete: true
Attachment #8900871 - Flags: review?(dkeeler)
Whiteboard: [psm-backlog] → [psm-active]
Target Milestone: mozilla11 → ---
Comment on attachment 8900871 [details] [diff] [review]
v1 (merged)

Review of attachment 8900871 [details] [diff] [review]:
-----------------------------------------------------------------

Cool. This looks good. r=me with comments addressed (particularly the assertion failure and the documentation).

I managed to reproduce the short write situation by running a server locally and using tc to limit the bandwidth and set a delay to it. It looks like Firefox can get into an inefficient state where it will send a large packet followed by a small packet followed by a large packet, etc.. This patch addresses the issue and makes the packet size much more uniform (and large). Unfortunately, it seems this would be hard to write a test for. Maybe we could add this as a QA smoketest task? Or I think we have some talos tests that simulate network effects that should be able to get Firefox in this state...

Anyway, I think we should elaborate on the documentation in this patch. Maybe something like this:

NSS indicates that it can't write all requested data (due to network congestion, for example) by returning either one less than the amount of data requested or 16383, if the requested amount is greater than 16384. We refer to this as a "short write". If we simply returned the amount that NSS did write, the layer above us would then call PSMSend with a very small amount of data (often 1). This is inefficient and can lead to alternating between sending large packets and very small packets. To prevent this, we alert the layer calling us that the operation would block and that it should be retried later, with the same data. When it does, we tell NSS to write the remaining byte it didn't write in the previous call. We then return the total number of bytes written, which is the number that caused the short write plus the additional byte we just wrote out.

::: security/manager/ssl/nsNSSIOLayer.cpp
@@ +1483,5 @@
>  #endif
>  
> +  if (socketInfo->IsShortWritePending() && amount > 0) {
> +    // We got "SSL short write" last time, try to flush the pending byte.
> +    MOZ_DIAGNOSTIC_ASSERT(*static_cast<unsigned char const*>(buf) == *socketInfo->GetShortWritePendingByteRef(),

This assertion fails for me. If I'm understanding correctly, the higher layer is retrying the original write, right? If so, buf wouldn't be the byte(s) that didn't get written - buf would be the original data, in which case buf[mShortWriteOriginalAmount - 1] should be mShortWritePendingByte, right? In fact, if we wanted to give ourselves more confidence that the right thing was happening, maybe we could save the original buf and compare it to the new buf (probably in debug-only builds, though).

@@ +1497,4 @@
>    int32_t bytesWritten = fd->lower->methods->send(fd->lower, buf, amount,
>                                                    flags, timeout);
>  
> +  /* Secure write sometimes returns this amount of bytes written even for larger

nit: //-style comments, please

@@ +1501,5 @@
> +     sizes of the output buffer, consider this as indication of a short write.
> +
> +     Value is based on libssl's maximum buffer size (in a private header):
> +        MAX_FRAGMENT_LENGTH - 1
> +     TODO: Query this value once at runtime (changes in NSS required) */

Let's file a bug in NSS :: Libraries to make this API available and reference it here.

::: security/manager/ssl/nsNSSIOLayer.h
@@ +175,5 @@
> +  // True when SSL layer has indicated an "SSL short write", i.e. need
> +  // to call on send one or more times to push all pending data to write.
> +  bool      mIsShortWritePending;
> +
> +  // These are only valid iff mIsShortWritePending is true.

nit: "only valid if and only if" is a bit redundant :)
Attachment #8900871 - Flags: review?(dkeeler) → review+
Priority: P3 → P1
Whiteboard: [psm-active] → [psm-assigned]
Attached patch v1.1Splinter Review
(updated ci message)
Attachment #8902272 - Attachment is obsolete: true
Attachment #8902311 - Flags: review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/21350cc2ff6b
Handle SSL short-write correctly to save CPU looping. r=keeler
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/21350cc2ff6b
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.