Closed Bug 1120046 Opened 9 years ago Closed 7 years ago

RFC mozilla/netwerk/base/src/nsBufferedStreams.cpp: better error reporting [forget the thread-race lock in the original post]

Categories

(Core :: Networking, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: ishikawa, Assigned: ishikawa)

Details

(Whiteboard: [necko-active])

Attachments

(1 file, 4 obsolete files)

RFC mozilla/netwerk/base/src/nsBufferedStreams.cpp: better error reporting and maybe adding thread-race lock

Work-in-progress patch

[1] Subtle Error reporting Issues.

I have noticed a subtle error reporting failure in
|Write| in mozilla/netwerk/base/src/nsBufferedStreams.cpp.

The uploaded patch fixes this issue.

nsBufferedOutputStream::Write(const char *buf, uint32_t count, uint32_t *result)

This should report NS_BASE_STREAM_CLOSED immediately
at the top when mStream is nullptr.

A call of Flush() within this |Write| probably OUGHT TO return this
error, but for some reason, |Flush| returns NS_OK when mStream is
nullptr.

So, after thinking hard, I decided to return this error immediately at
the top of |Write| function.

Also, at the end, if |Flush()| returns an error (maybe flakey file
system such as overflowing file system, or transient network error for
a remotely mounted file system, etc.), then |write| probably also
returns THAT error to catch this problem soonest. So I added the check
for NS_SUCCEEDED(rc) before returning from |Write|.

nsBufferedOutputStream::Finish()

This function seems to lack proper error checks and may not report
errors properly (errors from underlying streams, etc.).

So I followed the existing handling if the error return
in  nsBufferedOutputStream::Close().

Background:

I am trying to find the root cause of a message download failure when
buffering is enabled for a POP3 download thread (in 
comm-central/mailnews/local/src/nsPop3Sink.cpp ).
See [1]

I noticed the proper error is not returned at the soonest opportunity.
Thus the proposed patch.

But I noticed a larger issue as well. The thread-race issue.

[2] Multi-thread issue.

Thread safety

As of now, some key output functions in nsBufferedStreams.cpp are not
thread-safe concerning the handling of the buffers maintained in the
file.

In the terms used in the following,

http://www.gnu.org/software/libc/manual/html_node/Block-Input_002fOutput.html

these functions are "unlocked" versions.
(In Microsoft parlance, it is "nolock" vs "lock".
http://msdn.microsoft.com/en-us/library/eadhs8bd.aspx )

I think we need "locked" version as well.  That is, the buffers
handled at the level of output functions inside nsBufferedStreams.cpp
are not protected against racey accesses of competing threads at all,
and to prevent the races, we need proper locks.

Background: 

This race of buffering output became evident when, with thunderbird,
writing to Inbox seems to be run with a file stream shared at least
with two (maybe more?)  threads.

As of now, this particular write of downloading messages to Inbox is
not buffered at all. And the race does not manifest itself.

However, for performance reasons, I would like to make this writing
buffered.  A large attachment such as PDF or JPEG sent by TB and
received by another TB user ends up calling |write| system call of 75
octets repeatedly.  (If you receive a 3MB attachments, 41K+ |write|
system calls are invoked. Not so nice in terms of system performance).
Using a smallish 16KB buffer, we can reduce the # of |write| system
calls by a factor of 218. (If the mail is stored on a shared file
system remotely mounted, this difference of |write| calls that need to
travel over the wire makes a world difference of I/O performance. Or
for that matter, if this Inbox is in SSD or USB memory stick, the
reduced # of |write| by coalescing the repeated subsequent writes is
a clear win.)

But when I tried to enable buffering in nsPop3Sink.cpp, I saw the
failure to merge a downloaded message into Inbox in a test scenario.

Why the failure?
A thread that downloads e-mail messages from POP3 server and writes to
Inbox seems disrupted by another phantom thread [I have no idea what
thread that is, and why such a rogue thread is running :-( ] that
seems to moves the read/write pointer by |Seek| (and low-level I/O
function) and, worse, the underlying file-descriptor that is
referenced implicitly from the buffered stream is reset
(closed?) before this download thread knows about it, and suddenly it
sees the error of NS_BASE_STREAM_CLOSED (as discussed above: it was
not evident until the strict error reporting at the soonest
opportunity was done in |Write|, etc.) when a write to Inbox is attempted.

So I suspect, in addition to simple lock to protect |Write|, |Seek|,
|Flush|, etc., counting-semaphore-like protection is required for
|Open/Close| so that underlying stream is not closed while at least
one thread is referencing the buffered write stream (at the level of
nsBufferedStreams.cpp). Only when the last remaining such thread calls
|Close| to clear the buffered stream at the level of
nsBufferedStreams.cpp, then the low-level stream should be closed.

My big question: 

Is the omission of locking for buffered write functions at the level
of mozilla/netwerk/base/src/nsBufferedStreams.cpp INTENTIONAL or just an oversight?

Or AM I MISSING SOMETHING and the functions in nsBufferedStreams
are indeed protected from races? I doubt it from what I observed so
far by sprinkling verbose dump statements and see that
the |Flush|, |Close|, etc. suddenly begin to return
NS_BASE_STREAM_CLOSED.

Probably this locking issue was missed when conversion of nsFilespec
to nsIFile.  [2]

I have no idea what other program in Mozilla software packages will
require thread-safe buffered output routines. But TB definitely does
and I suspect Firefox OS would require such routines anyway for its applications.

TIA

PS: 

As for the problem of nsPop3Sink.cpp, it is possible that
nsPop3Sink.cpp that downloads messages from POP3 server and write them
to Inbox needs still higher application-level locks after the fix to
add locking to output functions in BufferedStreams.cpp.

Observation: 

In nsPop3Sink.cpp, I see a pattern of

   |Seek| to a place in Inbox
   // what happens if the flow is interrupted here.
   |Write| something

in the download code. I am afraid something bad happens when this thread is interrupted between |Seek| and |Write|.
There is no guarantee of writing to the desired place.

So we seem to need to protect these pair of |Seek| and |Write| as well at the level of nsPop3sink.cpp
as the proposed addition of locking to nsBufferedStreams.cpp.
(Without the locks to output functions in nsBufferedStreams.cpp, the
buffer data structure in nsBufferedStreams.cpp loses integrity because
the data may be corrupted within |Write| processing.)

Oh, I forgot to mention that the buffered streams are passed to routines outside nsPop3Sink.cpp as well. So the locking at the low-level nsBufferedStreams.cpp is a must.
Tweaking nsPop3Sink.cpp alone will not solve the race issue.


[1]  See Bug
Bug 1116055 Performance issue: Failure to use buffered write (comm-central thunderbird)
The patch proposed there so far dealt with the buffering of copy/move
of messages once they are in local folders. As it turns out, the
download from POP3 server to Inbox is not buffered at all

[2] Comment 5 in the bugzilla of above.
https://bugzilla.mozilla.org/show_bug.cgi?id=1116055#c5
Comment on attachment 8546973 [details] [diff] [review]
Work-in-progress: better error reporting in a couple of places.

This is a work-in-progress patch and there are a few extra outputs meant for
easier debugging.
But the change in Write to report NS_BASE_STREAM_CLOSED sooner,
and the change of the return code,
and the checking of error values in Finish seem to be very important.

An optimization of not calling Flush in Seek when buffer is empty is a bonus.
(But obviously, Flush() needs a protection of MT-safety.)

TIA
Attachment #8546973 - Flags: review?(jduell.mcbugs)
I only mentioned output/write, but input/read probably should be protected in the same manner although I have not looked into the issue very much.

TIA
Comment on attachment 8546973 [details] [diff] [review]
Work-in-progress: better error reporting in a couple of places.

Review of attachment 8546973 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine but Honza knows the streams code better than I do.
Attachment #8546973 - Flags: review?(jduell.mcbugs) → review?(honzab.moz)
(In reply to Jason Duell [:jduell] (needinfo? me for lower latency) from comment #3)
> Comment on attachment 8546973 [details] [diff] [review]
> Work-in-progress: better error reporting in a couple of places.
> 
> Review of attachment 8546973 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks fine but Honza knows the streams code better than I do.

Thank you for your comment. 

BTW, the code that prompted debugging and made me come up with bugzilla entry 
turned out to be full of unnecessary extra bogus close calls.
So I filed the following.
Bug 1121842 - [META] RFC: C-C Thunderbird - Cleaning of incorrect Close, unchecked Flush, Write etc. in nsPop3Sink.cpp and friends. 

Maybe the issue of thread-safety can wait until that clean up may end as far as TB is concerned.
(But I think MT-safety is a real issue and we probably want it 
for FireFox OS applications. So someone may want to take a lead in that direction.)

TIA
Re: error reporting:  it's always good to have it better, I'll check the patch and see.  Problem is that this is a very low level class and any change may cause a lot of, first hidden, troubles everywhere.  So, changing it is really sensitive.

Re: threading:  it's tricky.  The API as designed cannot ensure that when someone calls Seek() and then Write() that the cursor may not be altered in between by a different thread call.  However, if there are code paths allowing it then I would fairly say it's the software architectural problem!  The stream is a very low level thing, actually pretty dumb.  Mainly for performance, reliability (well predictable and clear simple behavior) and flexibility point of use and also because anything more complicated simply should be in hands of the consumer (user) of the stream.

Re: close only on full release:  I'm absolutely against putting it into this class!!  Such logic doesn't belong there, if anywhere.  If you want it, have a wrapper over the buffered stream.  Also, more then a single consumer working on multiple threads is (again) weird architecture...
Also, the interface doesn't say anything about thread-safety.  If you want a t-s version of buffered streams, you may reimplement it and use a different contractid.
Comment on attachment 8546973 [details] [diff] [review]
Work-in-progress: better error reporting in a couple of places.

Review of attachment 8546973 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/src/nsBufferedStreams.cpp
@@ +171,5 @@
>      METER(bufstats.mBufferReadUponSeek += mCursor);
>      METER(bufstats.mBufferUnreadUponSeek += mFillPoint - mCursor);
>      rv = Flush();
> +#if DEBUG
> +    NS_WARNING("Flush() is called WITHIN Seek().");

and what negative effect this should have?

@@ +628,5 @@
>              if (NS_FAILED(rv)) break;
>          }
>      }
>      *result = written;
> +    return (written > 0) && NS_SUCCEEDED(rv) ? NS_OK : rv;

hmm.. no, this is not correct.  if you have written something the result must be OK.  the consumer has to recognize only part of the data has been written by comparing the number of bytes it wants to write and the number of bytes actually written.  In reality the code is usually:

while (toWrite> 0)
  rv = stream.Write(buffer + cursor, toWrite, &written);
  if (FAILED(rv)) -> "ERROR!!"; return;
  toWrite -= written;
  cursor += written;
}

As you can see, if not all is written, you will loop again.  When the stream is in a failed state, you will discover that on the (say) second call to Write().  So you will know there is an error.

Note: stream is not obligated to write everything.  It's up to consumer to handle that.

@@ +641,5 @@
>          // Stream already cancelled/flushed; probably because of previous error.
>          return NS_OK;
>      }
> +    // optimize : some code within C-C needs to call Seek -> Flush() often.
> +    if(mFillPoint == 0) {

if (

@@ +643,5 @@
>      }
> +    // optimize : some code within C-C needs to call Seek -> Flush() often.
> +    if(mFillPoint == 0) {
> +#if DEBUG
> +        NS_WARNING("Flush: 0 bytes in the buffer.");

don't, this is a valid operation!

@@ +741,5 @@
>  
>          uint32_t read = 0;
>          rv = reader(this, closure, mBuffer + mCursor, *_retval, left, &read);
>  
> +        if (NS_FAILED(rv)) // If we have read some data, return ok

leave this comment as is
Attachment #8546973 - Flags: review?(honzab.moz) → review-
Hi, thank you for your comment.

(In reply to Honza Bambas (:mayhemer) from comment #5)
> Re: error reporting:  it's always good to have it better, I'll check the
> patch and see.  Problem is that this is a very low level class and any
> change may cause a lot of, first hidden, troubles everywhere.  So, changing
> it is really sensitive.

Yes, I was wondering about the impact of the change.
 
> Re: threading:  it's tricky.  The API as designed cannot ensure that when
> someone calls Seek() and then Write() that the cursor may not be altered in
> between by a different thread call.  However, if there are code paths
> allowing it then I would fairly say it's the software architectural problem!
> The stream is a very low level thing, actually pretty dumb.  Mainly for
> performance, reliability (well predictable and clear simple behavior) and
> flexibility point of use and also because anything more complicated simply
> should be in hands of the consumer (user) of the stream.

Yes, |Seek| and |Write| (or for that matter |Read|) is only loosely coupled.
I agree that multiple-thread (or even a callback) changing the position of seek behind
one's back  is a bad design.
With that in mind, please read this comment in
http://mxr.mozilla.org/comm-central/source/mailnews/local/src/nsPop3Sink.cpp#671
 
This piece of code is in a tight loop: TB downloads a line of message and
then writes it to the output stream. So it is executed many times.
(This is where MIME encoded data is written by 75 octets at a time...)

671     // seek to the end in case someone else has seeked elsewhere in our stream.
672     nsCOMPtr <nsISeekableStream> seekableOutStream = do_QueryInterface(m_outFileStream);
673     seekableOutStream->Seek(nsISeekableStream::NS_SEEK_END, 0);
674     uint32_t bytesWritten;
675     m_outFileStream->Write(buffer.BeginReading(), bufferLen, &bytesWritten);
676     NS_ENSURE_TRUE(bytesWritten == bufferLen, NS_ERROR_FAILURE);

And it seems that the |Seek()| on line 673 does not seem to be NOOP all the time.
During testing, I think I have seen |Seek()| moving the write position to a place where |Write()| on 675 has moved it previously, meaning that someone else moved the Write position (!?). Hmm...

Oh, line 675 and 676 are the classic examples of the failed coding which you correctly
pointed out should not be used.
Instead of doing the loop to call |Write| many times if |Write| failed
to write all the data in one go, the above code tries to call Write once and fails
if the whole data is not written in one go.
TB is full of such code, and I am trying to fix them [1][2] at least for pop3 message download and imap as I notice the incorrect code one by one 
(after exposing the buggy code
and first report the error by  warning by NS_WARNING/NS_ERROR. 
Finding the proper error recovery method
is not so straightforward. At least by marking troublesome spots, I hope more people work on these serious issues and come up with proper error recovery.)

[1] Bug 1121842 - [META] RFC: C-C Thunderbird - Cleaning of incorrect Close, unchecked Flush, Write etc. in nsPop3Sink.cpp and friends.
[2] Bug 1122698 - C-C Thunderbird - Cleaning of incorrect Close, unchecked Flush, Write etc. in nsPop3Sink.cpp and friends. (Step-1) 


> Re: close only on full release:  I'm absolutely against putting it into this
> class!!  Such logic doesn't belong there, if anywhere.  If you want it, have
> a wrapper over the buffered stream. 

I am not sure which part of my patch/suggestion you are referring to.
I will study the comment on the patch and see if I can make sense of the comment above.

> Also, more then a single consumer
> working on multiple threads is (again) weird architecture...

Maybe multi-threading may not be an apt description.
TB's design seems to let a callback that happens asynchronously to move a Write position or something :-(
I have yet to figure out the details which code is moving the write position 
behind the main-loop of pop3 downloading (and writing the message one line at at time).

BTW, the TB code performs |Close| on already |Close|d streams many times and so it adds to confusion, too. At least I have been able to reduce bogus |Close| calls so far in [1][2] and newer local patches I am testing.

(In reply to Honza Bambas (:mayhemer) from comment #6)
> Also, the interface doesn't say anything about thread-safety.  If you want a
> t-s version of buffered streams, you may reimplement it and use a different
> contractid.

Yes, this occurred to me.
If TB code really seem to require such M-T safety, and I may not be able
to solve the subtle issues when buffering is enabled [message corruption was observed before according to old bugzill entries],
I will create a wrapper class over the existing buffered stream (by adding a
lock to stream structure) and see if such a class can solve the corruption issue.
Using buffered output in message downloading seems to be really necessary for solving poor I/O performance for certain
users who need to use remote CIFS-mount for their Mail storage directory.
(I have verified that using CIFS-mount under linux really slows down the writing. No coalescing of local |write| is attemped, it seems. NFSv2 and NFSv3 daemons seem to allow such coalescing and I did not see the peformance degradation in contrast to CIFS-mount.)

Thank you again for your comments about general approach.

I will study your comment for the  patch I uploaded, and
will come up with the better patch after digesting your specific comments.
(In reply to ISHIKAWA, Chiaki from comment #8)
> This piece of code is in a tight loop: TB downloads a line of message and
> then writes it to the output stream. So it is executed many times.
> (This is where MIME encoded data is written by 75 octets at a time...)
> 
> 671     // seek to the end in case someone else has seeked elsewhere in our
> stream.
> 672     nsCOMPtr <nsISeekableStream> seekableOutStream =
> do_QueryInterface(m_outFileStream);
> 673     seekableOutStream->Seek(nsISeekableStream::NS_SEEK_END, 0);
> 674     uint32_t bytesWritten;
> 675     m_outFileStream->Write(buffer.BeginReading(), bufferLen,
> &bytesWritten);
> 676     NS_ENSURE_TRUE(bytesWritten == bufferLen, NS_ERROR_FAILURE);

This code is obviously wrong anyway - you must check result of Write().  When it's a failure then bytesWritten is undefined!  Calling Seek() before Write() since some other might manipulate the stream write position in the mean time is pretty ill architecture.  Something else then a stream is wrong here.


I'll answer to the whole comment later.
Flags: needinfo?(honzab.moz)
(In reply to ISHIKAWA, Chiaki from comment #8)
> And it seems that the |Seek()| on line 673 does not seem to be NOOP all the
> time.
> During testing, I think I have seen |Seek()| moving the write position to a
> place where |Write()| on 675 has moved it previously, meaning that someone
> else moved the Write position (!?). Hmm...

Or you just didn't handle the error result correctly?

> 
> Oh, line 675 and 676 are the classic examples of the failed coding which you
> correctly
> pointed out should not be used.
> Instead of doing the loop to call |Write| many times if |Write| failed
> to write all the data in one go, the above code tries to call Write once and
> fails
> if the whole data is not written in one go.
> TB is full of such code, and I am trying to fix them [1][2] at least for
> pop3 message download and imap as I notice the incorrect code one by one 
> (after exposing the buggy code
> and first report the error by  warning by NS_WARNING/NS_ERROR. 
> Finding the proper error recovery method
> is not so straightforward. At least by marking troublesome spots, I hope
> more people work on these serious issues and come up with proper error
> recovery.)
> 
> [1] Bug 1121842 - [META] RFC: C-C Thunderbird - Cleaning of incorrect Close,
> unchecked Flush, Write etc. in nsPop3Sink.cpp and friends.
> [2] Bug 1122698 - C-C Thunderbird - Cleaning of incorrect Close, unchecked
> Flush, Write etc. in nsPop3Sink.cpp and friends. (Step-1) 
> 
> 
> > Re: close only on full release:  I'm absolutely against putting it into this
> > class!!  Such logic doesn't belong there, if anywhere.  If you want it, have
> > a wrapper over the buffered stream. 
> 
> I am not sure which part of my patch/suggestion you are referring to.

You wanted to leave the stream open regardless of any of the consumers called Close() on it.  Only after all consumers release the stream, do actual close.  That is unacceptable.

> I will study the comment on the patch and see if I can make sense of the
> comment above.
> 
> > Also, more then a single consumer
> > working on multiple threads is (again) weird architecture...

yup

> 
> Maybe multi-threading may not be an apt description.
> TB's design seems to let a callback that happens asynchronously to move a
> Write position or something :-(
> I have yet to figure out the details which code is moving the write position 
> behind the main-loop of pop3 downloading (and writing the message one line
> at at time).

If you are using Read/WriteSegments check the callback function.  The move might happen there via some non-obvious code path?

> 
> BTW, the TB code performs |Close| on already |Close|d streams many times and
> so it adds to confusion, too. At least I have been able to reduce bogus
> |Close| calls so far in [1][2] and newer local patches I am testing.

Hmm.. some kind of even deeper rewrite might be wise?

> 
> (In reply to Honza Bambas (:mayhemer) from comment #6)
> > Also, the interface doesn't say anything about thread-safety.  If you want a
> > t-s version of buffered streams, you may reimplement it and use a different
> > contractid.
> 
> Yes, this occurred to me.
> If TB code really seem to require such M-T safety, and I may not be able
> to solve the subtle issues when buffering is enabled [message corruption was
> observed before according to old bugzill entries],
> I will create a wrapper class over the existing buffered stream (by adding a
> lock to stream structure) and see if such a class can solve the corruption
> issue.

According the seek() you may use thread private data.  In your Seek() just set the value to your private record and in your Write(), atomically, call underlying Seek(my_stored_private_seek_position) and Write() on the real stream.

But when there is a Seek() on thread 1 and then Write() on thread 2 that expects the position be correct (which would be pretty weird arch too...) you are screwed.

> Using buffered output in message downloading seems to be really necessary
> for solving poor I/O performance for certain
> users who need to use remote CIFS-mount for their Mail storage directory.
> (I have verified that using CIFS-mount under linux really slows down the
> writing. No coalescing of local |write| is attemped, it seems. NFSv2 and
> NFSv3 daemons seem to allow such coalescing and I did not see the peformance
> degradation in contrast to CIFS-mount.)

I use a microSD card, a slow one, to emulate slow FS.  Might be simpler ;)

> 
> Thank you again for your comments about general approach.
> 
> I will study your comment for the  patch I uploaded, and
> will come up with the better patch after digesting your specific comments.

Good luck!  It's great what you do!
Flags: needinfo?(honzab.moz)
this appears abandoned, please reopen if not so
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Hi,

I am re-opening this bugzilla.

I am no longer interested in the LOCK/thread-safety issue.
I want to make the error reporting better.

The patch that addressed earlier comment has been in use on local PC for like a year or so.
(Finally, the I/O issue I mentioned ealier 
in C-C TB has begun slow review process, and so I want this low-level function to return error code early.)

Here are some comments to an earlier review comment.
(In reply to Honza Bambas (:mayhemer) from comment #7)
> Comment on attachment 8546973 [details] [diff] [review]
> Work-in-progress: better error reporting in a couple of places.
> 
> Review of attachment 8546973 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: netwerk/base/src/nsBufferedStreams.cpp
> @@ +171,5 @@
> >      METER(bufstats.mBufferReadUponSeek += mCursor);
> >      METER(bufstats.mBufferUnreadUponSeek += mFillPoint - mCursor);
> >      rv = Flush();
> > +#if DEBUG
> > +    NS_WARNING("Flush() is called WITHIN Seek().");
> 
> and what negative effect this should have?
> 

I removed this message.

> @@ +628,5 @@
> >              if (NS_FAILED(rv)) break;
> >          }
> >      }
> >      *result = written;
> > +    return (written > 0) && NS_SUCCEEDED(rv) ? NS_OK : rv;
> 
> hmm.. no, this is not correct.  if you have written something the result
> must be OK.  the consumer has to recognize only part of the data has been
> written by comparing the number of bytes it wants to write and the number of
> bytes actually written.  In reality the code is usually:
> 
> while (toWrite> 0)
>   rv = stream.Write(buffer + cursor, toWrite, &written);
>   if (FAILED(rv)) -> "ERROR!!"; return;
>   toWrite -= written;
>   cursor += written;
> }
> 
> As you can see, if not all is written, you will loop again.  When the stream
> is in a failed state, you will discover that on the (say) second call to
> Write().  So you will know there is an error.
> 
> Note: stream is not obligated to write everything.  It's up to consumer to
> handle that.
> 

Thank you for the explanation.
(C-C TB is full of problematic loop of read/write not checking the # of  really read/written octets and thus short-read or short-write is a real issue there.
But then I think M-C FF also suffers from similar issues, but I digress.)

So the return value is fixed according to your explanation.

> @@ +641,5 @@
> >          // Stream already cancelled/flushed; probably because of previous error.
> >          return NS_OK;
> >      }
> > +    // optimize : some code within C-C needs to call Seek -> Flush() often.
> > +    if(mFillPoint == 0) {
> 
> if (

Fixed.

> 
> @@ +643,5 @@
> >      }
> > +    // optimize : some code within C-C needs to call Seek -> Flush() often.
> > +    if(mFillPoint == 0) {
> > +#if DEBUG
> > +        NS_WARNING("Flush: 0 bytes in the buffer.");
> 
> don't, this is a valid operation!

It is commented out now.
I will remove it completely once the review is done.

> 
> @@ +741,5 @@
> >  
> >          uint32_t read = 0;
> >          rv = reader(this, closure, mBuffer + mCursor, *_retval, left, &read);
> >  
> > +        if (NS_FAILED(rv)) // If we have read some data, return ok
> 
> leave this comment as is

This is in a read function, i.e.,
@@ -700,17 +805,17 @@ nsBufferedOutputStream::WriteSegments(ns
                 return (*_retval > 0) ? NS_OK : rv;
 
             continue;
         }
 
         uint32_t read = 0;
         rv = reader(this, closure, mBuffer + mCursor, *_retval, left, &read);
 
-        if (NS_FAILED(rv)) // If we have written some data, return ok
+        if (NS_FAILED(rv)) // If we have read some data, return ok
             return (*_retval > 0) ? NS_OK : rv;

So my checking the comment to use "read" instead of "written" correct?

TIA
Assignee: nobody → ishikawa
Attachment #8546973 - Attachment is obsolete: true
Status: RESOLVED → REOPENED
Attachment #8798598 - Flags: review?(honzab.moz)
Resolution: INCOMPLETE → ---
Summary: RFC mozilla/netwerk/base/src/nsBufferedStreams.cpp: better error reporting and maybe adding thread-race lock → RFC mozilla/netwerk/base/src/nsBufferedStreams.cpp: better error reporting [forget the thread-race lock in the original post]
Status: REOPENED → ASSIGNED
Whiteboard: [necko-active]
Comment on attachment 8798598 [details] [diff] [review]
(updated) better error reporting (forget the LOCK).

Review of attachment 8798598 [details] [diff] [review]:
-----------------------------------------------------------------

This is not "better error reporting".  Simply - don't use stdout/err directly in mozilla code.  We have tools like NS_WARNING (an unexpected input), NS_ERROR (an unexpected non-fatal code path), NS_WARN_IF(cond) warns in conditions when !cond.  All print to screen/debug output/whatever is setup to eat that output.

Please use them or, if you want more info, add logging.  See LazyLogModule class and how it's used around the code base.

Otherwise the changes seem reasonable.  Please push to try before next review round.

::: netwerk/base/nsBufferedStreams.cpp
@@ +132,5 @@
>      // the failure data-dependent.
>      nsresult rv;
>      nsCOMPtr<nsISeekableStream> ras = do_QueryInterface(mStream, &rv);
> +    if (NS_FAILED(rv)) {
> +#if DEBUG

"#ifdef DEBUG" please

@@ +327,5 @@
> +    };
> +#endif
> +
> +    if (NS_FAILED(rv1))
> +        return rv1;

note: when you are here and do these mall changes, please do it as:

if () {
}

(always brace)

@@ +662,5 @@
>      uint32_t written = 0;
> +    *result = 0;
> +    if (!mStream) {
> +        // Flush() returns NS_OK in this case ! See Flush() below.
> +        // We better catch the failure ASAP.

please make this comment more comprehensible.

@@ +711,5 @@
> +#if 0
> +    // this message output causes enough timing skew to hide nsPop3Sini.cpp
> +    // issue when buffering is enabled.
> +        NS_WARNING("Flush: 0 bytes in the buffer.");
> +#endif

remove, even before review please.

but please keep the comment why you are doing this early return.

@@ +744,1 @@
>          NS_WARNING("failed to flush buffered data! possible dataloss");

please put braces around the if body (in all cases which you touch with this patch)

@@ +766,5 @@
> +    // to the existing handling in  nsBufferedOutputStream::Close():
> +    //
> +    // Prioritize the error return from Flush().
> +    // then the error from Sink()->Close() [or mSafeStream->Finish()],
> +    // finally the error from nsBufferedStream::Close()

please try to say WHY you are doing this.
Attachment #8798598 - Flags: review?(honzab.moz) → review-
Thank you for the review.

Here is an updated patch that tries to address your comments.

(In reply to Honza Bambas (:mayhemer) from comment #13)
> Comment on attachment 8798598 [details] [diff] [review]
> (updated) better error reporting (forget the LOCK).
> 
> Review of attachment 8798598 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is not "better error reporting".  Simply - don't use stdout/err
> directly in mozilla code.  We have tools like NS_WARNING (an unexpected
> input), NS_ERROR (an unexpected non-fatal code path), NS_WARN_IF(cond) warns
> in conditions when !cond.  All print to screen/debug output/whatever is
> setup to eat that output.
> 
> Please use them or, if you want more info, add logging.  See LazyLogModule
> class and how it's used around the code base.

I used NS_WARNING().

It seems a little awkward to print the value dumps with NS_WARNING(), but
should I need it in the future, I will concoct a better patch.
For now, I simply dumped the messages without value strings.

Logging seems an over kill at this moment (after the fact that I have already
figured out what was causing the delayed error report in the first place).

I made sure that the extra dumps only occur inside #ifdef DEBUG, #endif and
only when an error occurred.

> 
> Otherwise the changes seem reasonable.  Please push to try before next
> review round.

I mentioned in earlier comment that the patch which the current patch is based
has been in use for C-C TB and try-comm-central job with the current patch is
in
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=bbbc942447253719b6778721dffdbfb96fcb137e

The debug version compiles under linux, windows and OSX.

The patch is listed in about the middle position in
https://hg.mozilla.org/try-comm-central/pushloghtml?changeset=bbbc942447253719b6778721dffdbfb96fcb137e
as
45f0d3fc4298	ISHIKAWA, Chiaki — Bug 1120046 - mozilla/netwerk/base/src/nsBufferedStreams.cpp: earlier and better error reporting

I did not show FF push try because
I am unfamiliar with mozilla-central push, and because of the following issue
I have not attempted it yet.
When I built a debug version of firefox (my patch only gets compiled when
debug version is built) using the latest mozilla-central repository a few hours ago, I succeeded building it locally under Debian GNU/Linux 64-bit, but
the resulting binary crashed with the strange assertions. (I don't think this is related to my patch, but who knows? Maybe I am using a non-standard mozconfig for FF. That is why I said I am unfamiliar with mozilla-central code.
If someone can push this to mozilla-central try server, I would appreicate.)

The crash happens very soon after running FF. The following is from the
log under gdb.
[New Thread 0x7fff8affd700 (LWP 7663)]
[7630] ###!!! ASSERTION: Double-initing a frame?: '!mContent', file /REF-OBJ-DIR/mozilla-central/layout/generic/nsFrame.cpp, line 515
[7630] ###!!! ASSERTION: Double-initing a frame?: '!mContent', file /REF-OBJ-DIR/mozilla-central/layout/generic/nsFrame.cpp, line 515
[7630] ###!!! ASSERTION: Shouldn't happen: '!aNewFrame->GetNextSibling()', file /REF-OBJ-DIR/mozilla-central/layout/base/nsCSSFrameConstructor.cpp, line 1253

Program received signal SIGSEGV, Segmentation fault.
0x00007fffea9f0a73 in nsLayoutUtils::GetLastSibling (aFrame=<optimized out>)
    at /REF-OBJ-DIR/mozilla-central/layout/base/nsLayoutUtils.cpp:1917
1917	  if (!aFrame) {
(gdb) where
#0  0x00007fffea9f0a73 in nsLayoutUtils::GetLastSibling(nsIFrame*) (aFrame=<optimized out>)
    at /REF-OBJ-DIR/mozilla-central/layout/base/nsLayoutUtils.cpp:1917
#1  0x00007fffeaadf003 in nsFrameList::SetFrames(nsIFrame*) (this=<optimized out>, aFrameList=<optimized out>)
    at /REF-OBJ-DIR/mozilla-central/layout/generic/nsFrameList.cpp:68
#2  0x00007fffea98306f in nsFrameItems::AddChild(nsIFrame*) (this=<optimized out>, aChild=<optimized out>)
    at /REF-OBJ-DIR/mozilla-central/layout/base/nsCSSFrameConstructor.cpp:650
#3  0x00007fffea98edf4 in nsFrameConstructorState::AddChild(nsIFrame*, nsFrameItems&, nsIContent*, nsStyleContext*, nsContainerFrame*, bool, bool, bool, bool, nsIFrame*) (this=<optimized out>, aNewFrame=<optimized out>, aFrameItems=..., aContent=<optimized out>, aStyleContext=<optimized out>, aParentFrame=<optimized out>, aCanBePositioned=<optimized out>, aCanBeFloated=<optimized out>, aIsOutOfFlowPopup=<optimized out>, aInsertAfter=<optimized out>, aInsertAfterFrame=<optimized out>) at /REF-OBJ-DIR/mozilla-central/layout/base/nsCSSFrameConstructor.cpp:1304

..

If someone familiar with mozilla-central push job, I would appreciate to see the result. I believe compilation is a non-issue because it compiles fine on
linux, windows, OSX as you can see from the try-comm-central job mentioned above.


Well, as I said earlier, C-C TB can use the patch without any ill-effect (and as a matter of fact, during the simulation of network errors when the local preference and mail directories were on a remotely mounted file system, the early error report made possible my patch was very helpful. Without the patch, write to (bogus) already closed file stream won't return an error until much later.

So while the strange assertion failures and crash issue is verified,
I will explain the current patch based on your earlier comment.



	
> 
> ::: netwerk/base/nsBufferedStreams.cpp
> @@ +132,5 @@
> >      // the failure data-dependent.
> >      nsresult rv;
> >      nsCOMPtr<nsISeekableStream> ras = do_QueryInterface(mStream, &rv);
> > +    if (NS_FAILED(rv)) {
> > +#if DEBUG
> 
> "#ifdef DEBUG" please
> 

Now #ifdef DEBUG is used.

> @@ +327,5 @@
> > +    };
> > +#endif
> > +
> > +    if (NS_FAILED(rv1))
> > +        return rv1;
> 
> note: when you are here and do these mall changes, please do it as:
> 
> if () {
> }
> 
> (always brace)

I may have misread your comment here and below and
replaced all the instances of one line statement following |if (expr)| without
any brace pairs. It may have been an over-reaction.
But frankly speaking, in this file, there are if-statment that is indented
in a funny manner when there is a single statement without {} pair following it, and I think my converting them to use {} pair always makes it easier to follow the logic (especially the error handling issues).
 
> @@ +662,5 @@
> >      uint32_t written = 0;
> > +    *result = 0;
> > +    if (!mStream) {
> > +        // Flush() returns NS_OK in this case ! See Flush() below.
> > +        // We better catch the failure ASAP.
> 
> please make this comment more comprehensible.

I tried to make the message a little more comprehensible.

The point is that error is not issued soon enough when buffered stream is used.
If a program tries to write to (bogus) already closed stream using buffered write, it will not be noticed until, probably Close() time without my patch
since Flush() seemed to return NS_OK.

> 
> @@ +711,5 @@
> > +#if 0
> > +    // this message output causes enough timing skew to hide nsPop3Sini.cpp
> > +    // issue when buffering is enabled.
> > +        NS_WARNING("Flush: 0 bytes in the buffer.");
> > +#endif
> 
> remove, even before review please.

removed.
 
> but please keep the comment why you are doing this early return.

Kept the comment why I wanted to return early.
 
> @@ +744,1 @@
> >          NS_WARNING("failed to flush buffered data! possible dataloss");
> 
> please put braces around the if body (in all cases which you touch with this
> patch)
> 

I did. As I mentioned above, I may have over-reacted by converting all such cases.

> @@ +766,5 @@
> > +    // to the existing handling in  nsBufferedOutputStream::Close():
> > +    //
> > +    // Prioritize the error return from Flush().
> > +    // then the error from Sink()->Close() [or mSafeStream->Finish()],
> > +    // finally the error from nsBufferedStream::Close()
> 
> please try to say WHY you are doing this.

Again, I tried to explain WHY and HOW.

If someone who is familiar with mozilla-central try server, I appreciate it
if you can submit and see if this patch works OK (I am sure it will, but who knows).

I have prepared scripts to prepare the patches for submission to try-comm-central, and used them successfully over the last few years. But then I realize they handles the quirks of try-comm-central job submission and may not work with mozilla-central try server immediately.


TIA
Attachment #8798598 - Attachment is obsolete: true
Attachment #8801096 - Flags: review?(honzab.moz)
Comment on attachment 8801096 [details] [diff] [review]
(updated v2)  early and better error reporting (forget the LOCK).

Review of attachment 8801096 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the patch.

Please fix the commit message only as:
"Bug 1120046 - Improve error reporting in nsBufferedStreams.cpp, r=honzab"

Only few nits to fix.  I will push this for you to try.  If it goes well, we can land it (I'll give r+ to the next version of the patch more quickly).

::: netwerk/base/nsBufferedStreams.cpp
@@ +139,5 @@
>      nsresult rv;
>      nsCOMPtr<nsISeekableStream> ras = do_QueryInterface(mStream, &rv);
> +    if (NS_FAILED(rv)) {
> +#ifdef DEBUG
> +        NS_ERROR("nsIseekableStream ras could not be obtained.");

"mStream doesn't QI to nsISeekableStream"

@@ +269,1 @@
>      return rv;

leave a blank line before return rv;

@@ +269,2 @@
>      return rv;
> +    

remove this blank line

@@ +705,5 @@
> +        // However, Flush() returns NS_OK when mStream is null (!!),
> +        // and we don't get a meaningful error, NS_BASE_STREAM_CLOSED,
> +        // soon enough when we use buffered output.
> +#ifdef DEBUG
> +        NS_WARNING("(info) nsBufferedOutputStream::Write returns NS_BASE_STREAM_CLOSED immediately (mStream=null).");

" (mStream == null)"

@@ +779,5 @@
> +    if (NS_FAILED(rv1)) {
> +        NS_WARNING("(debug) nsBufferedOutputStream::Flush() failed in nsBufferedOutputStream::Finish()! Possible dataloss.");
> +    }
> +
> +    if (NS_SUCCEEDED(rv1)) {

nit: maybe do just } else { here instead

@@ +787,5 @@
> +            NS_WARNING("(debug) mSafeStream->Finish() failed within nsBufferedOutputStream::Flush()! Possible dataloss.");
> +        }
> +#endif
> +    }
> +    else {

today style is } else {

@@ +801,4 @@
>  
> +    // We want to return the errors precisely from Finish()
> +    // and mimick the existing error handling in
> +    // nsBufferedOutputStream::Close() as  reference:

as  reference  (two spaces)

@@ +804,5 @@
> +    // nsBufferedOutputStream::Close() as  reference:
> +    // So we follow the steps there, and
> +    // prioritize the error return from Flush().
> +    // then the error from Sink()->Close() [or mSafeStream->Finish()],
> +    // finally the error from nsBufferedStream::Close()

if you explain this refers to ::Close() then don't repeat what's obvious please. remove all comment lines under "nsBufferedOutputStream::Close() as reference", since it's just overcommenting.

a good comment says "why" you do stuff, not "what" stuff you are doing, I can easily it by looking at the code.
Attachment #8801096 - Flags: review?(honzab.moz) → feedback+
(In reply to Honza Bambas (:mayhemer) from comment #16)
> Comment on attachment 8801096 [details] [diff] [review]
> (updated v2)  early and better error reporting (forget the LOCK).
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=2b5a1d553f87968c1c3922383f3fb36b928b1ba7

tc-Fxfn-r[tier 2](en-US):
TEST-UNEXPECTED-ERROR | test_safe_browsing_notification.py TestSafeBrowsingNotificationBar.test_notification_bar | TimeoutException: Timed out after 30.0 seconds

An intermittent failure is filed for this test (exactly this failure) but with your patch this seems fairly regular (perma failure).

https://bugzilla.mozilla.org/show_bug.cgi?id=1280878
Blocks: 1280878
(In reply to Honza Bambas (:mayhemer) from comment #17)
> An intermittent failure is filed for this test (exactly this failure) but
> with your patch this seems fairly regular (perma failure).
> 
> https://bugzilla.mozilla.org/show_bug.cgi?id=1280878

After checking the base changeset I used, it was affected too: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=f9f3cc95d728&selectedJob=5394873

You are good to go with the polished version!
No longer blocks: 1280878
Thank you for testing my patch on try server.
I was initially concerned about the possible problem, but it looks it is OK now.
Great!

Here is an updated patch to reflect your comments.

(In reply to Honza Bambas (:mayhemer) from comment #15)
> Comment on attachment 8801096 [details] [diff] [review]
> (updated v2)  early and better error reporting (forget the LOCK).
> 
> Review of attachment 8801096 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks for the patch.
> 
> Please fix the commit message only as:
> "Bug 1120046 - Improve error reporting in nsBufferedStreams.cpp, r=honzab"

Done.

> 
> Only few nits to fix.  I will push this for you to try.  If it goes well, we
> can land it (I'll give r+ to the next version of the patch more quickly).

Thank you again for the push.
> 
> ::: netwerk/base/nsBufferedStreams.cpp
> @@ +139,5 @@
> >      nsresult rv;
> >      nsCOMPtr<nsISeekableStream> ras = do_QueryInterface(mStream, &rv);
> > +    if (NS_FAILED(rv)) {
> > +#ifdef DEBUG
> > +        NS_ERROR("nsIseekableStream ras could not be obtained.");
> 
> "mStream doesn't QI to nsISeekableStream"
> 

Done. Changed the message. I am still new (after so many years) to the code base, since I am basically a user of TB and occasional patch submitter,
so using "QI" as a verb didn't occur to me  :-)

> @@ +269,1 @@
> >      return rv;
> 
> leave a blank line before return rv;

Done.

> 
> @@ +269,2 @@
> >      return rv;
> > +    
> 
> remove this blank line

Done.

> 
> @@ +705,5 @@
> > +        // However, Flush() returns NS_OK when mStream is null (!!),
> > +        // and we don't get a meaningful error, NS_BASE_STREAM_CLOSED,
> > +        // soon enough when we use buffered output.
> > +#ifdef DEBUG
> > +        NS_WARNING("(info) nsBufferedOutputStream::Write returns NS_BASE_STREAM_CLOSED immediately (mStream=null).");
> 
> " (mStream == null)"

Done. I was a little hazy, but I only changed "=" to "==".
This message is a rare event in a normal operation and so
the lengthy message should be OK (and it warns the developer very nicely
that the code is calling output routine against an already closed stream!)

 
> @@ +779,5 @@
> > +    if (NS_FAILED(rv1)) {
> > +        NS_WARNING("(debug) nsBufferedOutputStream::Flush() failed in nsBufferedOutputStream::Finish()! Possible dataloss.");
> > +    }
> > +
> > +    if (NS_SUCCEEDED(rv1)) {
> 
> nit: maybe do just } else { here instead

Right.
I tried to merge the following if() with "} else {" and
realized it has an else clause which ought to go into the
first if (NS_FAILED(rv1)) { } and so I modified it accordingly.
Then it hit upon me since the error message printed is so important and can happen at runtime, I should not hide it in #ifdef DEBUG, #endif and so removed it.
The result is in there.

> 
> @@ +787,5 @@
> > +            NS_WARNING("(debug) mSafeStream->Finish() failed within nsBufferedOutputStream::Flush()! Possible dataloss.");
> > +        }
> > +#endif
> > +    }
> > +    else {
> 
> today style is } else {

Fixed.

> 
> @@ +801,4 @@
> >  
> > +    // We want to return the errors precisely from Finish()
> > +    // and mimick the existing error handling in
> > +    // nsBufferedOutputStream::Close() as  reference:
> 
> as  reference  (two spaces)
> 
> @@ +804,5 @@
> > +    // nsBufferedOutputStream::Close() as  reference:
> > +    // So we follow the steps there, and
> > +    // prioritize the error return from Flush().
> > +    // then the error from Sink()->Close() [or mSafeStream->Finish()],
> > +    // finally the error from nsBufferedStream::Close()
> 
> if you explain this refers to ::Close() then don't repeat what's obvious
> please. remove all comment lines under "nsBufferedOutputStream::Close() as
> reference", since it's just overcommenting.
> 
> a good comment says "why" you do stuff, not "what" stuff you are doing, I
> can easily it by looking at the code.

I removed the latter part of the comment.

Now the attached patch is compiled and tested locally.

I wanted to at least make sure that it compiles and tested on try-comm-central, but as of a couple of hours earlier, I think it has some network infrastructure problem and the server even could not run compiler at all!
So you have to take my words for it as far as the compilation is concerned.

TIA
Attachment #8801096 - Attachment is obsolete: true
Attachment #8805314 - Flags: review?(honzab.moz)
Comment on attachment 8805314 [details] [diff] [review]
(updated v3) early and better error reporting (forget the LOCK).

Review of attachment 8805314 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, r=honzab

::: netwerk/base/nsBufferedStreams.cpp
@@ +190,2 @@
>  
> +    // xxx we probably want to handle EINTR-like retry here in the long run.

probably remove this comment or make it more clear.
Attachment #8805314 - Flags: review?(honzab.moz) → review+
Sorry I got tied up with day job (a big conference in December) and could only
follow this today.

(In reply to Honza Bambas (:mayhemer) from comment #20)
> Comment on attachment 8805314 [details] [diff] [review]
> (updated v3) early and better error reporting (forget the LOCK).
> 
> Review of attachment 8805314 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks, r=honzab
> 
> ::: netwerk/base/nsBufferedStreams.cpp
> @@ +190,2 @@
> >  
> > +    // xxx we probably want to handle EINTR-like retry here in the long run.
> 
> probably remove this comment or make it more clear.

I removed the comment line.

The patch (before the comment line was deleted) has been tested on try-comm-central over the last few weeks with many other patches. The recent submission was https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=09b6902e67d88644a59d035efec29d2c5b193aa5

Well, I wanted to show another tryserver run with the comment line removed, but
somehow when I refreshed the source tree, I have begun to see gcc-related errors about  "ISO C++ forbids in-class initialization of non-const static member '__gnu_cxx::__numeric_traits_integer<_Value>::__is_signed'" and many other ISO C standard violations, etc. So I can't show the truely the latest try-comm-central build. But the last should do.

TIA
Attachment #8831045 - Flags: review+
Attachment #8805314 - Attachment is obsolete: true
Keywords: checkin-needed
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7894b49435a1
Improve error reporting in nsBufferedStreams.cpp, r=honzab
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/7894b49435a1
Status: ASSIGNED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: