Last Comment Bug 769346 - Poor performance of Move, Copy and Delete to Local Folders due to "one write request per a line of message data". Writing needs buffering.
: Poor performance of Move, Copy and Delete to Local Folders due to "one write ...
Status: REOPENED
[regression:TB??][patchlove]
: perf, regression, regressionwindow-wanted
Product: MailNews Core
Classification: Components
Component: Backend (show other bugs)
: 13
: x86 Windows 7
: -- major with 4 votes (vote)
: Thunderbird 17.0
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on: 1116055 815012
Blocks: 558528 tb-enterprise 693659
  Show dependency treegraph
 
Reported: 2012-06-28 10:30 PDT by Doug Engert
Modified: 2016-08-02 19:39 PDT (History)
15 users (show)
vseerror: in‑testsuite?
vseerror: in‑moztrap+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Diff file of change to mailnews/local/src/nsLocalMailFolder.cpp from version 13.0.1 (892 bytes, patch)
2012-06-28 10:30 PDT, Doug Engert
no flags Details | Diff | Splinter Review
mercurial diff (1.03 KB, patch)
2012-07-02 16:21 PDT, David :Bienvenu
mozilla: review+
Details | Diff | Splinter Review

Description Doug Engert 2012-06-28 10:30:13 PDT
Created attachment 637578 [details] [diff] [review]
Diff file of change to mailnews/local/src/nsLocalMailFolder.cpp  from version 13.0.1

User Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0; .NET4.0C; .NET4.0E)

Steps to reproduce:

A Move, Copy or Delete of a large message can take up to 60 seconds, especially when mail folders are on a network files system, such as DFS, NFS or AFS and can cause "Not Responding" messages with Thunderbird.


Actual results:

Mozilla has been addressed over the years, yet the problems continue to reappear. Bug #539389 for example. But yet in TB 12.0, 13.0 and 13.0.1 the performance problem still exists. 

Using NSPR_LOG_FILE and NSPR_LOG_MODULES to  trace the Read and Write operations shows the many small writes of 74 or 78 bytes are done, during Copy, Move or Delete operations, (as the Delete
is really a copy to the Trash folder). 


Expected results:

A proposed patch is to use the NS_BufferOutputStream within the
nsLocalMailFolder.cpp with an 8K buffer. Without this change a copy
(or delete) of a 3.8MB file can take as much as 60 seconds, with this
change is is less then 2 seconds. The number of Writes is reduced 
by a factor of 110, (8192/74) and the number of Write goes from 
around 54,000 to around 490 for the test file.
Comment 1 WADA 2012-06-28 14:23:24 PDT
Problem of "One Write request per a line of message data" was fixed by Bug 539389 in Tb 3.1, but the "One Write request per a line of message data" was observed in Tb 13.0.1 for "Copy mail to local Drafts" and for "Save as draft to local Drafts" by Process Monitor on MS Win.
Regression of Bug 539389? 

Bug 558528 is for larger buffer(larger than 4KB) mainly in temp file writing
(...\Temp\nsemail.eml, ...\Temp\nscopy.tmp upon draft save. See bug 589798 for temp for attachment case).
Will your patch resolve issue of Bug 558528 too?
Comment 2 Doug Engert 2012-06-28 15:30:41 PDT
The patch I submitted was for the routines used by the CopyService in mailnews/base/src which appears to be new and was not doing any buffering.
The patch would only effect writing to local mail folders. Other output destinations have thei own routines called by the CopyService.  

Bug 539389 attempted to fix the problems with compress and a number of other copy operations and preceeded the CopyService. The CopyService appears to queue up copy requests, and it may have replaced some of the code the Bug 539389 patched. 

The best I can tell, Bug 558528 should be fixed as sgNewBugfferedFileOutputStream is called in a lot of places in mailnews. But if the CopyService is now being used, to copy files to \Temp, then it may not be fixed. The same simple change could be made to the file that sets up the output for a file. I have not figured which file that is as that was not a problem I was seeing. Ther may be many other places buffering should be being used with the CopyService.
Comment 3 David Rajchenbach-Teller [:Yoric] (please use "needinfo") 2012-06-29 13:57:33 PDT
I would like to understand: is this a full file copy? Or a copy of bytes inside a file?
Comment 4 Doug Engert 2012-06-29 14:07:29 PDT
Its coping a message from one local folder to another appending to the output folder that is really a file. 
The CopyService sets up the input and output files and buffers the input, but not the output. The Listener reads lines and writes lines. Many attachments have lines that are 74 bytes long. 

With a debug version you can use the NSPR_LOG_DEBUG opetions to see all of this,
or on Windows use the sysinternals process monitor to look at the I/O events.
Comment 5 David Rajchenbach-Teller [:Yoric] (please use "needinfo") 2012-07-02 02:00:53 PDT
Ok, in that case, OS.File is not going to be very helpful for the moment.
Comment 6 David :Bienvenu 2012-07-02 16:21:55 PDT
Created attachment 638530 [details] [diff] [review]
mercurial diff

I've heard that 16K might be a better buffer size...but I'll give a quick run with this patch and make sure it doesn't break anything.
Comment 7 David :Bienvenu 2012-07-02 16:57:18 PDT
Comment on attachment 638530 [details] [diff] [review]
mercurial diff

thx for the patch, Doug, this sounds like something reasonable to try.
Comment 8 Doug Engert 2012-07-02 21:26:16 PDT
My only concern is if the Copy Service is called with a listener that for some reason can't use buffering correctly. It is not clear from the code what are all the listeners. Someone whith a vbeter understanding of the code needs to answer this. (I can not look at anything till next week.)
Comment 9 David :Bienvenu 2012-07-05 08:17:06 PDT
(In reply to Doug Engert from comment #8)
> My only concern is if the Copy Service is called with a listener that for
> some reason can't use buffering correctly. It is not clear from the code
> what are all the listeners. Someone whith a vbeter understanding of the code
> needs to answer this. (I can not look at anything till next week.)

CopyListeners just get told about the progress of the copy, not the actual data involved, so they shouldn't care about buffering.
Comment 10 Doug Engert 2012-07-16 07:30:41 PDT
(In reply to David :Bienvenu from comment #6)
> Created attachment 638530 [details] [diff] [review]
> mercurial diff
> 
> I've heard that 16K might be a better buffer size...but I'll give a quick
> run with this patch and make sure it doesn't break anything.

16K also works. So you make the call. Either is better then no buffering. 

So what is the next step in this process? 
More reviews?
Who can set the "wanted-thunderbird" flag?
How does this change get included in nightly builds? 

Thanks.
Comment 11 David :Bienvenu 2012-07-16 08:26:57 PDT
The two possible next steps are either for you to put up a version of the patch with 16K as the buffer size, and I can give it a quick review, or we can just go with the original patch, in which case you'd add the checkin-needed keyword in the keyword field of this bug, and someone will land the fix.
Comment 12 Doug Engert 2012-07-16 09:04:46 PDT
8K patch looks OK, adding checkin-needed keyword.
Comment 13 Ryan VanderMeulen [:RyanVM] 2012-07-16 16:46:39 PDT
https://hg.mozilla.org/comm-central/rev/a9f658106adb

Thanks for the patch, Doug! To make life easier for those checking in on your behalf, please follow the directions below for any future patches you submit. It really helps. Thanks again!
https://developer.mozilla.org/en/Creating_a_patch_that_can_be_checked_in
Comment 14 Jim Porter (:squib) 2012-07-31 20:40:44 PDT
Should this be resolved?
Comment 15 Mark Banner (:standard8) 2012-08-01 02:56:43 PDT
Yes, I see no reason why not.
Comment 16 Mark Banner (:standard8) 2012-12-24 15:31:58 PST
Unfortunately we've had to back this out due to its causing bug 815012.

The short summary is that it seems like either using the buffered stream, or possibly replacing the reference to the normal stream with the buffered stream version is capable of causing corruption on the local mail storage.
Comment 17 Wayne Mery (:wsmwk, NI for questions) 2013-01-08 08:40:04 PST
Followup to comment 16 - as a cause of critical dataloss, the next incarnation of fix needs a test either here, or in a bug that fixes the IO methods being called, or better yet(?) in both places.  We shouldn't be shipping these problems :)
Comment 18 Ronald Killmer 2013-01-08 14:08:57 PST
Been thinking about why this bug caused bug 815012.

A buffered I/O seems like a good performance improvement, as long as the patch includes a safety check that all message copy/moves in a serial sequence are committed to file I/O. That the last buffered segment of Msg1 was transfered. I am assuming that Msg1 has some form of end of message flag that can be tested for.

I am guessing that if the output buffer has not finished its' final I/O that the end of message flag will not be present in the destination Msg store. That the process for Msg2 is beginning and the code is disposing of the initial buffer for Msg1, thus loosing a needed flag so Msg2 may become appended to to Msg1, not as a new message in the destination file.

I am tossing this out as food for thought.
Comment 19 :aceman 2013-01-08 23:50:00 PST
Is nsMsgLocalMailFolder::InitCopyMsgHdrAndFileStream called repeatedly over the lifetime of mCopyState->m_fileStream ? If yes, is it OK to buffer an already buffered stream? Shouldn't it be flushed before? Or buffer it just once at m_fileStream creation time?
Comment 20 Doug Engert 2013-01-15 14:13:54 PST
Based on comments in Bug 815012, it appears that the data is written but the offsets
may not be correct. I am not much of a C++ programmmer, but it looks like in
nsLocalMailFolder.cpp
WriteStartOfNewMessage() {
   nsCOMPtr <nsISeekableStream> seekableStream = do_QueryInterface(mCopyState->m_fileStream);  int64_t filePos;
   seekableStream->Tell(&filePos);

Should this really be
 mCopyState->m_fileStream->Tell(&filePos);


i.e. the  nsIBufferedOutputStream::Tell will take account of the mCursor
into the buffer, where the nsISeekableStream will not?
Comment 21 Wayne Mery (:wsmwk, NI for questions) 2013-03-04 21:21:55 PST
(In reply to Doug Engert from comment #20)
> ...
> Should this really be
>  mCopyState->m_fileStream->Tell(&filePos);
> 
> i.e. the  nsIBufferedOutputStream::Tell will take account of the mCursor
> into the buffer, where the nsISeekableStream will not?
Comment 22 :Irving Reid (No longer working on Firefox) 2013-03-05 07:50:09 PST
I spent some time looking into this, and the lines Doug referred to seem to be OK. The do_QueryInterface() call is the XPCOM equivalent of a type cast; we're still calling Tell() on the same stream, just through a different reference type. I couldn't see any problems with the implementation of Tell() in nsIBufferedOutputStream; it looks to me like it correctly takes into account the buffering.

I've also tried running my mail client with some extra assertions to detect cases where we try to buffer an already buffered stream, re-use buffered streams, and a few other quirks like that; so far I haven't seen anything.

I'll try a couple of other tests today.
Comment 23 Wayne Mery (:wsmwk, NI for questions) 2013-06-24 18:26:14 PDT
Doug, if Irving has nothing to add from tests, can we do a try with your updated patch?
Comment 24 Doug Engert 2013-07-08 06:52:12 PDT
The problem may not be in the nsIBubberedOutPutStream, but in the newer code added
to support imap files. All the reports of problems with this patch came from trying
to more/copy between mail box type files (where there are multiple messages per file) 
to/from imap type files where there is (one message per file.) 

The c++ code gets very confusing but I expect that some imap code is assuming 
that the offset of the start of a message in the output file is always 0.
(Or it does not do a flush and thus the buffering offset in the file is wrong.)
 i.e. the imap file routine is assuming the output file is an imap type file,
but it could be a mbk type file. The index files then get the wrong offsets too. 

I have not had much time to look at this.
Comment 25 :aceman 2013-07-15 15:46:58 PDT
Doug, the nsMsgLocalMailFolder::InitCopyMsgHdrAndFileStream seems to have changed since you made your patch. Do you still see the slowness?
Can you try adding a buffered outputstream somewhere after http://mxr.mozilla.org/comm-central/source/mailnews/local/src/nsMsgBrkMBoxStore.cpp#649 ?
Comment 26 Wayne Mery (:wsmwk, NI for questions) 2014-04-01 05:57:11 PDT
(In reply to :Irving Reid from comment #22)
> I spent some time looking into this, and the lines Doug referred to seem to
> be OK. The do_QueryInterface() call is the XPCOM equivalent of a type cast;
> we're still calling Tell() on the same stream, just through a different
> reference type. I couldn't see any problems with the implementation of
> Tell() in nsIBufferedOutputStream; it looks to me like it correctly takes
> into account the buffering.
> 
> I've also tried running my mail client with some extra assertions to detect
> cases where we try to buffer an already buffered stream, re-use buffered
> streams, and a few other quirks like that; so far I haven't seen anything.
> 
> I'll try a couple of other tests today.

Irving did you have good results from testing?
Comment 27 :Irving Reid (No longer working on Firefox) 2014-04-01 08:49:45 PDT
I didn't spot any reasons for the failures we've observed, during the time I had available to investigate.
Comment 28 Wayne Mery (:wsmwk, NI for questions) 2014-07-19 12:17:05 PDT
(In reply to :aceman from comment #25)
> Doug, the nsMsgLocalMailFolder::InitCopyMsgHdrAndFileStream seems to have
> changed since you made your patch. Do you still see the slowness?
> Can you try adding a buffered outputstream somewhere after
> http://mxr.mozilla.org/comm-central/source/mailnews/local/src/
> nsMsgBrkMBoxStore.cpp#649 ?

Doug has retired from ANL and no longer had the problem environment, nor a build system. So we need a new developer here.


A piece of our exchange....

On 4/1/2014 3:40 PM, Douglas E Engert wrote:> 
>>   Is there anyone at anl that might be able to follow in your 
>> footsteps for https://bugzilla.mozilla.org/show_bug.cgi?id=769346
> 
> No.  I believe that the problem was that the AFS cache manager combined 
> with  Windows could not handle so many small writes, and would have a 
> time out of some sort.
> 
> When I got a newer Windows 7 machine, it did not have the problem. So 
> this could have been a one time problem.
> 
> The number of very small writes by Thunderbird should be buffered (some 
> cases buffering is used). But the attempt to use the buffering in all 
> cases was never solved, If it could have been solved, it would be 
> performance improvements for all implementations.
> 
> The timeout may have been an isolated incident, and not reproducible, so 
> the problem became a low priority problem for me.
Comment 29 ISHIKAWA, Chiaki 2014-07-19 17:46:19 PDT
(In reply to Wayne Mery (:wsmwk) from comment #28)
> (In reply to :aceman from comment #25)
> > Doug, the nsMsgLocalMailFolder::InitCopyMsgHdrAndFileStream seems to have
> > changed since you made your patch. Do you still see the slowness?
> > Can you try adding a buffered outputstream somewhere after
> > http://mxr.mozilla.org/comm-central/source/mailnews/local/src/
> > nsMsgBrkMBoxStore.cpp#649 ?
> 
> Doug has retired from ANL and no longer had the problem environment, nor a
> build system. So we need a new developer here.
> 
> 
> A piece of our exchange....
> 
> On 4/1/2014 3:40 PM, Douglas E Engert wrote:> 
> >>   Is there anyone at anl that might be able to follow in your 
> >> footsteps for https://bugzilla.mozilla.org/show_bug.cgi?id=769346
> > 
> > No.  I believe that the problem was that the AFS cache manager combined 
> > with  Windows could not handle so many small writes, and would have a 
> > time out of some sort.
> > 
> > When I got a newer Windows 7 machine, it did not have the problem. So 
> > this could have been a one time problem.
> > 
> > The number of very small writes by Thunderbird should be buffered (some 
> > cases buffering is used). But the attempt to use the buffering in all 
> > cases was never solved, If it could have been solved, it would be 
> > performance improvements for all implementations.
> > 
> > The timeout may have been an isolated incident, and not reproducible, so 
> > the problem became a low priority problem for me.

I recall fixing a bug in I/O routine that is used by imap.
But it probably is not related to the corruption problem.

I am interested in doing performance improvement for thunderbird.
On a PC with SCSI-disk, I can literally hear the strange sound coming from the disk for each mail download meaning that there are strangely large amount of I/O.
I noticed that fdatasync() is not used and fsync() is inside SQLITE3 by default.
Failure to use fdatasync() by default is one reason for the strange noise, but
I suspect we can speed up the I/O more.
Someone with a datastore on remote CIFS mount also had an issue
with the small chunk: it hurts when less than 80 octets is sent repeatedly to a remote server.

TIA
Comment 30 WADA 2015-03-15 21:45:13 PDT
Patch creation looks to be in progress by bug 1116055. Setting dependency.
Comment 31 ISHIKAWA, Chiaki 2015-07-11 14:08:18 PDT
(In reply to WADA from comment #30)
> Patch creation looks to be in progress by bug 1116055. Setting dependency.

Thank you, WADA san, for setting the dependency flag.

In Bug 1116055, I noticed an issue about
creating a buffered stream resets file seek position to 0 and
we have to re-position the seek position to whatever we wanted it to be.
That problem is taken care of in bug 1116055.

Re-reading this bugzilla almost one year later, I realize the corruption caused by the early attemp to fix the performance issue was in imap-related files.
I will investigate a little more if my fix in bug 1116055 and friends
does not have any issues with imap-related files.

TIA
Comment 32 Ludovic Hirlimann [:Usul] 2015-09-25 05:46:50 PDT
Removing myslef on all the bugs I'm cced on. Please NI me if you need something on MailNews Core bugs from me.
Comment 33 ISHIKAWA, Chiaki 2015-12-26 15:40:32 PST
> Re-reading this bugzilla almost one year later, I realize the corruption
> caused by the early attemp to fix the performance issue was in imap-related
> files.
> I will investigate a little more if my fix in bug 1116055 and friends
> does not have any issues with imap-related files.

I am checking the my patches for 1116055 on tryserver and found windows version caused
test failures for maildir storage types (but ONLY WINDOWS). I will investigate more, but
I suspect the failures to test the return values of low-level I/O routines, etc.  
(Also, I suspect that there are some unclear interactions in
move/copy as noticed by rkent in  bug 856519.)
Oh well, I will report more in 1116055 as the error symptoms become clear.

TIA
Comment 34 ISHIKAWA, Chiaki 2016-08-02 19:39:35 PDT
(In reply to ISHIKAWA, Chiaki from comment #33)
> > Re-reading this bugzilla almost one year later, I realize the corruption
> > caused by the early attemp to fix the performance issue was in imap-related
> > files.
> > I will investigate a little more if my fix in bug 1116055 and friends
> > does not have any issues with imap-related files.
> 
> I am checking the my patches for 1116055 on tryserver and found windows
> version caused
> test failures for maildir storage types (but ONLY WINDOWS). I will
> investigate more, but
> I suspect the failures to test the return values of low-level I/O routines,
> etc.  
> (Also, I suspect that there are some unclear interactions in
> move/copy as noticed by rkent in  bug 856519.)
> Oh well, I will report more in 1116055 as the error symptoms become clear.
> 
> TIA

I forgot to post to this bugzilla entry.

Earlier this year (1Q-2Q), I figured out for Windows version that uses Maildir, the timing of closing a stream needs to come a little earlier since renaming/moving an opened file is allowed under POSIX (linux, solaris, Mac OSX, etc) while it is not under Windows.
After fixing my patch, the |make mozmill| and xpcshell-tests ran fine (!), that is, it did not introduce any more new errors.
So the issue  I noticed with my earlier patch has been resolved.

Bug 856519 mentions something about " "Multiple mail Copy/Move from AnyStore/IMAP/Offline-Use=Off to MaildirStore/pop3_or_none" moves only "Target\tmp\nnn of last mail" to cur\nnn, so produces empty DIRECTORY in Target\cur except for last mail, and if Move, mails are deleted from IMAP folder".

The files with the pathname that ends in "target\tmp\nnn" and "cur\nnn" were affected by a bug in my patch under Windows (and only in the case of Maildir), and so I suspect that bug 856519 may have a similar issue under the hood although it seems to have been already fixed there.

Note You need to log in before you can comment on or make changes to this bug.