Last Comment Bug 518678 - archive crash [@ nsMsgLocalMailFolder::EndMessage(unsigned int)]
: archive crash [@ nsMsgLocalMailFolder::EndMessage(unsigned int)]
: crash, fixed-seamonkey2.0.1
Product: MailNews Core
Classification: Components
Component: Backend (show other bugs)
: 1.9.1 Branch
: x86 Mac OS X
-- critical (vote)
: Thunderbird 3.0rc1
Assigned To: David :Bienvenu
Depends on:
Blocks: TB2SM
  Show dependency treegraph
Reported: 2009-09-24 14:42 PDT by Wayne Mery (:wsmwk, NI for questions)
Modified: 2012-07-04 13:18 PDT (History)
10 users (show)
dmose: wanted‑thunderbird3+
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---

fix archive batching - checked in. (5.20 KB, patch)
2009-10-19 16:58 PDT, David :Bienvenu
davida: review+
Details | Diff | Splinter Review
fix crash - checked in. (817 bytes, patch)
2009-10-19 17:00 PDT, David :Bienvenu
standard8: review+
standard8: superreview+
standard8: approval‑thunderbird3+
Details | Diff | Splinter Review
fix for copy chaining in copycompleted notification (3.20 KB, patch)
2009-10-22 12:40 PDT, David :Bienvenu
rkent: review+
standard8: superreview+
Details | Diff | Splinter Review
unit test that asserts w/o this fix (4.73 KB, patch)
2009-10-22 15:44 PDT, David :Bienvenu
rkent: review+
Details | Diff | Splinter Review

Description User image Wayne Mery (:wsmwk, NI for questions) 2009-09-24 14:42:44 PDT
archive crash [@ nsMsgLocalMailFolder::EndMessage(unsigned int)]

10 crashes but likely a single person. how do we know? because they ALL have comments (reporter must be "the good samaratin of crashes", aka TGSC)

0	thunderbird-bin	nsMsgLocalMailFolder::EndMessage	 nsAutoPtr.h:968
1	thunderbird-bin	nsCopyMessageStreamListener::EndMessage	nsCopyMessageStreamListener.cpp:117
2	thunderbird-bin	nsImapMailFolder::EndMessage	nsImapMailFolder.cpp:5161
3	libxpcom_core.dylib	NS_InvokeByIndex_P	xpcom/reflect/xptcall/src/md/unix/xptcinvoke_unixish_x86.cpp:179
4	libxpcom_core.dylib	nsProxyObjectCallInfo::Run	xpcom/proxy/src/nsProxyEvent.cpp:181
5	libxpcom_core.dylib	nsThread::ProcessNextEvent	xpcom/threads/nsThread.cpp:510
6	libxpcom_core.dylib	NS_ProcessPendingEvents_P	nsThreadUtils.cpp:180
7	thunderbird-bin	nsBaseAppShell::NativeEventCallback	widget/src/xpwidgets/nsBaseAppShell.cpp:121
8	thunderbird-bin	nsAppShell::ProcessGeckoEvents	widget/src/cocoa/
9	CoreFoundation	CFRunLoopRunSpecific	
10	CoreFoundation	CFRunLoopRunInMode	
11	HIToolbox	RunCurrentEventLoopInMode	
12	HIToolbox	ReceiveNextEventCommon	
13	HIToolbox	BlockUntilNextEventMatchingListInMode	
14	AppKit	_DPSNextEvent	
15	AppKit	-[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]	
16	thunderbird-bin	nsAppShell::ProcessNextNativeEvent	widget/src/cocoa/
17	thunderbird-bin	nsBaseAppShell::DoProcessNextNativeEvent	widget/src/xpwidgets/nsBaseAppShell.cpp:151
18	thunderbird-bin	nsBaseAppShell::OnProcessNextEvent	widget/src/xpwidgets/nsBaseAppShell.cpp:278
Comment 1 User image Wayne Mery (:wsmwk, NI for questions) 2009-10-05 16:08:08 PDT
Mac-only still, 3.0b2, 3.0b3 and 3.0b4

interesting comments are
selected all messages in a IMAP folder containing about 1400 and hit 'A' to archive all items. It seems to have worked for a while and some time later a modal message said that "another task is running on this folder, wait till it's done" or something like that. When I clicked OK, TB crashed.
switch to a folder (and it is indexing, downloading from the imap server, etc) and I select all and archive right away before it is done doing all of its updates.
Comment 2 User image Dawn Endico 2009-10-10 14:17:10 PDT
I'm not the original good Samaratin but I had this problem too. I'm also on a Mac (Leopard) using 3.0b4.

1. Tried out the new Archive button on my imap account and noticed that it created a new Archive folder on the imap server.

2. Changed the setting in Thunderbird to put the archive in Local Folders.

3. Switch to a local folder containing lots of old mail going back to 2004.

4. Select all and press Archive.

5. Got an alert dialog saying "Operation Failed because another operation is using the folder. Please wait for that operation to finish and then try again".

6. After dismissing the dialog I got another one just like it. I didn't count, but I think I probably got 5 dialogs in all from the original Archive operation. No crashes though, and some of the messages were archived.

7. Tried this again a few times with lots of failures but no crashes. Eventually it crashed but I'm not sure what was different that time. Maybe because by then I had switched to archiving messages in my IMAP inbox (I started out archiving a local folder).

The dialogs only happened when I tried archiving a lot of messages at once. Everything was ok with just a screenful of messages. I think this probably happens when the selection contains messages from multiple years, since in that case not all the messages are being copied to the same archive folder.
Comment 3 User image Dawn Endico 2009-10-10 14:24:51 PDT
I just tried reproducing the problem by selecting messages from my imap inbox from 2008 and 2009 and it worked just fine. No crashes, no dialogs. Apparently you need to archive a lot of messages at once.
Comment 4 User image Dawn Endico 2009-10-10 14:41:15 PDT
nominating for Thunderbird 3
Comment 5 User image Wayne Mery (:wsmwk, NI for questions) 2009-10-10 15:18:58 PDT
#28 crasher for 3.0b4, but my guess would be it will not be that low when we actually release 3.0
Comment 6 User image Dan Mosedale (:dmose) 2009-10-13 09:21:04 PDT
With the current numbers, leaving as wanted-.  bienvenu will try and look at anyway.  If we get worse numbers for RC1, renomination could be worthwhile.
Comment 7 User image Dawn Endico 2009-10-13 10:44:17 PDT
I'm confident that bienvenu will be able to reproduce this and if he can't, I have a backup of the mail folder that caused the problem.
Comment 8 User image Zandr Milewski [:zandr] 2009-10-17 00:18:56 PDT
I'm seeing a crash that looks a lot like this. Select about 4-5k messages and hit 'a' Spinning Pizza for a bit, then the Apple crash report dialog. I don't seem to be generating Mozilla crash reports, though.

Totally reproducible, so how can I generate useful info here?
Comment 9 User image Wayne Mery (:wsmwk, NI for questions) 2009-10-17 01:30:37 PDT
(In reply to comment #8)
> Totally reproducible, so how can I generate useful info here?

i think in your case we just wait for a patch to test
Comment 10 User image David :Bienvenu 2009-10-19 11:13:53 PDT
ah, archiving from an imap folder to a local folder - interesting. I did a large archive and I noticed that the batching code was behaving unexpectedly, in that it did multiple copies to my 2009 folder, instead of just one. That could partly account for the problem, though the copy service is really supposed to prevent contention over the dest folder.
Comment 11 User image David :Bienvenu 2009-10-19 13:33:06 PDT
There are two things going on here:

1. archive still does batches on a per-month basis even if the archive granularity is per-year (that's my fault). This means we'll generate up to 12 simultaneous copies to the same folder.

2. the copy service is supposed to serialize these requests but it isn't quite working in this case. It may be that there's something about imap to local copies, or it may be something more fundamental.

I'm going to fix #1, and then try to figure out #2. Marking blocking for now.
Comment 12 User image Kent James (:rkent) 2009-10-19 13:44:52 PDT
Just as a reminder, the unit tests that were written for some of these move cases also failed intermittently on the Mac (bug 502928). It is possible that the root cause of both these test failures, and your thing 2. from comment 11 are the same.
Comment 13 User image David :Bienvenu 2009-10-19 13:47:42 PDT
thx, could be related. I'm on Windows, but of course, there's really nothing platform-specific about the move/copy code.
Comment 14 User image David :Bienvenu 2009-10-19 16:58:21 PDT
Created attachment 407156 [details] [diff] [review]
fix archive batching - checked in.

this makes the batching match the archive granularity, which makes this crash much harder to reproduce. I figure out the archiveFolderUri and granularity in the batch coalescing code, and store it in the batch. And I changed the batch key to know whether the month should be used to determine batches.

There's still a little bit of duplication of code figuring out the identity, but we only run the dup code when the root archive folder doesn't exist.
Comment 15 User image David :Bienvenu 2009-10-19 17:00:08 PDT
Created attachment 407158 [details] [diff] [review]
fix crash - checked in.

this fixes the lowest level of the crash.

What's left is figuring out how the copy service is getting confused...
Comment 16 User image David :Bienvenu 2009-10-19 20:30:44 PDT
I've checked in the first patch, which suffices to take this off the blocker list.
Comment 17 User image David :Bienvenu 2009-10-21 15:56:40 PDT
The second patch was checked in as well.

If you disable the first patch, in order to make the copy service again chain urls that do a copy to the same target folder, the STR to reproduce this bug are as follows:

1. Setup an imap account to archive to a local folder, w/ the granularity left as the default (monthly)
2. Select 3 imap messages from different months but the same year.
3. Press archive.

Thx to asuth for listening to my addled description of the weird things I was seeing in the copy service and saying several true things that allowed me to make sense of what I was seeing. What's happening is that when the first move finishes, nsMsgCopyService::NotifyCompletion calls ClearRequest - this calls OnStopCopy on the listeners, which provokes the archive batching code into calling CopyMessages again, which adds a new request to the request queue. But nsMsgCopyService::NotifyCompletion is looping through the request queue looking for requests with a particular source and destination, and the batching code has unwittingly added a new one that matches, so it gets removed right away. But nsMsgLocalMailFolder::OnCopyCompleted doesn't get called for this second request, so it leaves the folder in an unhappy state, i.e., with a copy state object. The third request comes along, and the folder complains about the existing copy state...

So it turns out to be a variation on the classic issue of iterating over a list that has changed out from under us. I'm not sure why we think the second request is done before it has even started, and fixing that might fix the problem.
Comment 18 User image David :Bienvenu 2009-10-22 12:40:09 PDT
Created attachment 407825 [details] [diff] [review]
fix for copy chaining in copycompleted notification

I'm going to try to come up with a test case that breaks w/o this patch
Comment 19 User image David :Bienvenu 2009-10-22 15:44:24 PDT
Created attachment 407870 [details] [diff] [review]
unit test that asserts w/o this fix
Comment 20 User image David :Bienvenu 2009-10-22 15:47:25 PDT
Comment on attachment 407825 [details] [diff] [review]
fix for copy chaining in copycompleted notification

Kent knows this code better than anyone else, so I'm asking him for a review.

The change to where we increment sourceIndex is obviously not really a change...I can back that part out if you like.
Comment 21 User image Kent James (:rkent) 2009-10-23 12:02:13 PDT
Comment on attachment 407825 [details] [diff] [review]
fix for copy chaining in copycompleted notification

"The change to where we increment sourceIndex is obviously not really a
change...I can back that part out if you like."

That is my only complaint, I don't see the reason to remove a standard loop idiom like you did, so please restore the original.

Overall this has become a very fragile piece of code over time, and you are just patching over the fragility. But this is not the time for a major refactoring.
Comment 22 User image Kent James (:rkent) 2009-10-23 12:06:00 PDT
Comment on attachment 407870 [details] [diff] [review]
unit test that asserts w/o this fix


IFAICT you are not using this, so it could be removed.

>+var gMessageGenerator;
>+var gScenarioFactory;

These are only used locally in run_test, so should not be globals.

r=me with these nits fixed.
Comment 23 User image Dawn Endico 2009-10-24 20:52:35 PDT
Is the theory still that this bug is triggered when archiving messages from multiple years? I just got several crashes when trying to archive messages only from the same year. 

I'm trying to archive my itunes receipts which are dated 2008 and 2009. In the search bar i chose 'from filter' and type "itunes store' which displays all my receipts in the message list pane. To archive a mail i select 1 or more messages and hit the archive button in the message pane. I'm not using a separate search window.

At first it was only crashing when i selected messages from 2008. Archiving mails from 2009 failed too, but i only got an error dialog with no crash. 
   "an operation failed because another operation is using the folder"

Now that I'm done archiving the 2008 messages, archiving 2009 messages causes crashes too.

Still using 3.0b4

4b33ac82-4e53-4861-8c01-ff7172091024	10/24/09	8:46 PM
02fb4e29-db4e-4975-adaa-0144a2091024	10/24/09	8:39 PM
bded6a07-eaeb-4236-8d7b-094742091024	10/24/09	7:47 PM
7734f1a2-fd13-4921-b999-c79822091024	10/24/09	7:43 PM
fe2bc0af-e4f7-4c42-b0be-28fd12091024	10/24/09	7:42 PM
Comment 24 User image David :Bienvenu 2009-10-25 06:53:59 PDT
No, the theory was always archiving messages from different months in the same year caused crashes, if you had an archive granularity of years.
Comment 25 User image David :Bienvenu 2009-10-25 06:59:34 PDT
Dawn, I highly recommend trying a 3.0 pre build which should fix these crashes -
Comment 26 User image Dawn Endico 2009-10-25 12:53:06 PDT
Ok thanks, I misunderstood. I just installed the latest build and archiving works great. I love that feature!
Comment 27 User image David :Bienvenu 2009-11-12 16:55:05 PST
the last two patches have been landed on the trunk. They are additional bullet-proofing, w/ a unit test. Marking fixed.
Comment 28 User image Wayne Mery (:wsmwk, NI for questions) 2010-08-29 04:31:38 PDT
v.fixed per comment 26 and signature gone from crash stats *after* 3.0b4
Comment 29 User image Vincent (caméléon) 2012-07-04 13:18:10 PDT
This bug is coming back for at least one user of TB13 :-(

Note You need to log in before you can comment on or make changes to this bug.