Closed Bug 497622 Opened 15 years ago Closed 15 years ago

IMAP cross-server move fails silently after a copy to a different folder

Categories

(MailNews Core :: Networking: IMAP, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b3

People

(Reporter: rkent, Assigned: rkent)

References

Details

Attachments

(1 file, 4 obsolete files)

In trying to fix bug 448337 I ran into this issue. See attached unit test.
newest bug in this area is bug 495942
Attached patch Remove temp file earlier (obsolete) — Splinter Review
This patch, including an updated unit test, solves the unit test. I want to test it with bug 448337 before I recommend it for review.

I wanted to try to clear the copyState in the URL earlier, at the same point where in the patch I delete the file. But there needs to be a late reference to the copy state to preserve the move status, so that caused the move to fail. Still, I think it would be a good idea to try to remove any need for the copy state after the StopRunningURL so that it could be cleaned up there.
Attachment #382757 - Attachment is obsolete: true
In doing some testing of filters involving local copies to IMAP folders, I continue to have issues with IMAP that are related to this bug, and are not solved by my current patch.

The setup is this. Create a filter on a POP3 account, matchAll, which copies one message to an IMAP folder, and a second action on the same filter which moves the message to a different IMAP folder. Then, send 4 messages to the POP3 account, and download them together.

In three trials without the patch for this bug, I got:

1) two IMAP protocol error dialogs - then some messages misfiled, some filed twice.
2) everything worked correctly, and
3) a crash with the following abbreviated stack:

	msgimap.dll!ns_if_addref<nsIFile *>(nsIFile * expr=0x06b3d6e0)  Line 114 + 0xf bytes	C++
 	msgimap.dll!nsImapUrl::GetMsgFile(nsIFile * * aFile=0x097bfc3c)  Line 1157 + 0x19 bytes	C++
 	msgimap.dll!nsImapProtocol::OnAppendMsgFromFile()  Line 5638 + 0x3a bytes	C++
 	msgimap.dll!nsImapProtocol::ProcessAuthenticatedStateURL()  Line 7570	C++
 	msgimap.dll!nsImapProtocol::ProcessCurrentURL()  Line 1702	C++
 	msgimap.dll!nsImapProtocol::ImapThreadMainLoop()  Line 1354 + 0xd bytes	C++
 	msgimap.dll!nsImapProtocol::Run()  Line 1060	C++

I think this shows that the current code cannot take the stress of multiple IMAP copy/move requests fired at once. The copy/move to different folders probably prevents the move coalescer from keeping us out of trouble.

With the current patch, it works OK if there is one message - but crashes with multiple messages. I think that the patch is correct for what it is trying to do, so the crashes are showing a larger problem that really needs to be solved.

So I don't recommend the current patch.
I've added a couple more copies in my unit test to simulate the trials from comment 3, and the unit test fails consistently now - with an assertion about trying to remove too many messages from the local folder. I'm beginning to believe that this is this is the same root issue as bug 414723 about duplicate messages.
I want to use my new POP3Pump testing framework in a unit test here, so marking the dependency.
Depends on: 498154
Here's an update. Though much of the work I am doing is really directed more at bug 448337, the issues are really in the basic copying mechanism which this bug primarily deals with.

Whenever you submit multiple copy/move type operations that are local->IMAP, the initial stage of the copy from local mailbox to an intermediate file happens immediately. Multiple requests therefore clobber the single file used for shared copies.

I have implemented the createUnique file functionality to prevent this. This took awhile, because the act of creating a file slightly earlier seemed to trigger a well-known but hard-to-find Mozilla issue that the fileSize can be incorrect when operations are done on an output stream. I fixed that with a file clone as is common.

Then I had to get the unit test to work. It is tricky to get all of the IMAP file append operations to complete before I do the final folder updates and tests. I had to play around quite a bit with that, and it seems to be quite sensitive to timing with an extra printf or QI changing the behavior, but I seem to have it working as far as I can tell.

So now, my filter operations with a move followed by a copy will work when there is only one message being downloaded from POP3. But when I download 2 messages, the initial copy of the message to the first folder is lost. I get the same behavior in both unit tests and TB, which is good news. I'm going to try to figure out now why that happens.
Even on a single folder, when multiple copy requests are submitted, this results in the creation of multiple requests to copy to an intermediate file. Each request really has a separate copyState - hence more than one copyState can be active on a folder at one time. The failed message copy that I reported in comment 5 occurs because you are trying to init a copy state when there is already an active copy state, so that request fails here:

  http://mxr.mozilla.org/comm-central/source/mailnews/imap/src/nsImapMailFolder.cpp#7478

with this code:

  NS_ENSURE_TRUE(!m_copyState, NS_ERROR_FAILURE);

We need a mechanism to allow filters to put multiple copy and move requests to a folder, without it failing. Some possibilities:

1) See if nsImapMoveCoalescer can be extended to include copies.
2) Add some sort of copy/move queue on folders
3) Pass around the copyState between routines that use it, so that a single copy is not stored in the folder.
you're describing the copy service - it serializes copy requests to the same target. Either it has an issue, or some of the operations are not using the copy service.
(In reply to comment #8)
> you're describing the copy service - it serializes copy requests to the same
> target.

Then maybe what I need to do is to make sure that the copyState can be removed before the notification is sent to the copy service.
In general, that happens, e.g.:

nsImapMailFolder::OnCopyCompleted(nsISupports *srcSupport, nsresult rv)
{
  m_copyState = nsnull;
  nsresult result;
  nsCOMPtr<nsIMsgCopyService> copyService = do_GetService(NS_MSGCOPYSERVICE_CONTRACTID, &result);
  NS_ENSURE_SUCCESS(result, result);
  return copyService->NotifyCompletion(srcSupport, this, rv);
}
Yes I see that is supposed to be happening, let me figure out what is going wrong then.
This patch seems to work. I'm going to add it to my personal tree and run it for awhile before review though. I'll be marking a dependency to bug 498154, as the POP3pump framework is needed for the unit test with a filter.
Attachment #382855 - Attachment is obsolete: true
Whiteboard: [has patch]
Whiteboard: [has patch] → [has patch][blocked by 498154]
Depends on: 196732
Attached patch minor tweaks (obsolete) — Splinter Review
Now that the POP3pump testing framework is in checkin-needed, let's review this. I would really like to get it landed a week or so before beta3, as it could fix some long-standing copy bugs. But changes like this are always a little risky, so let's not wait until the last minute.

The actual changes in the code are quite small. Most of the patch consists of two unit tests. The test_localToImapCopy.js test leaks, I'm not sure why.
Attachment #383986 - Attachment is obsolete: true
Attachment #386076 - Flags: superreview?(bienvenu)
Attachment #386076 - Flags: review?(bienvenu)
Whiteboard: [has patch][blocked by 498154] → [needs r/sr bienvenu][blocked by 498154]
Target Milestone: Thunderbird 3.0b4 → Thunderbird 3.0b3
Whiteboard: [needs r/sr bienvenu][blocked by 498154] → [needs r/sr bienvenu]
do you mean test_localToImapFilter.js leaks? I don't see a test_localToImapCopy.js. This change looks reasonable. I'll try running with the patch tomorrow, and trying the tests. Thx for the patch!
(In reply to comment #14)
> do you mean test_localToImapFilter.js leaks? I don't see a
> test_localToImapCopy.js.

Look again. There are two unit tests. The second is test_localToImapFilter.js.
Never mind, you're right, I do mean that test_localToImapFilter.js leaks. There is no test_localToImapCopy.js.
Is this code to try to cause the bug to happen, or to make the test work? The url queue should work fine on its own, without you explicitly calling loadNextQueuedUrl. Commenting out the call to loadNextQueuedUrl seems to work, as far as the test succeeding is concerned.

  if (serverSink.queuedUrlCount)
  {
    dump("doTest loading queued URL\n");
    serverSink.loadNextQueuedUrl(null);
    do_timeout(20, "doTest();");
    return;
  }

I'm trying to figure out why the test is leaking...
The code that you mentioned was added because, without it, the unit tests would hang IIRC.

But I cannot reproduce this now. I see no difference in behavior of the test with the entire code block that you quoted commented out, both before and after the backend code changes are applied that the test is fixing. There have been other changes to the test code (and to POP3pump.js which is critical to this test) since I needed to add that code, so it is possible that other changes have solved whatever was causing this.

So I would have no problem with this being removed, though it makes me slightly nervous. Perhaps as a compromise we could leave the new attribute nsIImapServerSink.queuedUrlCount, and comment out the code in the test with a note?
(In reply to comment #18)

> So I would have no problem with this being removed, though it makes me slightly
> nervous. Perhaps as a compromise we could leave the new attribute
> nsIImapServerSink.queuedUrlCount, and comment out the code in the test with a
> note?

The imap code is supposed to manage the url queue itself, so there never should be a need for other code to call loadNextQueuedUrl - it's only in IDL because we're crossing the xpcom boundary and using xpcom proxies. My concern would be that compromise would lead others into temptation...I'd be OK with just adding a comment to the test code saying that at one point the test was hanging, perhaps because the url queue was getting stalled and calling loadNextQueuedUrl fixed it. Or perhaps this discussion in the bug is enough of a breadcrumb to help someone figure out what happened if it starts hanging again.
(In reply to comment #19)
> Or perhaps this discussion in the bug is enough of a breadcrumb to
> help someone figure out what happened if it starts hanging again.

I'd be happy with that as well. Tell me when you are done with your initial look, and ready for a revised patch. Or feel free to to this yourself.
I'm fine with the nsMsgCopyService.cpp change, and if we leave out the changes for the imap url queue, the remaining issue for me is why that one test leaks. I'll spend 30 minutes trying to figure that out before requesting an other patch...
no need for this return at the end of the test function:

+    dump(msgCount + ": " + hdr.subject + "\n");
+  }
+  return;

All I can tell about the leak is that something is holding onto the server objects, which hold their root folders. Eventually they do get gc'ed and freed, but it's after the leaks are calculated. So I'm going to punt for now because I have other b3 stuff to do - I'm ready for a new patch, and I'd like to get the core fix in, w/tests, and I'll worry about the leak later. I notice that this code copies stuff from test_nsIMsgFolderListenerIMAP.js, but it doesn't seem to leak. I think both these tests double create the local folders account, because loadLocalMailAccount creates the account that the code below creates again:

   loadLocalMailAccount()

  // We need an identity so that updateFolder doesn't fail
  let acctMgr = Cc["@mozilla.org/messenger/account-manager;1"]
                  .getService(Ci.nsIMsgAccountManager);
  let localAccount = acctMgr.createAccount();
  let identity = acctMgr.createIdentity();
  localAccount.addIdentity(identity);
  localAccount.defaultIdentity = identity;
  localAccount.incomingServer = gLocalIncomingServer;
  acctMgr.defaultAccount = localAccount;
Here's the revised patch.
Attachment #386076 - Attachment is obsolete: true
Attachment #387052 - Flags: superreview?(bienvenu)
Attachment #387052 - Flags: review?(bienvenu)
Attachment #386076 - Flags: superreview?(bienvenu)
Attachment #386076 - Flags: review?(bienvenu)
Attachment #387052 - Flags: superreview?(bienvenu)
Attachment #387052 - Flags: superreview+
Attachment #387052 - Flags: review?(bienvenu)
Attachment #387052 - Flags: review+
Keywords: checkin-needed
Whiteboard: [needs r/sr bienvenu]
Blocks: 502629
Checked in: http://hg.mozilla.org/comm-central/rev/3c01f6505372
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Flags: in-litmus+
Resolution: --- → FIXED
Note: Linux has shown some intermittent failures since this has been checked in:

http://tinderbox.mozilla.org/showlog.cgi?tree=Thunderbird3.0&errorparser=unittest&logfile=1246965888.1246966542.10829.gz&buildtime=1246965888&buildname=Linux%20comm-1.9.1%20check&fulltext=

I'm seeing how much they persist before I decide if to back out/disable/etc.
If the intermittent failures of comment 26 persist, I think the correct thing to do will be to disable the test, not back out the core code.

Looking at the output, the second operation, which is a move, has not completed. (This reminds me how valuable it is to have status dumps in unit tests, such as the listing of the message subjects for all 3 folders in this test.) I went through hoops getting things to complete during the test, and we ultimately removed one of the hoops (comment 17), though it is not clear that removing that was the cause.

The test leaks anyway, so it clearly needs work.
Blocks: 313026
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: