Closed Bug 1752323 Opened 4 years ago Closed 3 years ago

Intermittent TEST-UNEXPECTED-TIMEOUT | comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js | Test timed out

Categories

(MailNews Core :: Networking: IMAP, defect, P5)

Thunderbird 112
defect

Tracking

(thunderbird_esr102+ fixed)

RESOLVED FIXED
112 Branch
Tracking Status
thunderbird_esr102 + fixed

People

(Reporter: intermittent-bug-filer, Assigned: mkmelin)

References

Details

(Keywords: intermittent-failure)

Attachments

(5 files)

Assignee: nobody → nicolai
Status: NEW → ASSIGNED
Assignee: nicolai → nobody
Status: ASSIGNED → NEW

For me it fails consistently when I manually run unit test mailnews/imap/test/unit/test_imapStatusCloseDBs.js. It hangs due to ensure success:

0:00.65 pid:736381 [Parent 736381, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8000FFFF (NS_ERROR_UNEXPECTED): file /home/gene/mozilla/comm/mailnews/imap/src/nsImapProtocol.cpp:809

I haven't proved it but I think this is probably a regression of bug 163964. When I record the imap log while running the test it appears there is an attempt to start a new connection slightly before the original imap connection goes away. With real servers this isn't a problem but with the "fakeserver" used by unit test, it can apparently only handle a single connection at at time so it can't handle this transition.

I tried setting pref max_cached_connection to 2 in test_imapStatusCloseDBs.js but it seemed to totally stop any imap activity as observed in the log, but the test "passed" anyhow!

I'll try backing out the changes from bug 163964 from daily and see if that fixes it.

I was also wondering if there's a way to run a unit test on an older archived version without rebuilding it? That is, how could I run the unit test test_imapStatusCloseDBs on, say, the last 91.x version to see how it behaved before the changes for bug 163964? I tried rebuilding 91 from source but got tons of compile error using the current daily toolchain.

I'll try backing out the changes from bug 163964 from daily and see if that fixes it.

Backed it out plus several later and related changes and it still fails.

I tried setting pref max_cached_connection to 2 in test_imapStatusCloseDBs.js but it seemed to totally stop any imap activity as observed in the log, but the test "passed" anyhow!

I think I am getting fooled by the imap log. Adding Services.prefs.setIntPref("mail.server.server1.max_cached_connections", 2); to the test file actually does let it pass the test. When I look with wireshark there actually is imap stuff happening even though the log file is empty after I ran the test manually. What's happening is the imap log is getting cleared somehow after the tests run. If I add the MOZ_LOG parameter append to the env var, I can see imap activitiy in the log and it's not blank. I.e., I run it like this:

$ MOZ_LOG=IMAP:5,timestamp,sync,append MOZ_LOG_FILE=~/tblog0  ./mach xpcshell-test  comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js

But a couple things I still don't understand is that the log file seems to indicate some authentication errors returned by the fakeserver. But I don't see the same corresponding errors in the wireshark dump. Edit: Looking closer, it appears the the "authentication failures" in the log correspond with fakeserver FIN'ing/dropping the connection but on retry the connection is re-established and continues on and the tests pass.

Also it appears that the bottom half of the imap log file is truncated when compared with wireshark, even with the added append parameter. So maybe trying to run an imap log while running xpcshell-test for imap will give misleading results.

Mossop, do you know what is going wrong with the logging in comment 43? If not, who might?

Gene, Emilio suggests you might use https://rr-project.org/ to debug this.

Flags: needinfo?(dtownsend)

(In reply to Wayne Mery (:wsmwk) from comment #44)

Mossop, do you know what is going wrong with the logging in comment 43? If not, who might?

padenot maybe

Flags: needinfo?(dtownsend) → needinfo?(padenot)

Can we try to not use MOZ_LOG_FILE and use shell redirection in append mode (e.g. >>) to get all logs (or just have a terminal with infinite scroll-back I suppose) ? With the sync mode it should work.

Flags: needinfo?(padenot)

Gene, Emilio suggests you might use https://rr-project.org/ to debug this.

I think that using rr was suggested before while debugging another MOZ_LOG problem. Don't think it helped.

Can we try to not use MOZ_LOG_FILE and use shell redirection in append mode (e.g. >>) to get all logs (or just have a terminal with infinite scroll-back I suppose) ? With the sync mode it should work.

Thanks, good idea. That works but I haven't checked the comparison of the log to wireshark results yet so not sure it tells me anything more. Also, don't know why this unit test doesn't fail consistently when tests are run when the official builds are done every day (or maybe it does?).

I see the issue, and for me the timeout is problematic.
And I suspect this time out is also an issue for treeherder compilation/test server farm since the test jobneeds to stay long than necessary.

Below is an excerpt from the summary created locally from a local xpcshell test run.
I simply compare the timestamp at the beginning of the local log line and print if no activity was recorded more than 20 seconds.

 ========================================
 TIMEOUT
 ========================================

00:15.88 TEST_END: Test PASS. Subtests passed 43/43. Unexpected 0
00:15.88 TEST_START: xpcshell-imap.ini:comm/mail/components/extensions/test/xpcshell/test_ext_messages.js
LONG PAUSE FOUND: 25.950000 seconds
00:49.82 TEST_END: Test FAIL. Subtests passed 999/1003. Unexpected 4
00:00.39 TEST_START: comm/mail/components/shell/test/unit/test_shellService.js
--
00:48.20 TEST_END: Test PASS. Subtests passed 38/38. Unexpected 0
00:48.20 TEST_START: xpcshell-cpp.ini:comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js
LONG PAUSE FOUND: 299.710000 seconds
05:48.20 TEST_END: Test TIMEOUT. Subtests passed 5/5. Unexpected 0
00:00.34 TEST_START: comm/mailnews/import/test/unit/test_AddrBookFileImporter.js

I regularly run the whole xpcshell test sequentially with --verbose flag on my local linux PC to capture and analyze warnings and failures that may be introduced by my local patches.
The whole xpcshell test run takes (including the time out above) slightly longer than 1100 seconds.

 * * * * * * * * * * * * * * 
It took  1148  seconds.
 * * * * * * * * * * * * * * 

If the almost 5 minutes timeout (299.7 ~ 300 seconds) is not there, the run will be shortened to 848 seconds. That will be a significant decrease.

On the treeherder, the xpcshell test is run concurrently, but still the elapsed time would decrease if the timeout is avoided.

So how should we debug this?
Is the fakeserver not capable enough?

From the attached log, the test itself seems to think it has finished cleanly?
But somehow after the errror at mozilla/comm/mailnews/imap/src/nsImapProtocol.cpp:808 is recorded, nothing happens for five minutes and times out.
The line number may be off a bit due to local patch to dump variables, etc.

00:48.49 INFO (xpcshell/head.js) | test run_next_test 3 finished (1)
00:48.49 INFO exiting test
00:48.49 pid:257556 [Parent 257556, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8000FFFF (NS_ERROR_UNEXPECTED): file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/imap/src/nsImapProtocol.cpp:808
05:48.20 TEST_END: Test TIMEOUT. Subtests passed 5/5. Unexpected 0
05:48.20 INFO xpcshell return code: None
05:48.20 INFO xpcshell-cpp.ini:comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js | Process still running after test!
05:48.21 INFO xpcshell-cpp.ini:comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js failed or timed out, will retry.
05:48.21 ERROR TEST-UNEXPECTED-FAIL | Received SIGINT (control-C), so stopped run. (Use --keep-going to keep running tests after killing one with SIGINT)
05:48.21 INFO INFO | Result summary:

The line 808 is as follows.

  803	
  804	    nsCOMPtr<nsIMsgMailNewsUrl> mailnewsUrl = do_QueryInterface(m_runningUrl);
  805	    nsCOMPtr<nsIMsgIncomingServer> server = do_QueryReferent(m_server);
  806	    if (!server) {
  807	      rv = mailnewsUrl->GetServer(getter_AddRefs(server));
* 808	      NS_ENSURE_SUCCESS(rv, rv);  
  809	      m_server = do_GetWeakReference(server);
  810	    }

Looking at the log, I noticed another early failure.: As Gene noted, there may be an authentication screwup (?).

00:48.37 INFO xpcshell-cpp.ini:comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js | Starting setupTest
00:48.37 INFO (xpcshell/head.js) | test setupTest pending (2)
00:48.39 pid:257556 [Parent 257556, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8000FFFF (NS_ERROR_UNEXPECTED): file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/base/src/nsMsgAccountManager.cpp:2065
00:48.42 pid:257556 AUTH PLAIN line -AHVzZXIAcGFzc3dvcmQ=-
00:48.42 pid:257556 authorize-id: --, username: -user-, password: -password-
00:48.43 pid:257556 NS_NewBufferedOutputStream: outputStream (= std::move(aOutputputStream)) =0x7ff4bc0119e8

mozilla/comm/mailnews/base/src/nsMsgAccountManager.cpp:2065 is this.
Oh, I hate the coding style. I won't be surprised if there are bugs in there.
Does it have to be nested-if instead of the if-statements at the same level with proper if-expression?

  // try ("nobody","Local Folders","none"), and work down to any "none" server.
 2052	  rv = findServerInternal("nobody"_ns, "Local Folders"_ns, "none"_ns, 0,
 2053	                          aServer);
 2054	  if (NS_FAILED(rv) || !*aServer) {
 2055	    rv = findServerInternal("nobody"_ns, EmptyCString(), "none"_ns, 0, aServer);
 2056	    if (NS_FAILED(rv) || !*aServer) {
 2057	      rv = findServerInternal(EmptyCString(), "Local Folders"_ns, "none"_ns, 0,
 2058	                              aServer);
 2059	      if (NS_FAILED(rv) || !*aServer)
 2060	        rv = findServerInternal(EmptyCString(), EmptyCString(), "none"_ns, 0,
 2061	                                aServer);
 2062	    }
 2063	  }
 2064	
 2065	  NS_ENSURE_SUCCESS(rv, rv);  <---
 2066	

Hmm. I will dump some values to see what is going on.

OK, I inserted some dump statements to see what the values of some variables involved.
But it did not give me much insight.

However, while I was adding dump statements, I realized one issue.
|m_runningUrlLatest| was replaced with the value of |m_runningUrl| even if the call to |do_QueryInterface(aURL, &rv)| to set it returned failure code in |rv|.

That seems to be wrong.

So replaced the value of |m_runningUrlLatest| only after the |do_QueryInterface| call succeeded.

Voila. No time out.

I don't know the code at all. I am attaching the debug patch and let someone in the know evaluate the issue. The attached log is created with the patch in the following attachment.

This is the patch that solved the timeout issue.

I would ask Gene who has been working on this for some time to see if this looks right.

The essential part is this:

    m_runningUrl = do_QueryInterface(aURL, &rv);
    // m_runningUrlLatest = m_runningUrl;
    // We should not be changing m_runningUrlLatest if this is bogus?
    NS_WARNING(nsPrintfCString(
               "(debug) "
               "do_QueryInterface(aURL, &rv) = "
               "rv=0x%" PRIx32 " m_runningUrl =%s\n",
               static_cast<uint32_t>(rv), m_runningUrl? "true" : "false").get());
    if (NS_FAILED(rv)) return rv;
    m_runningUrlLatest = m_runningUrl;

I moved the | m_runningUrlLatest = m_runningUrl;| after the error return.

This seems to solve the issue. But I defer the judgement if this is correct or not to Gene and others.

Assignee: nobody → ishikawa
Flags: needinfo?(gds)
Attachment #9308385 - Flags: review?(gds)

Oh, my comment was premature.
If I executed the test as a single target, it did not time out with my WIP patch. Great.

However, if I run the test as part of the WHOLE xpcshell test, it still times out (!?).
This seems to happen on local PC.
I am quoting the excerpt from my sequential
So it is quite likely the problem is with the fakeserver side (?) although I think "DO NOT update m_runningUrlLastest if m_runningUrl is bogus (=0)" is probably true.

the following part (marked with *) is only in the failed log (from the whole xpecshell execution) and is not found in the single target execution. Strange.

00:53.00 pid:292470 : file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/base/src/nsMsgAccountManager.cpp:2088
00:53.00 pid:292470 [Parent 292470, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8000FFFF (NS_ERROR_UNEXPECTED): file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/base/src/nsMsgAccountManager.cpp:2096
00:53.00 pid:292470 [Parent 292470, Main Thread] WARNING: (debug) findServerInternal("nobody"_ns, "Local Folders"_ns,  "none"_ns, 0, aServer)=rv=0x0 aServer=7ffdbe1f95e8
00:53.00 pid:292470 : file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/base/src/nsMsgAccountManager.cpp:2058
* 00:53.02 pid:292470 [Parent 292470, Main Thread] WARNING: (debug) do_QueryInterface(aURL, &rv) = rv=0x0 m_runningUrl =true
* 00:53.02 pid:292470 : file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/imap/src/nsImapProtocol.cpp:803
* 00:53.02 pid:292470 [Parent 292470, Main Thread] WARNING: (debug) After do_QueryInterface(m_server): m_server=0, server = 0, m_runningUrl=true
* 00:53.02 pid:292470 : file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/imap/src/nsImapProtocol.cpp:813
00:53.05 pid:292470 AUTH PLAIN line -AHVzZXIAcGFzc3dvcmQ=-
00:53.05 pid:292470 authorize-id: --, username: -user-, password: -password-

More investigation seems to be necessary.

Hi Chiaki, Sorry for the long delay in responding to this. I'm not sure if you still need a review on this but I keep getting emails reminding me that it's overdue.
I did make the "essential" change in comment 56 (moving the line after the return) which does seem like a potential bug. However, it didn't prevent the timeout when I run:
./mach xpcshell-test ./comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js

If I patch the test like this it does pass (with or without the proposed change to nsImapProtocol.cpp):

diff --git a/mailnews/imap/test/unit/test_imapStatusCloseDBs.js b/mailnews/imap/test/unit/test_imapStatusCloseDBs.js
--- a/mailnews/imap/test/unit/test_imapStatusCloseDBs.js
+++ b/mailnews/imap/test/unit/test_imapStatusCloseDBs.js
@@ -7,18 +7,19 @@
 
 var { PromiseTestUtils } = ChromeUtils.import(
   "resource://testing-common/mailnews/PromiseTestUtils.jsm"
 );
 
 var gFolder1, gFolder2;
 
 add_task(function setupTest() {
   Services.prefs.setBoolPref("mail.check_all_imap_folders_for_new", true);
+  Services.prefs.setIntPref("mail.server.server1.max_cached_connections", 2);
 
   setupIMAPPump();
 
   IMAPPump.daemon.createMailbox("folder 1", { subscribed: true });
   IMAPPump.daemon.createMailbox("folder 2", { subscribed: true });
 
   IMAPPump.server.performTest("SUBSCRIBE");
 
   let rootFolder = IMAPPump.incomingServer.rootFolder;

Also, FWIW, I can see the imap log at t.log when I run this:

MOZ_LOG=IMAP:5,sync,timestamp  ./mach xpcshell-test ./comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js | grep IMAP > t.log
Flags: needinfo?(gds)

Comment on attachment 9308385 [details] [diff] [review]
WIP: dump some values and DO NOT update m_runningUrlLatest if m_runningUrl is bogus (=0?).

The essential change to nsImapProtocol.cpp looks like it's needed.

Attachment #9308385 - Flags: review?(gds) → review+

Chiaki, One more thing (actually two).
Do you know if there is a way to run a unit test on an old archived TB version like ESR 78? I thought maybe this could be used to tell where the test failure of this bug started. I couldn't find a way to do that.

The other question which is totally off-topic that I've been meaning to ask you: Do you ever use gdb debugger with TB? It seems that recently something has changed and I can't use it like I used to. See bug 1805773. I don't use it a lot but when I need it I would like to still be able to put breakpoints on a c++ line number.

(In reply to gene smith from comment #61)

Comment on attachment 9308385 [details] [diff] [review]
WIP: dump some values and DO NOT update m_runningUrlLatest if m_runningUrl is bogus (=0?).

The essential change to nsImapProtocol.cpp looks like it's needed.

So what should I do?
Should I clean up the comment (extra comment) and set r+=gds?

Flags: needinfo?(gds)

(In reply to gene smith from comment #62)

Chiaki, One more thing (actually two).
Do you know if there is a way to run a unit test on an old archived TB version like ESR 78? I thought maybe this could be used to tell where the test failure of this bug started. I couldn't find a way to do that.

The other question which is totally off-topic that I've been meaning to ask you: Do you ever use gdb debugger with TB? It seems that recently something has changed and I can't use it like I used to. See bug 1805773. I don't use it a lot but when I need it I would like to still be able to put breakpoints on a c++ line number.

The answer to the first question: Well, I have never thought about it.
It seems if you could find the original source file tree that corresponds to the particular ESR version, then it should be doable in principle.
Right, in principle. but I have never tried that.

Question No. 2: It rings a bell. I was quite tied up with a technology exhibition in November, December, but I distinctively remember some trouble using gdb with locally created DEBUG version of TB, and funny thing was that I found the very slow built-in GDB of valgrind somehow seemed to work. I mean, at least valgrind could print out the stacktrace and such, so I figured maybe the built-in gdb functionality of valgrind might work. But then I stopped short of verifying that due the busy schedule of technology exhibition.

So I think the gdb problem is real.

Flags: needinfo?(gds)

Thanks for the investigations ISHIKAWA and Gene!

Assignee: ishikawa → mkmelin+mozilla
Status: NEW → ASSIGNED
Target Milestone: --- → 112 Branch
Version: unspecified → Thunderbird 112

Pushed by elizabeth@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/798fcc52e90e
Fix comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js failures. r=gds

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

Comment on attachment 9322019 [details]
Bug 1752323 - Fix comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js failures. r=gds

[Approval Request Comment]
Regression caused by (bug #): not a regression
User impact if declined: this is primarily a test fix, but it could potentially clear up some unknown bad behaviour in imap handling
Testing completed (on c-c, etc.): c-c and beta
Risk to taking this patch (and alternatives if risky): safe

Attachment #9322019 - Flags: approval-comm-esr102?

Comment on attachment 9322019 [details]
Bug 1752323 - Fix comm/mailnews/imap/test/unit/test_imapStatusCloseDBs.js failures. r=gds

[Triage Comment]
Approved for esr102

Attachment #9322019 - Flags: approval-comm-esr102? → approval-comm-esr102+

This seems to still be a problem on comm-esr102 even after the uplift.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: