Open Bug 1299562 Opened 8 years ago Updated 2 years ago

TEST-UNEXPECTED-FAIL | mailnews/news/test/unit/test_bug540288.js

Categories

(Thunderbird :: General, defect)

x86_64
macOS
defect

Tracking

(Not tracked)

People

(Reporter: jorgk-bmo, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(5 files, 2 obsolete files)

First seen Wed Aug 31, 2016, 14:35:09:
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=e5fd1b18853f50d5340a585cc86364d602eb9499&selectedJob=45190

TEST-UNEXPECTED-FAIL | mailnews/news/test/unit/test_bug540288.js | xpcshell return code: 0
TEST-UNEXPECTED-FAIL | mailnews/news/test/unit/test_bug540288.js | streamListener.onStopRequest - [streamListener.onStopRequest : 23] 2152398861 == 0
TEST-UNEXPECTED-FAIL | mailnews/news/test/unit/test_bug540288.js | streamListener.onStopRequest - [streamListener.onStopRequest : 28] "" == "From: John Doe <john.doe@example.com>\\nDate: Sat, 24 Mar 1990 10:59:24 -0500\\nNewsgroups: test.subscribe.simple\\nSubject: H2G2

Most likely a fallout from bug 1021843.
OS: Unspecified → Mac OS X
Hardware: Unspecified → x86_64
Version: unspecified → Trunk
Blocks: mail-cache2
m-c regression window b18c8bcdc116 to 506facea6316 (unless it turns out to be intermittent).
OK, looking at the test, the first thing that goes wrong is:
  onStopRequest: function(aRequest, aContext, aStatusCode) {
    do_check_eq(aStatusCode, 0);

We get status code 2152398861, that's 0x804B000D, and 0xD = 13.
  const kNetBase = 2152398848; // 0x804B0000
and:
                        const NS_ERROR_MODULE_NETWORK = 2152398848;
NS_ERROR_CONNECTION_REFUSED = NS_ERROR_MODULE_NETWORK + 13;

So, as we already know, this seems to be an intermittent error, so perhaps there is a timing issue in connecting to the NNTP server.
This is really intermittent:

Hasn't failed here:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=98014a396c21079c660920dd13cdc94c38ecc1dc&selectedJob=21263

Has failed here:
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=cddb63dcd5cf51f63d01e4c01c551c458765f042&selectedJob=45227

Both runs from today, the failed one at 10 AM, the non-failed one at 11 AM. My theory: As the day progresses the machines get busier and slower and slower means less likely to fail.

Aleth, can you try this on your local machine, please, perhaps with a suitable delay after the cacheEntry.markValid();

I'm really sorry for having introduced this failure with my cache2 work, however, I don't really see that the transition to cache2 is at fault here, it's more likely a subtle timing issue that was only uncovered now.
Flags: needinfo?(aleth)
The Daily build here
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=cddb63dcd5cf51f63d01e4c01c551c458765f042&selectedJob=45237
also doesn't show the failure.

May I explain what the test does.

It asynchronously requests news://localhost:<port>/TSS1%40nntp.test from the cache via asyncOpenURI().

In the callback onCacheEntryAvailable() the cache delivers a new empty entry. The callback then goes to display this message using DisplayMessage() and a stream listener.

Displaying the message will retrieve it from the NNTP server and *write* it to the cache. When the stream listener stops, we check the content.

With some debug a good case looks like this:
 0:05.42 TEST_STATUS: Thread-1 run_test/<.onCacheEntryAvailable PASS [run_test/<.onCacheEntryAvailable : 73] 0 == 0
 0:06.05 TEST_STATUS: Thread-1 run_test/<.onCacheEntryAvailable PASS [run_test/<.onCacheEntryAvailable : 84] 1 == 1
 0:06.05 TEST_STATUS: Thread-1 run_test/<.onCacheEntryAvailable PASS [run_test/<.onCacheEntryAvailable : 85] true == true
 0:06.06 PROCESS_OUTPUT: Thread-1 (pid:5900) "=== cacheStorage->AsyncOpenURI NNTP"
 0:06.07 PROCESS_OUTPUT: Thread-1 (pid:5900) "=== |news://localhost:61367/TSS1%40nntp.invalid|"
 0:06.07 PROCESS_OUTPUT: Thread-1 (pid:5900) "=== nsNNTPProtocol::OnCacheEntryAvailable: WRITE"
 0:06.07 PROCESS_OUTPUT: Thread-1 (pid:5900) "=== nsNNTPProtocol::OnCacheEntryAvailable: WRITE END"
 0:06.08 LOG: Thread-1 INFO (xpcshell/head.js) | test pending (2)
 0:06.08 LOG: Thread-1 INFO (xpcshell/head.js) | test MAIN run_test finished (2)
 0:06.08 LOG: Thread-1 INFO running event loop
 0:06.09 TEST_STATUS: Thread-1 streamListener.onStopRequest PASS [streamListener.onStopRequest : 23] 0 == 0
 0:06.10 TEST_STATUS: Thread-1 streamListener.onStopRequest PASS [streamListener.onStopRequest : 28] "From: John Doe <john.doe@example.com>\\nDate: Sat, 24 Mar 1990 10:59:24 -0500\\nNewsgroups: test.subscribe.simple\\nSubject: H2G2  == "From: John Doe <john.doe@example.com>\\nDate: Sat, 24 Mar 1990 10:59:24 -0500\\nNewsgroups: test.subscribe.simple\\nSubject: H2G2
 0:06.12 LOG: Thread-1 INFO (xpcshell/head.js) | test finished (1)

You can see that Mailnews code opens the entry a second time and this time it writes to the cache.

The failures on Mac look like this:
08:13:49     INFO -  TEST-FAIL | mailnews/news/test/unit/test_bug540288.js | took 2447ms
08:13:49     INFO -  mailnews/news/test/unit/test_bug540288.js failed or timed out, will retry.

later:

08:26:33     INFO -  TEST-START | mailnews/news/test/unit/test_bug540288.js
08:26:33  WARNING -  TEST-UNEXPECTED-FAIL | mailnews/news/test/unit/test_bug540288.js | xpcshell return code: 0
08:26:33     INFO -  TEST-INFO took 353ms
08:26:33     INFO -  >>>>>>>
08:26:33     INFO -  (xpcshell/head.js) | test MAIN run_test pending (1)
08:26:33     INFO -  TEST-PASS | mailnews/news/test/unit/test_bug540288.js | run_test/<.onCacheEntryAvailable - [run_test/<.onCacheEntryAvailable : 73] 0 == 0
08:26:33     INFO -  TEST-PASS | mailnews/news/test/unit/test_bug540288.js | run_test/<.onCacheEntryAvailable - [run_test/<.onCacheEntryAvailable : 84] 1 == 1
08:26:33     INFO -  TEST-PASS | mailnews/news/test/unit/test_bug540288.js | run_test/<.onCacheEntryAvailable - [run_test/<.onCacheEntryAvailable : 85] true == true
08:26:33  WARNING -  TEST-UNEXPECTED-FAIL | mailnews/news/test/unit/test_bug540288.js | streamListener.onStopRequest - [streamListener.onStopRequest : 23] 2152398861 == 0

So the test gets run twice and the *second* time it fails with NS_ERROR_CONNECTION_REFUSED

There are also cases like this:

03:21:56     INFO -  TEST-START | xpcshell-unpack.ini:toolkit/mozapps/extensions/test/xpcshell/test_backgroundupdate.js
03:21:56     INFO -  TEST-TIMEOUT | mailnews/news/test/unit/test_bug540288.js | took 300000ms
03:21:56     INFO -  mailnews/news/test/unit/test_bug540288.js | Process still running after test!
03:21:56     INFO -  mailnews/news/test/unit/test_bug540288.js failed or timed out, will retry.

later:

03:28:34     INFO -  TEST-START | mailnews/news/test/unit/test_bug540288.js
03:28:35  WARNING -  TEST-UNEXPECTED-FAIL | mailnews/news/test/unit/test_bug540288.js | xpcshell return code: 0
03:28:35     INFO -  TEST-INFO took 354ms
03:28:35     INFO -  >>>>>>>
03:28:35     INFO -  (xpcshell/head.js) | test MAIN run_test pending (1)
03:28:35     INFO -  TEST-PASS | mailnews/news/test/unit/test_bug540288.js | run_test/<.onCacheEntryAvailable - [run_test/<.onCacheEntryAvailable : 73] 0 == 0
03:28:35     INFO -  TEST-PASS | mailnews/news/test/unit/test_bug540288.js | run_test/<.onCacheEntryAvailable - [run_test/<.onCacheEntryAvailable : 84] 1 == 1
03:28:35     INFO -  TEST-PASS | mailnews/news/test/unit/test_bug540288.js | run_test/<.onCacheEntryAvailable - [run_test/<.onCacheEntryAvailable : 85] true == true
03:28:35  WARNING -  TEST-UNEXPECTED-FAIL | mailnews/news/test/unit/test_bug540288.js | streamListener.onStopRequest - [streamListener.onStopRequest : 23] 2152398861 == 0

I think the key here is to understand why the test fails the first time. It's semi-clear that the second run might not work if we try to connect to the server a second time.

Looking at how the test changed at
https://hg.mozilla.org/comm-central/rev/c679c89e86ba#l23.8
you may ask: Why did you remove "cacheEntry.close();". The answer is here: Bug 1021843 comment #40 (quote):
There is no need to call entry.close(), there is no notion of an entry being closed any more.

That's all I know.
This is reproducible locally, but it immediately fails with NS_ERROR_CONNECTION_REFUSED.
Flags: needinfo?(aleth)
Really ;-) Then you're the man to fix it, I'm afraid. Add some sleep/wait/timeout after the cacheEntry.markValid();
I'm really sorry about this.
(In reply to Jorg K (GMT+2, PTO during summer) from comment #7)
> Really ;-) Then you're the man to fix it, I'm afraid. Add some
> sleep/wait/timeout after the cacheEntry.markValid();
> I'm really sorry about this.

That doesn't have any effect.

There is clearly an unidentified async dependency here. If you add dumps in various places, the timing changes, and you can get the test to timeout after the DisplayMessage message call, instead of the connection error.
Is the debug version of the cache patch in bug 1021843 up to date?
This should compile on all platforms now. (There is a terrible hassle about the right format for an int64_t.)
This bug/code is a mess. Additional debug logging shows that there are two kinds of onStopRequest calls:

nsNNTPProtocol::OnStopRequest, with status 2152398861 (it's unclear to me what's causing this one)

nsNNTPProtocol::CleanupAfterRunningUrl: this calls onStopRequest with status 0. A comment calls it a "truly evil" hack.

Both kinds always happen and the test passes if the JS onStopRequest gets a status 0 call. The status 2152398861 call may or may not make it to JS depending on whether m_channelListener is set. If it does, the test fails. Sometimes m_channelListener gets set to nullptr early in which case no onStopRequest call reaches JS and the test times out.

It would be interesting to know if the calls to nsNNTPProtocol::OnStopRequest also happen on other OS and with which status.
(In reply to aleth [:aleth] from comment #11)
> This bug/code is a mess
Why am I not surprised? Actually, I also don't understand the test. The bug that goes with it was:
Bug 540288 - News article is empty if selected during download from news server
See: https://hg.mozilla.org/comm-central/rev/1dcc58a75dbb

So they request the cache entry in the JS test, which is unusual, since it interferes with the internal logic in the NNTP protocol code. Surely this provokes and empty/new entry which we subsequently write to, see debug in comment #5. So the test ensures that empty/new cache entries are written and not read. That's the general idea of the cache and that hasn't changed, only in cache2 we already get told whether the entry is new.

Anything else that fails was flaky before, IMHO.

> It would be interesting to know if the calls to
> nsNNTPProtocol::OnStopRequest also happen on other OS and with which status.
I can run the test with any patch you provide.

Also, I'd be interested to see the debug on Mac and whether we get to writing the cache at all or if the connection error happens prior. Most likely that's the case.

BTW, "truly evil" is here:
https://dxr.mozilla.org/comm-central/rev/cddb63dcd5cf51f63d01e4c01c551c458765f042/mailnews/news/src/nsNNTPProtocol.cpp#4658
Note that test_getNewsMessage.js and test_bug403242.js/head_server_setup.js contain similar code.
I landed some docu changes on nsImapProtocol.cpp, so here's the refreshed debug patch.
Oops, how silly of me. Just strip one of the patches down to the NNTP part and lose the IMAP hunks.
Adds a lot of additional dumps for debugging.
Here's the output with a ton of extra log dumps from the similar test that passes. You can see the issues that make test_bug540288.js fail intermittently are also present here. So it looks like the cache changes have just revealed the problem.

You can see the problematic nsNNTPProtocol::OnStopRequest calls with status 2152398850, I don't know yet what's causing those exactly but I suspect it's CloseConnection(). The test passes because none of these happen to reach the JS listener in this case.

I'd be interested to know if these stop requests happen on other platforms.

Immediately obvious crazy codeflow examples: 
  * nsNNTPProtocol::CleanupAfterRunningUrl triggering CloseConnection whichs calls closeSocket which calls CleanupAfterRunningUrl (again)
  * the way it's completely unclear whether m_channelListener is set/unset when it should be

What is CleanupAfterRunningUrl actually supposed to do? Why does it close the socket? Does that mean NNTP opens a new connection for every single message it fetches?

Re test_bug540288.js, if you see a better/more sensible way to write the test given the possibilities of the new cache code, so that it doesn't interfere/depend on the internal NNTP logic so much, that would be good. It would certainly be a good idea if you could add some explanatory comments to the test for future reference.


./mach xpcshell-test mailnews/news/test/unit/test_getNewsMessage.js 
[...]
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== loadNewsURL"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== cleanup after running url"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = no m_channelListener"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== CloseConnection (causes stop request via NNTP_CMD_QUIT)"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  === nsNNTPProtocol::closeSocket"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "    calling CleanupAfterRunningUrl"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== cleanup after running url"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = no m_channelListener"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = removing any m_channelListener"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END cleanup after running url"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "    calling nsMsgProtocol:closeSocket"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "     --- nsMsgProtocol::CloseSocket"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "     --- END nsMsgProtocol::CloseSocket 0"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  === END nsNNTPProtocol::closeSocket 0"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END CloseConnection"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = removing any m_channelListener"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END cleanup after running url"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "----- nsNNTPProtocol::OnStopRequest 2152398850 "
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  no m_channellistener"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  before FinishMemCacheEntry"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== FinishMemCacheEntry valid=false "
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END FinishMemCacheEntry"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  before msgprotocol:OnStopRequest"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 2152398850"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "   mSuppressListenerNotifications false"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "   m_channelListener false"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 2 - 0"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 3 - 0"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 4 - false"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 5 - 0"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  after msgprotocol:OnStopRequest"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  - no m_channelListener"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  before CloseSocket"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  === nsNNTPProtocol::closeSocket"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "    calling CleanupAfterRunningUrl"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== cleanup after running url"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = no m_channelListener"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = removing any m_channelListener"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END cleanup after running url"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "    calling nsMsgProtocol:closeSocket"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "     --- nsMsgProtocol::CloseSocket"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "     --- END nsMsgProtocol::CloseSocket 0"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "  === END nsNNTPProtocol::closeSocket 0"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "----- END nsNNTPProtocol::OnStopRequest 0"
 0:01.70 TEST_STATUS: Thread-1 run_test PASS [run_test : 76] 1 == 1
 0:01.70 TEST_STATUS: Thread-1 run_test PASS [run_test : 77] true == true
 0:01.70 LOG: Thread-1 INFO (xpcshell/head.js) | test pending (2)
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 1"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 2"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 3"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 4"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 5"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 5b"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 6"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 7"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 8"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== DisplayMessage 9"
 0:01.70 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== GetMessageFromUrl 1"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== GetMessageFromUrl 3"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== nsNNTPProtocol::AsyncOpen 1"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "   --set m_channelListener in AsyncOpen"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== AsyncOpen 2"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "  fetcharticle"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== AsyncOpen 2-opencache"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== opencacheentry"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== cacheStorage->AsyncOpenURI NNTP"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== |news://localhost:53275/TSS1%40nntp.invalid|"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== nsNNTPProtocol::OnCacheEntryAvailable: WRITE"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "--- init channel listener in OnCacheEntryAvailable"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== nsNNTPProtocol::OnCacheEntryAvailable: WRITE END"
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== OpenCacheURI returns 0 "
 0:01.71 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== GetMessageFromUrl 4 : 0 "
 0:01.70 LOG: Thread-1 INFO (xpcshell/head.js) | test MAIN run_test finished (2)
 0:01.70 LOG: Thread-1 INFO running event loop
 0:01.72 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== cleanup after running url"
 0:01.72 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = sends m_channelListener->stoprequest 0 "
 0:01.72 TEST_STATUS: Thread-1 streamListener.onStopRequest PASS [streamListener.onStopRequest : 28] 0 == 0
 0:01.72 TEST_STATUS: Thread-1 streamListener.onStopRequest PASS [streamListener.onStopRequest : 33] "From: John Doe <john.doe@example.com>\\nDate: Sat, 24 Mar 1990 10:59:24 -0500\\nNewsgroups: test.subscribe.simple\\nSubject: H2G2  == "From: John Doe <john.doe@example.com>\\nDate: Sat, 24 Mar 1990 10:59:24 -0500\\nNewsgroups: test.subscribe.simple\\nSubject: H2G2 
 0:01.72 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = removing any m_channelListener"
 0:01.72 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END cleanup after running url"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== CloseConnection (causes stop request via NNTP_CMD_QUIT)"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "  === nsNNTPProtocol::closeSocket"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "    calling CleanupAfterRunningUrl"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== cleanup after running url"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = no m_channelListener"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = removing any m_channelListener"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END cleanup after running url"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "    calling nsMsgProtocol:closeSocket"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "     --- nsMsgProtocol::CloseSocket"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "     --- END nsMsgProtocol::CloseSocket 0"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "  === END nsNNTPProtocol::closeSocket 0"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END CloseConnection"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "----- nsNNTPProtocol::OnStopRequest 2152398850 "
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "  no m_channellistener"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "  before FinishMemCacheEntry"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== FinishMemCacheEntry valid=false "
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END FinishMemCacheEntry"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "  before msgprotocol:OnStopRequest"
 0:01.74 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 2152398850"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "   mSuppressListenerNotifications false"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "   m_channelListener false"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 2 - 0"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 3 - 0"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 4 - false"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "==== nsMsgProtocol::OnStopRequest 5 - 0"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "  after msgprotocol:OnStopRequest"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "  - no m_channelListener"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "  before CloseSocket"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "  === nsNNTPProtocol::closeSocket"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "    calling CleanupAfterRunningUrl"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== cleanup after running url"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = no m_channelListener"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "   = removing any m_channelListener"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "=== END cleanup after running url"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "    calling nsMsgProtocol:closeSocket"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "     --- nsMsgProtocol::CloseSocket"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "     --- END nsMsgProtocol::CloseSocket 0"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "  === END nsNNTPProtocol::closeSocket 0"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "----- END nsNNTPProtocol::OnStopRequest 0"
 0:01.75 PROCESS_OUTPUT: Thread-1 (pid:33271) "Stream closed whilst sending, this may be expected"
 0:01.74 LOG: Thread-1 INFO (xpcshell/head.js) | test finished (1)
(In reply to aleth [:aleth] from comment #17)
> Here's the output with a ton of extra log dumps from the similar test that
> passes.
A ton, indeed, what a pain to trace through it and add the debug.

> You can see the issues that make test_bug540288.js fail
> intermittently are also present here. So it looks like the cache changes
> have just revealed the problem.
That's what I think.

> You can see the problematic nsNNTPProtocol::OnStopRequest calls with status
> 2152398850, I don't know yet what's causing those exactly but I suspect it's
> CloseConnection(). The test passes because none of these happen to reach the
> JS listener in this case.
2152398850 = 0x0x804B 0002 = NS_ERROR_BINDING_ABORTED = NS_ERROR_MODULE_NETWORK + 2;

> I'd be interested to know if these stop requests happen on other platforms.
I'll run it.
 
> Immediately obvious crazy codeflow examples: 
>   * nsNNTPProtocol::CleanupAfterRunningUrl triggering CloseConnection whichs
> calls closeSocket which calls CleanupAfterRunningUrl (again)
>   * the way it's completely unclear whether m_channelListener is set/unset
> when it should be
Surely there needs to be a gate not to recurse forever. Note the:
CleanupAfterRunningUrl(); // is this needed?

> What is CleanupAfterRunningUrl actually supposed to do? Why does it close
> the socket? Does that mean NNTP opens a new connection for every single
> message it fetches?
Not my crufty code. Who wrote NNTP?

> Re test_bug540288.js, if you see a better/more sensible way to write the
> test given the possibilities of the new cache code, so that it doesn't
> interfere/depend on the internal NNTP logic so much, that would be good.
Well, thinking about it, the test makes some sense, see comment #12. 

> It would certainly be a good idea if you could add some explanatory
> comments to the test for future reference.
That I can do once we're done here.
Attached patch nntp-debug.patchSplinter Review
Needs to be applied before Aleth' patch. BTW, on Windows you can't print a char16_t with a %s:
-printf("==== nsMsgProtocol::OnStopRequest 3b - %s\n", errorString);
+printf("==== nsMsgProtocol::OnStopRequest 3b - %d\n", aStatus);
Attachment #8787336 - Attachment is obsolete: true
Attachment #8787888 - Attachment is obsolete: true
Can I leave it with you? At first glance, nothing spectacular.
Hmm, a few 2152398850 there, before and after the
 0:02.54 PROCESS_OUTPUT: Thread-1 (pid:3936) "=== cleanup after running url"
 0:02.54 PROCESS_OUTPUT: Thread-1 (pid:3936) "   = sends m_channelListener->stoprequest 0 "
 0:02.53 TEST_STATUS: Thread-1 streamListener.onStopRequest PASS [streamListener.onStopRequest : 23] 0 == 0
> nsNNTPProtocol::CleanupAfterRunningUrl triggering CloseConnection

The exact flow is

CleanupAfterRunningUrl 
nsMailNewsUrl::SetUrlState in this case calls NOTIFY_URL_LISTENERS(OnStopRunningUrl)
JS OnStopRunningUrl handler in the test calls nsNntpIncomingServer::CloseCachedConnections
nsNntpIncomingServer::CloseCachedConnections calls CloseConnection

I wonder why is the OnStopRunningUrl handler calling CloseCachedConnections and if this also happens outside of tests.
(In reply to Jorg K (GMT+2, PTO during summer) from comment #21)
> Hmm, a few 2152398850 there, before and after the
>  0:02.54 PROCESS_OUTPUT: Thread-1 (pid:3936) "=== cleanup after running url"
>  0:02.54 PROCESS_OUTPUT: Thread-1 (pid:3936) "   = sends
> m_channelListener->stoprequest 0 "
>  0:02.53 TEST_STATUS: Thread-1 streamListener.onStopRequest PASS
> [streamListener.onStopRequest : 23] 0 == 0

OK, so it looks like the other platforms at the moment are just lucky ;)
(In reply to aleth [:aleth] from comment #22)
> JS OnStopRunningUrl handler in the test calls
> nsNntpIncomingServer::CloseCachedConnections
Can that be removed?

Sorry, I'm not following this 100%, my head is spinning with caching part issues, bug 629738.
I noticed this error and read through the comment.

The number 2152398850 rang a bell.

I have seen it for a long time in my local log.

-- begin quote 
exiting test
PROCESS | 20199 | MODE READER,GROUP test.filter,XOVER 1-8,HEAD 1,HEAD 2,HEAD 3,HEAD 4,HEAD 5,HEAD 6,HEAD 7,HEAD 8
PROCESS | 20199 | Connection Lost 2152398850
--- end quote

I don't run xpcshell test as often as mozmill test since xpcshell test takes much longer.

But the oldest is from October 2014. So the similar problem has been there for a long time, I think.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME

I still see this when I run test under valgrind, but not ordinarily, at least in one of the tests I ran.
This suggest the timing issue is still there :-(

Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Status: REOPENED → NEW

Let us know if you still see this.
FWIW, we have open bug reports that match the description of bug 540288 comment 0

Flags: needinfo?(ishikawa)
See Also: → 540288

(In reply to Wayne Mery (:wsmwk) from comment #27)

FWIW, we have open bug reports that match the description of bug 540288 comment 0

Since you added me to Cc, I suspect that you were going to mention another bug, not an old one fixed earlier than this issue was created.

This is the output from

grep -3 "Connection Lost 2152398850" /FF-NEW/log1390-xpcshell.txt  

The log was captured on March 14th. (The source is about a week old. I have not updated the source tree due to disk issues, and THEN a network issue. I am now seriously considering replacing network adaptor, ordered one from an auction site.)
The log contains additional debug warnings from local patches.

We can easily see the "connection lost" message appears during or around the call of
|RemoveFromCache(this)|

I have not analyzed the code much, but it seems whichever module being contacted
from the control flow of |RemoveFromCache(this)|
either closes the connection too early.
I say this based on the following log output.

 0:16.55 pid:180618 MODE READER,GROUP test.filter,XOVER 1-8,HEAD 1,HEAD 2,HEAD 3,HEAD 4,HEAD 5,HEAD 6,HEAD 7,HEAD 8,QUIT
 0:16.55 pid:180618 Connection Lost 2152398850
 0:16.58 pid:180618 console.warn: mailnews.nntp: "Failed to send \"QUIT\" because socket state is closed"
 0:16.60 pid:180618 ~nsMsgDatabase(): closing database    /NEW-SSD/moz-obj-dir/objdir-tb3/temp/xpc-profile-xd41hzri/News/localhost.msf

The particular log comes from a run on Mach 14th.
And the excerpted part is from

 0:16.05 TEST_START: xpcshell.ini:comm/mailnews/news/test/unit/test_filter.js

PS:

BTW, the following is very important.

xpcshell runs many tests in parallel.
This has the benefit of maximizing the I/O and CPU. Maybe good from the viewpoint of compilation/test farm management.

However, parallel execution means all the output from various runs end up in one output stream and analyzing the log would be difficult in such a situation.

This is indeed the case with xpcshell.

If I add "--verbose" to xpcshell test execution, the output is created in a mixed manner from various parallel thread executing different tests. Analysis is difficult.
So I have to add "--sequential" to make it easier to analyze the failure and the warming messages that precede them. Analyzing warning messages is important because we want to compare successful and failing similar code execution in one run.

The log I obtains for xpcshell is created with "--sequential" and "--verbose", but
I need to add a wrapper to make because I think I call |mach| per test directory
for grouping the test output in a known order (?). (I forgot exactly why I did that. I think initially I thought I was going to run EACH test separately, but I thought it was silly and turned to per-directory execution with "--verbose" and "--serialize".)

I have been forced to execute xpcshell in a serialized manner for local analysis.

Try-comm-central:

It is not obvious how we can add "--sequential" and "--verbose" to xpcshell test run on try-comm-central.
I think somebody in the know ought to make it possible to execute xpcshell in a serialize manner
and making it easy to add "--verbose" and analyze the warnings.

BTW, only with this serialized execution, I could run the test under valgrind and found a few long standing valid memory issues.

Flags: needinfo?(ishikawa)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: