Closed Bug 604565 Opened 9 years ago Closed 8 years ago

OS X: Intermittent failure in test_syncengine_sync.js or test_service_sync_checkServerError.js or test_service_sync_updateEnabledEngines.js or test_service_sync_remoteSetup.js | command timed out: 1200 seconds without output

Categories

(Firefox :: Sync, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED WORKSFORME
mozilla11
Tracking Status
blocking2.0 --- -
status2.0 --- wanted

People

(Reporter: orangereporter, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [see comment 112])

Attachments

(2 files, 1 obsolete file)

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287104594.1287108107.30640.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/14 18:03:14
s: talos-r3-leopard-041

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-old-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_syncengine_sync.js | running test ...

command timed out: 1200 seconds without output, killing pid 303
I'm guessing this is the same underlying problem:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287112010.1287113780.22974.gz

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-old-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_checkServerError.js | running test ...

command timed out: 1200 seconds without output, killing pid 304
Summary: Intermittent failure in test_syncengine_sync.js | command timed out: 1200 seconds without output → Intermittent failure in test_syncengine_sync.js or test_service_sync_checkServerError.js | command timed out: 1200 seconds without output
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287161409.1287163187.20866.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/15 09:50:09
s: talos-r3-leopard-021

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-old-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_checkServerError.js | running test ...

command timed out: 1200 seconds without output, killing pid 318
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287345106.1287346887.27759.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-old-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_checkServerError.js | running test ...

command timed out: 1200 seconds without output, killing pid 318
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287370912.1287372682.16282.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-old-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_checkServerError.js | running test ...

command timed out: 1200 seconds without output, killing pid 308
Okay, this is getting pretty ridiculous. test_service_sync_checkServerError.js is hanging almost every other time at this point, though it only ever hangs on OSX.

I'm able to reproduce these hangs on OSX 10.6.4. It looks like they are always related to network calls. Though on my machine I'm not getting to the 1200 second timeout but the network resource implementation aborts after the 5 min timeout. I'm also seeing a lot of temporary hangs where the test is just stalled on a network call for a few seconds. For example:

2010-10-18 14:19:08	Net.Resource         DEBUG	PUT Length: 180
2010-10-18 14:19:08	ServerWBO            DEBUG	PUT request received
2010-10-18 14:19:08	ServerWBO            DEBUG	PUT 200 response written
2010-10-18 14:19:15	Net.Resource         DEBUG	PUT success 200 http://localhost:8080/1.0/johndoe/storage/crypto/clients

Here the client makes a PUT request to the test server that's running locally on port 8080 (using netwerk/test/httpserver/httpd.js), the server handler ("ServerWBO") receives and handles the request immediately but then it takes 7 seconds for the client to receive the response. This doesn't only happen with sync's test server handlers but also with the error handlers that httpd.js has built in (e.g. the 404 handler). Turning on debugging in httpd.js reveals that the problem is unlikely to be found there:

2010-10-18 14:24:05	Net.Resource         DEBUG	PUT Length: 102
2010-10-18 14:24:05 HTTPD-INFO *** onStopRequest async callback
2010-10-18 14:24:05 HTTPD-INFO *** onStopRequest [status=0x0]
2010-10-18 14:24:05 HTTPD-INFO *** closing connection 26 on port 52054
2010-10-18 14:24:05 HTTPD-INFO *** onSocketAccepted(socket=[xpconnect wrapped nsIServerSocket @ 0x4d44600 (native @ 0x4d4445c)], trans=[xpconnect wrapped (nsISupports, nsISocketTransport, nsITransport, nsIDNSListener) @ 0x7083390 (native @ 0x4d9f22c)])
2010-10-18 14:24:05 HTTPD-INFO >>> new connection on 127.0.0.1:52057
2010-10-18 14:24:05 HTTPD-INFO *** opening new connection 27 on port 52057
2010-10-18 14:24:05 HTTPD-INFO *** starting connection 27
2010-10-18 14:24:05 HTTPD-INFO *** onInputStreamReady(input=[xpconnect wrapped (nsISupports, nsIInputStream, nsIAsyncInputStream, nsISeekableStream, nsISearchableInputStream) @ 0x218819c0 (native @ 0x707cecc)]) on thread [xpconnect wrapped (nsISupports, nsIThread, nsIThreadInternal, nsIEventTarget, nsISupportsPriority) @ 0x7026240 (native @ 0x4d087f0)] (main is [xpconnect wrapped (nsISupports, nsIThread, nsIThreadInternal, nsIEventTarget, nsISupportsPriority) @ 0x7026240 (native @ 0x4d087f0)])
2010-10-18 14:24:05 HTTPD-INFO *** this._state == 0
2010-10-18 14:24:05 HTTPD-INFO *** _parseRequestLine('PUT /1.0/johndoe/storage/keys/pubkey HTTP/1.1')
2010-10-18 14:24:05 HTTPD-INFO *** _parseHeaders
2010-10-18 14:24:05 HTTPD-INFO _processHeaders, Content-length=102
2010-10-18 14:24:05 HTTPD-INFO *** loading data=123,34,112,97,121,108,111,97,100,34,58,34,123,92,34,116,121,112,101,92,34,58,92,34,112,117,98,107,101,121,92,34,44,92,34,107,101,121,68,97,116,97,92,34,58,92,34,115,101,107,114,105,116,92,34,44,92,34,112,114,105,118,97,116,101,75,101,121,85,114,105,92,34,58,92,34,112,114,105,118,107,101,121,92,34,125,34,44,34,105,100,34,58,34,112,117,98,107,101,121,34,125 len=102 excess=0
2010-10-18 14:24:05 HTTPD-INFO *** remaining body data len=0
2010-10-18 14:24:05 HTTPD-INFO *** _validateRequest
2010-10-18 14:24:05 HTTPD-INFO *** path == /1.0/johndoe/storage/keys/pubkey
2010-10-18 14:24:05 HTTPD-INFO calling override for /1.0/johndoe/storage/keys/pubkey
2010-10-18 14:24:05	ServerWBO            DEBUG	PUT request received
2010-10-18 14:24:05	ServerWBO            DEBUG	PUT 200 response written
2010-10-18 14:24:05 HTTPD-INFO *** complete()
2010-10-18 14:24:05 HTTPD-INFO *** partiallySent()
2010-10-18 14:24:05 HTTPD-INFO *** _startAsyncProcessor()
2010-10-18 14:24:05 HTTPD-INFO *** _sendHeaders()
2010-10-18 14:24:05 HTTPD-INFO *** non-async response, set Content-Length
2010-10-18 14:24:05 HTTPD-INFO *** header post-processing completed, sending response head...
2010-10-18 14:24:05 HTTPD-INFO *** preamble copying started
2010-10-18 14:24:05 HTTPD-INFO *** _waitToReadData
2010-10-18 14:24:05 HTTPD-INFO *** _waitForSinkClosure
2010-10-18 14:24:05 HTTPD-INFO *** onInputStreamReady
2010-10-18 14:24:05 HTTPD-INFO *** input wanted: 148
2010-10-18 14:24:05 HTTPD-INFO *** 148 bytes read
2010-10-18 14:24:05 HTTPD-INFO *** _waitToWriteData
2010-10-18 14:24:05 HTTPD-INFO *** _waitToReadData
2010-10-18 14:24:05 HTTPD-INFO *** onOutputStreamReady
2010-10-18 14:24:05 HTTPD-INFO *** wrote 148 bytes of data
2010-10-18 14:24:05 HTTPD-INFO *** _waitForSinkClosure
2010-10-18 14:24:05 HTTPD-INFO *** onInputStreamReady
2010-10-18 14:24:05 HTTPD-INFO *** input stream closed
2010-10-18 14:24:05 HTTPD-INFO *** _doneReadingSource(0x0)
2010-10-18 14:24:05 HTTPD-INFO *** _finishSource(0)
2010-10-18 14:24:05 HTTPD-INFO *** _cancelOrDispatchCancelCallback(0)
2010-10-18 14:24:05 HTTPD-INFO *** cancel(0)
2010-10-18 14:24:05 HTTPD-INFO *** _doneReadingSource(0x0)
2010-10-18 14:24:05 HTTPD-INFO *** _finishSource(0)
2010-10-18 14:24:05 HTTPD-INFO *** _cancelOrDispatchCancelCallback(0)
2010-10-18 14:24:05 HTTPD-INFO *** onStopRequest async callback
2010-10-18 14:24:05 HTTPD-INFO *** preamble copying complete [status=0x0]
2010-10-18 14:24:05 HTTPD-INFO *** _sendBody
2010-10-18 14:24:05 HTTPD-INFO *** starting async copier of body data...
2010-10-18 14:24:05 HTTPD-INFO *** onStartRequest
2010-10-18 14:24:05 HTTPD-INFO *** _waitToReadData
2010-10-18 14:24:05 HTTPD-INFO *** _waitForSinkClosure
2010-10-18 14:24:05 HTTPD-INFO *** onInputStreamReady
2010-10-18 14:24:05 HTTPD-INFO *** input wanted: 14
2010-10-18 14:24:05 HTTPD-INFO *** 14 bytes read
2010-10-18 14:24:05 HTTPD-INFO *** _waitToWriteData
2010-10-18 14:24:05 HTTPD-INFO *** _waitToReadData
2010-10-18 14:24:05 HTTPD-INFO *** onOutputStreamReady
2010-10-18 14:24:05 HTTPD-INFO *** wrote 14 bytes of data
2010-10-18 14:24:05 HTTPD-INFO *** _waitForSinkClosure
2010-10-18 14:24:05 HTTPD-INFO *** onInputStreamReady
2010-10-18 14:24:05 HTTPD-INFO *** input stream closed
2010-10-18 14:24:05 HTTPD-INFO *** _doneReadingSource(0x0)
2010-10-18 14:24:05 HTTPD-INFO *** _finishSource(0)
2010-10-18 14:24:05 HTTPD-INFO *** _cancelOrDispatchCancelCallback(0)
2010-10-18 14:24:05 HTTPD-INFO *** cancel(0)
2010-10-18 14:24:05 HTTPD-INFO *** _doneReadingSource(0x0)
2010-10-18 14:24:05 HTTPD-INFO *** _finishSource(0)
2010-10-18 14:24:05 HTTPD-INFO *** _cancelOrDispatchCancelCallback(0)
2010-10-18 14:24:05 HTTPD-INFO *** onStopRequest async callback
2010-10-18 14:24:05 HTTPD-INFO *** onStopRequest [status=0x0]
2010-10-18 14:24:05 HTTPD-INFO *** closing connection 27 on port 52057
2010-10-18 14:24:12	Net.Resource         DEBUG	PUT success 200 http://localhost:8080/1.0/johndoe/storage/keys/pubkey

I haven't dug much deeper. The failures started appearing the day we landed threaded crypto on m-c. However, these tests don't even use the real crypto service and if I switch Svc.Crypto back to the regular (unthreaded) WeaveCrypto implementation, I still get those hangs. I'm also puzzled by the fact that these hangs only occur on OSX.
Reduce amount of pointless network calls by not registering any engines for tests that provide their own fake engine implementations anyway.

This makes test_service_sync_checkServerError.js pass 10 out of 10 times whereas without this fix it hanged 9 out of 10 times for me.
Attachment #483997 - Flags: review?(mconnor)
Attachment #483997 - Flags: review?(mconnor) → review+
Rev3 MacOSX Leopard 10.5.8 places debug test xpcshell [testfailed]
http://tinderbox.mozilla.org/showlog.cgi?log=Places/1287418674.1287420478.5704.gz&fulltext=1
Another data point: these hangs only seem to occur on debug builds. According to tbpl, xpcshell tests on optimized builds take O(15min) whereas on debug builds they take O(1h), if they're successful at all.
From your push:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287441151.1287443072.12220.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/18 15:32:31
s: talos-r3-leopard-035

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 304

and the next one:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287447619.1287449667.10811.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/18 17:20:19
s: talos-r3-leopard-011

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 305

but just when I was starting to think backout, the third one went green.
Summary: Intermittent failure in test_syncengine_sync.js or test_service_sync_checkServerError.js | command timed out: 1200 seconds without output → Intermittent failure in test_syncengine_sync.js or test_service_sync_checkServerError.js or test_service_sync_updateEnabledEngines.js | command timed out: 1200 seconds without output
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287469531.1287472012.14234.gz

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...

command timed out: 1200 seconds without output, killing pid 309
I have a theory, that this has to do with bug 591881. It looks like pruning of dead connections is not done anymore in xpcshell tests (but it does happen in all other tests). An argument against this theory, though, is that the failure only happens on OS X 32 Debug builds...

But, the theory can be tested, by manually forcing pruning of dead connections. That could be done by doing

  mozilla::services::GetObserverService()->NotifyObservers(nsnull, "net:prune-dead-connections", nsnull);

whenever a new xpcshell test begins. Is there a way to do that?
It's probably easiest to do in head.js.
(In reply to comment #35)
> But, the theory can be tested, by manually forcing pruning of dead connections.
> That could be done by doing
> 
>   mozilla::services::GetObserverService()->NotifyObservers(nsnull,
> "net:prune-dead-connections", nsnull);
> 
> whenever a new xpcshell test begins. Is there a way to do that?

I added

  Cu.import("resource://services-sync/util.js");
  Svc.Obs.notify("net:prune-dead-connections");

to head_appinfo.js but it doesn't seem to help.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287511843.1287514305.7663.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 11:10:43
s: talos-r3-leopard-040

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 303
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287514074.1287516511.18148.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 11:47:54
s: talos-r3-leopard-008

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 302
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287517902.1287520398.4798.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 12:51:42
s: talos-r3-leopard-040

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 304
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287517589.1287519439.627.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test xpcshell on 2010/10/19 12:46:29
s: talos-r3-leopard-024

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 300
> I added
> 
>   Cu.import("resource://services-sync/util.js");
>   Svc.Obs.notify("net:prune-dead-connections");
> 
> to head_appinfo.js but it doesn't seem to help.

I tried something similar with head.js, didn't work either. I guess that theory is out then.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287519931.1287522560.14822.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 13:25:31
s: talos-r3-leopard-016

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 303
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287518633.1287521588.10486.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 13:03:53
s: talos-r3-leopard-012

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 307
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287521796.1287524199.23146.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 13:56:36
s: talos-r3-leopard-006

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 299
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287531670.1287534037.9878.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 16:41:10
s: talos-r3-leopard-052

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 301
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287543602.1287546356.3134.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 20:00:02
s: talos-r3-leopard-029

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 315
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287543752.1287546644.4620.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 20:02:32
s: talos-r3-leopard-020

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 323
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 20:35:41
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287545741.1287548026.10681.gz

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...

command timed out: 1200 seconds without output, killing pid 307
process killed by signal 9
program finished with exit code -1
elapsedTime=2117.524234
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287543210.1287546220.2108.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/19 19:53:30
s: talos-r3-leopard-004

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_syncengine_sync.js | running test ...
command timed out: 1200 seconds without output, killing pid 304
(In reply to comment #30)
> Okay, that didn't help much. I'll try to find a regression range.

I went back in time to when services/* landed on m-c. Even there I witness these "time jumps" as described in comment 17, although oddly enough test_syncengine_sync completes in ~5 secs without any hangs at all.

I tried to find out when test_syncengine_sync started to regress (on tip it randomly stalls and hangs) and tracked it down to http://hg.mozilla.org/mozilla-central/rev/ca49c629bf81 (hg bisect ftw). On a build based on any previous revision, test_syncengine_sync completes in ~5 secs, after this revision it randomly stalls and hangs. I have no idea how that changeset could even cause this problem, but I've verified it multiple times.

Attaching a gdb to a hanging test reveals the following backtraces:

Thread 1 (process 74319):
#0  0x9168f0fa in mach_msg_trap ()
#1  0x9168f867 in mach_msg ()
#2  0x9888efaf in __CFRunLoopRun ()
#3  0x9888e094 in CFRunLoopRunSpecific ()
#4  0x9888dec1 in CFRunLoopRunInMode ()
#5  0x97008f9c in RunCurrentEventLoopInMode ()
#6  0x97008d51 in ReceiveNextEventCommon ()
#7  0x97008bd6 in BlockUntilNextEventMatchingListInMode ()
#8  0x9375ca89 in _DPSNextEvent ()
#9  0x9375c2ca in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#10 0x013d1603 in nsAppShell::ProcessNextNativeEvent (this=0x4d3dc40, aMayWait=1) at .../mozille-central/widget/src/cocoa/nsAppShell.mm:628
#11 0x01421307 in nsBaseAppShell::DoProcessNextNativeEvent (this=0x4d3dc40, mayWait=1) at .../mozille-central/widget/src/xpwidgets/nsBaseAppShell.cpp:161
#12 0x0142205f in nsBaseAppShell::OnProcessNextEvent (this=0x4d3dc40, thr=0x4d0ebc0, mayWait=1, recursionDepth=0) at .../mozille-central/widget/src/xpwidgets/nsBaseAppShell.cpp:317
#13 0x013d0e80 in nsAppShell::OnProcessNextEvent (this=0x4d3dc40, aThread=0x4d0ebc0, aMayWait=1, aRecursionDepth=0) at .../mozille-central/widget/src/cocoa/nsAppShell.mm:833
#14 0x016e24aa in nsThread::ProcessNextEvent (this=0x4d0ebc0, mayWait=1, result=0xbfffdb08) at .../mozille-central/xpcom/threads/nsThread.cpp:517
#15 0x016fe149 in NS_InvokeByIndex_P (that=0x4d0ebc0, methodIndex=8, paramCount=2, params=0xbfffdaf8) at .../mozille-central/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_unixish_x86.cpp:179
#16 0x00f5621e in CallMethodHelper::Invoke (this=0xbfffdad4) at .../mozille-central/js/src/xpconnect/src/xpcwrappednative.cpp:3054
#17 0x00f58b22 in CallMethodHelper::Call (this=0xbfffdad4) at .../mozille-central/js/src/xpconnect/src/xpcwrappednative.cpp:2321
#18 0x00f52865 in XPCWrappedNative::CallMethod (ccx=@0xbfffdc9c, mode=XPCWrappedNative::CALL_METHOD) at .../mozille-central/js/src/xpconnect/src/xpcwrappednative.cpp:2285
#19 0x00f5e65f in XPC_WN_CallMethod (cx=0x4d37440, argc=1, vp=0x7300258) at .../mozille-central/js/src/xpconnect/src/xpcwrappednativejsops.cpp:1631
#20 0x01b34da0 in js::CallJSNative (cx=0x4d37440, native=0xf5e36b <XPC_WN_CallMethod(JSContext*, unsigned int, jsval_layout*)>, argc=1, vp=0x7300258) at jscntxtinlines.h:652
#21 0x0197b94c in js::Interpret (cx=0x4d37440, entryFrame=0x7300030, inlineCallCount=5, interpFlags=0) at .../mozille-central/js/src/jsinterp.cpp:4716
#22 0x0198f55f in js::RunScript (cx=0x4d37440, script=0x711fd80, fp=0x7300030) at jsinterp.cpp:638
#23 0x0198fb01 in js::Execute (cx=0x4d37440, chain=0x78050e0, script=0x711fd80, prev=0x0, flags=0, result=0xbfffeb60) at jsinterp.cpp:998
#24 0x018c5085 in JS_EvaluateUCScriptForPrincipals (cx=0x4d37440, obj=0x78050e0, principals=0x4e0d204, chars=0x711f2e0, length=25, filename=0x14121 "-e", lineno=1, rval=0xbfffeb60) at .../mozille-central/js/src/jsapi.cpp:4882
#25 0x018c5175 in JS_EvaluateScriptForPrincipals (cx=0x4d37440, obj=0x78050e0, principals=0x4e0d204, bytes=0xbffff249 "_execute_test(); quit(0);", nbytes=25, filename=0x14121 "-e", lineno=1, rval=0xbfffeb60) at .../mozille-central/js/src/jsapi.cpp:4908
#26 0x00006321 in ProcessArgs (cx=0x4d37440, obj=0x78050e0, argv=0xbfffed18, argc=18) at .../mozille-central/js/src/xpconnect/shell/xpcshell.cpp:1302
#27 0x00006fd7 in main (argc=18, argv=0xbfffed18, envp=0xbfffed64) at
.../mozille-central/js/src/xpconnect/shell/xpcshell.cpp:2002

So it's stuck waiting for the next event with the event loop here being spun by JS code. Both the xpcshell test harness (testing/xpcshell/head.js) and our Sync.js do that, they call nsIThread.processNextEvent(true). It looks like this sometimes causes Cocoa nsAppShell to stall briefly or even lock up completely in debug builds. Why http://hg.mozilla.org/mozilla-central/rev/ca49c629bf81 would be able to worsen that behaviour at all is beyond me, though.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287593406.1287595888.15368.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/20 09:50:06
s: talos-r3-leopard-024

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 320
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287593618.1287596326.17166.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/20 09:53:38
s: talos-r3-leopard-027

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 339
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287593375.1287596244.16733.gz
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central debug test xpcshell on 2010/10/20 09:49:35
s: talos-r3-snow-053

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_snowleopard-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_syncengine_sync.js | running test ...
command timed out: 1200 seconds without output, killing pid 332
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287588269.1287590124.21510.gz
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central debug test xpcshell on 2010/10/20 08:24:29
s: talos-r3-snow-033

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_snowleopard-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 324
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287600517.1287603398.20745.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/20 11:48:37
s: talos-r3-leopard-027

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 302
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287601683.1287603593.21734.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test xpcshell on 2010/10/20 12:08:03
s: talos-r3-leopard-046

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_updateEnabledEngines.js | running test ...
command timed out: 1200 seconds without output, killing pid 301
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287601309.1287604006.23857.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/20 12:01:49
s: talos-r3-leopard-013

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_syncengine_sync.js | running test ...
command timed out: 1200 seconds without output, killing pid 305
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/20 12:01:49
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287601309.1287604006.23857.gz

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_syncengine_sync.js | running test ...

command timed out: 1200 seconds without output, killing pid 305
process killed by signal 9
program finished with exit code -1
elapsedTime=2544.827277
I think it's time to disable these tests on OSX debug builds.
(In reply to comment #68)
> I think it's time to disable these tests on OSX debug builds.

I concur, this is basically perma-orange on OSX debug.
blocking2.0: --- → ?
Attachment #484837 - Attachment is obsolete: true
Attachment #484861 - Flags: review?(mconnor)
Attachment #484837 - Flags: review?(mconnor)
Attachment #484861 - Flags: review?(mconnor) → review+
Disabled test_syncengine_sync.js, test_service_sync_checkServerError.js and test_service_sync_updatedEnabledEngines.js on OSX debug builds:

http://hg.mozilla.org/mozilla-central/rev/11ab1b90ee17

Still need to land this on fx-sync.
(In reply to comment #78)
> Disabled test_syncengine_sync.js, test_service_sync_checkServerError.js and
> test_service_sync_updatedEnabledEngines.js on OSX debug builds:
> 
> http://hg.mozilla.org/mozilla-central/rev/11ab1b90ee17
> 
> Still need to land this on fx-sync.

Landed on fx-sync: http://hg.mozilla.org/services/fx-sync/rev/9c0f3c17554c
Piling on a new one:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287694348.1287696313.16217.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/21 13:52:28
s: talos-r3-leopard-042

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_remoteSetup.js | running test ...
command timed out: 1200 seconds without output, killing pid 323
Summary: Intermittent failure in test_syncengine_sync.js or test_service_sync_checkServerError.js or test_service_sync_updateEnabledEngines.js | command timed out: 1200 seconds without output → Intermittent failure in test_syncengine_sync.js or test_service_sync_checkServerError.js or test_service_sync_updateEnabledEngines.js or test_service_sync_remoteSetup.js | command timed out: 1200 seconds without output
(In reply to comment #82)
> Piling on a new one:
> 
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287694348.1287696313.16217.gz
> Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/10/21
> 13:52:28
> s: talos-r3-leopard-042
> 
> TEST-INFO |
> /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_remoteSetup.js
> | running test ...
> command timed out: 1200 seconds without output, killing pid 323

If this is going to be a frequent flyer, we might have to disable that one too. Let's wait a bit and see what happens.

(In reply to comment #83)
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287730476.1287732397.6838.gz

This is bug 605458.
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1287813809.1287815852.2001.gz
Rev3 MacOSX Leopard 10.5.8 tracemonkey debug test xpcshell on 2010/10/22 23:03:29
s: talos-r3-leopard-023

TEST-INFO | /Users/cltbld/talos-slave/tracemonkey_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_remoteSetup.js | running test ...
command timed out: 1200 seconds without output, killing pid 312
(In reply to comment #85)
> http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1287813809.1287815852.2001.gz
> Rev3 MacOSX Leopard 10.5.8 tracemonkey debug test xpcshell on 2010/10/22
> 23:03:29
> s: talos-r3-leopard-023

So this is for rev ee1746020149f6278ef03e577afa0c5c2b8db645 which definitely has those tests disabled for OSX debug builds. As in it should immediately return from run_test(). Wondering how this could still make it hang...
_remoteSetup.js is the one I piled on here after you disabled the other three, no?
(In reply to comment #87)
> _remoteSetup.js is the one I piled on here after you disabled the other three,
> no?

You're right of course. I just looked at the bug summary, my bad. Anyway, this doesn't bode well for _remoteSetup...
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1287880372.1287882385.19371.gz
Rev3 MacOSX Leopard 10.5.8 tracemonkey debug test xpcshell on 2010/10/23 17:32:52
s: talos-r3-leopard-023

TEST-INFO | /Users/cltbld/talos-slave/tracemonkey_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_remoteSetup.js | running test ...
command timed out: 1200 seconds without output, killing pid 317
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1287953788.1287955710.15614.gz
Rev3 MacOSX Leopard 10.5.8 tracemonkey debug test xpcshell on 2010/10/24 13:56:28
s: talos-r3-leopard-012

TEST-INFO | /Users/cltbld/talos-slave/tracemonkey_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_remoteSetup.js | running test ...
command timed out: 1200 seconds without output, killing pid 311
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1287980015.1287981936.16448.gz
Rev3 MacOSX Leopard 10.5.8 tracemonkey debug test xpcshell on 2010/10/24 21:13:35
s: talos-r3-leopard-004

TEST-INFO | /Users/cltbld/talos-slave/tracemonkey_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_remoteSetup.js | running test ...
command timed out: 1200 seconds without output, killing pid 358
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288057003.1288059109.15748.gz
Rev3 MacOSX Leopard 10.5.8 tracemonkey debug test xpcshell on 2010/10/25 18:36:43
s: talos-r3-leopard-052

TEST-INFO | /Users/cltbld/talos-slave/tracemonkey_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_remoteSetup.js | running test ...
command timed out: 1200 seconds without output, killing pid 329
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288075497.1288077498.30117.gz
Rev3 MacOSX Leopard 10.5.8 tracemonkey debug test xpcshell on 2010/10/25 23:44:57
s: talos-r3-leopard-007

TEST-INFO | /Users/cltbld/talos-slave/tracemonkey_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_remoteSetup.js | running test ...
command timed out: 1200 seconds without output, killing pid 307
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288073057.1288075139.19713.gz
Rev3 MacOSX Leopard 10.5.8 tracemonkey debug test xpcshell on 2010/10/25 23:04:17
s: talos-r3-leopard-032

TEST-INFO | /Users/cltbld/talos-slave/tracemonkey_leopard-o-debug_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_service_sync_remoteSetup.js | running test ...
command timed out: 1200 seconds without output, killing pid 310
Rev3 MacOSX Leopard 10.5.8 places debug test xpcshell [testfailed]
http://tinderbox.mozilla.org/showlog.cgi?log=Places/1288189775.1288191774.31625.gz&fulltext=1
blocking2.0: ? → final+
Disabled test_service_sync_remoteSetup.js as well (r=mconnor via IRC):
http://hg.mozilla.org/services/fx-sync/rev/4000c04829e6

Will be picked up with the next merge to m-c
I've been doing some work on 10.6.4 x64, and I don't see any test failures when I set 

DISABLE_TESTS_BUG_604565 = false;

My Minefield build is debug.

Is this failure perhaps limited to 32-bit? Philipp, you said you repro'd on 10.6, right?

I'd be interested to see if we could narrow down the range of platforms on which to exclude these tests; there was a very real risk of them bitrotting during my implementation of Bug 603489.
(In reply to comment #97)
> I've been doing some work on 10.6.4 x64, and I don't see any test failures

They're not failures per se, they're hangs. See my analysis in comment 55.

> Is this failure perhaps limited to 32-bit? Philipp, you said you repro'd on
> 10.6, right?

Yes, I reproduced random hangs and random stalls on 10.6 (32 bit build). The problem is their randomness and their popping up after unrelated changesets. Maybe 32 bit and OSX 10.5 have to do with it, maybe they don't...

> I'd be interested to see if we could narrow down the range of platforms on
> which to exclude these tests; there was a very real risk of them bitrotting
> during my implementation of Bug 603489.

I agree it's not optimal for developers on OSX. But these tests still get run on all other platforms as well as OSX optimised builds. In the worst case your try build will tell you about them ;) (assuming you're pushing to try first which is definitely advised with a big change like bug 603489)
(In reply to comment #98)

> They're not failures per se, they're hangs. See my analysis in comment 55.

Apologies for being unclear. I mean the tests run straight through and pass, no hangs.

> Yes, I reproduced random hangs and random stalls on 10.6 (32 bit build). The
> problem is their randomness and their popping up after unrelated changesets.
> Maybe 32 bit and OSX 10.5 have to do with it, maybe they don't...

I'm building using the 10.6 SDK, with the following otool output:

obj-ff-dbg/dist/MinefieldDebug.app/Contents/MacOS/firefox-bin:
Mach header
      magic cputype cpusubtype  caps    filetype ncmds sizeofcmds      flags
MH_MAGIC_64  X86_64        ALL LIB64     EXECUTE    27       2952   NOUNDEFS DYLDLINK TWOLEVEL

I wouldn't be surprised if that did make a difference.

> In the worst case your
> try build will tell you about them ;) (assuming you're pushing to try first
> which is definitely advised with a big change like bug 603489)

I plan to!
These tests broke on 2010-12-07 briefly (see bug 617320) and were accidentally starred to this bug. The actual intermittent failures haven't been happening since these tests were disabled on OSX.
Whiteboard: [orange] → [orange][see comment 112]
blocking2.0: final+ → -
status2.0: --- → wanted
Comment 115 through 118 really meant to star bug 629664.

Comment 119 is an occurrence on Linux with a rather weird error. Definitely not this bug, seems more related to bug 587808.
Summary: Intermittent failure in test_syncengine_sync.js or test_service_sync_checkServerError.js or test_service_sync_updateEnabledEngines.js or test_service_sync_remoteSetup.js | command timed out: 1200 seconds without output → OS X: Intermittent failure in test_syncengine_sync.js or test_service_sync_checkServerError.js or test_service_sync_updateEnabledEngines.js or test_service_sync_remoteSetup.js | command timed out: 1200 seconds without output
Comment 121 is mis-star. It's a similar crash as comment 191, filed bug 633246 for them.
Comment 123 is a mis-star, much like comment 121. Ehsan meant bug 627985.
Actually, if anything, Ehsan meant bug 594666 or bug 633559, except that those are scoped to the jsreftests and reftests respectively.  I don't think we've seen the deadlock assertions in xpcshell before; probably worth filing a new bug and noting it in bug 594666.
Comment 126 is a mis-star. sfink meant to star bug 587420.
It's been a while, so I enabled the disabled OS X tests on services-central. If the oranges go away, great. Otherwise, it's easy enough to revert.

https://hg.mozilla.org/services/services-central/rev/5302ba54392c
(In reply to Gregory Szorc [:gps] from comment #128)
> It's been a while, so I enabled the disabled OS X tests on services-central.
> If the oranges go away, great. Otherwise, it's easy enough to revert.
> 
> https://hg.mozilla.org/services/services-central/rev/5302ba54392c

Re-enabled on m-c, too:

https://hg.mozilla.org/mozilla-central/rev/5302ba54392c
Looks like this has been working in the wild for a while.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Target Milestone: --- → mozilla11
Whiteboard: [orange][see comment 112] → [see comment 112]
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.