Closed Bug 775924 Opened 7 years ago Closed 2 years ago

Intermittent devtools/server/tests/unit/test_dbgsocket.js | Test timed out

Categories

(DevTools :: Debugger, defect, P3)

x86_64
macOS
defect

Tracking

(firefox24 disabled, firefox25 disabled, firefox26 affected)

RESOLVED INCOMPLETE
Tracking Status
firefox24 --- disabled
firefox25 --- disabled
firefox26 --- affected

People

(Reporter: emorley, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [test disabled on OSX][leave open])

Attachments

(6 files)

Rev4 MacOSX Snow Leopard 10.6 birch opt test xpcshell on 2012-07-19 22:36:13 PDT for push 6f5947eaf463

slave: talos-r4-snow-050

https://tbpl.mozilla.org/php/getParsedLog.php?id=13697088&tree=Birch

Helpful log is helpful :-/

{
TEST-INFO | /Users/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/devtools/debugger/tests/unit/test_getyoungestframe.js | running test ...
TEST-PASS | /Users/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/devtools/debugger/tests/unit/test_getyoungestframe.js | test passed (time: 107.291ms)
TEST-INFO | /Users/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/devtools/debugger/tests/unit/test_nsjsinspector.js | running test ...
TEST-PASS | /Users/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/devtools/debugger/tests/unit/test_nsjsinspector.js | test passed (time: 105.442ms)
TEST-INFO | /Users/cltbld/talos-slave/test/build/xpcshell/tests/toolkit/devtools/debugger/tests/unit/test_dbgsocket.js | running test ...

command timed out: 1200 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1233.488204
TinderboxPrint: xpcshell<br/><em class="testfail">T-FAIL</em>
buildbot.slave.commands.TimeoutError: command timed out: 1200 seconds without output, attempting to kill
}
(btw: Birch mirrors m-c except uses gcc rather than clang for OS X)
Priority: -- → P3
https://tbpl.mozilla.org/php/getParsedLog.php?id=16759448&tree=Mozilla-Inbound
Summary: Intermittent test_dbgsocket.js | command timed out: 1200 seconds without output → Intermittent test_dbgsocket.js | command timed out: 1200 seconds without output, attempting to kill
test_dbgsocket.js
It puts the URL in its clipboard, or it gets the silly comment again.

https://tbpl.mozilla.org/php/getParsedLog.php?id=17057463&tree=Firefox
Whiteboard: [orange]
We hit this more frequently than the comments here would suggest (since it's typically a manual star on TBPL). Mihai, do you or Panos have any idea what might be happening here?
It's unfortunate that the process is being sent a SIGKILL and not a SIGINT signal, otherwise we would be getting the log up to the point that the test got stuck. I wonder if we could modify the harness to sent SIGINT first, wait a bit and then deliver a SIGKILL.

My guess is that TCP port 2929 is in use and the test can't attach a debugger server to it. We could probably add a loop to try different port numbers and timeout if unavailable.
(In reply to Panos Astithas [:past] from comment #95)
> It's unfortunate that the process is being sent a SIGKILL and not a SIGINT
> signal, otherwise we would be getting the log up to the point that the test
> got stuck. I wonder if we could modify the harness to sent SIGINT first,
> wait a bit and then deliver a SIGKILL.
> 
> My guess is that TCP port 2929 is in use and the test can't attach a
> debugger server to it. We could probably add a loop to try different port
> numbers and timeout if unavailable.

Can we please? :)
Attached patch Patch v1Splinter Review
Randomized the port number.
Assignee: nobody → past
Status: NEW → ASSIGNED
https://hg.mozilla.org/mozilla-central/rev/5beec2eab11b
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 22
This just happened. :/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
With any luck, bug 859065 will help here.
Depends on: 859065
(In reply to TinderboxPushlog Robot from comment #154)
> RyanVM
> https://tbpl.mozilla.org/php/getParsedLog.php?id=21914116&tree=Mozilla-
> Inbound
> Rev5 MacOSX Mountain Lion 10.8 mozilla-inbound opt test xpcshell on
> 2013-04-17 07:46:16
> slave: talos-mtnlion-r5-043
> 
> 07:54:26  WARNING -  TEST-UNEXPECTED-FAIL |
> /builds/slave/talos-slave/test/build/tests/xpcshell/tests/toolkit/devtools/
> debugger/tests/unit/test_dbgsocket.js | test failed (with xpcshell return
> code: 0), see following log:
> 07:54:31  WARNING -  TEST-UNEXPECTED-FAIL |
> /builds/slave/talos-slave/test/build/tests/xpcshell/head.js | test timed out
> - See following stack:
> 08:06:17    ERROR - Return code: 1

Hooray for our fancy new logging! Panos, does anything in this log help?
Flags: needinfo?(past)
This should fix the error I see in the log, even though I'm not sure whether it's the real cause of the test failure:
https://hg.mozilla.org/integration/mozilla-inbound/rev/544ccbe0f7ef
Flags: needinfo?(past)
https://hg.mozilla.org/mozilla-central/rev/544ccbe0f7ef
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: Firefox 22 → Firefox 23
I'm going to give this a week or so and uplift it to the various branches if m-c stays clean.
Flags: needinfo?(ryanvm)
Guess not :(
Status: RESOLVED → REOPENED
Flags: needinfo?(ryanvm)
Resolution: FIXED → ---
Attached patch Add telemetrySplinter Review
Once m-i reopens I'd like to land this reversal of the previous attempt in comment 157, with some additional logging sprinkled in to help figure out where time is being spent during the test.
(In reply to Panos Astithas [:past] from comment #166)
> Once m-i reopens I'd like to land this reversal of the previous attempt in
> comment 157, with some additional logging sprinkled in to help figure out
> where time is being spent during the test.

To expand on this a little, my theory is that either the long message and unicode test is taking too much time, or that the socket shutdown test is not getting a connection refused error. In the former case, I'll split the test in two, but in the latter case I'll need to come up with something else, because it seems like a platform issue.
(In reply to Panos Astithas [:past] from comment #168)
> To expand on this a little, my theory is that either the long message and
> unicode test is taking too much time, or that the socket shutdown test is
> not getting a connection refused error. In the former case, I'll split the
> test in two, but in the latter case I'll need to come up with something
> else, because it seems like a platform issue.

Comment 171 indicates its the latter unfortunately. 5' were spent on the socket shutdown test. I need to think about it some more.
(In reply to TinderboxPushlog Robot from comment #183)
> RyanVM
> https://tbpl.mozilla.org/php/getParsedLog.php?id=22821730&tree=Birch
> Rev5 MacOSX Mountain Lion 10.8 birch opt test xpcshell on 2013-05-10 08:05:04
> slave: talos-mtnlion-r5-058
> 
> 08:13:11  WARNING -  TEST-UNEXPECTED-FAIL |
> /builds/slave/talos-slave/test/build/tests/xpcshell/tests/toolkit/devtools/
> debugger/tests/unit/test_dbgsocket.js | Test timed out
> 08:13:11  WARNING -  TEST-UNEXPECTED-FAIL |
> /builds/slave/talos-slave/test/build/tests/xpcshell/tests/toolkit/devtools/
> debugger/tests/unit/test_dbgsocket.js | test failed (with xpcshell return
> code: -9), see following log:
> 08:29:01    ERROR - Return code: 1

Now that the latest timeout patch has landed, maybe this log will generate more useful information.
Anything useful in the logs, Panos?
Flags: needinfo?(past)
Attached patch Add more loggingSplinter Review
No more info from the latest failures, but this should show if any data is received from the transport.

https://hg.mozilla.org/integration/mozilla-inbound/rev/56c2ea3fe3bd
Flags: needinfo?(past)
(In reply to Panos Astithas [:past] from comment #190)
> No more info from the latest failures, but this should show if any data is
> received from the transport.

None received.
*sigh*
Attached file Process sample
I ran the test in an infinite loop locally and I eventually was able to hit this hang. Then I took a sample of the process with Activity Monitor, but I don't know enough about the low-level bits of the platform to make sense of it. It is waiting for something, but that something doesn't look like it's the same as in bug 847558.
Attached file Process sample 2
Here is another process sample, slightly simpler I think. It doesn't have the CoreGraphics frames in the main thread and the JS watchdog thread is one frame shorter.
Here is the shell script I used to run the test until it hangs.
Steven, perhaps you could explain what the process is waiting for in this case (see the attached process samples)? netstat shows that the socket is gone when the process hangs, but for some reason the client doesn't get a connection refused error.

nsIInputStreamPump.asycRead is the last known thing that ran before the hang.
Flags: needinfo?(smichaud)
This really isn't my area of expertise -- I know very little about the networking code.  But I *do* know that profiles are almost entirely useless for figuring out hangs.  You need to reproduce the hang in gdb, then break and take an all-thread stack trace ("thread apply all bt").
Flags: needinfo?(smichaud)
Summary: Intermittent test_dbgsocket.js | command timed out: 1200 seconds without output, attempting to kill → Intermittent devtools/server/tests/unit/test_dbgsocket.js | Test timed out
Would anybody shed tears if I disabled this test? I'm strongly considering it for at least Fx24 so this doesn't carry on to the ESR, though I suppose starring it there for a year is no different than starring m-c for a year.
Flags: needinfo?(past)
Target Milestone: Firefox 23 → ---
I don't mind disabling it on anything besides m-c to be honest.
Flags: needinfo?(past)
Flags: needinfo?(ryanvm)
Disabled on OSX on Aurora and Beta.

https://hg.mozilla.org/releases/mozilla-aurora/rev/a5129f5b2e95
https://hg.mozilla.org/releases/mozilla-beta/rev/8490a48fc51d
Flags: needinfo?(ryanvm)
Whiteboard: [leave open] → [test disabled on OSX][leave open]
Does it seem safe to close this now?  Looks like it hasn't happened for quite some time.
Flags: needinfo?(ryanvm)
Are you ever planning a fix other than disabling the test? Your call ;)
Flags: needinfo?(ryanvm)
Priority: P3 → --
I'm not actually working on this.
Assignee: past → nobody
Something landed in m-i today that broke this again in optimized Linux builds. Any idea what it was?
Blocks: dbg-test