Closed Bug 432471 Opened 12 years ago Closed 12 years ago

qm-xserve01 failing during mochitest

Categories

(Release Engineering :: General, defect, P1)

PowerPC
macOS
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: rcampbell, Assigned: rcampbell)

References

Details

Attachments

(1 file)

qm-xserve01 has been occasionally failing while running mochitests in the past few days. To try to track down the source of the problem, I've enabled debugging symbols and added stack-and-abort to the XPCOM_DEBUG_BREAK environment variable. Additionally, I enabled core dumps on the machine to try to get some useful debugging info for roc. Still waiting on a core.
Assignee: nobody → rcampbell
Priority: -- → P1
error seems to consistently be:

*** 17123 INFO Running /tests/dom/tests/mochitest/dom-level1-core/test_characterdatagetdata.html...
FAIL Exited with code 10 during test run
kill TERM 10438
Process killed. Took 1 second to die.

Did not get a core dump, despite turning that on in /etc/hostconfig

Running mochitests interactively to see what's going on.
I've attempted to reproduce this error by running the tests interactively 10 times now, with and without gdb attached to the firefox process, with and without a vnc window into the machine to watch the results and I have been so far unable to reproduce the error. I'm going to have to put this machine back online and will be attempting to scavenge another xserve to do backup unittest duty while we figure this out.
We should get the system core dump functionality working.

Can you get a core dump by manually doing "kill -11" on the process while it's running mochitests? How about when it's running mochitests in the harness?

How about "kill -10"?
We should probably just disable the test for now, since we're losing unit test coverage when it happens, and it is happening more and more frequently.
agreed, though I do think this failure is due to a checkin on or around 4/29/2008 before 18:00.
(In reply to comment #3)
> We should get the system core dump functionality working.
> 
> Can you get a core dump by manually doing "kill -11" on the process while it's
> running mochitests? How about when it's running mochitests in the harness?
> 
> How about "kill -10"?

I tried using kill -6 when I first enabled. I can try again when I have non-hotel internets.
best guess is the checkin range in: http://tinyurl.com/6de2lo.

adding smichaud, smaug and shebs for a consult.
Feels like backout / reland one of these changes could be a blocker for FF3...
Anyone tried valgrinding this test?
Comment on attachment 319684 [details] [diff] [review]
disable test temporarily

>Index: dom/tests/mochitest/dom-level1-core/Makefile.in

>+# temporarily disabled because it's causing crashes
>+# 		test_characterdatagetdata.html \

I landed this change (without the \ which was causing my build to fail).
Failing now in a related test - 

*** 17171 INFO Running /tests/dom/tests/mochitest/dom-level1-core/test_characterdatagetlength.html...
FAIL Exited with code 10 during test run
kill TERM 15467
Hmm, this is reminiscent of the purported Linux libpr0n fails, that apparently had less to do with the specific test than that the browser had been running for long enough to trigger the bug.
The png reftest fails were caused by an abnormally large places.sqlite causing the tests to timeout when trying to load the files, and was fixed by removing that places.sqlite and making sure to use a new profile for each test run.

Given that the two tests are similar I think it's relatively likely that they're both just triggering the same bug. I tried disabling the second one, and robcee kicked off another build, this time with core dumps enabled. If that build goes green, the next build will catch my backout of the test disabling and we should be able to get a stack for the crash, assuming it happens again.
(In reply to comment #12)
> Hmm, this is reminiscent of the purported Linux libpr0n fails, that apparently
> had less to do with the specific test than that the browser had been running
> for long enough to trigger the bug.

Big difference is that Mochitest uses a clean profile every time, so there's no cross-run wackiness that can cause problems.  I don't believe we make any permanent changes outside the profile when we run, but I could be wrong.
with crowder's awesome help, we were able to decode a core dump:

(gdb) bt
#0  js_Interpret (cx=0x2d6e62a0) at /builds/slave/trunk_osx/mozilla/js/src/jsinterp.c:5974
#1  0x00240941 in js_Execute (cx=0x2d6e62a0, chain=0x2a1808a0, script=0x2a19d000, down=0x0, flags=0, result=0xbfffe0fc) at /builds/slave/trunk_osx/mozilla/js/src/jsinterp.c:1535
#2  0x0020b603 in JS_EvaluateUCScriptForPrincipals (cx=0x2d6e62a0, obj=0x2a1808a0, principals=0x31128754, chars=0x2ad53008, length=150640, filename=0x2f29afc8 "http://localhost:8888/MochiKit/packed.js", lineno=1, rval=0xbfffe0fc) at /builds/slave/trunk_osx/mozilla/js/src/jsapi.c:4999
#3  0x01487e27 in nsJSContext::EvaluateString (this=0x2d6e6210, aScript=@0x2f29afa0, aScopeObject=0x2a1808a0, aPrincipal=0x31128750, aURL=0x2f29afc8 "http://localhost:8888/MochiKit/packed.js", aLineNo=1, aVersion=0, aRetValue=0x0, aIsUndefined=0xbfffe1f0) at /builds/slave/trunk_osx/mozilla/dom/src/base/nsJSEnvironment.cpp:1530
#4  0x01373942 in nsScriptLoader::EvaluateScript (this=0x338251c0, aRequest=0x2f29af90, aScript=@0x2f29afa0) at /builds/slave/trunk_osx/mozilla/content/base/src/nsScriptLoader.cpp:582
#5  0x01373d80 in nsScriptLoader::ProcessRequest (this=0x338251c0, aRequest=0x2f29af90) at /builds/slave/trunk_osx/mozilla/content/base/src/nsScriptLoader.cpp:496
#6  0x01373e52 in nsScriptLoader::ProcessPendingRequests (this=0x338251c0) at /builds/slave/trunk_osx/mozilla/content/base/src/nsScriptLoader.cpp:629
#7  0x01373f77 in nsScriptLoader::OnStreamComplete (this=0x338251c0, aLoader=0x319f2d80, aContext=0x2f29af90, aStatus=0, aStringLen=150640, aString=0x2a7eb008 "/***\n\n    MochiKit.MochiKit 1.4 : PACKED VERSION\n\n    THIS FILE IS AUTOMATICALLY GENERATED.  If creating patches, please\n    diff against the source tree, not this file.\n\n    See <http://mochikit.com/"...) at /builds/slave/trunk_osx/mozilla/content/base/src/nsScriptLoader.cpp:804
#8  0x0108d606 in nsStreamLoader::OnStopRequest (this=0x319f2d80, request=0x319f2a8c, ctxt=0x2f29af90, aStatus=0) at /builds/slave/trunk_osx/mozilla/netwerk/base/src/nsStreamLoader.cpp:108
#9  0x010e9ff5 in nsHttpChannel::OnStopRequest (this=0x319f2a60, request=0x2f6ecce0, ctxt=0x2f29af90, status=0) at /builds/slave/trunk_osx/mozilla/netwerk/protocol/http/src/nsHttpChannel.cpp:4443
#10 0x0107547e in nsInputStreamPump::OnStateStop (this=0x2f6ecce0) at /builds/slave/trunk_osx/mozilla/netwerk/base/src/nsInputStreamPump.cpp:576
#11 0x010758d3 in nsInputStreamPump::OnInputStreamReady (this=0x2f6ecce0, stream=0x2f6ecdc8) at /builds/slave/trunk_osx/mozilla/netwerk/base/src/nsInputStreamPump.cpp:401
#12 0x01a65e37 in nsInputStreamReadyEvent::Run (this=0x2f6ecc80) at /builds/slave/trunk_osx/mozilla/xpcom/io/nsStreamUtils.cpp:111
#13 0x018ad974 in nsThread::ProcessNextEvent (this=0x20d10380, mayWait=0, result=0xbfffe4dc) at /builds/slave/trunk_osx/mozilla/xpcom/threads/nsThread.cpp:510
#14 0x01875621 in NS_ProcessPendingEvents_P (thread=0x20d10380, timeout=20) at nsThreadUtils.cpp:180
#15 0x01833c47 in nsBaseAppShell::NativeEventCallback (this=0x25cb5f10) at /builds/slave/trunk_osx/mozilla/widget/src/xpwidgets/nsBaseAppShell.cpp:121
#16 0x01807303 in nsAppShell::ProcessGeckoEvents (aInfo=0x25cb5f10) at /builds/slave/trunk_osx/mozilla/widget/src/cocoa/nsAppShell.mm:302
#17 0x9082a09a in CFRunLoopRunSpecific ()
#18 0x90829b0e in CFRunLoopRunInMode ()
#19 0x92dd8bef in RunCurrentEventLoopInMode ()
#20 0x92dd8234 in ReceiveNextEventCommon ()
#21 0x92dd8154 in BlockUntilNextEventMatchingListInMode ()
#22 0x9327d465 in _DPSNextEvent ()
#23 0x9327d056 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#24 0x93276ddb in -[NSApplication run] ()
#25 0x01806745 in nsAppShell::Run (this=0x25cb5f10) at /builds/slave/trunk_osx/mozilla/widget/src/cocoa/nsAppShell.mm:591
#26 0x016bc7bb in nsAppStartup::Run (this=0x25cd1b50) at /builds/slave/trunk_osx/mozilla/toolkit/components/startup/src/nsAppStartup.cpp:181
#27 0x01011828 in XRE_main (argc=6, argv=0xbffff92c, aAppData=0x20d0b520) at /builds/slave/trunk_osx/mozilla/toolkit/xre/nsAppRunner.cpp:3170
#28 0x00001cc9 in start ()
(gdb) 
Igor and I have looked at this for a while.  The crash is happening as the result of a jsval_void being cast as an object, but there also seems to be heap corruption happening nearby (fp->script's first 12-16 bytes are nonsense).  Going to try another stab at reproducing here.
Depends on: 432729
Do we have multiple stacks? are they all in similar places?

Should we try running the tests with the Mac OS malloc debugging environment options enabled?
(bug 427878 doesn't strictly depend on this, but it was another mochitest crasher that could be debugged if we're getting cores now.)
Blocks: 427878
(In reply to comment #17)
> Do we have multiple stacks? are they all in similar places?
> 
> Should we try running the tests with the Mac OS malloc debugging environment
> options enabled?
> 

These are both good ideas.  robcee, can you try to collect a few more cores?  Also turn on guard malloc on the test machine?
sorry fellas, we're blocking everything else with this. I'm wrangling up a second xserve we can run in parallel so we can devote one machine to producing cores. I'll need some time setup the machine. ETA, approx 12 hours.
Depends on: 432749
Group: security
Any update on this? 

Also, are you still using the loaner qm-xserve06 / bm-xserve09 machine?
this appears to be working now and we are still using qm-xserve06.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Did something actually get fixed, or is this really WFM? Comment 16 sounded bad but I don't see any progress toward fixing that, nor is there likely to be if this bug has the title it does and the status FIXED.
Blast from the past! I don't think it got fixed, but the errors stopped happening or weren't reproducible. Does that match your memory, crowder?

In any case, I bet that code's changed a fair bit since then.
My recollection agrees w/ robcee's, and I think WFM is a better resolution-status than FIXED.
Resolution: FIXED → WORKSFORME
Group: core-security
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.