Closed Bug 808671 Opened 12 years ago Closed 12 years ago

Testruns on Linux nodes are getting aborted due to unknown reasons

Categories

(Mozilla QA Graveyard :: Mozmill Automation, defect)

All
Linux
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

So far this has only be seen on our Linux 64bit machine [mm-ub-1204-64-1 (10.250.73.246)]. Under /tmp the binary, the profile, the mozmill-tests are not getting removed. Not sure yet under which condition that happens but we have to fix that ASAP. I will look into this issue in a bit.
As it looks like this problem appears whenever Jenkins itself has to cancel a testrun. On the Linux 64bit machine this happens for the functional testrun, especially the testAddons_enableDisableExtension/test2.js test. It hangs and Mozmill doesn't kill the application. Not sure why yet.

http://10.250.73.243:8080/job/mozilla-central_functional/1582/console

In some cases the run continuous but fails here:

TEST-START | /tmp/tmpB8obr2.mozmill-tests/tests/functional/restartTests/testRestartChangeArchitecture/test3.js | setupModule
WARNING | test3.js::setupModule | (SKIP) Architecture changes only supported on OSX 10.6
TEST-START | /tmp/tmpB8obr2.mozmill-tests/tests/functional/restartTests/testRestartChangeArchitecture/test3.js | tBuild timed out (after 60 minutes). Marking the build as aborted.
TEST-START | /tmp/tmp7Whhgc.mozmill-tests/tests/functional/restartTests/testAddons_enableDisableExtension/test2.js | testDisableExtension

TEST-PASS | /tmp/tmp7Whhgc.mozmill-tests/tests/functional/restartTests/testAddons_enableDisableExtension/test2.js | test2.js::testDisableExtenBuild timed out (after 60 minutes). Marking the build as aborted.
Build was aborted
Recording test results
No emails were triggered.
Finished: ABORTED


I wish we would send out emails for aborted runs. Dave, would that be possible? Can we get this files as a mozmill-ci issue?
Summary: Out of disk space due to testrun files are not getting removed → Out of disk space on Linux 64 node because testrun files are not getting removed when Jenkins aborts a testrun
The next task here is to figure out why Mozmill is not able to shutdown the browser in this situation. Most likely I have to file a Mozmill bug for it.
(In reply to Henrik Skupin (:whimboo) from comment #2)
> I wish we would send out emails for aborted runs. Dave, would that be
> possible? Can we get this files as a mozmill-ci issue?

Not only possible, but relatively easy. I've raised https://github.com/mozilla/mozmill-ci/issues/182 for this.
Flags: needinfo?(hskupin)
Flags: needinfo?(hskupin)
So it's not a general issue with Linux64 but only for this specific node. In a local VM it works as expected.
Those errors are not related to this bug but bug 808548. I have reopened the other one.
This seems to happen across Linux nodes. So not only the 64 bit one is affected. We got a couple of those reports this morning:

32 bit: http://10.250.73.243:8080/job/mozilla-aurora_functional/1637/
64 bit: http://10.250.73.243:8080/job/mozilla-aurora_functional/1636/

Mainly we fail in 'testAddons_enableDisableExtension/test2.js' but also in 'testAddons_RestartlessExtensionWorksAfterRestart'. Both cause a hang in all of the cases and Mozmill is not able to shutdown the browser.

Andrea, please file individual bugs for each of the cases under mozmill-tests and mark them dependent on this bug. Thanks!
Component: Mozmill Automation → Mozmill
Product: Mozilla QA → Testing
Hardware: x86_64 → All
Summary: Out of disk space on Linux 64 node because testrun files are not getting removed when Jenkins aborts a testrun → Testruns on Linux nodes are getting aborted due to Mozmill not being able to shutdown the application after the global timeout
Whiteboard: [mozmill-1.5.20?][mozmill-2.0?]
This morning an endurance testrun got aborted, but no test was runned:
http://10.250.73.243:8080/job/mozilla-aurora_endurance/1644/

updating to branch default
401 files updated, 0 files merged, 0 files removed, 0 files unresolved
*** Installing 2012-11-13-04-20-14-mozilla-aurora-firefox-18.0a2.fr.win32.installer.exe => c:\docume~1\mozilla\locals~1\temp\tmpakaznl.binary\
*** Application: Firefox 18.0a2
*** Updating to branch 'mozilla-aurora'
pulling from mozmill-tests
searching for changes
no changes found
37 files updated, 0 files merged, 0 files removed, 0 files unresolved
Build timed out (after 60 minutes). Marking the build as aborted.
Build was aborted
Recording test results
No test report files were found. Configuration error?
Email was triggered for: Aborted
Sending email for trigger: Aborted
Sending email to: mozmill-ci@mozilla.org
Finished: ABORTED
Oh wow! So that's not related to any type of testrun but seems to be a general issue with the VM or the Jenkins master<->slave connection.

Dave, have you ever seen something like that?
I have not. I wouldn't expect a master/slave issue to cause a build to hang though. Has anyone witnessed this issue occurring? I wonder what is present during this time.
This may be related to bug 797389. Alex is going to demonstrate a hang to me now.
(In reply to Dave Hunt (:davehunt) from comment #13)
> This may be related to bug 797389. Alex is going to demonstrate a hang to me
> now.

I don't think so. Two of the referenced tests do not make use of a user shutdown.

When it happens the browser hangs. Not sure for what else I should look. Any idea?
Since yesterday afternoon, we have about 10 aborted testruns. Here are the links for restart tests:
* http://10.250.73.243:8080/job/mozilla-central_functional/1834/
* http://10.250.73.243:8080/job/mozilla-aurora_functional/1680/
* http://10.250.73.243:8080/job/mozilla-central_functional/1826/
* http://10.250.73.243:8080/job/mozilla-aurora_functional/1673/
* http://10.250.73.243:8080/job/mozilla-aurora_functional/1663/

What I see now is that happened on the non restart tests also, failing at  testPrefereces/testPreferredLanguage.js:
* http://10.250.73.243:8080/job/mozilla-central_functional/1835/
* http://10.250.73.243:8080/job/mozilla-central_functional/1833/
* http://10.250.73.243:8080/job/mozilla-central_functional/1807/
* http://10.250.73.243:8080/job/mozilla-central_functional/1806
* http://10.250.73.243:8080/job/mozilla-central_functional/1801/

This is the most detailed error:

TEST-START | /tmp/tmpE9GcwZ.mozmill-tests/tests/functional/testPreferences/testPreferredLanguage.js | setupModule
TEST-PASS | /tmp/tmpE9GcwZ.mozmill-tests/tests/functional/testPreferences/testPreferredLanguage.js | testPreferredLanguage.js::setupModule
TEST-START | /tmp/tmpE9GcwZ.mozmill-tests/tests/functional/testPreferences/testPreferredLanguage.js | testSetLanguages
TEST-PASS | /tmp/tmpE9GcwZ.mozmill-tests/tests/functional/testPreferences/testPreferredLanguage.js | testPreferredLanguage.js::testSetLanguages
TEST-START | /tmp/tmpE9GcwZ.mozmill-tests/tests/functional/testPreferences/teNOTE: child process received `Goodbye', closing down
WARNING: waitpid failed pid:31116 errno:10: file /builds/slave/m-cen-lnx64-ntly/build/ipc/chromium/src/base/process_util_posix.cc, line 260
WARNING: waitpid failed pid:31116 errno:10: file /builds/slave/m-cen-lnx64-ntly/build/ipc/chromium/src/base/process_util_posix.cc, line 260
WARNING: Failed to deliver SIGKILL to 31116!(3).: file /builds/slave/m-cen-lnx64-ntly/build/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc, line 118
Build timed out (after 60 minutes). Marking the build as aborted.
Build was aborted
Recording test results
Email was triggered for: Aborted
Sending email for trigger: Aborted
Sending email to: mozmill-ci@mozilla.org
Finished: ABORTED

I will file a separate bug for it.
Looks like Mozmill isn't involved at all here. So moving back to automation for now. I have restarted the box and will watch it today and do some trial runs. If it still happens I hope to be able to find the application in such a state.
Component: Mozmill → Mozmill Automation
Product: Testing → Mozilla QA
Whiteboard: [mozmill-1.5.20?][mozmill-2.0?]
In case when firefox currently runs tests and the process is halted, the application is not frozen and doesn't hang in any way. It just sits around and does nothing. I will try to nail this down. Probably this case could be related to the userShutdown issue.
As what we have seen the machines are totally slow in their response. It's somewhat similar to what we have already discovered with VMware Fusion in former time. The host gets filled up with memory and doesn't let the VM properly function anymore. A manual run of 'purge' fixed the problem for us all the time.

I'm going to make use of the Linux VM's on qa-set again. I hope that will fix the problem until the new ESX cluster can be used.
I haven't swapped the machines yet but updated the Ubuntu 12.04 64bit VM in ESX with the latest software. It also upgraded Java to 7.x. As for now we do not see those aborts anymore. Not sure if it is because of the Java upgrade or the restart. I will do the same for the 32bit machine and watch results the next days or two.
No longer blocks: 811239, 811241, 811296, 812099
Depends on: 814430
This bug doesn't depend on bug 814430. It's independent. So I have also updated Ubuntu 32bit and so far I can only see aborts due to bug 814430. If it stays that way we shall be done here.
No longer depends on: 814430
Summary: Testruns on Linux nodes are getting aborted due to Mozmill not being able to shutdown the application after the global timeout → Testruns on Linux nodes are getting aborted due to unknown reasons
I call this done. No other aborts anymore as the known ones from the functional testrun which is covered buy bug 814430.

http://10.250.73.243:8080/computer/mm-ub-1204-32-1/builds
http://10.250.73.243:8080/computer/mm-ub-1204-64-1/builds
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.