Closed Bug 938073 Opened 12 years ago Closed 12 years ago

Aborted testruns on OS X 10.9 nodes after hanging

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

All
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: AndreeaMatei, Unassigned)

Details

(Whiteboard: [qa-automation-blocked])

We have all the testruns on 10.9 OS X nodes timeout, at the moment mm-osx-109-1 is offline and we're investigating the others. Seems only 10.9 is affected, as the other versions work fine. From what I noticed, after mounting the build, the browser opens but no tests starts, so the browser remains like that until timeout. Example: http://mm-ci-master.qa.scl3.mozilla.com:8080/job/release-mozilla-esr24_remote/442
mm-osx-109-4 had a Firefox instance running, that was responding but not running any tests. I restarted mm-osx-109-4 and started Jenkins, which then allowed the other nodes to continue. I found a similar situation on mm-osx-109-2, which I took offline and tried running the tests locally. The first attempt stalled after printing "Updating branch of test repository to 'mozilla-esr24'" to the console with a Firefox instance open. I interrupted this, stopped Firefox, and successfully switched to the target branch via the command line. I ran the testrun again and it run to completion. I am suspecting an intermittent issue with launching Firefox on 10.9. We should try to replicate this locally and add debug to identify what's happening. I did notice that com.apple.IconServicesAgent was using over 800MB on mm-osx-109-4, which appears to be a known Mavericks issue: https://discussions.apple.com/thread/5472367?start=0&tstart=0 it may or may not be related to this issue as other nodes with high values were running successfully, and some nodes did not have such high values.
On an additional remote testrun on mm-osx-109-2 the tests stalled during the restart tests. I will restart this node and mm-osx-109-1 to see if we can get these back online.
Whiteboard: [qa-automation-blocked]
All nodes are back online and appear to be operating well. We need to keep an eye on these nodes and act quickly if we see a node hanging as it can have a large impact especially when running ondemand tests. If we see this, find the node that has caused the hang (usually the earliest started build, or the one with an orphaned Firefox instance). Take the node temporarily offline and abort the stalled build. Other builds should then recover and free up the queue. As mentioned, we need to try to replicate this on OS X 10.9 and determine the underlying issue.
That's strange. Have you had a chance to take a look at the system log? I wonder if some helpful information are listed there. If it happens again I would propose to not reboot one of the affected nodes but really put it in offline mode. This should only be done for one of them.
(In reply to Henrik Skupin (:whimboo) from comment #4) > That's strange. Have you had a chance to take a look at the system log? I > wonder if some helpful information are listed there. If it happens again I > would propose to not reboot one of the affected nodes but really put it in > offline mode. This should only be done for one of them. Yes, the system log had a lot of entries in it, similar to what we see in the Jenkins console. Given all this noise it's difficult to see anything useful. As mentioned above, we did put nodes into offline mode (mm-osx-109-1 was offline for the longest) but needed to restore full capacity for desktop testing.
Could you upload parts of the log during this phase as attachment just for reference? Would be good to have something for reference and starting point of investigation.
As said, I believe we need to attempt to replicate this locally first on an OS X 10.9 installation. Andreea has said she will try to sort this out. I don't think uploading the system log for the time of these failures is currently a good use of my time. If it occurs again I will attach the system logs, however like I say I didn't see anything there that related directly to the issue.
I have upgraded mine to 10.9 and currently is running testruns. I'll reply here if I can reproduce.
I don't think that this issue is easy to reproduce in short term. You might have to run tests on it for a couple of days.
Did this happen again? If not lets get this bug closed for now.
I haven't seen this failure this week so I close this for now.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.