Closed Bug 1087261 Opened 11 years ago Closed 10 years ago

Investigate mm-osx-107-2 for potential issues

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: andrei, Unassigned)

References

Details

For both bug 1055442 and bug 1066032 the root cause seems to lie with the machine mm-osx-107-2. At a first glance I can't find anything wrong with it. Hardware is identical to other 10.7 machines. Software-update it's in the same state. top doesn't show any significant differences in load. uptime shows 91-92 days for these machines. ---- Given that we plan to update all 10.7 machines to 10.10 in bug 1059264 this might quickly become a non issue.
I agree. Not sure if that warrants digging deeper into this. Especially if nothing specific is visible. I still have to fix some things for 10.10 before we can go live, but hopefully it will be soon!
(In reply to Henrik Skupin (:whimboo) from comment #1) > I agree. Not sure if that warrants digging deeper into this. Especially if > nothing specific is visible. I still have to fix some things for 10.10 > before we can go live, but hopefully it will be soon! I think we can leave the machine offline for now. There are not a lot of reported failures on a daily basis, but since at least 2 different issues are clearly visible, there's a possibility there are more... so I'd leave the machine offline for now. If we see an impact (delay) in testrun times on Friday, we can bring it back online. But I don't think it would be necessary.
Maybe we could try with a restart first? Given that it is online for about 90 days it would be worth a try.
(In reply to Henrik Skupin (:whimboo) from comment #3) > Maybe we could try with a restart first? Given that it is online for about > 90 days it would be worth a try. Thought of that. Didn't do it because all 10.7 machines had the same uptime. But lets give it a try. I'll also reenable it, and if we see it fail again, we can take it back down.
With bug 1119146 this machine will be reinstalled, and the problem will hopefully be gone.
Depends on: 1119146
The machine was reinstalled and I brought it back. Lets have a look at it for the next days but I hope we do not see it again.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
The two bugs (bug 1055442 and bug 1066032) still reproduce on this machine
Lets disconnect this machine then!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Van, we re-installed this machine lately via bug 1119146 but somehow the same problem as it was seen before only on this machine starts to happen again. Has maybe something survived the re-install? Or do you have any idea what could be a problem hardware wise? The configuration is all the same compared to the other 10.7 boxes.
Flags: needinfo?(vle)
:whimboo, vinh formatted the drive then reimaged the host. it should have been a fresh install. i want to say because of its compact size and lack of good airflow, the components just break down over time. (especially when you it's constantly being hammered by tests.) historically, we begin to see issues with mac minis about 2 years into service. ill run diagnostics for this host to confirm if we can find any issues but more of then than not, the test comes back 'passed' with no issues. a good example of this is host mm-osx-106-4, where it passed our diagnostics, an offsite diagnostics but keeps hanging after a few hours/days.
Flags: needinfo?(vle)
:whimboo/mihaelav, can i take this host offline to run diagnostics?
missed a word here: i want to say because of its compact size and lack of good airflow, the components just break down over time. (especially when you consider it's constantly being hammered by tests.)
Yes, this host is offline so you can use it for testing. I think that this has something to do with keyboard and mouse events which our tests synthesizes. They seem to not reach the appropriate elements in the browser. Please be aware that I will be off the next two days.
i tried several Apple DVDs with an external and the internal DVD drive but the diagnostic application won't finish probing the hardware so i am unable to run diagnostics on this host. i'm going to drop this off at the Mac repair shop to see if they can find any issues with the host.
they want $200 to replace the optical drive which may or may not resolve the issue you guys are facing. i think the mac repair shop is running into the same issue i did. they are reporting the optical drive is problematic. it doesnt accept a disk 90% of the time (that's why i had to keep retrying the multiple DVDs we have) so they either cant run the diagnostics as it cant finish probing or the diagnostics is only showing the optical drive as an issue. fwiw, i used an external DVD drive to run the diagnostics with same results. we recently decommissioned mm-osx-106-4 as it kept crashing. i think we can take the optical drive from this mac and place it in 107-2 to see it resolves the issue.
something is wrong with this host. we replaced the optical drive in house from a decommissioned r4 but it still doesnt want to work correctly. we're unable to run diagnostics so i went ahead and formatted the drive, reimaged, and reconfigured the settings. i noticed the host is a little sluggish and a lot of things can contribute to this. the easily FRUs are the hard drives and the memory which we can replace if you guys don't want to decommission it. on another note, since you guys are debating retiring the 10-6s, can we upgrade the 10-6 to 10-7 and use it to replace this host?
Flags: needinfo?(hskupin)
The decision to shut the 10.6 machines down is not final. So we may have to wait for bug 1119146. Something I wonder is if we can replace this host with a new machine. Reason is that our 10.7 boxes will have to be upgraded to 10.10, and existing 10.6 machines are too old for that release, right?
Flags: needinfo?(hskupin) → needinfo?(vle)
We want to keep the existing 10.6 machines and still run the tests on them. So we would need a replacement for this failing 10.7 node. As mentioned in my last comment it should fulfill the requirement to get 10.10 running on it.
> As mentioned in my last comment it should fulfill the requirement to get 10.10 running on it. :whimboo, are you guys going to procure the mini or do you guys need help procuring one? since these are older and harder to find, they might cost a little more as well.
(In reply to Van Le [:van] from comment #20) > :whimboo, are you guys going to procure the mini or do you guys need help > procuring one? since these are older and harder to find, they might cost a > little more as well. I would like to defer this question to David.
Flags: needinfo?(dburns)
10.7 has a low user base so as the machines die we don't need to replace them
Flags: needinfo?(dburns)
>10.7 has a low user base so as the machines die we don't need to replace them please confirm if i can decommission 107-2 as it was exhibiting issues or if it was resolved in the last reimage/hardware replacement.
Flags: needinfo?(vle)
(In reply to David Burns :automatedtester from comment #22) > 10.7 has a low user base so as the machines die we don't need to replace them Just to clarify... * We have 3 machines left for 10.7. If one more machine dies we will have to stop testing OS X 10.7. I'm not sure if there has been made a decision about such a step yet - similar to OS X 10.6 on bug 1119146. Jonathan, when you talked to people was it only about 10.6 or also 10.7/10.8? * My plan was to upgrade those machines from 10.7 to 10.10 once we are ok in running tests for that version. That means we should better request new machines then? Especially as long our tests are not run in buildbot or taskcluster?
Flags: needinfo?(jgriffin)
Flags: needinfo?(dburns)
(In reply to Van Le [:van] from comment #23) > please confirm if i can decommission 107-2 as it was exhibiting issues or if > it was resolved in the last reimage/hardware replacement. I set it up again and its active now. I triggered an example testrun, so lets see how it works: http://mm-ci-production.qa.scl3.mozilla.com:8080/job/mozilla-central_functional/31294/console
It looks like its working for now. I will observe the machine the next couple of days.
(In reply to Henrik Skupin (:whimboo) from comment #24) > (In reply to David Burns :automatedtester from comment #22) > > 10.7 has a low user base so as the machines die we don't need to replace them > > Just to clarify... > > * We have 3 machines left for 10.7. If one more machine dies we will have to > stop testing OS X 10.7. I'm not sure if there has been made a decision about > such a step yet - similar to OS X 10.6 on bug 1119146. Jonathan, when you > talked to people was it only about 10.6 or also 10.7/10.8? Mozilla buildbot infrastructure only tests on 10.6 and 10.10 so we could switch these tests off for 10.7 and 10.8. Since we are going to driving towards that, starting in Q2, I don't think we should invest much effort in this. > > * My plan was to upgrade those machines from 10.7 to 10.10 once we are ok in > running tests for that version. That means we should better request new > machines then? Especially as long our tests are not run in buildbot or > taskcluster? This might be good to make sure the transition happens cleanly but wouldnt consider it a high priority at the moment
Flags: needinfo?(dburns)
The relative size of our user base on OSX is: 10.10 > 10.9 > 10.6 > 10.8 > 10.7. Populations on both 10.8 and 10.7 are negligible, so testing those is not a high priority.
Flags: needinfo?(jgriffin)
The node mm-osx-107-2 is working again without problems over the last days. I think we are good here for now. Given the feedback from Jonathan above, we will work on the transition of 10.7 boxes to 10.10 around May this year when I'm back from my PTO.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.