Closed
Bug 1087261
Opened 11 years ago
Closed 10 years ago
Investigate mm-osx-107-2 for potential issues
Categories
(Mozilla QA Graveyard :: Infrastructure, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: andrei, Unassigned)
References
Details
For both bug 1055442 and bug 1066032 the root cause seems to lie with the machine mm-osx-107-2.
At a first glance I can't find anything wrong with it. Hardware is identical to other 10.7 machines. Software-update it's in the same state.
top doesn't show any significant differences in load.
uptime shows 91-92 days for these machines.
----
Given that we plan to update all 10.7 machines to 10.10 in bug 1059264 this might quickly become a non issue.
Comment 1•11 years ago
|
||
I agree. Not sure if that warrants digging deeper into this. Especially if nothing specific is visible. I still have to fix some things for 10.10 before we can go live, but hopefully it will be soon!
| Reporter | ||
Comment 2•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #1)
> I agree. Not sure if that warrants digging deeper into this. Especially if
> nothing specific is visible. I still have to fix some things for 10.10
> before we can go live, but hopefully it will be soon!
I think we can leave the machine offline for now. There are not a lot of reported failures on a daily basis, but since at least 2 different issues are clearly visible, there's a possibility there are more... so I'd leave the machine offline for now.
If we see an impact (delay) in testrun times on Friday, we can bring it back online. But I don't think it would be necessary.
Comment 3•11 years ago
|
||
Maybe we could try with a restart first? Given that it is online for about 90 days it would be worth a try.
| Reporter | ||
Comment 4•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #3)
> Maybe we could try with a restart first? Given that it is online for about
> 90 days it would be worth a try.
Thought of that. Didn't do it because all 10.7 machines had the same uptime.
But lets give it a try. I'll also reenable it, and if we see it fail again, we can take it back down.
Comment 5•11 years ago
|
||
There still are issues on this machine:
http://mozmill-release.blargon7.com/#/functional/failure?app=Firefox&branch=All&platform=Mac&from=2014-10-26&to=2014-11-04&test=%2FtestTabView%2FtestTabGroupNaming.js&func=testTabGroupNaming
Taking the machine offline
Comment 6•10 years ago
|
||
With bug 1119146 this machine will be reinstalled, and the problem will hopefully be gone.
Depends on: 1119146
Comment 7•10 years ago
|
||
The machine was reinstalled and I brought it back. Lets have a look at it for the next days but I hope we do not see it again.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 8•10 years ago
|
||
The two bugs (bug 1055442 and bug 1066032) still reproduce on this machine
Comment 9•10 years ago
|
||
Lets disconnect this machine then!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 10•10 years ago
|
||
Van, we re-installed this machine lately via bug 1119146 but somehow the same problem as it was seen before only on this machine starts to happen again. Has maybe something survived the re-install? Or do you have any idea what could be a problem hardware wise? The configuration is all the same compared to the other 10.7 boxes.
Flags: needinfo?(vle)
Comment 11•10 years ago
|
||
:whimboo, vinh formatted the drive then reimaged the host. it should have been a fresh install.
i want to say because of its compact size and lack of good airflow, the components just break down over time. (especially when you it's constantly being hammered by tests.)
historically, we begin to see issues with mac minis about 2 years into service. ill run diagnostics for this host to confirm if we can find any issues but more of then than not, the test comes back 'passed' with no issues. a good example of this is host mm-osx-106-4, where it passed our diagnostics, an offsite diagnostics but keeps hanging after a few hours/days.
Flags: needinfo?(vle)
Comment 12•10 years ago
|
||
:whimboo/mihaelav, can i take this host offline to run diagnostics?
Comment 13•10 years ago
|
||
missed a word here:
i want to say because of its compact size and lack of good airflow, the components just break down over time. (especially when you consider it's constantly being hammered by tests.)
Comment 14•10 years ago
|
||
Yes, this host is offline so you can use it for testing. I think that this has something to do with keyboard and mouse events which our tests synthesizes. They seem to not reach the appropriate elements in the browser.
Please be aware that I will be off the next two days.
Comment 15•10 years ago
|
||
i tried several Apple DVDs with an external and the internal DVD drive but the diagnostic application won't finish probing the hardware so i am unable to run diagnostics on this host.
i'm going to drop this off at the Mac repair shop to see if they can find any issues with the host.
Comment 16•10 years ago
|
||
they want $200 to replace the optical drive which may or may not resolve the issue you guys are facing.
i think the mac repair shop is running into the same issue i did. they are reporting the optical drive is problematic. it doesnt accept a disk 90% of the time (that's why i had to keep retrying the multiple DVDs we have) so they either cant run the diagnostics as it cant finish probing or the diagnostics is only showing the optical drive as an issue. fwiw, i used an external DVD drive to run the diagnostics with same results.
we recently decommissioned mm-osx-106-4 as it kept crashing. i think we can take the optical drive from this mac and place it in 107-2 to see it resolves the issue.
Comment 17•10 years ago
|
||
something is wrong with this host. we replaced the optical drive in house from a decommissioned r4 but it still doesnt want to work correctly. we're unable to run diagnostics so i went ahead and formatted the drive, reimaged, and reconfigured the settings. i noticed the host is a little sluggish and a lot of things can contribute to this. the easily FRUs are the hard drives and the memory which we can replace if you guys don't want to decommission it.
on another note, since you guys are debating retiring the 10-6s, can we upgrade the 10-6 to 10-7 and use it to replace this host?
Flags: needinfo?(hskupin)
Comment 18•10 years ago
|
||
The decision to shut the 10.6 machines down is not final. So we may have to wait for bug 1119146.
Something I wonder is if we can replace this host with a new machine. Reason is that our 10.7 boxes will have to be upgraded to 10.10, and existing 10.6 machines are too old for that release, right?
Flags: needinfo?(hskupin) → needinfo?(vle)
Comment 19•10 years ago
|
||
We want to keep the existing 10.6 machines and still run the tests on them. So we would need a replacement for this failing 10.7 node. As mentioned in my last comment it should fulfill the requirement to get 10.10 running on it.
Comment 20•10 years ago
|
||
> As mentioned in my last comment it should fulfill the requirement to get 10.10 running on it.
:whimboo, are you guys going to procure the mini or do you guys need help procuring one? since these are older and harder to find, they might cost a little more as well.
Comment 21•10 years ago
|
||
(In reply to Van Le [:van] from comment #20)
> :whimboo, are you guys going to procure the mini or do you guys need help
> procuring one? since these are older and harder to find, they might cost a
> little more as well.
I would like to defer this question to David.
Flags: needinfo?(dburns)
Comment 22•10 years ago
|
||
10.7 has a low user base so as the machines die we don't need to replace them
Flags: needinfo?(dburns)
Comment 23•10 years ago
|
||
>10.7 has a low user base so as the machines die we don't need to replace them
please confirm if i can decommission 107-2 as it was exhibiting issues or if it was resolved in the last reimage/hardware replacement.
Flags: needinfo?(vle)
Comment 24•10 years ago
|
||
(In reply to David Burns :automatedtester from comment #22)
> 10.7 has a low user base so as the machines die we don't need to replace them
Just to clarify...
* We have 3 machines left for 10.7. If one more machine dies we will have to stop testing OS X 10.7. I'm not sure if there has been made a decision about such a step yet - similar to OS X 10.6 on bug 1119146. Jonathan, when you talked to people was it only about 10.6 or also 10.7/10.8?
* My plan was to upgrade those machines from 10.7 to 10.10 once we are ok in running tests for that version. That means we should better request new machines then? Especially as long our tests are not run in buildbot or taskcluster?
Flags: needinfo?(jgriffin)
Flags: needinfo?(dburns)
Comment 25•10 years ago
|
||
(In reply to Van Le [:van] from comment #23)
> please confirm if i can decommission 107-2 as it was exhibiting issues or if
> it was resolved in the last reimage/hardware replacement.
I set it up again and its active now. I triggered an example testrun, so lets see how it works:
http://mm-ci-production.qa.scl3.mozilla.com:8080/job/mozilla-central_functional/31294/console
Comment 26•10 years ago
|
||
It looks like its working for now. I will observe the machine the next couple of days.
Comment 27•10 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #24)
> (In reply to David Burns :automatedtester from comment #22)
> > 10.7 has a low user base so as the machines die we don't need to replace them
>
> Just to clarify...
>
> * We have 3 machines left for 10.7. If one more machine dies we will have to
> stop testing OS X 10.7. I'm not sure if there has been made a decision about
> such a step yet - similar to OS X 10.6 on bug 1119146. Jonathan, when you
> talked to people was it only about 10.6 or also 10.7/10.8?
Mozilla buildbot infrastructure only tests on 10.6 and 10.10 so we could switch these tests off for 10.7 and 10.8. Since we are going to driving towards that, starting in Q2, I don't think we should invest much effort in this.
>
> * My plan was to upgrade those machines from 10.7 to 10.10 once we are ok in
> running tests for that version. That means we should better request new
> machines then? Especially as long our tests are not run in buildbot or
> taskcluster?
This might be good to make sure the transition happens cleanly but wouldnt consider it a high priority at the moment
Flags: needinfo?(dburns)
Comment 28•10 years ago
|
||
The relative size of our user base on OSX is: 10.10 > 10.9 > 10.6 > 10.8 > 10.7. Populations on both 10.8 and 10.7 are negligible, so testing those is not a high priority.
Flags: needinfo?(jgriffin)
Comment 29•10 years ago
|
||
The node mm-osx-107-2 is working again without problems over the last days. I think we are good here for now.
Given the feedback from Jonathan above, we will work on the transition of 10.7 boxes to 10.10 around May this year when I'm back from my PTO.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•