Closed Bug 873566 Opened 7 years ago Closed 7 years ago

Determine what is the matter with these Windows 7 machines

Categories

(Infrastructure & Operations :: DCOps, task, P3)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

Attachments

(4 files)

It seems that four machines are having graphics issues which have caused a lot of the intermittent oranges since yesterday:
* t-w732-ix-012
* t-w732-ix-016
* t-w732-ix-043
* t-w732-ix-045
No longer blocks: win7-ix-releng
Depends on: t-w732-ix-070
Priority: -- → P3
I will not be able to look at these before I head out.
Assignee: armenzg → nobody
Whiteboard: [buildduty]
t-w732-ix-011 and tw-732-ix-070 also seem to suffer from this.
This bug should be with DCops so they can investigate all these machines as a class and see whether there is a common theme in the failures: bad graphics card batch, faulty power connect to rack, etc.
Assignee: nobody → server-ops-dcops
Component: Release Engineering: Machine Management → Server Operations: DCOps
QA Contact: armenzg → dmoore
>This bug should be with DCops so they can investigate all these machines as a class and see whether there is a common theme in the failures: bad graphics card batch, faulty power connect to rack, etc.

every single one of these iX systems have a video card. I don't think 4 suspect video cards out of ~3-400 will indicate a bad batch. That's about a 1% failure rate.

these servers are installed 4 to a chassis with dual PSU so faulty power would affect 4 nodes, unless the specific slot happens to be bad. Are the servers stable, any random reboots? I'm not even sure what intermittent orange means from comment #1. Currently, the servers are all pinging.

I am not aware of any video card testing utilities but I'd be more than happy to RMA these video cards for you. Are there any logs that can help us diagnose the problem or we can pass along to iX?
Whiteboard: [buildduty]
Armen, can you help out DCOps with answers to their questions?
Flags: needinfo?(armenzg)
colo-trip: --- → scl3
I will put it on me to give you some more info.
Assignee: server-ops-dcops → armenzg
Flags: needinfo?(armenzg)
Hi van,
Can you guys have a look at the graphic setup for machines 43, 45 & 70?

I will look at the other machines meanwhile.

t-w732-ix-011 - it seems to be fine - more releng investigation is needed
t-w732-ix-012 - it seems to be fine - more releng investigation is needed
t-w732-ix-016 - it seems to be fine - more releng investigation is needed
t-w732-ix-043 - small resolution & 2 monitors
t-w732-ix-045 - small resolution & 2 monitors
t-w732-ix-070 - small resolution & 2 monitors - IPMI does not work
Assignee: armenzg → server-ops-dcops
t-w732-ix-0[43,45,70] - fixed the small resolution, unsure how to get rid of the 2 monitors. I've made the add-on the primary video output and it's displaying at 1920x1200. BIOS also redirects to add-on video card.

Q: do you have any input on how to get rid of the 2 monitors issue?

also fixed IPMI on t-w732-ix-070.
Attached image Bad color depth
The following machines are also affected:
t-w732-ix-115
t-w732-ix-015

I have also been having intermittent failures in a test that uses WebGL (see bug 893715).

"Error: WebGL: Error during ANGLE OpenGL ES initialization"

The screenshot shows that there are clearly graphics issues (see screenshot).

This is normally an issue with the Direct3D driver.
so is this something DCOPs can help with or is this more of a driver issue?
Depends on: t-w732-ix-115
(In reply to Michael Ratcliffe [:miker] [:mratcliffe] from comment #9)
> The following machines are also affected:
> t-w732-ix-115

This machine was recently re-imaged in bug 890035 and after that it messed up a bunch of jobs and got disabled.

I see two monitors and very few screen resolutions.
http://cl.ly/QfRQ

van, can you please look at this machine?

> t-w732-ix-015
> 
It is still taking jobs. If I connect through VNC I don't see anything weird in the screen.
The setup also seems fine (1 monitor; many resolutions):
http://cl.ly/QfSE

I don't think there's anything to be done for this second machine.
:armenzg, i was able to get t-w732-ix-115 to see 1900x1200 resolution. However it still displays 2 screens even though nothing else is connected. I still can't get rid of it. Everything I've read says it's a bug/issue with video drivers and we should try to reinstall the drivers. Is this something we can try? Can you point me to the drivers you guys are using or can you reinstall it remotely?
Adding in Mark in case he has any ideas here.
(In reply to Van Le [:van] from comment #12)
> :armenzg, i was able to get t-w732-ix-115 to see 1900x1200 resolution.
> However it still displays 2 screens even though nothing else is connected. I
> still can't get rid of it. Everything I've read says it's a bug/issue with
> video drivers and we should try to reinstall the drivers. Is this something
> we can try? Can you point me to the drivers you guys are using or can you
> reinstall it remotely?

I think it is best if we can re-image to see if it fixes it. I prefer to start from a known state.

Nevertheless, if you believe it is better to update the drivers and see, I believe Mark can let us know what is being used.
Just a quick note that we have seen certain IX hosts that exhibit inconsistent video behavior. This was a problen when writing the original video switch software. In some cases we could get the secondary video card to recognize without disabling the on board or we would still see the on board as active even though it disabled in BIOS. This anomalies stuck across reboots and re images.
Hmm I seemed to of missed this bug when I was originally added as a cc. 

Is there one of the problematic machines I can jump on and take a look at?
(In reply to Mark Cornmesser [:markco] from comment #16)
> Hmm I seemed to of missed this bug when I was originally added as a cc. 
> 
> Is there one of the problematic machines I can jump on and take a look at?

Feel free to look at any of them - they're all offline. Apologies for the long delay in response.
Van, do we have spare video cards on site?  If so, can we try swapping in a new one to see if that fixes the issue?
:arr - We still have the 25+ video cards that you ordered extra.  I can swap one out when I'm onsite.
Thanks Vinh.
So I've noticed that w732-ix-043, 045, and 070 all display 2 monitors but once you disable the Matrox G200eW (Winbond) in device manager, it will then revert to 1 monitor.  Upon enabling the Matrox G200eW, the 2 displays issue returns.  These three hosts have the older third party video card (NOT the GT-610).

I then examined three other hosts that have the GT-610 video card installed and saw that they only detected 1 monitor. Under device manager, all three had an error message under the Matrox G200eW "Windows has stopped this device because it has reported problems (code 43)."  Pretty much the Matrox G200eW is in a disable state.

With that said, do you still want me to swap out the video card and replace it with the GT-610?
Vinh: hang on a second, all of the production machines should have identical video cards. Are you saying that some of them are different?
arr: Yup some has GT-610 graphic cards and some do not. See attachment snapshots.
Attached image GT-610.JPG
Attached image non-gt-610.JPG
Also noticed on hosts that have the lower graphic cards, once I physically remove the card and default it to Matrox G200eW; I am able to see the console through IPMI (instead of just a black screen).  Currently t-w732-ix-043.wintest.releng.scl3.mozilla.com has no third party graphics card installed and works with IPMI.
Hey Vinh, when you have a moment could you please check and see what graphic cards are in t-w732-ix-003 and 004, please? And could you give me a heads up before hand? I am working on both of the machines currently. 

The variation in cards may be a key to another issue.
Side note (post this bug), if we find a way to write a script or a check that would check for the right graphic card on a Win7 machine, I could add it in "pre-flight" tasks - bug 712206 - to prevent a machine to take a job.
Vihn: so that's devide drivers, not the physical card.  Can you please check the *physical card* to make sure they're different?
*Maybe* t-w732-ix-118 might have an issue as well.
:arr - Ok physically they are the same card. Just the drivers were showing differently.
 Hey Vinh,

 We need to make sure the  on board card is disabled in BIOS. The machines throwing the 43 error are correct in their configuration since the on-board card can not start due to BIOS settings. There is a documented oddity on these machines and windows 7 that if the primary video card is enabled in BIOS the secondary graphics card isn't detected correctly. Can we get the BIOS settings checked an possibly have the on-board graphics flipped back an forth on the problematic machines until they report the same 43 error?
Q - Flipping the on-board graphics back and forth in BIOS did not produce the 43 error code. Looks like I have to disable the on-board graphics drivers under device manager in order to get it to display 1 monitor. Should I go ahead and do the same for all the problematic machines?
I hope to avoid that but lets make that a good plan b. Is there any difference in the BIOS versions on those machines ? I keep coming back to the odd occurrence of machines with the same hardware imaged the same way with the same OS would behave differently.
A more comprehensive update for what we're seeing and the short term and long term plans:

1) Machines would ONLY be affected if they're reimaged not rebooted. A machine running tests will not encounter this issue in normal operation.  

2) This issue does not always occur after a reimage, and we're not sure what causes it to occur when it does. We continue to look into that.

3) Out of the 130 w7 machines we imaged, fewer than 10 had this issue, and we were speculating that it might be hardware or configuration related on those few machines. We're apparently hitting this when we reimage now as well, though.

4) Visual verification can be done to make sure that the machine is configured correctly after being reimaged, and we will add steps to the documentation to show dcops (and releng, if they self servicing) how to correct the drivers if they are not configured correctly. 

5) We believe that the lower-level c++ software that we've been working on writing should fix this issue; we've just been blocked on getting the contiguous cycles to finish this work.
What are the chances we could get 115 re-imaged ?
Flags: needinfo?
115 currently re-imaging.
Flags: needinfo?
Thanks
Hmmm...looks like 115 has failed the re-imaging process twice. Stuck at the log in screen with incorrect password and showing ".\cltbld" Going to attempt another try.
I saw that too I saw some network name lookup errors on the server. Can you run 115 through one more time ?
115 looks good now can you unplug the dell monitor for me?
Looks like we have a few issues here.

1) Some machines are behaving oddly when they have the full correct config 

2) Some newly re-imaged machines are still seeing the on-board card even though they are disabled 

3) Some machines are failing to get the full NVidia driver install when being re-imaged. I think this may be due to file corruption in the Nvidia install package I am working on verifying that. Fixing the driver install issue should correct any issues we see with "known good" machines getting re-imaged. ( t-w732-ix-115 and t-w732-ix-005 were corrected with the driver fix)
This is an error from t-w732-ix-043 that states it can't find an NVidia card. I have only seen this on 043 maybe a re-seat is in order?
Flags: needinfo?(vhua)
118 also suffered from the driver issue. I have written a script to identify the problem and I am going to loop all the win7 machines today.
I've reseated the graphics card on t-w732-ix-043.  Give it a try now.
Flags: needinfo?(vhua)
Looks like it finds the card now.
Armen, can 043 get dropped back into the queue? It should be good now.
Flags: needinfo?(armenzg)
Rebooted into production.

We should see it take jobs in the next hour or so:
https://secure.pub.build.mozilla.org/buildapi/recent/t-w732-ix-043
Flags: needinfo?(armenzg)
The report is almost complete it is taking time as I am doing passive checks so I don't interrupt any tests. I will manually fix any driver issues today and supply a list of affected and fixed machines to make sure they are put back into use with a cautious eye focused on them.
I see mochitests passing
Out of 130 machines the following currently have driver issues:
T-W732-IX-016
T-W732-IX-017
T-W732-IX-045
T-W732-IX-070
T-W732-IX-100

 I am working on correcting these tonight and they should be ready to go back into production tomorrow
The following are all set:
T-W732-IX-016
T-W732-IX-017
T-W732-IX-070
T-W732-IX-100

I ma saving T-W732-IX-045 for some quick control testing in the morning.
All the below T-W732-IX- machines should be able to be added back into production as of now:
015
016
017
043
045
070
100
115

Can we get them added? I will keep an eye on them today.
Flags: needinfo?(armenzg)
I'm surprised that #015 came into the list as it has been in production like forever and the sheriffs would have caught it if it failed a lot of jobs.

I don't see the following machines in the list:
t-w732-ix-011
t-w732-ix-012
t-w732-ix-118
Should I move them to a separate bug to determine what malady they might have?
Maybe they got fixed in one of the comments above?
http://cl.ly/RLdb - Can you please have a look at this screenshot?
Maybe they should have not been added to this bug? (comment 30)

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-011
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-012
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-118
############

> 015 - This machine has been in production for many months and last 100 jobs seemed to be good
> 016 - Rebooted into production
> 017 - Rebooted into production
> 043 - Rebooted into production (yesterday)
> 045 - Rebooted into production
> 070 - IPMI credentials do not work - Can someone please fix it?
> 100 - IPMI credentials do not work - Can someone please fix it?
> 115 - Rebooted into production

WRT, to the IPMI credentials I can move it to another bug to avoid the orthogonal noise.

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-015
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-016
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-017
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-043
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-045
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-070
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-100
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-115
Flags: needinfo?(armenzg)
118 is good now.

011 and 012 are some of the odd machines that will work with out disabling the on-board card. However, they are set to use the NVIDIA graphics adapter and should be able to go back in.
(In reply to Q from comment #55)
> 118 is good now.
> 
> 011 and 012 are some of the odd machines that will work with out disabling
> the on-board card. However, they are set to use the NVIDIA graphics adapter
> and should be able to go back in.

All have been rebooted into production.

I believe all we have to do is wait to see what the results of all these hosts are by tomorrow and close it, right?
Yes Sir, that is correct.
Such optimism!

Results I've seen so far: 

011 fails to start up the browser in talos, and fails reftests with radius corners
012 may be okay
015 remains just fine
016 failed to start up the browser in a talos job
017 failed to start up in several talos jobs, and failed a whole bunch of tests (https://tbpl.mozilla.org/php/getParsedLog.php?id=27802553&tree=Mozilla-Inbound) that I've never seen fail before, looking like maybe the clipboard wasn't working right, or something was in the way
043 may be okay (one explicable failure, one inexplicable failure but on Thunderbird so I have no opinion)
045 is either okay or hasn't taken enough jobs to fail yet
070 is the same, not enough jobs to say
100 has the same failure to start talos jobs
115 does too
118 does too

If I understood correctly, Q was saying the failure to start has something to do with Apache.
Reviewing philor's assessment and proposing action.

012 is showing test failures like this: http://cl.ly/RMfj
 |--> disabled for now

philor, should we disable 016, 017, 100, 115 & 118?
Probably check the graphics setup once more.
Probably re-image and try on staging?

045 & 070 - we need more jobs

Q, do you have more info wrt to Apache? I can try to run a job manually and see what it looks like when connected to the machine.
Flags: needinfo?(q)
Flags: needinfo?(philringnalda)
re apache: I saw apache errors on almost all the machines it looks like there was a config error on several machines now corrected. The failed calls were to localhost/xtalos/* 

As far as the graphics errors let's get 012 re-imaged. Let me look at the other failures as the graphics config *seem* good.
Flags: needinfo?(q)
(In reply to Q from comment #60)
> re apache: I saw apache errors on almost all the machines it looks like
> there was a config error on several machines now corrected. The failed calls
> were to localhost/xtalos/* 
> 
> As far as the graphics errors let's get 012 re-imaged. Let me look at the
> other failures as the graphics config *seem* good.

To clarify this, we should not see the talos issues anymore. If we do, let Q know.
(In reply to Q from comment #60)
> 
> As far as the graphics errors let's get 012 re-imaged. Let me look at the
> other failures as the graphics config *seem* good.

Filed bug 916156 to re-image 012.
Looks like what remains is that 011 and 012 are odd and do not have working graphics, 017 is flaky and needs diagnostics (or to have the dust blown out of its fans, or to have its RAM reseated, or whatever), and 070 took just one job (which got cancelled). I'd fight those separately, and close this bug.
Flags: needinfo?(philringnalda)
Depends on: t-w732-ix-017
Thanks philor.

011 - to be re-imaged in bug 916239
012 - to be re-imaged in bug 916156
017 - to be investigated in bug 916840
* "017 is flaky and needs diagnostics (or to have the dust blown out of its fans, or to have its RAM reseated, or whatever)"
070 - needs more jobs
118 - From dep bug: "And exactly like it did in May, it fails reftests with a couple of pixels a bit off."
* to be re-imaged in bug 916837 and maybe some diagnostics (memory test or graphics replacement)
t-w732-ix-070 needs DCOps investigation as well - bug 916841
> I'd fight those separately, and close this bug.

The only open bug left from comment 64 is 012 which is owned by Q. Please open separate bugs for any additional issues.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.