Determine what is the matter with these Windows 7 machines

RESOLVED FIXED

Status

P3
normal
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: armenzg, Unassigned)

Tracking

Details

Attachments

(4 attachments)

(Reporter)

Description

6 years ago
It seems that four machines are having graphics issues which have caused a lot of the intermittent oranges since yesterday:
* t-w732-ix-012
* t-w732-ix-016
* t-w732-ix-043
* t-w732-ix-045
(Reporter)

Updated

6 years ago
No longer blocks: 770578
Depends on: 873634
Depends on: 876203
(Reporter)

Updated

6 years ago
Priority: -- → P3
(Reporter)

Comment 1

6 years ago
I will not be able to look at these before I head out.
Assignee: armenzg → nobody
Whiteboard: [buildduty]

Updated

6 years ago
No longer depends on: 872867, 872877, 872915, 873078, 873634, 876203
t-w732-ix-011 and tw-732-ix-070 also seem to suffer from this.
This bug should be with DCops so they can investigate all these machines as a class and see whether there is a common theme in the failures: bad graphics card batch, faulty power connect to rack, etc.
Assignee: nobody → server-ops-dcops
Component: Release Engineering: Machine Management → Server Operations: DCOps
QA Contact: armenzg → dmoore

Comment 4

6 years ago
>This bug should be with DCops so they can investigate all these machines as a class and see whether there is a common theme in the failures: bad graphics card batch, faulty power connect to rack, etc.

every single one of these iX systems have a video card. I don't think 4 suspect video cards out of ~3-400 will indicate a bad batch. That's about a 1% failure rate.

these servers are installed 4 to a chassis with dual PSU so faulty power would affect 4 nodes, unless the specific slot happens to be bad. Are the servers stable, any random reboots? I'm not even sure what intermittent orange means from comment #1. Currently, the servers are all pinging.

I am not aware of any video card testing utilities but I'd be more than happy to RMA these video cards for you. Are there any logs that can help us diagnose the problem or we can pass along to iX?
Whiteboard: [buildduty]
Armen, can you help out DCOps with answers to their questions?
Flags: needinfo?(armenzg)
colo-trip: --- → scl3
(Reporter)

Comment 6

6 years ago
I will put it on me to give you some more info.
(Reporter)

Updated

6 years ago
Assignee: server-ops-dcops → armenzg
Flags: needinfo?(armenzg)
(Reporter)

Comment 7

6 years ago
Hi van,
Can you guys have a look at the graphic setup for machines 43, 45 & 70?

I will look at the other machines meanwhile.

t-w732-ix-011 - it seems to be fine - more releng investigation is needed
t-w732-ix-012 - it seems to be fine - more releng investigation is needed
t-w732-ix-016 - it seems to be fine - more releng investigation is needed
t-w732-ix-043 - small resolution & 2 monitors
t-w732-ix-045 - small resolution & 2 monitors
t-w732-ix-070 - small resolution & 2 monitors - IPMI does not work
Assignee: armenzg → server-ops-dcops

Comment 8

6 years ago
t-w732-ix-0[43,45,70] - fixed the small resolution, unsure how to get rid of the 2 monitors. I've made the add-on the primary video output and it's displaying at 1920x1200. BIOS also redirects to add-on video card.

Q: do you have any input on how to get rid of the 2 monitors issue?

also fixed IPMI on t-w732-ix-070.
Created attachment 777760 [details]
Bad color depth

The following machines are also affected:
t-w732-ix-115
t-w732-ix-015

I have also been having intermittent failures in a test that uses WebGL (see bug 893715).

"Error: WebGL: Error during ANGLE OpenGL ES initialization"

The screenshot shows that there are clearly graphics issues (see screenshot).

This is normally an issue with the Direct3D driver.

Comment 10

5 years ago
so is this something DCOPs can help with or is this more of a driver issue?
(Reporter)

Updated

5 years ago
Depends on: 890035
(Reporter)

Comment 11

5 years ago
(In reply to Michael Ratcliffe [:miker] [:mratcliffe] from comment #9)
> The following machines are also affected:
> t-w732-ix-115

This machine was recently re-imaged in bug 890035 and after that it messed up a bunch of jobs and got disabled.

I see two monitors and very few screen resolutions.
http://cl.ly/QfRQ

van, can you please look at this machine?

> t-w732-ix-015
> 
It is still taking jobs. If I connect through VNC I don't see anything weird in the screen.
The setup also seems fine (1 monitor; many resolutions):
http://cl.ly/QfSE

I don't think there's anything to be done for this second machine.

Comment 12

5 years ago
:armenzg, i was able to get t-w732-ix-115 to see 1900x1200 resolution. However it still displays 2 screens even though nothing else is connected. I still can't get rid of it. Everything I've read says it's a bug/issue with video drivers and we should try to reinstall the drivers. Is this something we can try? Can you point me to the drivers you guys are using or can you reinstall it remotely?
Adding in Mark in case he has any ideas here.
(Reporter)

Comment 14

5 years ago
(In reply to Van Le [:van] from comment #12)
> :armenzg, i was able to get t-w732-ix-115 to see 1900x1200 resolution.
> However it still displays 2 screens even though nothing else is connected. I
> still can't get rid of it. Everything I've read says it's a bug/issue with
> video drivers and we should try to reinstall the drivers. Is this something
> we can try? Can you point me to the drivers you guys are using or can you
> reinstall it remotely?

I think it is best if we can re-image to see if it fixes it. I prefer to start from a known state.

Nevertheless, if you believe it is better to update the drivers and see, I believe Mark can let us know what is being used.

Comment 15

5 years ago
Just a quick note that we have seen certain IX hosts that exhibit inconsistent video behavior. This was a problen when writing the original video switch software. In some cases we could get the secondary video card to recognize without disabling the on board or we would still see the on board as active even though it disabled in BIOS. This anomalies stuck across reboots and re images.
Hmm I seemed to of missed this bug when I was originally added as a cc. 

Is there one of the problematic machines I can jump on and take a look at?
(In reply to Mark Cornmesser [:markco] from comment #16)
> Hmm I seemed to of missed this bug when I was originally added as a cc. 
> 
> Is there one of the problematic machines I can jump on and take a look at?

Feel free to look at any of them - they're all offline. Apologies for the long delay in response.
Van, do we have spare video cards on site?  If so, can we try swapping in a new one to see if that fixes the issue?

Comment 19

5 years ago
:arr - We still have the 25+ video cards that you ordered extra.  I can swap one out when I'm onsite.

Comment 20

5 years ago
Thanks Vinh.

Comment 21

5 years ago
So I've noticed that w732-ix-043, 045, and 070 all display 2 monitors but once you disable the Matrox G200eW (Winbond) in device manager, it will then revert to 1 monitor.  Upon enabling the Matrox G200eW, the 2 displays issue returns.  These three hosts have the older third party video card (NOT the GT-610).

I then examined three other hosts that have the GT-610 video card installed and saw that they only detected 1 monitor. Under device manager, all three had an error message under the Matrox G200eW "Windows has stopped this device because it has reported problems (code 43)."  Pretty much the Matrox G200eW is in a disable state.

With that said, do you still want me to swap out the video card and replace it with the GT-610?
Blocks: 890035, 876518
No longer depends on: 890035
Vinh: hang on a second, all of the production machines should have identical video cards. Are you saying that some of them are different?

Comment 23

5 years ago
arr: Yup some has GT-610 graphic cards and some do not. See attachment snapshots.

Comment 24

5 years ago
Created attachment 802414 [details]
GT-610.JPG

Comment 25

5 years ago
Created attachment 802416 [details]
non-gt-610.JPG

Comment 26

5 years ago
Also noticed on hosts that have the lower graphic cards, once I physically remove the card and default it to Matrox G200eW; I am able to see the console through IPMI (instead of just a black screen).  Currently t-w732-ix-043.wintest.releng.scl3.mozilla.com has no third party graphics card installed and works with IPMI.
Hey Vinh, when you have a moment could you please check and see what graphic cards are in t-w732-ix-003 and 004, please? And could you give me a heads up before hand? I am working on both of the machines currently. 

The variation in cards may be a key to another issue.
(Reporter)

Comment 28

5 years ago
Side note (post this bug), if we find a way to write a script or a check that would check for the right graphic card on a Win7 machine, I could add it in "pre-flight" tasks - bug 712206 - to prevent a machine to take a job.
Vihn: so that's devide drivers, not the physical card.  Can you please check the *physical card* to make sure they're different?
(Reporter)

Comment 30

5 years ago
*Maybe* t-w732-ix-118 might have an issue as well.
Blocks: 876773

Comment 31

5 years ago
:arr - Ok physically they are the same card. Just the drivers were showing differently.

Comment 32

5 years ago
 Hey Vinh,

 We need to make sure the  on board card is disabled in BIOS. The machines throwing the 43 error are correct in their configuration since the on-board card can not start due to BIOS settings. There is a documented oddity on these machines and windows 7 that if the primary video card is enabled in BIOS the secondary graphics card isn't detected correctly. Can we get the BIOS settings checked an possibly have the on-board graphics flipped back an forth on the problematic machines until they report the same 43 error?

Comment 33

5 years ago
Q - Flipping the on-board graphics back and forth in BIOS did not produce the 43 error code. Looks like I have to disable the on-board graphics drivers under device manager in order to get it to display 1 monitor. Should I go ahead and do the same for all the problematic machines?

Comment 34

5 years ago
I hope to avoid that but lets make that a good plan b. Is there any difference in the BIOS versions on those machines ? I keep coming back to the odd occurrence of machines with the same hardware imaged the same way with the same OS would behave differently.
A more comprehensive update for what we're seeing and the short term and long term plans:

1) Machines would ONLY be affected if they're reimaged not rebooted. A machine running tests will not encounter this issue in normal operation.  

2) This issue does not always occur after a reimage, and we're not sure what causes it to occur when it does. We continue to look into that.

3) Out of the 130 w7 machines we imaged, fewer than 10 had this issue, and we were speculating that it might be hardware or configuration related on those few machines. We're apparently hitting this when we reimage now as well, though.

4) Visual verification can be done to make sure that the machine is configured correctly after being reimaged, and we will add steps to the documentation to show dcops (and releng, if they self servicing) how to correct the drivers if they are not configured correctly. 

5) We believe that the lower-level c++ software that we've been working on writing should fix this issue; we've just been blocked on getting the contiguous cycles to finish this work.

Comment 36

5 years ago
What are the chances we could get 115 re-imaged ?
Flags: needinfo?

Comment 37

5 years ago
115 currently re-imaging.
Flags: needinfo?

Comment 38

5 years ago
Thanks

Comment 39

5 years ago
Hmmm...looks like 115 has failed the re-imaging process twice. Stuck at the log in screen with incorrect password and showing ".\cltbld" Going to attempt another try.

Comment 40

5 years ago
I saw that too I saw some network name lookup errors on the server. Can you run 115 through one more time ?

Comment 41

5 years ago
115 looks good now can you unplug the dell monitor for me?

Comment 42

5 years ago
Looks like we have a few issues here.

1) Some machines are behaving oddly when they have the full correct config 

2) Some newly re-imaged machines are still seeing the on-board card even though they are disabled 

3) Some machines are failing to get the full NVidia driver install when being re-imaged. I think this may be due to file corruption in the Nvidia install package I am working on verifying that. Fixing the driver install issue should correct any issues we see with "known good" machines getting re-imaged. ( t-w732-ix-115 and t-w732-ix-005 were corrected with the driver fix)

Comment 43

5 years ago
Created attachment 802812 [details]
Error from 043 when trying to install video driver

This is an error from t-w732-ix-043 that states it can't find an NVidia card. I have only seen this on 043 maybe a re-seat is in order?
Flags: needinfo?(vhua)

Comment 44

5 years ago
118 also suffered from the driver issue. I have written a script to identify the problem and I am going to loop all the win7 machines today.

Comment 45

5 years ago
I've reseated the graphics card on t-w732-ix-043.  Give it a try now.
Flags: needinfo?(vhua)

Comment 46

5 years ago
Looks like it finds the card now.

Comment 47

5 years ago
Armen, can 043 get dropped back into the queue? It should be good now.
Flags: needinfo?(armenzg)
(Reporter)

Comment 48

5 years ago
Rebooted into production.

We should see it take jobs in the next hour or so:
https://secure.pub.build.mozilla.org/buildapi/recent/t-w732-ix-043
Flags: needinfo?(armenzg)

Comment 49

5 years ago
The report is almost complete it is taking time as I am doing passive checks so I don't interrupt any tests. I will manually fix any driver issues today and supply a list of affected and fixed machines to make sure they are put back into use with a cautious eye focused on them.

Comment 50

5 years ago
I see mochitests passing

Comment 51

5 years ago
Out of 130 machines the following currently have driver issues:
T-W732-IX-016
T-W732-IX-017
T-W732-IX-045
T-W732-IX-070
T-W732-IX-100

 I am working on correcting these tonight and they should be ready to go back into production tomorrow

Comment 52

5 years ago
The following are all set:
T-W732-IX-016
T-W732-IX-017
T-W732-IX-070
T-W732-IX-100

I ma saving T-W732-IX-045 for some quick control testing in the morning.

Comment 53

5 years ago
All the below T-W732-IX- machines should be able to be added back into production as of now:
015
016
017
043
045
070
100
115

Can we get them added? I will keep an eye on them today.
Flags: needinfo?(armenzg)
(Reporter)

Comment 54

5 years ago
I'm surprised that #015 came into the list as it has been in production like forever and the sheriffs would have caught it if it failed a lot of jobs.

I don't see the following machines in the list:
t-w732-ix-011
t-w732-ix-012
t-w732-ix-118
Should I move them to a separate bug to determine what malady they might have?
Maybe they got fixed in one of the comments above?
http://cl.ly/RLdb - Can you please have a look at this screenshot?
Maybe they should have not been added to this bug? (comment 30)

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-011
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-012
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-118
############

> 015 - This machine has been in production for many months and last 100 jobs seemed to be good
> 016 - Rebooted into production
> 017 - Rebooted into production
> 043 - Rebooted into production (yesterday)
> 045 - Rebooted into production
> 070 - IPMI credentials do not work - Can someone please fix it?
> 100 - IPMI credentials do not work - Can someone please fix it?
> 115 - Rebooted into production

WRT, to the IPMI credentials I can move it to another bug to avoid the orthogonal noise.

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-015
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-016
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-017
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-043
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-045
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-070
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-100
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w732-ix&name=t-w732-ix-115
Flags: needinfo?(armenzg)

Comment 55

5 years ago
118 is good now.

011 and 012 are some of the odd machines that will work with out disabling the on-board card. However, they are set to use the NVIDIA graphics adapter and should be able to go back in.
(Reporter)

Comment 56

5 years ago
(In reply to Q from comment #55)
> 118 is good now.
> 
> 011 and 012 are some of the odd machines that will work with out disabling
> the on-board card. However, they are set to use the NVIDIA graphics adapter
> and should be able to go back in.

All have been rebooted into production.

I believe all we have to do is wait to see what the results of all these hosts are by tomorrow and close it, right?

Comment 57

5 years ago
Yes Sir, that is correct.
Such optimism!

Results I've seen so far: 

011 fails to start up the browser in talos, and fails reftests with radius corners
012 may be okay
015 remains just fine
016 failed to start up the browser in a talos job
017 failed to start up in several talos jobs, and failed a whole bunch of tests (https://tbpl.mozilla.org/php/getParsedLog.php?id=27802553&tree=Mozilla-Inbound) that I've never seen fail before, looking like maybe the clipboard wasn't working right, or something was in the way
043 may be okay (one explicable failure, one inexplicable failure but on Thunderbird so I have no opinion)
045 is either okay or hasn't taken enough jobs to fail yet
070 is the same, not enough jobs to say
100 has the same failure to start talos jobs
115 does too
118 does too

If I understood correctly, Q was saying the failure to start has something to do with Apache.
(Reporter)

Comment 59

5 years ago
Reviewing philor's assessment and proposing action.

012 is showing test failures like this: http://cl.ly/RMfj
 |--> disabled for now

philor, should we disable 016, 017, 100, 115 & 118?
Probably check the graphics setup once more.
Probably re-image and try on staging?

045 & 070 - we need more jobs

Q, do you have more info wrt to Apache? I can try to run a job manually and see what it looks like when connected to the machine.
Flags: needinfo?(q)
Flags: needinfo?(philringnalda)

Comment 60

5 years ago
re apache: I saw apache errors on almost all the machines it looks like there was a config error on several machines now corrected. The failed calls were to localhost/xtalos/* 

As far as the graphics errors let's get 012 re-imaged. Let me look at the other failures as the graphics config *seem* good.
Flags: needinfo?(q)
(Reporter)

Comment 61

5 years ago
(In reply to Q from comment #60)
> re apache: I saw apache errors on almost all the machines it looks like
> there was a config error on several machines now corrected. The failed calls
> were to localhost/xtalos/* 
> 
> As far as the graphics errors let's get 012 re-imaged. Let me look at the
> other failures as the graphics config *seem* good.

To clarify this, we should not see the talos issues anymore. If we do, let Q know.
(Reporter)

Comment 62

5 years ago
(In reply to Q from comment #60)
> 
> As far as the graphics errors let's get 012 re-imaged. Let me look at the
> other failures as the graphics config *seem* good.

Filed bug 916156 to re-image 012.
Looks like what remains is that 011 and 012 are odd and do not have working graphics, 017 is flaky and needs diagnostics (or to have the dust blown out of its fans, or to have its RAM reseated, or whatever), and 070 took just one job (which got cancelled). I'd fight those separately, and close this bug.
Flags: needinfo?(philringnalda)
(Reporter)

Updated

5 years ago
Depends on: 895121
(Reporter)

Comment 64

5 years ago
Thanks philor.

011 - to be re-imaged in bug 916239
012 - to be re-imaged in bug 916156
017 - to be investigated in bug 916840
* "017 is flaky and needs diagnostics (or to have the dust blown out of its fans, or to have its RAM reseated, or whatever)"
070 - needs more jobs
118 - From dep bug: "And exactly like it did in May, it fails reftests with a couple of pixels a bit off."
* to be re-imaged in bug 916837 and maybe some diagnostics (memory test or graphics replacement)
(Reporter)

Comment 65

5 years ago
t-w732-ix-070 needs DCOps investigation as well - bug 916841

Comment 66

5 years ago
> I'd fight those separately, and close this bug.

The only open bug left from comment 64 is 012 which is owned by Q. Please open separate bugs for any additional issues.

Updated

5 years ago
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.