Closed Bug 1200180 Opened 9 years ago Closed 9 years ago

Post reimage processes failing to produce functional windows test hosts

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: vciobancai, Assigned: q)

References

Details

(Whiteboard: [windows])

Attachments

(1 file)

      No description provided.
Started a reimage process for two slaves (t-xp32-ix-030 and t-w864-ix-020) but in the process of reimage I received the following error in MDT: "An error has ocuurred in the script on this page; ERROR: Name redefined"
Depends on: 1184571
Assignee: relops → q
It looks like troubleshooting XP installs this weekend also broke w7/8 images. please stand by.
Blocks: 1195785
Windows 7/8 look back on track now. XP is still busted
Reinstalling t-w864-ix-020 now
t-w864-ix-020 reinstalled touch less and took a mocha test and succeeded. A pgo test is running now.
Re-image failed for the following slave t-w732-ix-117
Because there's no video output on the w7 boxes, we don't know if this is an imaging issue or a problem with the box.
:vladc: I saw that dcops rebooted t-w732-ix-117 in bug 1200531. Did you try a reimage after that?
Flags: needinfo?(vlad.ciobancai)
DO we have a list of XP machines that need reimaging? Q has been working on a fix, and it would be good to have a number of machines to test out on.
Flags: needinfo?(kmoir)
Flags: needinfo?(alin.selagea)
I'll let vlad respond, I think he has a list
Flags: needinfo?(kmoir)
Below you can find the xp slaves that needs to be re-imaged:
- t-xp32-ix-030 (bug 1198420)
- t-xp32-ix-032 (bug 1201396)
- t-xp32-ix-004 (needs a re-image , according to this bug 880784 has some problems with the jobs)
- t-xp32-ix-033 (needs a re-image , according to this bug 959635)
Flags: needinfo?(vlad.ciobancai)
(In reply to Amy Rich [:arr] [:arich] from comment #8)
> :vladc: I saw that dcops rebooted t-w732-ix-117 in bug 1200531. Did you try
> a reimage after that?

Amy I re-installed the t-w732-ix-117 slave and the process of re-image worked without any problems. The slave has been re-enabled in slavealloc
(In reply to Amy Rich [:arr] [:arich] from comment #8)
> :vladc: I saw that dcops rebooted t-w732-ix-117 in bug 1200531. Did you try
> a reimage after that?

Yesterday I also re-imaged t-w864-ix-158 and it worked fine.
Flags: needinfo?(alin.selagea)
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-158 - it didn't work fine, it did 1.5 jobs, needed a reboot, and did .5 jobs after that, and is sitting idle. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-092 is the best Win8 slave which has been reimaged in the last two weeks, because once it did two jobs in a row instead of 1.5. In general, they do 1.5, sit idle, do 1.5, sit idle, and settle into doing .5 and then idling.

And t-w732-ix-117 has some sort of graphics problem, resolution or the wrong graphics card or whatever it is that causes them to fail every webgl test.
t-w732-ix-159 also got reimaged this week, also disabled for broken graphics.
vladc: hm, do you have a list of all windows machines you've reimaged since 2015-08-31? Q: we should correlate those and see if they're all failing or some subset or...

Philor: are you seeing machines with the same failure modes that haven't been reimaged since then?
Flags: needinfo?(vlad.ciobancai)
Flags: needinfo?(q)
Flags: needinfo?(philringnalda)
No, every Win8 slave which is doing 1.5 jobs and then stopping is one which has been reimaged this week or last week, every Win7 slave with busted graphics is one which has been reimaged this week or last week.
Flags: needinfo?(philringnalda)
And the reverse: to the best of my knowledge (not everyone always admits in a bug when they reimage something), every Win8 and Win7 slave which has been reimaged this week or last week is broken.
Summary: reimage failed for t-xp32-ix and t-w864-ix → reimaging failing to produce functional windows test hosts
Let's halt reimages. I will dive into a reimaged machine a start verifiing state. We can roll back if we need to and start a new win 10 server.
So it isn't ALL reimages t-w864-ix-020 was reimaged after the update as noted in comment 5 and has been running fine with tests and has proper resolution. (http://buildbot-master119.bb.releng.scl3.mozilla.com:8201/buildslaves/t-w864-ix-020). Still looking for a common link and state on broken machines.
Flags: needinfo?(q)
t-w864-ix-092 appears to be hardware locked up it will not respond over OOB KVM even though the  screen renders. VNC and ssh are not working. I am cold resetting via ipmi commands now
It looks like t-w864-ix-158 was enabled during imaging and it rebooted before the NVidia drivers were finished installing. After the next reboot the machine came back and the install repaired. That allowed the resolution scripts to run and it has the correct resolution now. I don't think we have a technological problem with reimaging on windows 8 at this point.  I am diving into the windows 7 issues now
It does appear we can get into a race condition if the NVidia drivers are interrupted. I am adding some pieces to the install bat to delete the scheduled task if the install is good which should fix the condition.
Summary: reimaging failing to produce functional windows test hosts → Post reimage processes failing to produce functional windows test hosts
Confirmed that win8 hosts are looking better. Windows 7 hosts have been slower to trouble shoot due to their graphics setups but it looks like t-w732-ix-159 still has a dell 2048wfp monitor plugged into it causing the resolution to get set to the wrong value. t-w732-ix-1117 had the wrong setting for the onboard VGA card. After some research this was my fault since I had reset the settings for testing and I did put them back before vlad reimaged. After resetting the BIOS setting 117 looks good so far.
comment 24 should read  "and I did NOT put them back before vlad reimaged."
We also looked at the win7&8 slaves and found that the following ones have been re-imaged since August 31:

Windows 8:
t-w864-ix-025
t-w864-ix-092
t-w864-ix-158
t-w864-ix-020
t-w864-ix-043

Windows 7:
t-w732-ix-117
t-w732-ix-159
Flags: needinfo?(vlad.ciobancai)
XP installs are back online. I am checking through windows 7 as it also touches the 32 bit install pipeline to make sure no xp changes affected it. However, the test machine t-xp32-ix-033 is taking tests and passing after a touchless install:
http://buildbot-master119.bb.releng.scl3.mozilla.com:8201/buildslaves/t-xp32-ix-033
Also re-imaged t-xp32-ix-030, t-xp32-ix-032, t-xp32-ix-004 and returned them to the pool. Noticed that the tests are passing on each of them.
So,

Windows 8:
t-w864-ix-025 - hasn't been reenabled since 2015-09-15
t-w864-ix-092 - enabled, but hasn't taken a job or been rebootable since 2015-09-17
t-w864-ix-158 - redisabled, failing video tests
t-w864-ix-020 - redisabled, failing video tests
t-w864-ix-043 - hasn't been reenabled since 2015-09-10

Windows 7:
t-w732-ix-117 - disabled, failing video tests
t-w732-ix-159 - hasn't been enabled since 2015-09-14

Windows XP:
t-xp32-ix-030 - disabled, unable to do webgl and running at 4-bit color
t-xp32-ix-032 - so far, still enabled
t-xp32-ix-004 - so far, still enabled
bug 1207160 is possibly related and talks about the two w7 machines mentioned in comment 29.
Blocks: 1207160
Can we please attach the dxdiag of all failing machines plus a desktop screenshot?

Could we also check that the info in here [1] is still valid for working machines of each pool?
Could we also attach to that wiki a dxdiag for a working machine of each pool?

[1] https://wiki.mozilla.org/ReleaseEngineering:Buildduty:Slave_Management#Working_graphical_setup
Depends on: 1208270
We discovered that the media feature pack is not being installed. See bug 1209577
Depends on: 1209577
In the process of re-image for t-w864-ix-025 a warning message appear in the mgmnt console. I attached the warning message
Whiteboard: [windows]
Blocks: 1209492
Blocks: 1200250
The screen capture you show of t-w864-ix-025 was due to bug 1210344.
Do we have a status on the re-imaging process at the moment?

I've been able to re-image several win7&8 slaves this week using a remote command and did not notice issues (maybe bug 1205227 is the exception, but I don't think that the issue is related to the re-image process in that case).

Thanks.
Flags: needinfo?(q)
As far as we know, the issues with imaging have been solved.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(q)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: