Closed Bug 977615 Opened 6 years ago Closed 6 years ago

Upgrade Windows 8 machines' Nvidia drivers to 335.23

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows 8
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: q)

References

Details

Attachments

(5 files, 1 obsolete file)

We're going to need to upgrade the drivers to help with bug 938395.

I will run a machine through staging first.
OS: Linux → Windows 8
Summary: Upgrade one Windows 8 machine's Nvidia drivers to 334.89 → Upgrade Windows 8 machines' Nvidia drivers to 334.89
Blocks: 974684
Assignee: relops → q
Nvidia has released GeForce 335.23 WHQL drivers, can this driver be installed instead?

http://www.nvidia.com/download/driverResults.aspx/73780/en-us
I'm testing now the machine on staging and I will have the results soon.

When could we have time to test the installation on a machine and check that we have the graphics update service still disabled?
http://us.download.nvidia.com/Windows/335.23/335.23-desktop-win8-win7-winvista-64bit-english-whql.exe
We should have all results by tomorrow morning.
All jobs passed.

Q: When can this fit within your schedule?

Thanks!
Flags: needinfo?(q)
I would recommend that we wait to make any major changes to production till after pwn2own.
For sure, we'll time the deployment after it.
After the release that fixes it goes out.
Q: any news on when this could be done?
Let's aim for Monday. Does that work for you?
Flags: needinfo?(q)
Flags: needinfo?(armenzg)
Monday wfm!
Thank you Q!

Is there an easy way to rollback in case anything goes wrong?
I don't think we will need it but we should try to test it once.

jrmuizel: we might also need your experience if we need to disable a patch temporarily. We got a green run on staging but the greenness could change.
Flags: needinfo?(armenzg)
I'm ready whenever you are. Thanks!
Sorry this didn't go out yesterday I was testing rollback options and we have them. I will work on getting this out today.

Q
For CCed people: No action required. Adding you as FYI. We're upgrading Nvidia drivers on Win8 machines. The change has been tested on staging.
Blocks: 988012
This was rolled back to figure out what caused the failures that closed the trees.
Q, could you please deploy the change to t-w864-ix-009?
I would like to test it on my staging master.

On another note, I'm putting t-w864-ix-042 again on staging to see why it didn't catch the issues from bug 988012.
will do
Q, can we please use 335.23 instead of 334.89?
Summary: Upgrade Windows 8 machines' Nvidia drivers to 334.89 → Upgrade Windows 8 machines' Nvidia drivers to 335.23
targeting only t-w864-ix-009 with the update. Rebooting now to pick it up with driver version 335.25
I think there's something funky with 009's graphical setup.

Q: can you have a look at it when you have time?
Could you also install the newer graphics driver on 042?

Thanks for your help!

On another note, my testing on staging was pretty much invalid.
GPO reverted my manual installation in between reboots.
This means that I was testing the older setup and that is why I got everything to be green. Yay me!
009 will only get the new driver  ( it has it now deployed via GPO) I ran c:\monitor_config\fakemon.vbs ( gets run on startup) and things looks good. You should be good to test that machine. I will address 042 right now.
Everything is running green so far on 009.
The driver version seems the right one (it is still on the right version 335.23).

Could it be that we need a reboot after the driver install?
Maybe the tests fail on the first run after the drivers upgrade?

I should run tests for mozilla-beta since that is what the logs from bug 988012 list.
Matt, Jeff: I've attached the logs of the jobs that failed after running with the newer Nvidia driver.

Could you please have a look at them and find a way to make the green?

Once we have solutions for this we can move forward to the deployment (I will need to do another run across various branches).
Are you assuming that your problems with 009 were different than the problems in production? I don't think there's any reason to believe they were different, my memory of the failures would make "tiny resolution and no acceleration" a perfect fit.
It seems that we might have two issues:
* the installation can get us into a bad state with no acceleration and tiny screen resolution
* after upgrade, we have some perma orange failures on various branches

Matt, Jeff: when do you have time to look at the failures? Does the following plan make sense?

I can think of this path forward:
1) We can run a machine through all the release branches: m-c, m-a, m-b, m-r & esr24
1.1) File bugs for each failing test; ask original authors to help them fix them or disable them
1.2) Loan you a machine and ask
2) Once we deploy this again, we can quickly disable any machines that get a bad installation and fix them up
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jmuizelaar)
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #26)
> It seems that we might have two issues:
> * the installation can get us into a bad state with no acceleration and tiny
> screen resolution
> * after upgrade, we have some perma orange failures on various branches
> 
> Matt, Jeff: when do you have time to look at the failures? Does the
> following plan make sense?

Sounds reasonable to me.
Flags: needinfo?(jmuizelaar)
Yes, sounds good to me too.

Jeff and I looked at the logs last week, nothing stood out as obviously graphics related at all. It wasn't really obvious how the driver upgrade could have caused them.
Flags: needinfo?(matt.woodrow)
Jeff, Matt: would you mind preparing a patch to disable the perma-failure on beta?

Q: do you have any trick to see if the installations failed on some of the machines?

Any ideas on how to prevent machines from taking jobs if they are not ready to take jobs? (graphically speaking)
Anything I can check on the machine?

I'm hoping to look into the machines that failed last time (by looking at Windows logs) to see if there's anything that failed during the installation.

I triggered jobs on mozilla-aurora and they all came out clean.
I've triggered jobs on mozilla-release and mozilla-esr24 and we should know by tomorrow. I assume they might share the perma-failure of mozilla-beta.

It seems that "WINNT 6.2 mozilla-central debug test mochitest-browser-chrome" was intermittent. It failed 2/5 times *and* the test failures from each were different.

"WINNT 6.2 mozilla-beta opt test mochitest-browser-chrome" has failed 3/3 times the same way [1].



[1] These errors and similar:
17:40:04  WARNING -  TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/customizableui/test/browser_876944_customize_mode_create_destroy.js | The number of placeholders should be correct. - Got 2, expected 1
17:40:09  WARNING -  TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/customizableui/test/browser_880382_drag_wide_widgets_in_panel.js | Area PanelUI-contents should have 13 items. - Got 12, expected 13
17:40:16  WARNING -  TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/customizableui/test/browser_890140_orphaned_placeholders.js | Should no longer be in default state.
Any answers wrt the questions on comment 29?

m-r and m-esr24 have come out clean.
As of now, we only have a perma-orange on mozilla-beta.

Q: what do you think if we deploy the change again to five machines and keep a close eye on it?
I would like to see recent failures.

Meanwhile, I will look at one of the machine that failed in the past: t-w864-ix-105
Flags: needinfo?(q)
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jmuizelaar)
I think deploying to five machines is great idea. Do you have candidate machines for me?
Flags: needinfo?(q)
Let's do 010 to 015.

Thanks Q!
Q: if you could do it as early as possible either today or tomorrow, it will give me a higher chance of keeping an eye on them.

I will probably disable buildbot at the end of my day and re-enable them in the following morning.
Flags: needinfo?(jmuizelaar)
Q and I will be deploying the change to five machines and keep a close eye on them.
Flags: needinfo?(matt.woodrow)
10 - 15 were set to get the new drivers after reboot. I still haven't seen an install get pulled I am going to make sure the change is getting picked up.
The install runs at boot and makes sure that runslave doesn't start. The install is still running on 14 and runslave was killed. is it possible something rebooted this machine during the last install ? The machine has only been up for 6 minutes.

Q
The installer completed on 014 and it rebooted without my intervention. Checking on the resolution etc now
The last job that #14 took successfully finished at: 4/23/2014, 3:16:44 PM
After that it should have rebooted (last step in this job [1])

At that point, I assume it came back from a reboot and installed the drivers.
At Wed Apr 23 15:34:43 2014 it started the next job and after 1 mins, 48 secs it failed (Wed Apr 23 15:36:32 2014) *without* rebooting. [3]

From your comment, who killed runslave? Did you mean that it was not running? ("...and runslave was killed.")

FTR I gracefully shutdown buildbot somewhere before my comment (15:50:02 PDT)
Taking that action should look like killing runslave.py when connected to the machine.

IIUC (correctly if I'm wrong) #14 took jobs even though the installation had not finished.

#######################
[1] http://buildbot-master110.srv.releng.scl3.mozilla.com:8201/builders/WINNT%206.2%20try%20opt%20test%20reftest-no-accel/builds/521

[2] http://buildbot-master110.srv.releng.scl3.mozilla.com:8201/builders/WINNT%206.2%20fx-team%20opt%20test%20mochitest-2/builds/251

[3]
15:35:27     INFO - #####
15:35:27     INFO - ##### Running clobber step.
15:35:27     INFO - #####
15:35:27     INFO - Running pre-action listener: _resource_record_pre_action
15:35:27     INFO - Running main action method: clobber
15:35:27     INFO - rmtree: C:\slave\test\build
15:35:27     INFO - Using _rmtree_windows ...
15:35:27     INFO - retry: Calling <bound method DesktopUnittest._rmtree_windows of <__main__.DesktopUnittest object at 0x025F9B30>> with args: ('C:\\slave\\test\\build',), kwargs: {}, attempt #1

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
I think I might have an idea here let me check the install logs.
This was my fault. For some reason the hostname filter that told gpo to NOT revert the changes back to the old driver was not taking. So the new driver was installing then GPO was detecting the new driver and resetting the node hence the kill behavior ( we had this in place because test would fail with the new driver ). I removed the revert statements entirely and am retrying now.
Okay 14 looks stable and came up with the correct res and did not revert.
Yay!

Jeff, Matt: I want to wait until next week so Q and I can coordinate further deploying this (I'm on PTO until Monday). Please reach for coop if this could not wait.
Thanks Q. I'm happy to see this figured out.
Do we have an metrics on how these machines are doing?
https://tbpl.mozilla.org/php/getParsedLog.php?id=38670508&tree=Mozilla-Central is 015 failing the webgl mochitests which fail when run (without hardware acceleration, at such a low resolution, whathaveyou) when webgl doesn't work on a slave.
Two of the others have done green mochitest-2 runs (though nobody has yet done either of my favorites, reftest or mochitest-1), so it may be that just 015 is broken.
https://tbpl.mozilla.org/php/getParsedLog.php?id=38674156&tree=Mozilla-Inbound is 015 failing reftest webgl tests. Disabled it in slavealloc.
Thank you philor.
It seems that only 015 has given trouble so far.

I sometimes feel that not being able to set a WebGL context should turn the job red.

Anyways, let's look tomorrow as to why 015 misbehaved.
Let's hope the other ones stay put.

18:08:41     INFO -  JavaScript warning: http://mochi.test:8888/tests/dom/imptests/html/webgl/common.js, line 3: WebGL: Can't get a usable WebGL context
18:43:29     INFO -  JavaScript warning: file:///C:/slave/test/build/tests/reftest/tests/content/canvas/test/reftest/webgl-utils.js, line 59: WebGL: Can't get a usable WebGL context
Attached image Capture1.PNG (obsolete) —
Is this relevant? Should we re-image 015 and start again?

On another note, could we deploy the change to machine 020 to 029?
If machine 010 to 014 have been working I'm confident to see those work.
After that we should deploy across the pool and disable individual machines that get into the state of 015.
Attachment #8414866 - Flags: feedback?(q)
We're going to deploy across the board.
Deploying now rather than on the morning will be less disruptive if we have some machines that don't work.

If this goes sideways please use the escalation wiki.
Comment on attachment 8414866 [details]
Capture1.PNG

Irrelevant event. Obsoleting screenshot.
Attachment #8414866 - Attachment is obsolete: true
Attachment #8414866 - Flags: feedback?(q)
Email sent to sheriffs.

This can be used to monitor the change: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=WINNT%206.2
And maybe this: http://builddata.pub.build.mozilla.org/reports/pending/pending.html (last two diagrams)
It seems that this can kill the first job. Following jobs work as expected.
While the installation of the driver goes on, the script was killing any python process starting up. That caused the "blue" jobs. Once the installation finished the machine would reboot and come back with the right driver.

Unfortunately, some machines would get a messed up device/screen-resolution condition.
The way that the fakemon.vbs runs, it can get on the way of buildbot (see [1]).
Some jobs would come out orange.
We have 005 which recovered from such state. On the other hand, we had 004 that got 3 orange jobs.

With the newer starttalos.bat that Q is deploying to the win8 machines, it will prevent machines like 004 to run.

[1] TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_details.js | Enable button should be visible
So it seems that the new starttalos.bat is doing what we expecte.

Currently, the are no production win8 jobs to hit.
There are a bunch of Try ones:
https://tbpl.mozilla.org/?tree=Try&jobname=WINNT%206.2
I will be checking before 9pm.
I will let Q know by then if there's anything else we need to do or not.

I hope to get more data points for later in here (I'm waiting on the try backlog to clear to get to my jobs):
https://tbpl.mozilla.org/?tree=Ash&jobname=WINNT 6.2 ash opt test&rev=685ffafc35cf
https://tbpl.mozilla.org/?tree=Cedar&jobname=WINNT%206.2.*opt&rev=d0a03ae4832f
No longer depends on: t-w864-ix-030
It seems that we're out of the woods.
However, this machine (even though it had your new starttalos.bat) was with the screen resolution code in the forefront.
It burned a job. I closed cmd and disabled the machine for when you have time to look at it.
Attachment #8414979 - Flags: feedback?(q)
t-w864-ix-117 is a bit suspicious, since the screenshot from a mochitest-5 timeout shows the Start screen, but I left it enabled to see what else it could manage to break.
083 looks like it had an error from before the starttalos change. A reboot brought the box back
t-w864-ix-117 seems fine from looking at it with VNC.

t-w864-ix-070 and t-w864-ix-077 are not OK. I will reboot them once more.

Re-enabled and rebooted:
t-w864-ix-015
t-w864-ix-030
t-w864-ix-042
t-w864-ix-062
t-w864-ix-063
t-w864-ix-083

I will watch them.
Ryan and philor have disabled 063 and 018.

Ryan says that we're actually seeing the issue on machines at random.
We've seen the same timeouts on at least 6 or 7 different slaves in the last hour or so. Looking at slave health, they appear to recover OK after rebooting.
The screenshot thing is maybe unrelated, since instances of bug 632290 on April 12th and 17th show the same screenshot of the Start screen, but the way we had one instance of that failure last week on Linux, and we've had six all on Win8 since last night seems related.
Q: can we put the call of fakemon.vbs outside of starttalos.bat the way it was? or maybe remove the 2nd call?
FYI I spoke with Q and we decided not to touch anything anymore as I asked on comment 68.
I believe the extra failures were some machines I should have not put back into the pool (020/021) that brought back some issues that they only showed.

It seems that we're OK now. philor/RyanVM let me know if not.

I'm going to go over any stragglers and file a follow up bug.
Blocks: 1004813
Done with the update.
I filed bug 1004813 for stragglers.
Blocks: 974684
Depends on: 1003614
Attachment #8414979 - Flags: feedback?(q)
You need to log in before you can comment on or make changes to this bug.