Closed Bug 1273286 Opened 6 years ago Closed 6 years ago

Upgrade Linux puppet drivers to NVIDIA 361.42

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

All
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: acomminos, Assigned: acomminos)

References

Details

Attachments

(2 files)

The current version of the NVIDIA driver on Linux puppet deployments (310.32) is outdated and causing us many issues with hangs in the NVIDIA driver, both within gecko as well as compiz. The latest long-lived driver, 361.42, is available in the graphics-drivers ubuntu ppa (https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa), and has been verified to resolve the hang issues in bug 1197954 on a loaner.

We should switch to this ppa from xorg-edgers and use its nvidia-361 package.
:rail, can you help us move this to the right place?
Flags: needinfo?(rail)
I presume this was in reference to releng CI machines, not IT machines.
Component: Infrastructure: Puppet → RelOps: Puppet
QA Contact: cshields → dustin
Thanks for the move.

I'm not sure how the APT mirrors on puppet work, but this assumes that the graphics-drivers PPA is available to pull from in place of xorg-edgers.
Flags: needinfo?(rail)
I'll mirror the files.

Would be great to coordinate this change with jmaher (CCed).
please make it clear when this is deployed, we can look for talos changes, likewise random intermittents.
Comment on attachment 8753077 [details]
MozReview Request: Bug 1273286 - Upgrade NVIDIA drivers to 361.42, switch to graphics-drivers PPA. r?rail

https://reviewboard.mozilla.org/r/52990/#review49830

Thank you for the patch!
Attachment #8753077 - Flags: review?(rail) → review+
I copied the files to match the patch. We can try to deploy it on a limited set of machines or the whole pool. In both cases reverting may require reimaging.
For the record, this patch assumes a reimage as a variety of packages pulled in incidentally by xorg-edgers will no longer need to be installed.
Keywords: checkin-needed
(In reply to Andrew Comminos [:acomminos] from comment #9)
> For the record, this patch assumes a reimage as a variety of packages pulled
> in incidentally by xorg-edgers will no longer need to be installed.

I think this is the most straight forward and safe way to proceed.

I'd like to reimage and pin to the new puppet env N machines, so we can ensure everything works as expected.

Andrew, Joel, any ideas about how many machines we should pin?
Joel would probably know better than I do regarding the number of machines needed to get statistically significant data within a reasonable timeframe.
Flags: needinfo?(jmaher)
I like 20 machines- lets document which get this- then we can push to try and look for odd talos patterns, etc.  likewise any issues on the trees.  We should let the sheriffs know the change when it happens.
Flags: needinfo?(jmaher)
Comment on attachment 8754483 [details]
Bug 1273286 - Upgrade Linux drivers to NVIDIA 361.42

https://reviewboard.mozilla.org/r/53992/#review50716

I'm ok with this landing, I do have one downside to comment on, wouldn't this still mean that we run puppet with 'production' puppet first and then *next* run is with your environment?
Attachment #8754483 - Flags: review?(bugspam.Callek) → review+
(In reply to Justin Wood (:Callek) from comment #14)
> I'm ok with this landing, I do have one downside to comment on, wouldn't
> this still mean that we run puppet with 'production' puppet first and then
> *next* run is with your environment?

Not sure. I'll figure this out before enabling this slave.
Unfortunately kickstart won't talk to my environment at the beginning, so it tries to install xedgers packages first. After the initial puppetization it starts talking to my env and installs new packages. I purged the packages from that repo by:

  apt-get purge nvidia-310 nvidia-settings-310
  apt-get install  libpixman-1-0=0.24.4-1 libpixman-1-0:i386=0.24.4-1


talos-linux64-ix-024 is enabled now and should start taking jobs. Let's start with just one for now.
is there still something needed to checkin on mozilla-inbound or so ?
Flags: needinfo?(andrew)
No, thanks- my mistake.
Flags: needinfo?(andrew)
Comment on attachment 8754483 [details]
Bug 1273286 - Upgrade Linux drivers to NVIDIA 361.42

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53992/diff/1-2/
Attachment #8754483 - Attachment description: MozReview Request: Bug 1273286 - Pin talos-linux64-ix-024 to rail's env r=Callek → MozReview Request: Bug 1273286 - Use NVIDIA 361.42 on ~20 talos machines. r=Callek
Comment on attachment 8754483 [details]
Bug 1273286 - Upgrade Linux drivers to NVIDIA 361.42

This should move ~20 first slaves to my environment and update the drivers to the NVIDIA 361.42. Reimaging is not required, because the drivers can live together without any issues.
Attachment #8754483 - Flags: review?(bugspam.Callek)
Attachment #8754483 - Flags: review+
Attachment #8754483 - Flags: checked-in+
Attachment #8754483 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8754483 [details]
Bug 1273286 - Upgrade Linux drivers to NVIDIA 361.42

https://reviewboard.mozilla.org/r/53992/#review51822

r+ on diff based on https://gist.github.com/rail/f9b5a9454031384494c248fc5a9fc5f7 (since reviewboards interdiff was odd)
Comment on attachment 8753077 [details]
MozReview Request: Bug 1273286 - Upgrade NVIDIA drivers to 361.42, switch to graphics-drivers PPA. r?rail

Apparently we need to remove the 310 related stuff to make X use the new drivers. I added this block https://gist.github.com/rail/efbe7e78e999358cb60c59f38e9b25a8#file-puppet-diff-L40-L42 to make it work.
It took quite a bit to get all machines to switch to the new driver, because we don't reboot too often anymore. I just verified, that talos-linux64-ix-001 to talos-linux64-ix-019 and talos-linux64-ix-024 have the new driver installed.

Joel, how long do you want to watch these machines for before we proceed further?
Flags: needinfo?(jmaher)
can we give them until next wednesday?  with a 3 day weekend in the US, I imagine we won't get a lot of traffic as normal- I would rather let this bake for a few days.  If others have different desires/goals/plans, please speak up!
Flags: needinfo?(jmaher)
WFM!
removing keyword to get this out of our list of things to checkin :))
Keywords: checkin-needed
Comment on attachment 8754483 [details]
Bug 1273286 - Upgrade Linux drivers to NVIDIA 361.42

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53992/diff/2-3/
Attachment #8754483 - Attachment description: MozReview Request: Bug 1273286 - Use NVIDIA 361.42 on ~20 talos machines. r=Callek → Bug 1273286 - Upgrade Linux drivers to NVIDIA 361.42
Comment on attachment 8754483 [details]
Bug 1273286 - Upgrade Linux drivers to NVIDIA 361.42

Updated the patch to reflect recent Windows changes.
Attachment #8754483 - Flags: review?(bugspam.Callek)
Attachment #8754483 - Flags: review+
Attachment #8754483 - Flags: checked-in+
Attachment #8754483 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8754483 [details]
Bug 1273286 - Upgrade Linux drivers to NVIDIA 361.42

https://reviewboard.mozilla.org/r/53992/#review55202

Lets give this a shot...

::: modules/packages/manifests/nvidia_drivers.pp:13
(Diff revision 3)
>      case $::operatingsystem {
>          Ubuntu: {
> +            $nvidia_version = "361"
> +            $nvidia_full_version = "361.42"
> +            # The Ubuntu graphics-drivers repo embeds the version number in the
> +            # package name, so we can easily require "latest"

s/can/can't/ ?
It make take some time to get all machines to update the packages. So far I checked one of the machines and this is what I see in /var/log/Xorg.0.log:

  [    62.492] (II) NVIDIA GLX Module  361.42  Tue Mar 22 17:25:45 PDT 2016
Just received some successful results with machines talos-linux64-ix-06{7,8}. Looks like the driver is working perfectly.

Thanks!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Wheeee!!!!
as a note, this has fixed a lot of bi-modal/noisy data on linux64:
https://bugzilla.mozilla.org/show_bug.cgi?id=1271948#c21

(or it was a real coincidence).  Not sure I understand why that is, but it is good to keep in mind.
Wow! Sounds like the current driver is much better! \o/

Another reason to keep the systems up to date!
You need to log in before you can comment on or make changes to this bug.