Closed Bug 1096734 Opened 10 years ago Closed 10 years ago

masters are using old manifests for production (or something)

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4088] )

Attachments

(2 files)

When I run 'puppet agent' on my Windows test system, b-2008-ix-0081, it often -- but not always -- fails with

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: This OS is not supported for collectd at /etc/puppet/production/modules/collectd/manifests/settings.pp:46 on node b-2008-ix-0081.winbuild.releng.scl3.mozilla.com
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

But that error was fixed *hours* ago!  I've even restarted the master since then.  The puppet run works fine against other masters, so it's something particular to this master that's failing.
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4088]
Mark's seeing this a bit, too.  It seems intermittent :(
As far as I can see, this is the only thing that includes collectd, so removing it should mean that nothing -- windows or not windows -- tries to install collectd.  It won't actually *remove* collectd.
Attachment #8520849 - Flags: review?(jwatkins)
Comment on attachment 8520849 [details] [diff] [review]
bug1096734-temp-remove-.patch

I don't see a problem with this.  Let's just make sure we re-enable it down the line.
Attachment #8520849 - Flags: review?(jwatkins) → review+
Well, I didn't see *collectd* errors after this, although I did see some old manifests being used -- I think.  This is a tough one to nail down.  Anyway, I'm going to back out attachment 8520849 [details] [diff] [review].
I'm going to close this as WORKSFORME for the moment, until I see this again.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Dustin J. Mitchell changed story state to delivered in Pivotal Tracker
Resolution: WORKSFORME → FIXED
Dustin J. Mitchell changed story state to accepted in Pivotal Tracker
Status: RESOLVED → VERIFIED
Dustin J. Mitchell changed story state to accepted in Pivotal Tracker
This is still happening -- and it's happening randomly.  I just ran puppet three times, and two times worked, while the third failed with 

> Error: Could not retrieve catalog from remote server: Error 400 on SERVER: This
> OS is not supported for collectd at /etc/puppet/production/modules/collectd/mani
> fests/settings.pp:46 on node b-2008-ix-0081.winbuild.releng.scl3.mozilla.com
> Warning: Not using cache on failed catalog
> Error: Could not retrieve catalog; skipping run

This seemed to stop while attachment 8520849 [details] [diff] [review] was in place,  but "seemed" is as sure as I can be.

The only 'include collectd' appears within an 'if ($::operatingsystem != Windows)' block which should fail for other reasons on Windows (packages::bash for example).  So to all appearances that conditional is true sometimes and false sometimes.  Which doesn't make any sense.
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Comparing the output of 'facter -p' over several minutes doesn't show anything interesting (time, memory use, etc change, but that's expected)
I tried to capture the traffic between the windows host and the master and decrypt it, but I couldn't get wireshark to actually do the decryption.
From discussion with Callek, things to try:
 - add a warn() that will print the $operatingsystem fact somewhere early in the manifests.  Maybe it's spelled "NT" sometimes?
 - hack the client-side ruby to print out the list of facts it's sending
With http://hg.mozilla.org/build/puppet/rev/b192ebcf48b8 in place, I get
> Nov 20 12:50:59 releng-puppet2 puppet-master[6653]: (Scope(Class[main])) fqdn b-2008-ix-0081.winbuild.releng.scl3.mozilla.com operatingsystem windows
in the logs, but still a failure like comment 10.

Adding a warning(..) to modules/collectd/manifests/init.pp doesn't show up for windows.  But adding a warning(..) to modules/collectd/manifests/settings.pp *does*.  Why??
I still have no clue what's going on.  I tried to hack puppet itself on the puppetmaster to print out the include relationships (what includes what), but that seemed to bog down the puppetmaster and my SSH connection got oom'd.  The load averages are still in the 20's, so I'm not going to touch it anymore.  See also bug 1102540 - these hosts were already running hot.
I tried upgrading to Puppet-3.7.0 on the agent (it had 3.4.3 installed!) but that didn't help.
I added

  def evaluate_classes(classes, scope, lazy_evaluate = true, fqname = false)
    Puppet.warning("evaluate_classes #{classes.inspect} #{scope.source.file}:#{scope.source.line}")

in lib/puppet/parser/compiler.rb.  Here's what I see:

> Nov 20 16:31:17 releng-puppet2 puppet-master[31166]: (Scope(Class[main])) fqdn b-2008-ix-0081.winbuild.releng.scl3.mozilla.com operatingsystem windows
> Nov 20 16:31:17 releng-puppet2 puppet-master[31166]: evaluate_classes ["collectd::settings"] :
> Nov 20 16:31:17 releng-puppet2 puppet-master[31166]: evaluate_classes ["::config"] /etc/puppet/production/modules/collectd/manifests/settings.pp:4
> Nov 20 16:31:17 releng-puppet2 puppet-master[31166]: This OS is not supported for collectd at /etc/puppet/production/modules/collectd/manifests/settings.pp:46 on node b-2008-ix-0081.winbuild.releng.scl3.mozilla.com
> Nov 20 16:31:23 releng-puppet2 puppet-master[31166]: last message repeated 2 times

Which looks weird.  But hey, check THIS out, even for a non-windows host, that collectd::settings is the first thing included...

> Nov 20 16:31:23 releng-puppet2 puppet-master[31169]: (Scope(Class[main])) fqdn talos-mtnlion-r5-099.test.releng.scl3.mozilla.com operatingsystem Darwin
> Nov 20 16:31:23 releng-puppet2 puppet-master[31169]: evaluate_classes ["collectd::settings"] :
> Nov 20 16:31:23 releng-puppet2 puppet-master[31169]: evaluate_classes ["::config"] /etc/puppet/production/modules/collectd/manifests/settings.pp:4
> Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["toplevel::slave::releng::test::gpu"] /etc/puppet/production/manifests/nodes.pp:15
> Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["packages::setup"] /etc/puppet/production/modules/toplevel/manifests/base.pp:7
> Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["config"] /etc/puppet/production/modules/packages/manifests/setup.pp:4
> Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["puppet"] /etc/puppet/production/modules/toplevel/manifests/base.pp:7
> Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["packages::puppet"] /etc/puppet/production/modules/puppet/manifests/init.pp:7
Deleting modules/collectd allows runs to succeed on Windows (of course, they fail everywhere else).  So this automagical include of collectd::settings before anything else is parsed is apparently "soft", so if collectd::settings doesn't exist, no problem.

There's something "smart" that puppet's doing here, and when I find out it's going to get patched into oblivion.
OK, I've replicated this in relabs, so I'm not live-hacking the production puppetmasters anymore.
Oh, and more importantly -- I didn't see the collectd error until the running master compiled a catalog that *did* involve collectd.  So this is a case of something in the compiler "leaking" between catalog runs.
A statement at topscope level inside a module is added to the parser's context once it sees that manifest file -- which only happens when a POSIX run occurs.  After that, though, it's as if this include had been written in site.pp.
Attachment #8526907 - Flags: review?(jwatkins)
Attachment #8526907 - Flags: review?(jwatkins) → review+
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: