Closed
Bug 1096734
Opened 10 years ago
Closed 10 years ago
masters are using old manifests for production (or something)
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Infrastructure & Operations
RelOps: Puppet
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4088] )
Attachments
(2 files)
1.05 KB,
patch
|
dividehex
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
1.02 KB,
patch
|
dividehex
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
When I run 'puppet agent' on my Windows test system, b-2008-ix-0081, it often -- but not always -- fails with Error: Could not retrieve catalog from remote server: Error 400 on SERVER: This OS is not supported for collectd at /etc/puppet/production/modules/collectd/manifests/settings.pp:46 on node b-2008-ix-0081.winbuild.releng.scl3.mozilla.com Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run But that error was fixed *hours* ago! I've even restarted the master since then. The puppet run works fine against other masters, so it's something particular to this master that's failing.
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4088]
Assignee | ||
Comment 1•10 years ago
|
||
Mark's seeing this a bit, too. It seems intermittent :(
Assignee | ||
Comment 2•10 years ago
|
||
As far as I can see, this is the only thing that includes collectd, so removing it should mean that nothing -- windows or not windows -- tries to install collectd. It won't actually *remove* collectd.
Attachment #8520849 -
Flags: review?(jwatkins)
Comment 3•10 years ago
|
||
Comment on attachment 8520849 [details] [diff] [review] bug1096734-temp-remove-.patch I don't see a problem with this. Let's just make sure we re-enable it down the line.
Attachment #8520849 -
Flags: review?(jwatkins) → review+
Assignee | ||
Comment 4•10 years ago
|
||
Comment on attachment 8520849 [details] [diff] [review] bug1096734-temp-remove-.patch https://hg.mozilla.org/build/puppet/rev/87ed518b3152
Attachment #8520849 -
Flags: checked-in+
Assignee | ||
Comment 5•10 years ago
|
||
Well, I didn't see *collectd* errors after this, although I did see some old manifests being used -- I think. This is a tough one to nail down. Anyway, I'm going to back out attachment 8520849 [details] [diff] [review].
Assignee | ||
Comment 6•10 years ago
|
||
I'm going to close this as WORKSFORME for the moment, until I see this again.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Comment 7•10 years ago
|
||
Dustin J. Mitchell changed story state to delivered in Pivotal Tracker
Resolution: WORKSFORME → FIXED
Comment 8•10 years ago
|
||
Dustin J. Mitchell changed story state to accepted in Pivotal Tracker
Status: RESOLVED → VERIFIED
Comment 9•10 years ago
|
||
Dustin J. Mitchell changed story state to accepted in Pivotal Tracker
Assignee | ||
Comment 10•10 years ago
|
||
This is still happening -- and it's happening randomly. I just ran puppet three times, and two times worked, while the third failed with > Error: Could not retrieve catalog from remote server: Error 400 on SERVER: This > OS is not supported for collectd at /etc/puppet/production/modules/collectd/mani > fests/settings.pp:46 on node b-2008-ix-0081.winbuild.releng.scl3.mozilla.com > Warning: Not using cache on failed catalog > Error: Could not retrieve catalog; skipping run This seemed to stop while attachment 8520849 [details] [diff] [review] was in place, but "seemed" is as sure as I can be. The only 'include collectd' appears within an 'if ($::operatingsystem != Windows)' block which should fail for other reasons on Windows (packages::bash for example). So to all appearances that conditional is true sometimes and false sometimes. Which doesn't make any sense.
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 11•10 years ago
|
||
Comparing the output of 'facter -p' over several minutes doesn't show anything interesting (time, memory use, etc change, but that's expected)
Assignee | ||
Comment 12•10 years ago
|
||
I tried to capture the traffic between the windows host and the master and decrypt it, but I couldn't get wireshark to actually do the decryption.
Assignee | ||
Comment 13•10 years ago
|
||
From discussion with Callek, things to try: - add a warn() that will print the $operatingsystem fact somewhere early in the manifests. Maybe it's spelled "NT" sometimes? - hack the client-side ruby to print out the list of facts it's sending
Assignee | ||
Comment 14•10 years ago
|
||
With http://hg.mozilla.org/build/puppet/rev/b192ebcf48b8 in place, I get > Nov 20 12:50:59 releng-puppet2 puppet-master[6653]: (Scope(Class[main])) fqdn b-2008-ix-0081.winbuild.releng.scl3.mozilla.com operatingsystem windows in the logs, but still a failure like comment 10. Adding a warning(..) to modules/collectd/manifests/init.pp doesn't show up for windows. But adding a warning(..) to modules/collectd/manifests/settings.pp *does*. Why??
Assignee | ||
Comment 15•10 years ago
|
||
I still have no clue what's going on. I tried to hack puppet itself on the puppetmaster to print out the include relationships (what includes what), but that seemed to bog down the puppetmaster and my SSH connection got oom'd. The load averages are still in the 20's, so I'm not going to touch it anymore. See also bug 1102540 - these hosts were already running hot.
Assignee | ||
Comment 16•10 years ago
|
||
I tried upgrading to Puppet-3.7.0 on the agent (it had 3.4.3 installed!) but that didn't help.
Assignee | ||
Comment 17•10 years ago
|
||
I added def evaluate_classes(classes, scope, lazy_evaluate = true, fqname = false) Puppet.warning("evaluate_classes #{classes.inspect} #{scope.source.file}:#{scope.source.line}") in lib/puppet/parser/compiler.rb. Here's what I see: > Nov 20 16:31:17 releng-puppet2 puppet-master[31166]: (Scope(Class[main])) fqdn b-2008-ix-0081.winbuild.releng.scl3.mozilla.com operatingsystem windows > Nov 20 16:31:17 releng-puppet2 puppet-master[31166]: evaluate_classes ["collectd::settings"] : > Nov 20 16:31:17 releng-puppet2 puppet-master[31166]: evaluate_classes ["::config"] /etc/puppet/production/modules/collectd/manifests/settings.pp:4 > Nov 20 16:31:17 releng-puppet2 puppet-master[31166]: This OS is not supported for collectd at /etc/puppet/production/modules/collectd/manifests/settings.pp:46 on node b-2008-ix-0081.winbuild.releng.scl3.mozilla.com > Nov 20 16:31:23 releng-puppet2 puppet-master[31166]: last message repeated 2 times Which looks weird. But hey, check THIS out, even for a non-windows host, that collectd::settings is the first thing included... > Nov 20 16:31:23 releng-puppet2 puppet-master[31169]: (Scope(Class[main])) fqdn talos-mtnlion-r5-099.test.releng.scl3.mozilla.com operatingsystem Darwin > Nov 20 16:31:23 releng-puppet2 puppet-master[31169]: evaluate_classes ["collectd::settings"] : > Nov 20 16:31:23 releng-puppet2 puppet-master[31169]: evaluate_classes ["::config"] /etc/puppet/production/modules/collectd/manifests/settings.pp:4 > Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["toplevel::slave::releng::test::gpu"] /etc/puppet/production/manifests/nodes.pp:15 > Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["packages::setup"] /etc/puppet/production/modules/toplevel/manifests/base.pp:7 > Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["config"] /etc/puppet/production/modules/packages/manifests/setup.pp:4 > Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["puppet"] /etc/puppet/production/modules/toplevel/manifests/base.pp:7 > Nov 20 16:31:24 releng-puppet2 puppet-master[31169]: evaluate_classes ["packages::puppet"] /etc/puppet/production/modules/puppet/manifests/init.pp:7
Assignee | ||
Comment 18•10 years ago
|
||
Deleting modules/collectd allows runs to succeed on Windows (of course, they fail everywhere else). So this automagical include of collectd::settings before anything else is parsed is apparently "soft", so if collectd::settings doesn't exist, no problem. There's something "smart" that puppet's doing here, and when I find out it's going to get patched into oblivion.
Assignee | ||
Comment 19•10 years ago
|
||
OK, I've replicated this in relabs, so I'm not live-hacking the production puppetmasters anymore.
Assignee | ||
Comment 20•10 years ago
|
||
Oh, and more importantly -- I didn't see the collectd error until the running master compiled a catalog that *did* involve collectd. So this is a case of something in the compiler "leaking" between catalog runs.
Assignee | ||
Comment 21•10 years ago
|
||
A statement at topscope level inside a module is added to the parser's context once it sees that manifest file -- which only happens when a POSIX run occurs. After that, though, it's as if this include had been written in site.pp.
Attachment #8526907 -
Flags: review?(jwatkins)
Updated•10 years ago
|
Attachment #8526907 -
Flags: review?(jwatkins) → review+
Assignee | ||
Comment 22•10 years ago
|
||
Comment on attachment 8526907 [details] [diff] [review] bug1096734-topscope.patch remote: https://hg.mozilla.org/build/puppet/rev/57b9d37c646a
Attachment #8526907 -
Flags: checked-in+
Assignee | ||
Updated•10 years ago
|
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•