Closed Bug 1491732 Opened 6 years ago Closed 6 years ago

t-linux64-ms workers getting stuck on puppetize

Categories

(Infrastructure & Operations :: RelOps: Puppet, task, P2)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bcrisan, Assigned: dhouse)

References

Details

Attachments

(1 file)

55 bytes, text/x-github-pull-request
dhouse
: checked-in+
Details | Review
During the run, puppet tries to connect to releng-puppet2.srv.releng.scl3.mozilla.com instead of releng-puppet2.srv.releng.mdc1.mozilla.com and end up being unpuppetized. We currently have 79 machines in a bad state that can't perform any action from MDC1 and MDC2. Problem explained in https://bugzilla.mozilla.org/show_bug.cgi?id=1464064#c35 TL;DR: started the puppet service on the first machine (t-linux64-ms-001) and looked into papertrail and found this: > Sep 16 18:04:59 t-linux64-ms-001.test.releng.mdc1.mozilla.com puppet-agent: (/File[/var/lib/puppet/lib]) Could not evaluate: Could not retrieve file metadata for puppet://releng-puppet2.srv.releng.scl3.mozilla.com/plugins: Failed to open TCP connection to releng-puppet2.srv.releng.scl3.mozilla.com:8140 (Connection timed out - connect(2) for "releng-puppet2.srv.releng.scl3.mozilla.com" port 8140) and > bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com > releng-puppet2.srv.releng.scl3.mozilla.com is unreachable Also looked into build-puppet repo and found some use cases but I'm not sure which of them causes the issue.
Blocks: 1464064
Severity: normal → critical
Priority: -- → P2
When reimaging/rebuilding, the puppetize.sh's first run of Puppet sees the error for mig-agent. Because of this error, puppetize.sh keeps retrying the puppet run. I can reboot, and then the worker will start but I think many of these are getting hung up on that failure. Note: workers in mdc1 are moved to https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw (and I see most of the hosts as active from the list https://docs.google.com/spreadsheets/d/1A6fU2t3rVY2oAd-U26k4lPZGjfULnh6w5XySqsMofUM/edit#gid=0 referenced in https://bugzilla.mozilla.org/show_bug.cgi?id=1464064#c35) I think the scl puppetmaster warnings are not a problem causing this, and will be resolved with the scl3 entries being removed (there is a PR for that which I expect will be merged in the next day). I reimaged t-linux64-ms-001.test.releng.mdc1 and checked the puppetize.log and it shows it is using the mdc1 puppet master and not attempting scl3: ``` [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ssh root@t-linux64-ms-001.test.releng.mdc1.mozilla.com root@t-linux64-ms-001.test.releng.mdc1.mozilla.com's password: Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-66-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. root@t-linux64-ms-001:~# tail -f /root/puppetize.log Contacting puppet server puppet 16 Sep 21:43:45 ntpdate[757]: no server suitable for synchronization found Certificate request for t-linux64-ms-001.test.releng.mdc1.mozilla.com securely removing deploypass Running puppet agent against server 'puppet' ```
Reverting the mig-agent restart fixes the problem. So we'll need to find another way to fix the upgrade hanging issue for bug 1489603. I verified by reinstalling/reimaging t-linux64-ms-003 without a deploypass. Then I manually ran puppetize.sh against my puppet environment with the mig-agent restart reverted (to a kill) and it puppetized correctly
Depends on: MIG_Agent
Summary: Stop using releng-puppet2.srv.releng.scl3.mozilla.com for t-linux64-ms workers → t-linux64-ms workers getting stuck on puppetize
Attached file GitHub Pull Request
Assignee: relops → dhouse
Attachment #9009535 - Flags: checked-in+
The PR was merged last night around 1PST and ciduty reimaged the linux moonshots that were stuck. So it looks like things are good for now, and we'll need to reopen 1489603 to remake that to not block on new builds.
I pushed to try to get some jobs running on the pool (forced workertype conversion to "-tw" to run in mdc1 taskcluster-worker workers).
(In reply to Dave House [:dhouse] from comment #5) > I pushed to try to get some jobs running on the pool (forced workertype > conversion to "-tw" to run in mdc1 taskcluster-worker workers). https://treeherder.mozilla.org/#/jobs?repo=try&revision=17414a7ea95c812f3e1bfa891d893baa803e8168
(In reply to Dave House [:dhouse] from comment #6) > (In reply to Dave House [:dhouse] from comment #5) > > I pushed to try to get some jobs running on the pool (forced workertype > > conversion to "-tw" to run in mdc1 taskcluster-worker workers). > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=17414a7ea95c812f3e1bfa891d893baa803e8168 https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d896df6f066fb1bfc84bc4b68ba1a9d5b253d43 I had a typo in the worker type name (needed to omit "64" in the workertype name) That worked. So jobs ran against the mdc2 and mdc1 workers. I see tasks running or completed in both worker types for the machines that had been broken/stuck. https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos (MDC2) https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw (MDC1 temporary during transition to generic-worker)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: