Closed
Bug 1491732
Opened 6 years ago
Closed 6 years ago
t-linux64-ms workers getting stuck on puppetize
Categories
(Infrastructure & Operations :: RelOps: Puppet, task, P2)
Infrastructure & Operations
RelOps: Puppet
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bcrisan, Assigned: dhouse)
References
Details
Attachments
(1 file)
55 bytes,
text/x-github-pull-request
|
dhouse
:
checked-in+
|
Details | Review |
During the run, puppet tries to connect to releng-puppet2.srv.releng.scl3.mozilla.com instead of releng-puppet2.srv.releng.mdc1.mozilla.com
and end up being unpuppetized.
We currently have 79 machines in a bad state that can't perform any action from MDC1 and MDC2.
Problem explained in https://bugzilla.mozilla.org/show_bug.cgi?id=1464064#c35
TL;DR: started the puppet service on the first machine (t-linux64-ms-001) and looked into papertrail and found this:
> Sep 16 18:04:59 t-linux64-ms-001.test.releng.mdc1.mozilla.com puppet-agent: (/File[/var/lib/puppet/lib]) Could not evaluate: Could not retrieve file metadata for puppet://releng-puppet2.srv.releng.scl3.mozilla.com/plugins: Failed to open TCP connection to releng-puppet2.srv.releng.scl3.mozilla.com:8140 (Connection timed out - connect(2) for "releng-puppet2.srv.releng.scl3.mozilla.com" port 8140)
and
> bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com
> releng-puppet2.srv.releng.scl3.mozilla.com is unreachable
Also looked into build-puppet repo and found some use cases but I'm not sure which of them causes the issue.
Reporter | ||
Updated•6 years ago
|
When reimaging/rebuilding, the puppetize.sh's first run of Puppet sees the error for mig-agent. Because of this error, puppetize.sh keeps retrying the puppet run. I can reboot, and then the worker will start but I think many of these are getting hung up on that failure.
Note: workers in mdc1 are moved to https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw (and I see most of the hosts as active from the list https://docs.google.com/spreadsheets/d/1A6fU2t3rVY2oAd-U26k4lPZGjfULnh6w5XySqsMofUM/edit#gid=0 referenced in https://bugzilla.mozilla.org/show_bug.cgi?id=1464064#c35)
I think the scl puppetmaster warnings are not a problem causing this, and will be resolved with the scl3 entries being removed (there is a PR for that which I expect will be merged in the next day).
I reimaged t-linux64-ms-001.test.releng.mdc1 and checked the puppetize.log and it shows it is using the mdc1 puppet master and not attempting scl3:
```
[dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ssh root@t-linux64-ms-001.test.releng.mdc1.mozilla.com
root@t-linux64-ms-001.test.releng.mdc1.mozilla.com's password:
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-66-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
root@t-linux64-ms-001:~# tail -f /root/puppetize.log
Contacting puppet server puppet
16 Sep 21:43:45 ntpdate[757]: no server suitable for synchronization found
Certificate request for t-linux64-ms-001.test.releng.mdc1.mozilla.com
securely removing deploypass
Running puppet agent against server 'puppet'
```
Reverting the mig-agent restart fixes the problem. So we'll need to find another way to fix the upgrade hanging issue for bug 1489603.
I verified by reinstalling/reimaging t-linux64-ms-003 without a deploypass. Then I manually ran puppetize.sh against my puppet environment with the mig-agent restart reverted (to a kill) and it puppetized correctly
Depends on: MIG_Agent
Summary: Stop using releng-puppet2.srv.releng.scl3.mozilla.com for t-linux64-ms workers → t-linux64-ms workers getting stuck on puppetize
Assignee: relops → dhouse
Attachment #9009535 -
Flags: checked-in+
The PR was merged last night around 1PST and ciduty reimaged the linux moonshots that were stuck. So it looks like things are good for now, and we'll need to reopen 1489603 to remake that to not block on new builds.
I pushed to try to get some jobs running on the pool (forced workertype conversion to "-tw" to run in mdc1 taskcluster-worker workers).
(In reply to Dave House [:dhouse] from comment #5)
> I pushed to try to get some jobs running on the pool (forced workertype
> conversion to "-tw" to run in mdc1 taskcluster-worker workers).
https://treeherder.mozilla.org/#/jobs?repo=try&revision=17414a7ea95c812f3e1bfa891d893baa803e8168
(In reply to Dave House [:dhouse] from comment #6)
> (In reply to Dave House [:dhouse] from comment #5)
> > I pushed to try to get some jobs running on the pool (forced workertype
> > conversion to "-tw" to run in mdc1 taskcluster-worker workers).
>
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=17414a7ea95c812f3e1bfa891d893baa803e8168
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d896df6f066fb1bfc84bc4b68ba1a9d5b253d43
I had a typo in the worker type name (needed to omit "64" in the workertype name)
That worked. So jobs ran against the mdc2 and mdc1 workers. I see tasks running or completed in both worker types for the machines that had been broken/stuck.
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos (MDC2)
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw (MDC1 temporary during transition to generic-worker)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•