Closed Bug 1491732 Opened 6 years ago Closed 6 years ago

t-linux64-ms workers getting stuck on puppetize

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bcrisan, Assigned: dhouse)

References

Details

Attachments

(1 file)

GitHub Pull Request 6 years ago :dhouse 55 bytes, text/x-github-pull-request	dhouse : checked-in+	Details \| Review

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Reporter

Description

•

6 years ago

During the run, puppet tries to connect to releng-puppet2.srv.releng.scl3.mozilla.com instead of releng-puppet2.srv.releng.mdc1.mozilla.com and end up being unpuppetized. We currently have 79 machines in a bad state that can't perform any action from MDC1 and MDC2. Problem explained in https://bugzilla.mozilla.org/show_bug.cgi?id=1464064#c35 TL;DR: started the puppet service on the first machine (t-linux64-ms-001) and looked into papertrail and found this: > Sep 16 18:04:59 t-linux64-ms-001.test.releng.mdc1.mozilla.com puppet-agent: (/File[/var/lib/puppet/lib]) Could not evaluate: Could not retrieve file metadata for puppet://releng-puppet2.srv.releng.scl3.mozilla.com/plugins: Failed to open TCP connection to releng-puppet2.srv.releng.scl3.mozilla.com:8140 (Connection timed out - connect(2) for "releng-puppet2.srv.releng.scl3.mozilla.com" port 8140) and > bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com > releng-puppet2.srv.releng.scl3.mozilla.com is unreachable Also looked into build-puppet repo and found some use cases but I'm not sure which of them causes the issue.

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Reporter

Updated

•

6 years ago

Blocks: 1464064

Severity: normal → critical

Priority: -- → P2

:dhouse

Assignee

Comment 1

•

6 years ago

When reimaging/rebuilding, the puppetize.sh's first run of Puppet sees the error for mig-agent. Because of this error, puppetize.sh keeps retrying the puppet run. I can reboot, and then the worker will start but I think many of these are getting hung up on that failure. Note: workers in mdc1 are moved to https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw (and I see most of the hosts as active from the list https://docs.google.com/spreadsheets/d/1A6fU2t3rVY2oAd-U26k4lPZGjfULnh6w5XySqsMofUM/edit#gid=0 referenced in https://bugzilla.mozilla.org/show_bug.cgi?id=1464064#c35) I think the scl puppetmaster warnings are not a problem causing this, and will be resolved with the scl3 entries being removed (there is a PR for that which I expect will be merged in the next day). I reimaged t-linux64-ms-001.test.releng.mdc1 and checked the puppetize.log and it shows it is using the mdc1 puppet master and not attempting scl3: ``` [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ssh root@t-linux64-ms-001.test.releng.mdc1.mozilla.com root@t-linux64-ms-001.test.releng.mdc1.mozilla.com's password: Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-66-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. root@t-linux64-ms-001:~# tail -f /root/puppetize.log Contacting puppet server puppet 16 Sep 21:43:45 ntpdate[757]: no server suitable for synchronization found Certificate request for t-linux64-ms-001.test.releng.mdc1.mozilla.com securely removing deploypass Running puppet agent against server 'puppet' ```

:dhouse

Assignee

Comment 2

•

6 years ago

Reverting the mig-agent restart fixes the problem. So we'll need to find another way to fix the upgrade hanging issue for bug 1489603. I verified by reinstalling/reimaging t-linux64-ms-003 without a deploypass. Then I manually ran puppetize.sh against my puppet environment with the mig-agent restart reverted (to a kill) and it puppetized correctly

Depends on: MIG_Agent

Summary: Stop using releng-puppet2.srv.releng.scl3.mozilla.com for t-linux64-ms workers → t-linux64-ms workers getting stuck on puppetize

:dhouse

Assignee

Comment 3

•

6 years ago

Attached file GitHub Pull Request — Details

Assignee: relops → dhouse

Attachment #9009535 - Flags: checked-in+

:dhouse

Assignee

Comment 4

•

6 years ago

The PR was merged last night around 1PST and ciduty reimaged the linux moonshots that were stuck. So it looks like things are good for now, and we'll need to reopen 1489603 to remake that to not block on new builds.

:dhouse

Assignee

Comment 5

•

6 years ago

I pushed to try to get some jobs running on the pool (forced workertype conversion to "-tw" to run in mdc1 taskcluster-worker workers).

:dhouse

Assignee

Comment 6

•

6 years ago

(In reply to Dave House [:dhouse] from comment #5) > I pushed to try to get some jobs running on the pool (forced workertype > conversion to "-tw" to run in mdc1 taskcluster-worker workers). https://treeherder.mozilla.org/#/jobs?repo=try&revision=17414a7ea95c812f3e1bfa891d893baa803e8168

:dhouse

Assignee

Comment 7

•

6 years ago

(In reply to Dave House [:dhouse] from comment #6) > (In reply to Dave House [:dhouse] from comment #5) > > I pushed to try to get some jobs running on the pool (forced workertype > > conversion to "-tw" to run in mdc1 taskcluster-worker workers). > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=17414a7ea95c812f3e1bfa891d893baa803e8168 https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d896df6f066fb1bfc84bc4b68ba1a9d5b253d43 I had a typo in the worker type name (needed to omit "64" in the workertype name) That worked. So jobs ran against the mdc2 and mdc1 workers. I see tasks running or completed in both worker types for the machines that had been broken/stuck. https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos (MDC2) https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw (MDC1 temporary during transition to generic-worker)

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

t-linux64-ms workers getting stuck on puppetize

Categories

(Infrastructure & Operations :: RelOps: Puppet, task, P2)

Tracking

(Not tracked)

People

(Reporter: bcrisan, Assigned: dhouse)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Attachment

General

Description

File Name

Content Type