Closed Bug 1489603 (MIG_Agent) Opened 6 years ago Closed 6 years ago

Problem with MIG agent

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: apop, Assigned: dhouse)

References

Details

Attachments

(1 file, 1 obsolete file)

55 bytes, text/x-github-pull-request
Details | Review
During my daily checks, I've saw a lot of notification mails from puppet with the following message : Fri Sep 07 11:40:55 -0700 2018 Puppet (err): Command exceeded timeout Fri Sep 07 11:40:55 -0700 2018 /Stage[main]/Mig::Agent::Daemon/Exec[kill mig]/returns (err): change from notrun to 0 failed: Command exceeded timeout Check also the bug 1471424 Can you please help me with this issue or point me to someone who can help?
<dividehex> https://bugzilla.mozilla.org/show_bug.cgi?id=1489603 <dividehex> it looks like the deb packages install a systemd service, correct? <dividehex> i don't think the problem is with mig but the way releng puppet is managing it <dividehex> I'm curious to know if the deb package scripts are setup to restart the systemd service after a package upgrade <dividehex> in the past, releng puppet would query the mig agent for it's pid and then use that to kill the daemon and then let puppet restart it <dividehex> if the deb package restarts the service automatically after package has been upgraded then there should be no need to have releng puppet restart the service <zack> MIG actually tries to update its service itself. We’ve had problems before wherein this process of replacing its own older service would fail. In my team’s experience, running the new agent manually gets it up and going again. You may want to hard kill (`kill -9`) any hanging processes. <zack> i.e. it’s not the package that does this, but the tool itself <dividehex> https://github.com/mozilla-releng/build-puppet/blob/master/modules/mig/manifests/agent/daemon.pp#L23 <dividehex> that line is failing after the upgrade because '/sbin/mig-agent -q=pid' is hanging indefinetly <dividehex> it doesn't timeout at all <dividehex> and therefore puppet keeps spawning calls on ever puppet run <dividehex> restarting the service via systemctl cleared it up <dividehex> i'd like to not have to call mig-agent in order to get the pid in order to kill the agent... seems kinda chicken/eggy <dividehex> if systemd is already managing it, maybe we just call systemctl restart mig-agent <zack> That’s definitely a reasonable way to go. <dividehex> this problem is only happening on ubuntu 16.04 hosts <dividehex> and we only have a handhful of those <dividehex> ok.. thanks for this info! I'm going to make changes to fix this
Attached file GitHub Pull Request (obsolete) —
Assignee: relops → jwatkins
I've restarted mig-agent via systemctl on the handful of hosts that got wedged here. The PR changes should prevent this the next time we upgrade mig.
Blocks: 1491732
I reverted the patch as it was causing new installs to hang during their first puppet run (tries to restart mig-agent before it is set up and so systemctl errors and then puppet errors out).
notes: hangs when it upgrades? systemd config in package or no? or puppet dependency order causes new-install failure? follow up w/ Zack https://github.com/mozilla/mig
Assignee: jwatkins → dhouse
The packages (rpm and deb) themselves don't contain a systemd config. Rather, the MIG Agent actually creates one itself when it's started. It tries to replace this file itself when it upgrades by terminating any running instances of itself before updating the Unit file it creates and persisting itself as a daemon. For now, manually restarting it with systemctl or by running the agent binary directly seems to be the best way to resolve the issue. I'm also working on some infrastructure changes today which I expect will be causing some breakage. If the solutions above don't work, please ignore that for now. I can let you know when everything should be operational.
The MIG infrastructure looks to be back in shape, so there shouldn't be any issues resulting from the outage from earlier.
(In reply to Zack Mullaly [:zack] (use NEEDINFO) from comment #6) > The packages (rpm and deb) themselves don't contain a systemd config. > Rather, the MIG Agent actually creates one itself when it's started. It > tries to replace this file itself when it upgrades by terminating any > running instances of itself before updating the Unit file it creates and > persisting itself as a daemon. For now, manually restarting it with > systemctl or by running the agent binary directly seems to be the best way > to resolve the issue. > > I'm also working on some infrastructure changes today which I expect will be > causing some breakage. If the solutions above don't work, please ignore that > for now. I can let you know when everything should be operational. Thank you Zack for explaining that. The additional problem we saw is that when the agent binary has never been run, the service is not available. So when Puppet tries to restart the service, it fails to find it. I'm thinking to just check for the service first: `systemctl --no-pager status mig-agent && systemctl restart mig-agent`
(In reply to Zack Mullaly [:zack] (use NEEDINFO) from comment #7) > The MIG infrastructure looks to be back in shape, so there shouldn't be any > issues resulting from the outage from earlier. Thank you!
Attached file GitHub Pull Request
Attachment #9007385 - Attachment is obsolete: true
Tested through this on linux staging worker #240 with mig-agent service missing and that failed but reports a different action. I need to test with a new install to test the actual state of a new install as I think something is being cached (puppet knows mig-agent _was_ installed). https://papertrailapp.com/systems/2015017211/events?focus=984611050395701264&selected=984611050395701264 ``` Oct 04 13:59:50 t-linux64-ms-240.test.releng.mdc1.mozilla.com puppet-agent: Could not start Service[mig-agent]: Execution of '/bin/systemctl start mig-agent' returned 5: Failed to start mig-agent.service: Unit mig-agent.service not found. Oct 04 13:59:50 t-linux64-ms-240.test.releng.mdc1.mozilla.com puppet-agent: (/Stage[main]/Mig::Agent::Daemon/Service[mig-agent]/ensure) change from stopped to running failed: Could not start Service[mig-agent]: Execution of '/bin/systemctl start mig-agent' returned 5: Failed to start mig-agent.service: Unit mig-agent.service not found. ```
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: