Closed
Bug 864823
Opened 9 years ago
Closed 7 years ago
Chief for Socorro
Categories
(Socorro :: Infra, task, P3)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: lonnen, Unassigned)
References
Details
We'd like to set up Chief to handle the main Socorro push. For non-trivial features we'll still need to involve IT, but most of the time we shouldn't have to interrupt ops.
Reporter | ||
Comment 1•9 years ago
|
||
The setup is a little different than your standard playdoh project, but very similar to the socorro-crashstats chief setup that JakeM set up in our new cluster in Q1.
Comment 2•9 years ago
|
||
Is this a webdev bug or an ops bug?
Reporter | ||
Comment 3•9 years ago
|
||
reassigning to web operations.
Assignee: nobody → server-ops-webops
Component: Webdev → Server Operations: Web Operations
QA Contact: nmaul
Comment 4•9 years ago
|
||
We're not too comfortable with putting this in Chief just yet, because it seems to frequently fail. Socorro 44 had trouble with 3 nodes... it feels like the update mechanism isn't robust enough yet. It's not to the point where you can push a button and reasonably expect that it will work every time. I'm also afraid it might take too long- Chief has a tendency to time out waiting for response. Would take some finagling. What about shell access (sudo even, if necessary) to the relevant admin node, such that you could run the update script yourself? Would that be an acceptable alternative?
Flags: needinfo?(chris.lonnen)
Comment 5•9 years ago
|
||
(In reply to Jake Maul [:jakem] from comment #4) > We're not too comfortable with putting this in Chief just yet, because it > seems to frequently fail. Socorro 44 had trouble with 3 nodes... it feels > like the update mechanism isn't robust enough yet. It's not to the point > where you can push a button and reasonably expect that it will work every > time. > > I'm also afraid it might take too long- Chief has a tendency to time out > waiting for response. Would take some finagling. > > What about shell access (sudo even, if necessary) to the relevant admin > node, such that you could run the update script yourself? Would that be an > acceptable alternative? lonnen and I were just looking at this, and I suspect this is what's going on: * apps are started with daemonize, which handles the lock https://github.com/mozilla/socorro/blob/master/scripts/init.d/socorro-processor#L33 * killproc gives a 15s window for the app to shut down after a normal kill is sent, then sends a -9 (see /etc/init.d/functions) I suspect at this point, daemonize doesn't clear the lockfile properly... I'll test this and put in a check if that's the case. Are there problems other than with these init scripts?
Comment 6•9 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #5) > (In reply to Jake Maul [:jakem] from comment #4) > > We're not too comfortable with putting this in Chief just yet, because it > > seems to frequently fail. Socorro 44 had trouble with 3 nodes... it feels > > like the update mechanism isn't robust enough yet. It's not to the point > > where you can push a button and reasonably expect that it will work every > > time. > > > > I'm also afraid it might take too long- Chief has a tendency to time out > > waiting for response. Would take some finagling. > > > > What about shell access (sudo even, if necessary) to the relevant admin > > node, such that you could run the update script yourself? Would that be an > > acceptable alternative? > > lonnen and I were just looking at this, and I suspect this is what's going > on: > > * apps are started with daemonize, which handles the lock > https://github.com/mozilla/socorro/blob/master/scripts/init.d/socorro- > processor#L33 > * killproc gives a 15s window for the app to shut down after a normal kill > is sent, then sends a -9 (see /etc/init.d/functions) > > I suspect at this point, daemonize doesn't clear the lockfile properly... > I'll test this and put in a check if that's the case. OK I have just tested this and it does clear the lockfile properly (checked the pidfile to make sure it's not actually there anymore). The only option I see is that perhaps the app is in uninterruptible sleep, so it doesn't respond to the kill -9 before we turn around and try to start the app. Our timeout is 15s right now, we may need to either extend that or change the design of the apps.
Comment 7•9 years ago
|
||
Filed bug 868512 to track down fixing this on the Socorro side.
Reporter | ||
Comment 8•9 years ago
|
||
@JakeM -- I think we'd rather have chief, and I hope that after 868512 is resolved we can set it up. As an interim solution, though, giving rhelmer and myself the means to push code ourselves would be appreciated.
Flags: needinfo?(chris.lonnen)
Reporter | ||
Comment 10•9 years ago
|
||
ping. whats up with this?
Reporter | ||
Comment 12•9 years ago
|
||
This is highly desirable as part of a q3 goal to sync the release processes of socorro + the django frontend. How can we move forward with this?
Updated•9 years ago
|
Component: Server Operations: Web Operations → WebOps: Socorro
Product: mozilla.org → Infrastructure & Operations
Comment 13•9 years ago
|
||
:lonnen and I discussed this on Vidyo and we've settled on a phase 1 for Chief and Socorro. Phase 1: * Install Chief on socorroadm.private.phx1 via Puppet * Setup Chief for socorro staging * Initial commander compatible python script will just call the existing shell scripts * cron job will start calling Chief instead of the script directly * Chief + cronjob takes over the staging auto-deploy
Assignee: server-ops-webops → bburton
Status: NEW → ASSIGNED
Flags: needinfo?(server-ops-webops)
Flags: needinfo?(nmaul)
Priority: -- → P3
Comment 14•9 years ago
|
||
bburton@macbookair-00886541dab0 [06:16:31] [~/code/mozilla/sysadmins/puppet/trunk] -> % svn ci -m "setting up chief for socorro cluster, bug 864823" modules/webapp/manifests/chief/socorro.pp Adding modules/webapp/manifests/chief/socorro.pp Transmitting file data . Committed revision 71035.
Comment 15•9 years ago
|
||
bburton@macbookair-00886541dab0 [06:17:23] [~/code/mozilla/sysadmins/puppet/trunk] -> % svn ci -m "setting up chief for socorro cluster, bug 864823" manifests/nodes/socorro.pp Sending manifests/nodes/socorro.pp Transmitting file data . Committed revision 71036.
![]() |
||
Comment 16•9 years ago
|
||
FYI, recent socorro stage commits from sysadmins r71551..71558 seem to have broken puppet on certain stage hosts: socorro-collector1.stage.webapp.phx1 socorro-mware1.stage.webapp.phx1 socorro-web1.stage.webapp.phx1 etc. The error is a missing dependency for /var/log/xyz, where xyz varies based on the host. For instance: Failed to apply catalog: Could not find dependency Package[httpd] for File[/var/log/httpd/crash-reports.mozilla.com]
Comment 17•9 years ago
|
||
(In reply to Richard Soderberg [:atoll] from comment #16) > FYI, recent socorro stage commits from sysadmins r71551..71558 seem to have > broken puppet on certain stage hosts: > > socorro-collector1.stage.webapp.phx1 > socorro-mware1.stage.webapp.phx1 > socorro-web1.stage.webapp.phx1 > etc. > > The error is a missing dependency for /var/log/xyz, where xyz varies based > on the host. For instance: > > Failed to apply catalog: Could not find dependency Package[httpd] for > File[/var/log/httpd/crash-reports.mozilla.com] Thanks, I got lost in a maze of edits trying to track down a conflicting httpd service resource that ended up being in ganglia::server I guess I failed to completely clean up my mess I shall fix it shortly
Updated•9 years ago
|
Assignee: bburton → server-ops-webops
Updated•8 years ago
|
Assignee: server-ops-webops → nobody
Component: WebOps: Socorro → Infra
Product: Infrastructure & Operations → Socorro
QA Contact: nmaul
Reporter | ||
Comment 18•7 years ago
|
||
Kind of got it, but then we moved out.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•