Closed Bug 864823 Opened 7 years ago Closed 4 years ago

Chief for Socorro

Categories

(Socorro :: Infra, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lonnen, Unassigned)

References

Details

We'd like to set up Chief to handle the main Socorro push. For non-trivial features we'll still need to involve IT, but most of the time we shouldn't have to interrupt ops.
The setup is a little different than your standard playdoh project, but very similar to the socorro-crashstats chief setup that JakeM set up in our new cluster in Q1.
Is this a webdev bug or an ops bug?
reassigning to web operations.
Assignee: nobody → server-ops-webops
Component: Webdev → Server Operations: Web Operations
QA Contact: nmaul
We're not too comfortable with putting this in Chief just yet, because it seems to frequently fail. Socorro 44 had trouble with 3 nodes... it feels like the update mechanism isn't robust enough yet. It's not to the point where you can push a button and reasonably expect that it will work every time.

I'm also afraid it might take too long- Chief has a tendency to time out waiting for response. Would take some finagling.

What about shell access (sudo even, if necessary) to the relevant admin node, such that you could run the update script yourself? Would that be an acceptable alternative?
Flags: needinfo?(chris.lonnen)
(In reply to Jake Maul [:jakem] from comment #4)
> We're not too comfortable with putting this in Chief just yet, because it
> seems to frequently fail. Socorro 44 had trouble with 3 nodes... it feels
> like the update mechanism isn't robust enough yet. It's not to the point
> where you can push a button and reasonably expect that it will work every
> time.
> 
> I'm also afraid it might take too long- Chief has a tendency to time out
> waiting for response. Would take some finagling.
> 
> What about shell access (sudo even, if necessary) to the relevant admin
> node, such that you could run the update script yourself? Would that be an
> acceptable alternative?

lonnen and I were just looking at this, and I suspect this is what's going on:

* apps are started with daemonize, which handles the lock https://github.com/mozilla/socorro/blob/master/scripts/init.d/socorro-processor#L33
* killproc gives a 15s window for the app to shut down after a normal kill is sent, then sends a -9 (see /etc/init.d/functions)

I suspect at this point, daemonize doesn't clear the lockfile properly... I'll test this and put in a check if that's the case.

Are there problems other than with these init scripts?
(In reply to Robert Helmer [:rhelmer] from comment #5)
> (In reply to Jake Maul [:jakem] from comment #4)
> > We're not too comfortable with putting this in Chief just yet, because it
> > seems to frequently fail. Socorro 44 had trouble with 3 nodes... it feels
> > like the update mechanism isn't robust enough yet. It's not to the point
> > where you can push a button and reasonably expect that it will work every
> > time.
> > 
> > I'm also afraid it might take too long- Chief has a tendency to time out
> > waiting for response. Would take some finagling.
> > 
> > What about shell access (sudo even, if necessary) to the relevant admin
> > node, such that you could run the update script yourself? Would that be an
> > acceptable alternative?
> 
> lonnen and I were just looking at this, and I suspect this is what's going
> on:
> 
> * apps are started with daemonize, which handles the lock
> https://github.com/mozilla/socorro/blob/master/scripts/init.d/socorro-
> processor#L33
> * killproc gives a 15s window for the app to shut down after a normal kill
> is sent, then sends a -9 (see /etc/init.d/functions)
> 
> I suspect at this point, daemonize doesn't clear the lockfile properly...
> I'll test this and put in a check if that's the case.

OK I have just tested this and it does clear the lockfile properly (checked the pidfile to make sure it's not actually there anymore).

The only option I see is that perhaps the app is in uninterruptible sleep, so it doesn't respond to the kill -9 before we turn around and try to start the app.

Our timeout is 15s right now, we may need to either extend that or change the design of the apps.
Depends on: 868512
Filed bug 868512 to track down fixing this on the Socorro side.
@JakeM -- I think we'd rather have chief, and I hope that after 868512 is resolved we can set it up. As an interim solution, though, giving rhelmer and myself the means to push code ourselves would be appreciated.
Flags: needinfo?(chris.lonnen)
What are the next steps on this?
Flags: needinfo?(nmaul)
ping. whats up with this?
ping?
Flags: needinfo?(server-ops-webops)
This is highly desirable as part of a q3 goal to sync the release processes of socorro + the django frontend. How can we move forward with this?
Component: Server Operations: Web Operations → WebOps: Socorro
Product: mozilla.org → Infrastructure & Operations
:lonnen and I discussed this on Vidyo and we've settled on a phase 1 for Chief and Socorro.

Phase 1:

* Install Chief on socorroadm.private.phx1 via Puppet
* Setup Chief for socorro staging
 * Initial commander compatible python script will just call the existing shell scripts
 * cron job will start calling Chief instead of the script directly
* Chief + cronjob takes over the staging auto-deploy
Assignee: server-ops-webops → bburton
Status: NEW → ASSIGNED
Flags: needinfo?(server-ops-webops)
Flags: needinfo?(nmaul)
Priority: -- → P3
bburton@macbookair-00886541dab0 [06:16:31] [~/code/mozilla/sysadmins/puppet/trunk]
-> % svn ci -m "setting up chief for socorro cluster, bug 864823" modules/webapp/manifests/chief/socorro.pp
Adding         modules/webapp/manifests/chief/socorro.pp
Transmitting file data .
Committed revision 71035.
bburton@macbookair-00886541dab0 [06:17:23] [~/code/mozilla/sysadmins/puppet/trunk]
-> % svn ci -m "setting up chief for socorro cluster, bug 864823" manifests/nodes/socorro.pp
Sending        manifests/nodes/socorro.pp
Transmitting file data .
Committed revision 71036.
Depends on: 894648
Depends on: 896104
FYI, recent socorro stage commits from sysadmins r71551..71558 seem to have broken puppet on certain stage hosts:

socorro-collector1.stage.webapp.phx1
socorro-mware1.stage.webapp.phx1
socorro-web1.stage.webapp.phx1
etc.

The error is a missing dependency for /var/log/xyz, where xyz varies based on the host. For instance:

Failed to apply catalog: Could not find dependency Package[httpd] for File[/var/log/httpd/crash-reports.mozilla.com]
(In reply to Richard Soderberg [:atoll] from comment #16)
> FYI, recent socorro stage commits from sysadmins r71551..71558 seem to have
> broken puppet on certain stage hosts:
> 
> socorro-collector1.stage.webapp.phx1
> socorro-mware1.stage.webapp.phx1
> socorro-web1.stage.webapp.phx1
> etc.
> 
> The error is a missing dependency for /var/log/xyz, where xyz varies based
> on the host. For instance:
> 
> Failed to apply catalog: Could not find dependency Package[httpd] for
> File[/var/log/httpd/crash-reports.mozilla.com]

Thanks, I got lost in a maze of edits trying to track down a conflicting httpd service resource that ended up being in ganglia::server

I guess I failed to completely clean up my mess

I shall fix it shortly
Depends on: 896870
Depends on: 901015
Depends on: 909900
Depends on: 910480
Depends on: 912647
Depends on: 949693
Assignee: bburton → server-ops-webops
Assignee: server-ops-webops → nobody
Component: WebOps: Socorro → Infra
Product: Infrastructure & Operations → Socorro
QA Contact: nmaul
Kind of got it, but then we moved out.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.