Closed Bug 1194331 Opened 10 years ago Closed 9 years ago

crash-analysis server should not require manual setup

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: jschneider)

Details

There are some bits I set up on crash-analysis manually that are not properly captured in puppet/terraform. I just set this node back up manually, making a list of what is yet to be done.
Notes from setting up the crash-analysis node manually just now --- Mount volume on /mnt/crashanalysis (labeled socorroanalysis-prod) Add rkaiser user (uid 1002) Crontab for rkaiser: MAILTO="kairo@mozilla.com" # run reports daily at 0:10 (matviews usually finish at 6am UTC, CSVs are there at 5am Pacific) CRON_TZ=UTC 40 08 * * * /mnt/crashanalysis/rkaiser/crash-report-tools/run-reports # TODO make kairo's reports use envconsul Create /home/rkaiser/.socorro-prod-dbsecret.json: { "host": "", "port": "5432", "user": "analyst", "password": "" } Edit /etc/nginx/conf.d/socorro-analysis.conf In location block: root /mnt/crashanalysis; # FIXME not sure why the CSV job in the old infra didn't need these copies, we should do this in the sudo cp /data/socorro/application/scripts/config/dailyurlconfig.py.dist /data/socorro/application/scripts/config/dailyurlconfig.py sudo cp /data/socorro/application/scripts/config/commonconfig.py.dist /data/socorro/application/scripts/config/commonconfig.py # TODO make CSV job use envconsul: sudo vi /etc/socorro/common.conf export database_hostname='' export database_name='breakpad' export database_username='breakpad_rw' export database_password=''
jp - the only real problem I see with comment 1 is that the volume is in us-west-2b so the EC2 node must be also. How can we handle that from the terraform side, should we set that as the only acceptable AZ?
Flags: needinfo?(jschneider)
The correlation scripts need http://hg.mozilla.org/users/dbaron_mozilla.com/crash-data-tools/ in /data/crash-data-tools
getting my SSH key on there to be able to log into the rkaiser user is probably another thing that should be in this.
To make the last comment clearer: there should be a rkaiser user on the machine that I can log into via ssh and that runs my cron job (next to the crontab, it also hold a hidden file in the rkaiser home directory that contains the PostgreSQL connection data for my scripts to use). FWIW, having that available was the primary reason why we even created this machine - putting correlation scripts on it in addition only came later because it was already there and the files it produces were always hosted on the crash-analysis machine.
Robert, I would rather we work together and get you setup to do the same things you need to do in a new setup. We're looking to make this a sustainable setup, where servers are treated as cattle rather than pets. Right now the analysis server is not there yet, but we're moving toward that as we speak, and I think the rkaiser user running cron jobs is a step in the wrong direction. Instead, what we can do is make sure you are setup to do what you have to do in a sustainable way. I'm fine making temporary modifications to the server but we then need to put that into puppet/imaging to make sure that if that server dies (as it did, by my hand, yesterday) we're not scrambling to make all those manual modifications to the server to get things back up. I will work with you in irc to come up with what you need. For the moment, you can do the following to get cron jobs setup ssh centos@serveraddress sudo -sH #you will become root sudo -sHu socorro #you will become the socorro user, which runs the tasks on this node commandyouneedtorun # or vi /etc/cron.d/socorro to modify the crontab for the user I'll take what you setup and make it sustainable in puppet/images after you have things setup as you want them.
Flags: needinfo?(jschneider)
One important thing that's missing on the crash-analysis server is auto-deploy of new software and configs - http://hg.mozilla.org/users/kairo_kairo.at/crash-report-tools/ should be auto-updated (like we do for Socorro software), and currently config files are used rather than environment variables (so it could be run under envconsul). Also missing is shipping the logs on the analysis node off to loggly, currently one must ssh in to see the logs and that's not really sustainable long-term, since nodes can and will go away unexpectedly. Other than that, the big thing stopping us from using our "normal" deployment process ("spin up a new instance, kill the old") is that there's an EBS volume that can only be mounted on one host at a time. I think if all of these things were done, ssh access to the active node would not be necessary.
Just an FYI, we now do have those logs on loggly, so anyone that needs it (robert? others?) can absolutely see those logs. The auto-deploy doesn't fit into our current mechanism for updates, and the 'sudo yum remove socorro;sudo yum install socorro' method hasn't worked on the old analysis node. We'll try it and see if we can do something like that, but it won't be automated in that setup either. We'll have to talk about how we make that node safe to kill if we want to squeeze it into our current methodology, or figure out how to safely automatically update it to the current prod (versus stage) version in the repo.
I often need to do manual runs on those scripts, so auto-deploy would only get us so far. We have a long-term plan that we discussed before the switch to AWS but if I can not rely on having a consistent user environment on this machine that for now I can always log into, then I cannot do my job and we are doing releases in the blind, i.e. I will have the whole chain of people I report to poke you until we can get it to do what I need. Sorry if that sounds harsh but this story puts me into a bad mood and right now we are flying blind in a release week and even losing data of some of my scripts (because some of this cannot be backfilled if the cron does not run), so I'm really grumpy.
I totally understand, and you will have a consistent, albeit it differently named than you are used to, user environment you can always log into if you need to run ad-hoc stuff. What we're looking to ensure is that for batch types of things we need to run on a regular (rather than ad-hoc) and repeatable basis are properly written in automation. Thus, if a node dies there are not manual and potentially fallible steps to get it to working order.
(In reply to JP Schneider [:jp] from comment #6) > I'll take what you setup and make it sustainable in puppet/images after you > have things setup as you want them. I don't really want to mangle my stuff with other stuff on a machine that was just set up for my stuff originally. I had every I need set up correctly in the rkaiser user, why can't we just have that work? Also, I can't really run my jobs as any other user than rkaiser or root right now, I guess, as the DB secrets file my jobs need (to have the right host and password to access the PostgreSQL DB) is placed as a .something file in the home directory of rkaiser.
OK, after a lot of IRC conversation with jp, I have chowned the data files over to the centos user, move the DB secrets file over to that user as well, and integrated the cron job into /etc/cron.d/socorro
Over to jp, who is actually working on this. Let me know if you need my help!
Assignee: rhelmer → jschneider
So we don't forget, once the code in https://bugzilla.mozilla.org/show_bug.cgi?id=1202739 lands we'll be able to set up a Pingdom monitoring that'll alert us sooner if the crash-analysis stuff isn't working. A baby-step.
Is work from this thread still something we are interested in doing?
Flags: needinfo?(peterbe)
We no longer care about missing symbols on crash-analysis. That's been moved to the webapp. And the correlations are still running in a cron job there. But it requires no manual work. The script needs the right firefox versions and it gets it from the webapp's API (over the internets). The correlations is in a slow trajectory of death. We're not entirely ready to completely pull the plug on it yet.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Flags: needinfo?(peterbe)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.