Closed Bug 598908 Opened 14 years ago Closed 12 years ago

deploy map/reduce job from bug 594777 in production

Categories

(Socorro :: General, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: rhelmer)

References

Details

Attachments

(1 file, 2 obsolete files)

I need the map/reduce job from bug 594777 deployed as a daily cron job in production. This isn't super urgent, since I need to fix bug 598757 before I can actually use it, but I'd like to get this rolled out reasonably soon.
Xavier: can you take care of this?
Assignee: nobody → xstevens
Target Milestone: --- → 1.7.7
Yeah, I'll work on this today and post instructions for deployment.
So we need to do the following for deployment:

cd $SOCCORRO_CHECKOUT/analysis
ant hadoop-jar

Deploy build/lib/socorro-analysis-job.jar to $DEPLOYMENT_HOME
Deploy bin/modulelist.sh to $DEPLOYMENT_HOME

Then you can cron job modulelist.sh to run once a day. Currently modulelist.sh copies stuff to people.mozilla.org but we'll probably want to change that.

Ted: Where do you want the module lists to go to?
Also just to re-verify the format here is a module list I produced for 02-15-2011:

http://people.mozilla.org/~xstevens/20110215-modulelist.txt
Format is still fine. Laura was thinking people.mo/crash_analysis, and that's fine with me.
Laura,

Do you know what user is used to copy over to crash_analysis on people?
don't know if this helps but directory and files are owned by bacula

drwxr-xr-x 2 bacula bacula 12288 Feb 18 03:56 20110217
(In reply to comment #6)
> Laura,
> 
> Do you know what user is used to copy over to crash_analysis on people?

You want Jabba.
Currently, the socorro user scp's the files to people using the bacula account on people. The socorro user's ssh pubkey is in bacula's authorized_keys.
So this needs to be deployed and run under the socorro user on sp-admin01. Can we set up hudson to do this? Or who handles our deployment?
(In reply to comment #10)
> So this needs to be deployed and run under the socorro user on sp-admin01. Can
> we set up hudson to do this? Or who handles our deployment?

We just need to write a cron job for this, and make sure it's added to the crontab (that part is managed by puppet on stage and prod).

Example crontab that does something like this is:
http://code.google.com/p/socorro/source/browse/trunk/scripts/crons/cron_daily_reports.sh

The ". /etc/socorro/socorrorc" pulls in the socorro configuration for users, paths etc. and provides a couple helper functions like lock()/unlock()

Once that's checked in we can make sure it works ok on stage, and IT can push to prod (via puppet).

Let me know if you need help with any of the above.
Assignee: xstevens → rhelmer
This takes care of deployment, install docs and cron. One side-effect is that it makes our default "make install" depend on java+ant, but we can split this out later if it becomes an issue.

Hudson has direct support for Ant, but I think it makes things clearer to just call everything from the Makefile.
Attachment #514529 - Flags: review?(xstevens)
Attachment #514529 - Flags: feedback?(jdow)
small adjustment to cron_modulelist.sh, looks good to xstevens per IRC.

I am going to test this using my account on sp-admin01, once it passes hudson.

Committed revision 2954.
Attachment #514529 - Attachment is obsolete: true
Attachment #514529 - Flags: review?(xstevens)
Attachment #514529 - Flags: feedback?(jdow)
(In reply to comment #13)
> I am going to test this using my account on sp-admin01, once it passes hudson.

This test revealed a firewall change we need, waiting on bug 634396 for that.
Firewall should be unblocked now, I am going to test this evening when we're out of peak hours.

xstevens points out that the cron job should be scheduled for non-peak hours in the future as well.

How often should this job be run, is once per day enough? 5 PM Pacific?
Once per day is exactly what I wanted. The time doesn't matter at all, as long as it's consistent and I can rely on the output being available at a certain time.
Priority: -- → P1
rhelmer: I would suggest running this at 1 a.m. and giving it the previous day's date stamp.
(In reply to comment #15)
> Firewall should be unblocked now, I am going to test this evening when we're
> out of peak hours.

BTW I didn't actually do this, and wanted to leave prod alone right after the release. Going to test this tonight, if there are no objections.

Xavier, now that staging is up, is it ok to run this there?
Status: NEW → ASSIGNED
Tested this on staging, and code was deployed with 1.7.7

I thought that we were waiting on a firewall change, but I don't see it looking at the deps for this.

I'll test this on prod this evening, and we can enable this job if it looks good.
It looks like something is up with the hbase/hadoop install on sp-admin01:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/Bytes
	at com.mozilla.socorro.hadoop.CrashReportJob.<clinit>(CrashReportJob.java:77)

Looks like it's in the hadoop.jar though:

$ jar tvf /usr/lib/hbase/hbase.jar | grep hadoop | grep util | grep Bytes.class
 17950 Tue Jan 01 00:00:00 PST 1980 org/apache/hadoop/hbase/util/Bytes.class

The command being run specifically is:

/usr/lib/hadoop/bin/hadoop jar /data/socorro/analysis/socorro-analysis-job.jar com.mozilla.socorro.hadoop.CrashReportModuleList -Dproduct.filter=Firefox '-Dos.filter=Windows NT' -Dstart.date=20110518 -Dend.date=20110518 20110518-modulelist-out

If I run this script with "bash -x", I can see it's putting /usr/lib/hbase/hbase-0.90.1-cdh3u0.jar on HADOOP_CLASSPATH but we don't have that version installed (we have /usr/lib/hbase/hbase-0.89.20100924+28.jar).

All that said, this works on staging, so we should figure out what's different (I think this was set up via puppet in both cases).

Also, I noticed there's a typo in the wrapper (cron_modulelist.sh). I'll check in a fix for that.
jabba, can you please take a look at the hbase/hadoop install on sp-admin01?

Specifically it looks like /usr/lib/hadoop/bin/hadoop is expecting 0.90.1 but we only have 0.89 actually installed, see comment 21.
Assignee: rhelmer → jdow
jabba upgraded all the cloudera packages and re-ran puppet, now everything seems to work but we're getting exceptions such as:

Exception in createBlockOutputStream java.net.ConnectException: Connection timed out

I seem to be able to run basic hdfs commands though. I mailed the full output to xstevens for further analysis.
Assignee: jdow → rhelmer
This works now, just need to make a couple adjustments to the paths in the scripts so the cron job calls the modulelist script, and it's uploaded to the right server.
Target Milestone: 1.7.7 → 2.0
Priority: P1 → P3
(In reply to comment #24)
> This works now, just need to make a couple adjustments to the paths in the
> scripts so the cron job calls the modulelist script, and it's uploaded to
> the right server.

Landed on trunk (for 2.0) and also the 1.7.8 branch, so this will ridealong if we do any other 1.7.8 point releases before 2.0
We actually got this as a ride-along into 1.7.8.2 (bug 664021), jabba ran it by hand in production and it produced:

https://crash-analysis.mozilla.com/crash_analysis/modulelist/20110613-modulelist.txt

Filed dependent bug 664033 to enable this as a nightly cronjob.
I think we're done here, reopen if you disagree. modulelist reports will be in:

https://crash-analysis.mozilla.com/crash_analysis/modulelist/
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Did this stop running at some point? The last file in that directory is from June 29th.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #28)
> Did this stop running at some point? The last file in that directory is from
> June 29th.

Looks like we can't connect to hbase from sp-admin01 again (confirmed this w/ "hbase shell" too):

ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase

I imagine this is a firewall problem, will investigate and reopen bug 634396 if so.

Also, we should have received email alerts about this and did not, I'll fix that as part of this bug too.
Hey Rob,

I'm adding tmary because I know our zookeeper cluster was moved to a set of new machines a while back. If the configs on sp-admin01 weren't updated that could be your problem. If they have been updated then it's probably a firewall issue.
(In reply to comment #30)
> Hey Rob,
> 
> I'm adding tmary because I know our zookeeper cluster was moved to a set of
> new machines a while back. If the configs on sp-admin01 weren't updated that
> could be your problem. If they have been updated then it's probably a
> firewall issue.

Thanks for the hint - it looks like the config was updated via puppet, but we don't have the firewall open to the new machines. I'll file a bug about that.
The modulelist script doesn't do any error checking currently, which is why we didn't catch this. Here's a patch that adds a "fatal" function to make this easier (requires bash so changed the #! line).
Attachment #514645 - Attachment is obsolete: true
Attachment #550110 - Flags: review?(xstevens)
Comment on attachment 550110 [details] [diff] [review]
add error checking to modulelist command

r=xstevens per irc

Committed revision 3328.
Attachment #550110 - Flags: review?(xstevens) → review+
Should start working tonight, the extra error checking will go out with the next Socorro release.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
It didn't work last night, dumitru had to run manually. Reopening
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Sorry, I pointed you at this bug for the history. rhelmer filed bug 779912 on the most recent breakage.
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: