Closed
Bug 598908
Opened 14 years ago
Closed 12 years ago
deploy map/reduce job from bug 594777 in production
Categories
(Socorro :: General, task, P3)
Socorro
General
Tracking
(Not tracked)
RESOLVED
FIXED
2.0
People
(Reporter: ted, Assigned: rhelmer)
References
Details
Attachments
(1 file, 2 obsolete files)
1.65 KB,
patch
|
rhelmer
:
review+
|
Details | Diff | Splinter Review |
I need the map/reduce job from bug 594777 deployed as a daily cron job in production. This isn't super urgent, since I need to fix bug 598757 before I can actually use it, but I'd like to get this rolled out reasonably soon.
Comment 1•13 years ago
|
||
Xavier: can you take care of this?
Assignee: nobody → xstevens
Target Milestone: --- → 1.7.7
Comment 2•13 years ago
|
||
Yeah, I'll work on this today and post instructions for deployment.
Comment 3•13 years ago
|
||
So we need to do the following for deployment: cd $SOCCORRO_CHECKOUT/analysis ant hadoop-jar Deploy build/lib/socorro-analysis-job.jar to $DEPLOYMENT_HOME Deploy bin/modulelist.sh to $DEPLOYMENT_HOME Then you can cron job modulelist.sh to run once a day. Currently modulelist.sh copies stuff to people.mozilla.org but we'll probably want to change that. Ted: Where do you want the module lists to go to?
Comment 4•13 years ago
|
||
Also just to re-verify the format here is a module list I produced for 02-15-2011: http://people.mozilla.org/~xstevens/20110215-modulelist.txt
Reporter | ||
Comment 5•13 years ago
|
||
Format is still fine. Laura was thinking people.mo/crash_analysis, and that's fine with me.
Comment 6•13 years ago
|
||
Laura, Do you know what user is used to copy over to crash_analysis on people?
Comment 7•13 years ago
|
||
don't know if this helps but directory and files are owned by bacula drwxr-xr-x 2 bacula bacula 12288 Feb 18 03:56 20110217
Comment 8•13 years ago
|
||
(In reply to comment #6) > Laura, > > Do you know what user is used to copy over to crash_analysis on people? You want Jabba.
Comment 9•13 years ago
|
||
Currently, the socorro user scp's the files to people using the bacula account on people. The socorro user's ssh pubkey is in bacula's authorized_keys.
Comment 10•13 years ago
|
||
So this needs to be deployed and run under the socorro user on sp-admin01. Can we set up hudson to do this? Or who handles our deployment?
Assignee | ||
Comment 11•13 years ago
|
||
(In reply to comment #10) > So this needs to be deployed and run under the socorro user on sp-admin01. Can > we set up hudson to do this? Or who handles our deployment? We just need to write a cron job for this, and make sure it's added to the crontab (that part is managed by puppet on stage and prod). Example crontab that does something like this is: http://code.google.com/p/socorro/source/browse/trunk/scripts/crons/cron_daily_reports.sh The ". /etc/socorro/socorrorc" pulls in the socorro configuration for users, paths etc. and provides a couple helper functions like lock()/unlock() Once that's checked in we can make sure it works ok on stage, and IT can push to prod (via puppet). Let me know if you need help with any of the above.
Assignee | ||
Updated•13 years ago
|
Assignee: xstevens → rhelmer
Assignee | ||
Comment 12•13 years ago
|
||
This takes care of deployment, install docs and cron. One side-effect is that it makes our default "make install" depend on java+ant, but we can split this out later if it becomes an issue. Hudson has direct support for Ant, but I think it makes things clearer to just call everything from the Makefile.
Attachment #514529 -
Flags: review?(xstevens)
Attachment #514529 -
Flags: feedback?(jdow)
Assignee | ||
Comment 13•13 years ago
|
||
small adjustment to cron_modulelist.sh, looks good to xstevens per IRC. I am going to test this using my account on sp-admin01, once it passes hudson. Committed revision 2954.
Attachment #514529 -
Attachment is obsolete: true
Attachment #514529 -
Flags: review?(xstevens)
Attachment #514529 -
Flags: feedback?(jdow)
Assignee | ||
Comment 14•13 years ago
|
||
(In reply to comment #13) > I am going to test this using my account on sp-admin01, once it passes hudson. This test revealed a firewall change we need, waiting on bug 634396 for that.
Assignee | ||
Comment 15•13 years ago
|
||
Firewall should be unblocked now, I am going to test this evening when we're out of peak hours. xstevens points out that the cron job should be scheduled for non-peak hours in the future as well. How often should this job be run, is once per day enough? 5 PM Pacific?
Reporter | ||
Comment 16•13 years ago
|
||
Once per day is exactly what I wanted. The time doesn't matter at all, as long as it's consistent and I can rely on the output being available at a certain time.
Assignee | ||
Updated•13 years ago
|
Priority: -- → P1
Comment 17•13 years ago
|
||
rhelmer: I would suggest running this at 1 a.m. and giving it the previous day's date stamp.
Assignee | ||
Comment 18•13 years ago
|
||
(In reply to comment #15) > Firewall should be unblocked now, I am going to test this evening when we're > out of peak hours. BTW I didn't actually do this, and wanted to leave prod alone right after the release. Going to test this tonight, if there are no objections. Xavier, now that staging is up, is it ok to run this there?
Status: NEW → ASSIGNED
Comment 19•13 years ago
|
||
Yep.
Assignee | ||
Comment 20•13 years ago
|
||
Tested this on staging, and code was deployed with 1.7.7 I thought that we were waiting on a firewall change, but I don't see it looking at the deps for this. I'll test this on prod this evening, and we can enable this job if it looks good.
Assignee | ||
Comment 21•13 years ago
|
||
It looks like something is up with the hbase/hadoop install on sp-admin01: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/Bytes at com.mozilla.socorro.hadoop.CrashReportJob.<clinit>(CrashReportJob.java:77) Looks like it's in the hadoop.jar though: $ jar tvf /usr/lib/hbase/hbase.jar | grep hadoop | grep util | grep Bytes.class 17950 Tue Jan 01 00:00:00 PST 1980 org/apache/hadoop/hbase/util/Bytes.class The command being run specifically is: /usr/lib/hadoop/bin/hadoop jar /data/socorro/analysis/socorro-analysis-job.jar com.mozilla.socorro.hadoop.CrashReportModuleList -Dproduct.filter=Firefox '-Dos.filter=Windows NT' -Dstart.date=20110518 -Dend.date=20110518 20110518-modulelist-out If I run this script with "bash -x", I can see it's putting /usr/lib/hbase/hbase-0.90.1-cdh3u0.jar on HADOOP_CLASSPATH but we don't have that version installed (we have /usr/lib/hbase/hbase-0.89.20100924+28.jar). All that said, this works on staging, so we should figure out what's different (I think this was set up via puppet in both cases). Also, I noticed there's a typo in the wrapper (cron_modulelist.sh). I'll check in a fix for that.
Assignee | ||
Comment 22•13 years ago
|
||
jabba, can you please take a look at the hbase/hadoop install on sp-admin01? Specifically it looks like /usr/lib/hadoop/bin/hadoop is expecting 0.90.1 but we only have 0.89 actually installed, see comment 21.
Assignee: rhelmer → jdow
Assignee | ||
Comment 23•13 years ago
|
||
jabba upgraded all the cloudera packages and re-ran puppet, now everything seems to work but we're getting exceptions such as: Exception in createBlockOutputStream java.net.ConnectException: Connection timed out I seem to be able to run basic hdfs commands though. I mailed the full output to xstevens for further analysis.
Assignee: jdow → rhelmer
Assignee | ||
Comment 24•13 years ago
|
||
This works now, just need to make a couple adjustments to the paths in the scripts so the cron job calls the modulelist script, and it's uploaded to the right server.
Assignee | ||
Updated•13 years ago
|
Target Milestone: 1.7.7 → 2.0
Assignee | ||
Updated•13 years ago
|
Priority: P1 → P3
Assignee | ||
Comment 25•13 years ago
|
||
(In reply to comment #24) > This works now, just need to make a couple adjustments to the paths in the > scripts so the cron job calls the modulelist script, and it's uploaded to > the right server. Landed on trunk (for 2.0) and also the 1.7.8 branch, so this will ridealong if we do any other 1.7.8 point releases before 2.0
Assignee | ||
Comment 26•13 years ago
|
||
We actually got this as a ride-along into 1.7.8.2 (bug 664021), jabba ran it by hand in production and it produced: https://crash-analysis.mozilla.com/crash_analysis/modulelist/20110613-modulelist.txt Filed dependent bug 664033 to enable this as a nightly cronjob.
Assignee | ||
Comment 27•13 years ago
|
||
I think we're done here, reopen if you disagree. modulelist reports will be in: https://crash-analysis.mozilla.com/crash_analysis/modulelist/
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 28•13 years ago
|
||
Did this stop running at some point? The last file in that directory is from June 29th.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 29•13 years ago
|
||
(In reply to comment #28) > Did this stop running at some point? The last file in that directory is from > June 29th. Looks like we can't connect to hbase from sp-admin01 again (confirmed this w/ "hbase shell" too): ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase I imagine this is a firewall problem, will investigate and reopen bug 634396 if so. Also, we should have received email alerts about this and did not, I'll fix that as part of this bug too.
Comment 30•13 years ago
|
||
Hey Rob, I'm adding tmary because I know our zookeeper cluster was moved to a set of new machines a while back. If the configs on sp-admin01 weren't updated that could be your problem. If they have been updated then it's probably a firewall issue.
Assignee | ||
Comment 31•13 years ago
|
||
(In reply to comment #30) > Hey Rob, > > I'm adding tmary because I know our zookeeper cluster was moved to a set of > new machines a while back. If the configs on sp-admin01 weren't updated that > could be your problem. If they have been updated then it's probably a > firewall issue. Thanks for the hint - it looks like the config was updated via puppet, but we don't have the firewall open to the new machines. I'll file a bug about that.
Assignee | ||
Comment 32•13 years ago
|
||
The modulelist script doesn't do any error checking currently, which is why we didn't catch this. Here's a patch that adds a "fatal" function to make this easier (requires bash so changed the #! line).
Attachment #514645 -
Attachment is obsolete: true
Attachment #550110 -
Flags: review?(xstevens)
Assignee | ||
Comment 33•13 years ago
|
||
Comment on attachment 550110 [details] [diff] [review] add error checking to modulelist command r=xstevens per irc Committed revision 3328.
Attachment #550110 -
Flags: review?(xstevens) → review+
Assignee | ||
Comment 34•13 years ago
|
||
Should start working tonight, the extra error checking will go out with the next Socorro release.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
Comment 35•12 years ago
|
||
It didn't work last night, dumitru had to run manually. Reopening
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 36•12 years ago
|
||
Sorry, I pointed you at this bug for the history. rhelmer filed bug 779912 on the most recent breakage.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 12 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•