Closed Bug 878052 Opened 12 years ago Closed 12 years ago

sp-admin01.phx1.mozilla.com:Socorro Admin - crash analysis is CRITICAL: CRITICAL: File 20130529-modulelist.txt

Categories

(Socorro :: Backend, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Usul, Assigned: rhelmer)

Details

Nagios is alerting since this morning about the file not being present. ashish: long running cron job, killed it and started a new one by hand [10:43am] ashish: let's see how this goes [10:43am] ashish: it was running for 8h, a little unusual ashish: no, running one by hand [11:21am] ashish: can downtime it ashish: Usul: fyi - 20130530-modulelist.txt is now generated by the hand run [12:29pm] ashish: i'll work on generating 20130529 and step away for a coffee nagios-pek1 joined the chat room. [12:57pm] ashish: eh 0 byte file [12:57pm] • ashish reruns ashish: nope, i'm unable to get a good 20130529-modulelist.txt [1:28pm] ashish: Usul: can you check with laura on #breakpad in about 3h if that is ok or whether it needs a bug filed? lars is looking through the crontabber logs [3:57pm] lars: Usul: file a bug about this issue - assign to rhelmer, he'll likely either solve it or reassign to a more appropriate person...
This is the "cron_modulelist.sh" cron job, I see this in my cron email: Fatal exit code: 143 pig run failed sp-admin01:/var/log/socorro/cron_modulelist.log says: 2013-05-31 04:00:57,932 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! getmerge: Target /mnt/crashanalysis/crash_analysis/modulelist/20130529-modulelist.txt already exists Moved to trash: hdfs://hp-admin01.phx1.mozilla.com/user/socorro/modulelist-20130529-20130529 Maybe it got re-run for the wrong day?
Can you try re-running this job now? Looks like it tried to regenerate the 29th not the 30th before for some reason, see comment 1 This should do it I think: export JAVA_HOME=/usr/lib/jvm/java-1.6.0; /data/socorro/application/scripts/crons/cron_modulelist.sh
Flags: needinfo?(ashish)
This is a pig (map reduce) job that runs against hadoop, may take a while, use screen :)
(In reply to Robert Helmer [:rhelmer] from comment #3) > This is a pig (map reduce) job that runs against hadoop, may take a while, > use screen :) Ashish is on week-end right now.
(In reply to Ludovic Hirlimann [:Usul] from comment #4) > (In reply to Robert Helmer [:rhelmer] from comment #3) > > This is a pig (map reduce) job that runs against hadoop, may take a while, > > use screen :) > > Ashish is on week-end right now. Oh ok anyone should be able to do it, can you or should I start casting about ? :)
Flags: needinfo?(ashish)
I'm on the verge of week-end too :-)
I'm running this now.
[socorro@sp-admin01.phx1 ~]$ export JAVA_HOME=/usr/lib/jvm/java-1.6.0; /data/socorro/application/scripts/crons/cron_modulelist.sh warn (lock): removing lock for cron_modulelist, 17941 not running Fatal exit code: 255 hadoop getmerge failed
(In reply to Shyam Mani [:fox2mike] from comment #8) > [socorro@sp-admin01.phx1 ~]$ export JAVA_HOME=/usr/lib/jvm/java-1.6.0; > /data/socorro/application/scripts/crons/cron_modulelist.sh > warn (lock): removing lock for cron_modulelist, 17941 not running > Fatal exit code: 255 > hadoop getmerge failed Looks like it already completed: getmerge: Target /mnt/crashanalysis/crash_analysis/modulelist/20130530-modulelist.txt already exists $ ls -latr /mnt/crashanalysis/crash_analysis/modulelist/20130530-modulelist.txt -rwxrwxrwx 1 socorro socorro 4376709 May 31 03:20 /mnt/crashanalysis/crash_analysis/modulelist/20130530-modulelist.txt So everything seems fine? Did nagios stop alerting?
(In reply to Robert Helmer [:rhelmer] from comment #9) > So everything seems fine? Did nagios stop alerting? Yes
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.